Time |
Nickname |
Message |
00:10
🔗
|
mgrandi |
maybe just try X amount of times until you get a non 404 in your script |
00:10
🔗
|
mgrandi |
also file list 01 finished, just waiting on 2 which is like 60% done |
00:11
🔗
|
JAA |
Yeah, I might do that after the current scan. |
00:18
🔗
|
yano |
okay, if i spin up a cpx51 can we get another target |
00:18
🔗
|
yano |
this is getting bonkers |
00:18
🔗
|
yano |
i still have 300 VMs waiting to upload |
00:20
🔗
|
Craigle |
The real question is what would cost less, the workers, or the single VM? |
00:21
🔗
|
yano |
probably the single VM, as I can pay per hour |
00:21
🔗
|
yano |
kiska: you around? lol |
00:21
🔗
|
yano |
kiska: wanna make a target? :3 |
00:21
🔗
|
Craigle |
I assume so too, just saying, lol |
00:23
🔗
|
yano |
i'm bored lol, too many storms, can't play with my ham radio, damn lightning |
00:24
🔗
|
yano |
ooh down to 307 |
00:24
🔗
|
yano |
lol |
00:25
🔗
|
yano |
phuzion: hetzner cloud now has a referral |
00:25
🔗
|
yano |
i can give you 20 euros and if you spend 10 euros i get 10 euros |
00:25
🔗
|
yano |
\o/ |
00:25
🔗
|
yano |
not that profitable but could be worth it for some folks |
00:28
🔗
|
mgrandi |
i have a large hard drive but the problem is the residential ISP internet |
00:28
🔗
|
yano |
Kaz: wanna spin up a target for the microsoft project? |
00:30
🔗
|
Craigle |
I believe Kaz and EggplantN are already running 1-2 each |
00:31
🔗
|
yano |
i can donate a machine but i don't have access to set it up as a target |
00:31
🔗
|
Craigle |
Yeah, according to this: https://usercontent.irccloud-cdn.com/file/PAJkkzW6/image.png There are 6 targets currently |
00:31
🔗
|
Craigle |
Oh, gotcha. I misunderstood |
00:31
🔗
|
yano |
ah, no worries |
00:31
🔗
|
yano |
i'm just poking folks who i know who have set them up in the past |
00:31
🔗
|
|
antomatic has quit IRC (Ping timeout: 265 seconds) |
00:32
🔗
|
Craigle |
Right, I completely see that now. |
00:32
🔗
|
yano |
no worries :) |
00:34
🔗
|
|
antomatic has joined #archiveteam-bs |
01:13
🔗
|
|
RichardG_ has quit IRC (Quit: Keyboard not found, press F1 to continue) |
01:15
🔗
|
|
RichardG has joined #archiveteam-bs |
01:16
🔗
|
|
RichardG has quit IRC (Client Quit) |
01:18
🔗
|
|
RichardG has joined #archiveteam-bs |
01:31
🔗
|
|
Mateon1 has quit IRC (Remote host closed the connection) |
01:33
🔗
|
|
Mateon1 has joined #archiveteam-bs |
01:52
🔗
|
|
Mateon1 has quit IRC (Remote host closed the connection) |
01:53
🔗
|
|
Mateon1 has joined #archiveteam-bs |
02:16
🔗
|
yano |
there we go |
02:23
🔗
|
kiska |
Alright awake now reading scrollback |
02:42
🔗
|
arkiver |
I'm off to bed |
02:42
🔗
|
arkiver |
please do requeue the microsoft items if nothing new is being returned (and then assuming all out items are inactive) |
02:43
🔗
|
kiska |
Wilco |
03:03
🔗
|
yano |
all the out items are being waited to be uploaded |
03:03
🔗
|
yano |
most of them are me |
03:10
🔗
|
|
VADemon has quit IRC (left4dead) |
03:30
🔗
|
|
antomati_ has joined #archiveteam-bs |
03:32
🔗
|
|
antomatic has quit IRC (Read error: Operation timed out) |
03:37
🔗
|
|
qw3rty_ has joined #archiveteam-bs |
03:38
🔗
|
|
Mateon1 has quit IRC (Remote host closed the connection) |
03:41
🔗
|
|
Mateon1 has joined #archiveteam-bs |
03:45
🔗
|
|
qw3rty__ has quit IRC (Read error: Operation timed out) |
04:23
🔗
|
kiska |
Who is ready for me to requeue 6k items :D |
04:24
🔗
|
kiska |
YEET! |
04:45
🔗
|
mgrandi |
so on the subject of 'targets', is there any documentation on what is needed to be a target? |
04:46
🔗
|
mgrandi |
my friend says he has a lot of disk space and is curious |
04:48
🔗
|
Craigle |
I don't know all of the details, but I know upload bandwidth is a big one. I believe 1GB up and down is considered a minimum. |
04:56
🔗
|
mgrandi |
"I have gigabit symmetric" |
04:59
🔗
|
mgrandi |
but regardless, would be nice to document it |
05:05
🔗
|
kiska |
https://www.archiveteam.org/index.php?title=Dev/Staging :D |
05:10
🔗
|
kiska |
An addendum: Fast disks(min SATA SSDs with redundancy, RAID1 minimum, RAID5/6 preferred, NVMe drives preferred with redundancy RAID1 min, RAID5/6 preferred) |
06:04
🔗
|
kiska |
mgrandi: ^ |
06:18
🔗
|
endrift |
oh I blinked and missed it |
06:19
🔗
|
endrift |
no need for my bandwidth this month it seems |
06:23
🔗
|
|
purplebot has joined #archiveteam-bs |
06:42
🔗
|
kiska |
endrift: Will be doing requeue shortly |
06:47
🔗
|
endrift |
ah, ok |
06:48
🔗
|
|
DigiDigi has quit IRC (Read error: Operation timed out) |
06:50
🔗
|
kiska |
endrift: Wanna spin something up? |
06:51
🔗
|
endrift |
I spun up 3 docker containers in advance |
06:51
🔗
|
endrift |
I don't know if/what the deadline is |
06:51
🔗
|
endrift |
I have not been paying attention |
06:51
🔗
|
endrift |
had other priorities today |
06:51
🔗
|
kiska |
lol |
06:51
🔗
|
kiska |
I requeued and all gone |
06:53
🔗
|
endrift |
I don't appear to have caught any of them haha |
06:53
🔗
|
endrift |
ah well |
06:53
🔗
|
kiska |
https://server8.kiska.pw/uploads/47ad6e57ff56618a/image.png And the targets appear... HOT |
07:18
🔗
|
|
DigiDigi has joined #archiveteam-bs |
07:50
🔗
|
mgrandi |
i'm at 9800/10,000 for the last of the files i was getting |
07:50
🔗
|
mgrandi |
i'll talk to you @kiska later tomorrow possibly about when i should rsync these files over |
07:51
🔗
|
|
NOTBARN has joined #archiveteam-bs |
07:53
🔗
|
mgrandi |
2.5 TiB so far, in 5 GiB .warc.gz chunks |
07:53
🔗
|
|
NOTBARN has quit IRC (Remote host closed the connection) |
07:53
🔗
|
Kaz |
morning |
07:54
🔗
|
Kaz |
whats the situ |
07:54
🔗
|
mgrandi |
i think pretty much everything on the microsoft download site has been archived, minus the 200 files i have left |
08:01
🔗
|
|
Mateon1 has quit IRC (Remote host closed the connection) |
08:04
🔗
|
|
Mateon1 has joined #archiveteam-bs |
08:17
🔗
|
|
MillerBOS has joined #archiveteam-bs |
08:27
🔗
|
|
mgrandi has quit IRC (Leaving) |
08:37
🔗
|
|
mgrandi has joined #archiveteam-bs |
08:54
🔗
|
kiska |
Did we do this with the warrior project? |
08:54
🔗
|
kiska |
Kaz: My machine is happily accepting data |
08:55
🔗
|
Kaz |
does it still idle at 50% disk usage |
08:55
🔗
|
kiska |
Cause of mixer |
08:55
🔗
|
kiska |
Also its not idling I am still accepting 3 conns for mixer |
08:56
🔗
|
Kaz |
why do you have 250G of mixer sitting around |
08:56
🔗
|
Kaz |
those are some pretty huge videos if so |
08:57
🔗
|
kiska |
I am pretty sure 1 or 2 of them are stale given I killed the connection just now, so I rm'd them |
08:57
🔗
|
kiska |
I kicked them over to buyvm where I have more disk space |
08:58
🔗
|
Kaz |
you've had 250G of mixer sitting there for a week? https://usercontent.irccloud-cdn.com/file/bJgYPpA0/image.png |
08:58
🔗
|
kiska |
Yep, didn't know what finished and what didn't |
08:59
🔗
|
Kaz |
O.o |
08:59
🔗
|
kiska |
And since this project kicked off, I just killed all the connections and punted them over to buyvm |
08:59
🔗
|
Kaz |
colour me confused |
09:04
🔗
|
|
DLoader_ has joined #archiveteam-bs |
09:04
🔗
|
kiska |
7.5G in reddit, 12G in bitbucket(mercurial), 4.5G in the artists union, etc etc |
09:04
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
09:05
🔗
|
kiska |
85G in mixer, so I need to poke it further |
09:05
🔗
|
|
SketchCo1 has joined #archiveteam-bs |
09:05
🔗
|
|
swebb sets mode: +o SketchCo1 |
09:06
🔗
|
kiska |
Also the size is 468G :D\ |
09:06
🔗
|
|
Doran has joined #archiveteam-bs |
09:06
🔗
|
|
Hecatz- has joined #archiveteam-bs |
09:07
🔗
|
|
Meli-sama has joined #archiveteam-bs |
09:07
🔗
|
|
synm0nger has joined #archiveteam-bs |
09:07
🔗
|
|
Sanqui has joined #archiveteam-bs |
09:07
🔗
|
|
Silvan has joined #archiveteam-bs |
09:08
🔗
|
|
sirvy has joined #archiveteam-bs |
09:08
🔗
|
|
Coderjo has joined #archiveteam-bs |
09:08
🔗
|
|
tapedriv1 has joined #archiveteam-bs |
09:08
🔗
|
|
Zebranky_ has joined #archiveteam-bs |
09:09
🔗
|
|
sknebel_ has joined #archiveteam-bs |
09:09
🔗
|
|
betamax_ has joined #archiveteam-bs |
09:10
🔗
|
|
Meroje_ has quit IRC (Ping timeout: 745 seconds) |
09:11
🔗
|
|
girst_ has joined #archiveteam-bs |
09:12
🔗
|
|
N4Y_ has joined #archiveteam-bs |
09:12
🔗
|
|
Meroje has joined #archiveteam-bs |
09:12
🔗
|
|
gandalf_ has joined #archiveteam-bs |
09:12
🔗
|
|
ephemer0l has quit IRC (Ping timeout: 745 seconds) |
09:13
🔗
|
|
AlsoJAA_ has joined #archiveteam-bs |
09:13
🔗
|
|
JAA sets mode: +o AlsoJAA_ |
09:13
🔗
|
|
Sanky has quit IRC (Ping timeout: 745 seconds) |
09:13
🔗
|
|
nepeat_ has quit IRC (Ping timeout: 745 seconds) |
09:14
🔗
|
|
DFJustin has quit IRC (Ping timeout: 745 seconds) |
09:14
🔗
|
|
SynMonger has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
Craigle has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
acridAxid has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
Meli has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
atg has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
tapedrive has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
PotcFdk has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
step has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
Doranwen has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
Jonimoose has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
sirvy_ has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
underscor has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
Zebranky has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
Coderjo_ has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
BlueMax has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
SilSte has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
DLoader has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
girst has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
K4k__ has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
zhongfu has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
ivan has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
second has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
sknebel has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
coderobe has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
apache2 has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
mr_archiv has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
betamax has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
N4Y has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
gandalf has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
AlsoJAA has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
Hecatz has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
N4Y_ is now known as N4Y |
09:15
🔗
|
|
gandalf_ is now known as gandalf |
09:15
🔗
|
|
Hecatz- is now known as Hecatz |
09:15
🔗
|
|
SketchCow has quit IRC (Ping timeout: 745 seconds) |
09:15
🔗
|
|
girst_ is now known as girst |
09:15
🔗
|
|
DLoader_ is now known as DLoader |
09:18
🔗
|
|
second has joined #archiveteam-bs |
09:21
🔗
|
|
Jonimoose has joined #archiveteam-bs |
09:23
🔗
|
|
K4k__ has joined #archiveteam-bs |
09:23
🔗
|
|
zhongfu has joined #archiveteam-bs |
09:23
🔗
|
|
nepeat has joined #archiveteam-bs |
09:23
🔗
|
|
apache2 has joined #archiveteam-bs |
09:23
🔗
|
|
DFJustin has joined #archiveteam-bs |
09:23
🔗
|
|
PotcFdk has joined #archiveteam-bs |
09:24
🔗
|
|
step has joined #archiveteam-bs |
09:24
🔗
|
|
Craigle has joined #archiveteam-bs |
09:24
🔗
|
|
coderobe has joined #archiveteam-bs |
09:24
🔗
|
|
ivan has joined #archiveteam-bs |
09:24
🔗
|
|
atg has joined #archiveteam-bs |
09:30
🔗
|
|
acridAxid has joined #archiveteam-bs |
10:02
🔗
|
mgrandi |
My files for Microsoft dl center finished |
10:03
🔗
|
mgrandi |
I dunno if wget has problems counting a file if it has an error (and retries) but it says it downloaded like 9997/10,000 at the "finished" message |
10:03
🔗
|
JAA |
My third scan finished as well with lots of false 404s. I fucking hate this site. |
10:03
🔗
|
JAA |
mgrandi: Can you check whether there were any 404s or other non-200s? |
10:04
🔗
|
mgrandi |
Even with retries? |
10:04
🔗
|
JAA |
No, without. |
10:04
🔗
|
JAA |
This is the slower scan I started before we talked about that. |
10:05
🔗
|
mgrandi |
I'm not sure if the wget log logs any of those, I got a handful of "read error: (success) in headers" |
10:06
🔗
|
mgrandi |
I think my command told it to retry forever for files |
10:09
🔗
|
JAA |
121 files were missed by this third scan but found by one of the previous two. Based on some brief spot-checking, they all still exist. This sucks. |
10:10
🔗
|
JAA |
11 additional files found in the third scan. |
10:10
🔗
|
JAA |
https://transfer.notkiska.pw/10q0Kt/microsoft-download-center-files-scan-3-sorted-new.jsonl |
10:15
🔗
|
mgrandi |
Maybe scan just the 404s in your next scan? |
10:15
🔗
|
mgrandi |
But yeah, I'll check our my data tomorrow, I'm just gonna move it to a smaller drive real fast and zzz |
10:26
🔗
|
JAA |
mgrandi: The reason I asked about 404s/non-200s is that my extraction of the download links isn't entirely clean. I'm not parsing HTML but just extracting strings from it. So in some edge cases, the links might be slightly incorrect and result in 404s or other errors. |
10:28
🔗
|
JAA |
Queued the 11 extra files. |
10:28
🔗
|
JAA |
And naturally, Fusl already ate them all. |
10:30
🔗
|
mgrandi |
Ah, well I don't think wget encountered any of those |
10:30
🔗
|
JAA |
Alright, that's good. |
10:30
🔗
|
mgrandi |
I would need strings to search for, I'll compare against the list tomorrow to see what files are missing or if it is just a reporting bug when wget finishes |
10:37
🔗
|
JAA |
Yeah, not sure about wget's log format. |
10:53
🔗
|
|
mtntmnky has quit IRC (Remote host closed the connection) |
10:53
🔗
|
|
mtntmnky has joined #archiveteam-bs |
11:00
🔗
|
|
trc has joined #archiveteam-bs |
11:18
🔗
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
11:20
🔗
|
|
ephemer0l has joined #archiveteam-bs |
12:10
🔗
|
|
Pixi` has quit IRC (Ping timeout: 260 seconds) |
13:27
🔗
|
|
AlsoJAA_ is now known as AlsoJAA |
13:43
🔗
|
JAA |
I started a fourth scan. I'm now retrying 404 redirects four times with a 5-sec sleep between requests. I'm only writing the last 404 redirect to WARC. Hopefully, that produces a clean dataset now. |
13:45
🔗
|
|
britmob has quit IRC (Ping timeout: 265 seconds) |
13:48
🔗
|
arkiver |
Craigle: you're doing a 64 GB item for the microsoft project |
13:48
🔗
|
arkiver |
is that one still running? |
13:50
🔗
|
JAA |
I noticed that as well. Fusl was running it before I requeued it. |
13:50
🔗
|
JAA |
Had been out for 4 hours at the time though with little stuff still coming in, so I expected that to be dead. |
13:55
🔗
|
arkiver |
I'll run a machine as well and get that one archived |
13:55
🔗
|
arkiver |
if we have two copies of it, it's not a problem |
13:58
🔗
|
arkiver |
downloading at 400 Mbit :) |
14:15
🔗
|
kiska |
And I requeued it at some point as well |
14:16
🔗
|
kiska |
Oh well we'll have some number of copies of it |
15:33
🔗
|
|
Arcorann has quit IRC (Read error: Connection reset by peer) |
15:39
🔗
|
|
Jens has quit IRC (Quit: Jens) |
15:40
🔗
|
|
Jens has joined #archiveteam-bs |
15:40
🔗
|
|
Lord_Nigh has quit IRC (Read error: Operation timed out) |
15:55
🔗
|
JAA |
Fourth scan should be done within the next hour I think. |
16:04
🔗
|
|
TC01 has quit IRC (Read error: Operation timed out) |
16:30
🔗
|
Craigle |
arkiver: I don't show that running. But I would also expect it to fail since it's larger than the storage on those cloud machines |
16:30
🔗
|
Craigle |
I didn't expect any items that large |
16:42
🔗
|
arkiver |
all done |
16:42
🔗
|
arkiver |
I just finished that 65 GB item |
16:47
🔗
|
JAA |
Yay |
16:48
🔗
|
JAA |
Fourth scan should be done in 10 minutes or so. |
16:48
🔗
|
JAA |
That should finally be complete, I hope. |
16:48
🔗
|
arkiver |
awesome |
16:49
🔗
|
kiska |
So what do we do with the files that mgrandi has downloaded? |
16:50
🔗
|
JAA |
I'd say either verify that we got all of it in the DPoS project data or just throw it on IA as well to be safe. |
16:52
🔗
|
arkiver |
yeah if it's just a TB or two, I guess it can be dumped to IA |
17:03
🔗
|
|
britmob has joined #archiveteam-bs |
17:13
🔗
|
JAA |
WTF? There are *still* false 404s. |
17:13
🔗
|
JAA |
I retried those four times with sleeps and clearing cookies... |
17:15
🔗
|
kiska |
Broken? |
17:15
🔗
|
JAA |
Actually, hold on, will need to investigate more. |
17:22
🔗
|
JAA |
Soo, it turns out that Microsoft is replacing files on items occasionally. |
17:22
🔗
|
JAA |
E.g. https://www.microsoft.com/en-us/download/details.aspx?id=41653 had a file from last week until yesterday. |
17:26
🔗
|
JAA |
Yeah, no false 404s anymore now I think. |
17:26
🔗
|
JAA |
A bunch of replaced files and three new IDs. |
17:27
🔗
|
JAA |
Just to underline that it's not at all sequential: the new IDs are 39717, 51495, and 100429. |
17:29
🔗
|
JAA |
And the only new files I discovered on this fourth scan are two updated ones. Cool. |
17:30
🔗
|
JAA |
Looks like they delete the replaced files, by the way, e.g. https://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_20200727.xml from 41653 is a 404 now. |
17:35
🔗
|
mgrandi |
Did you guys queue the entire list of data? Including IDs like 0-30,000? (Aka what i downloaded) |
17:35
🔗
|
JAA |
mgrandi: Yes |
17:36
🔗
|
mgrandi |
Oh so you don't need my data then lol |
17:37
🔗
|
JAA |
*If* everything went fine, no. There's always a chance something went wrong for some reason though. |
17:37
🔗
|
mgrandi |
Is there a way to like get a list/hash of files within a warc so we can see if all the files were grabbed by someone else? |
17:37
🔗
|
mgrandi |
Yeah, I won't delete it yet |
17:37
🔗
|
JAA |
That's the CDX. |
17:38
🔗
|
mgrandi |
Well I haven't uploaded mine to IA yet |
17:38
🔗
|
mgrandi |
So I don't have that |
17:38
🔗
|
JAA |
Right |
17:39
🔗
|
JAA |
You could generate a CDX somehow from your WARCs for comparison I guess. |
17:42
🔗
|
JAA |
But probably not all data is on IA yet from the DPoS project. |
17:43
🔗
|
JAA |
Two more files from my fourth scan queued. |
17:43
🔗
|
kiska |
I think all of it is, besides the 2 items |
17:43
🔗
|
JAA |
And done already. :-) |
17:43
🔗
|
kiska |
:D |
17:44
🔗
|
kiska |
Actually I have 16G on my machine to upload |
17:47
🔗
|
JAA |
Also, another correction, those three IDs mentioned above are also updated, not new. |
17:48
🔗
|
mgrandi |
Yeah, I'll have to look to see if I can somehow generate then without too much hassle |
17:49
🔗
|
|
Lord_Nigh has joined #archiveteam-bs |
18:16
🔗
|
|
TC01 has joined #archiveteam-bs |
18:27
🔗
|
|
systwi_ has joined #archiveteam-bs |
18:33
🔗
|
|
systwi has quit IRC (Ping timeout: 622 seconds) |
18:53
🔗
|
|
not_barn has joined #archiveteam-bs |
19:08
🔗
|
|
MaximeleG has joined #archiveteam-bs |
19:12
🔗
|
|
Larsenv_ has joined #archiveteam-bs |
19:12
🔗
|
|
MaximeleG has quit IRC (Client Quit) |
19:16
🔗
|
|
Larsenv has quit IRC (Read error: Operation timed out) |
20:28
🔗
|
lennier1 |
mgrandi: Warcat is good for extracting files from a .warc. https://github.com/chfoo/warcat |
20:32
🔗
|
mgrandi |
I don't really want to extract them since I have 3 TB of files...I just mainly want to iterate over the files in the warc and get a file size / sha256 hash or something so I can compare it to the files that everyone else got to make sure we have everything before I nuke the files I have |
20:34
🔗
|
mgrandi |
I can probably whip up a script to do it myself later if one doesn't exist |
20:46
🔗
|
|
Larsenv_ is now known as Larsenv |
20:51
🔗
|
JAA |
mgrandi: The CDX contains the URL and the SHA-1 in base36, so you just need to extract those from the file. |
20:52
🔗
|
JAA |
Here's a dirty way: zgrep '^WARC-Target-URI: \|^WARC-Payload-Digest: \|^WARC-Type: ' $file + further processing that into some format you can diff/comm against an extract from the CDXs. |
20:52
🔗
|
mgrandi |
So wget-at creates that inside the warc.gz? |
20:52
🔗
|
JAA |
Yeah, the WARC-Payload-Digest is in that format. |
20:52
🔗
|
mgrandi |
I didn't realize, thought IA added those |
20:52
🔗
|
mgrandi |
Well that makes it easier |
20:53
🔗
|
JAA |
IA might recompute them, not sure. There's some ambiguity/misunderstandings about what the payload digest should be exactly in some cases (chunked responses), but let's hope that doesn't apply here. :-) |
20:54
🔗
|
JAA |
I don't see chunked TE in a brief test, so I think that's fine. |
20:55
🔗
|
JAA |
Note that WARC lines end with CRLF, not just LF. You might want to `tr -d '\r'` somewhere in the processing. |
20:56
🔗
|
mgrandi |
I mainly just want a sorted list of the files (and hashes) I have, and the I need to get a sorted list of the files that were downloaded by the seesaw project and then diff them |
20:56
🔗
|
mgrandi |
Cause I started downloading earlier than everyone so maybe I have some files that weren't picked up |
20:58
🔗
|
JAA |
Looks like at least some of the DPoS items aren't derived yet, so no CDX for those. |
20:58
🔗
|
JAA |
Actually, only the most recent couple items. |
20:59
🔗
|
mgrandi |
I'm at "work" now so I'll look at it later |
21:06
🔗
|
JAA |
mgrandi: Here's how to generate a list of URL + digest from the data on IA: download them using ia download --search 'Archive Team Microsoft Download Center collection:archiveteam' --glob '*.warc.os.cdx.gz' and process with zcat */*.cdx.gz | awk '{ print $3 " " $6 }' | grep -v '^b s$' |
21:06
🔗
|
mgrandi |
Thanks! |
21:08
🔗
|
JAA |
But you might want to wait for all items to finish deriving before running that. |
21:09
🔗
|
JAA |
Also, not sure if there's any data still on the targets. |
21:40
🔗
|
|
Wingy has quit IRC (Read error: Operation timed out) |
21:44
🔗
|
|
Wingy has joined #archiveteam-bs |
21:48
🔗
|
|
Wingy has quit IRC (Client Quit) |
21:48
🔗
|
|
Pixi has joined #archiveteam-bs |
22:09
🔗
|
|
Wingy has joined #archiveteam-bs |
22:09
🔗
|
mgrandi |
Yeah I'll wait |
22:52
🔗
|
|
Wingy has quit IRC (Read error: Operation timed out) |
22:56
🔗
|
|
Wingy has joined #archiveteam-bs |
23:05
🔗
|
|
Arcorann has joined #archiveteam-bs |
23:30
🔗
|
|
Wingy has quit IRC (Read error: Operation timed out) |
23:31
🔗
|
|
Wingy has joined #archiveteam-bs |
23:35
🔗
|
|
lunik19 has joined #archiveteam-bs |
23:36
🔗
|
|
lunik1 has quit IRC (Ping timeout: 265 seconds) |
23:36
🔗
|
|
lunik19 is now known as lunik1 |