#archiveteam-bs 2020-08-04,Tue

↑back Search

Time Nickname Message
00:10 🔗 mgrandi maybe just try X amount of times until you get a non 404 in your script
00:10 🔗 mgrandi also file list 01 finished, just waiting on 2 which is like 60% done
00:11 🔗 JAA Yeah, I might do that after the current scan.
00:18 🔗 yano okay, if i spin up a cpx51 can we get another target
00:18 🔗 yano this is getting bonkers
00:18 🔗 yano i still have 300 VMs waiting to upload
00:20 🔗 Craigle The real question is what would cost less, the workers, or the single VM?
00:21 🔗 yano probably the single VM, as I can pay per hour
00:21 🔗 yano kiska: you around? lol
00:21 🔗 yano kiska: wanna make a target? :3
00:21 🔗 Craigle I assume so too, just saying, lol
00:23 🔗 yano i'm bored lol, too many storms, can't play with my ham radio, damn lightning
00:24 🔗 yano ooh down to 307
00:24 🔗 yano lol
00:25 🔗 yano phuzion: hetzner cloud now has a referral
00:25 🔗 yano i can give you 20 euros and if you spend 10 euros i get 10 euros
00:25 🔗 yano \o/
00:25 🔗 yano not that profitable but could be worth it for some folks
00:28 🔗 mgrandi i have a large hard drive but the problem is the residential ISP internet
00:28 🔗 yano Kaz: wanna spin up a target for the microsoft project?
00:30 🔗 Craigle I believe Kaz and EggplantN are already running 1-2 each
00:31 🔗 yano i can donate a machine but i don't have access to set it up as a target
00:31 🔗 Craigle Yeah, according to this: https://usercontent.irccloud-cdn.com/file/PAJkkzW6/image.png There are 6 targets currently
00:31 🔗 Craigle Oh, gotcha. I misunderstood
00:31 🔗 yano ah, no worries
00:31 🔗 yano i'm just poking folks who i know who have set them up in the past
00:31 🔗 antomatic has quit IRC (Ping timeout: 265 seconds)
00:32 🔗 Craigle Right, I completely see that now.
00:32 🔗 yano no worries :)
00:34 🔗 antomatic has joined #archiveteam-bs
01:13 🔗 RichardG_ has quit IRC (Quit: Keyboard not found, press F1 to continue)
01:15 🔗 RichardG has joined #archiveteam-bs
01:16 🔗 RichardG has quit IRC (Client Quit)
01:18 🔗 RichardG has joined #archiveteam-bs
01:31 🔗 Mateon1 has quit IRC (Remote host closed the connection)
01:33 🔗 Mateon1 has joined #archiveteam-bs
01:52 🔗 Mateon1 has quit IRC (Remote host closed the connection)
01:53 🔗 Mateon1 has joined #archiveteam-bs
02:16 🔗 yano there we go
02:23 🔗 kiska Alright awake now reading scrollback
02:42 🔗 arkiver I'm off to bed
02:42 🔗 arkiver please do requeue the microsoft items if nothing new is being returned (and then assuming all out items are inactive)
02:43 🔗 kiska Wilco
03:03 🔗 yano all the out items are being waited to be uploaded
03:03 🔗 yano most of them are me
03:10 🔗 VADemon has quit IRC (left4dead)
03:30 🔗 antomati_ has joined #archiveteam-bs
03:32 🔗 antomatic has quit IRC (Read error: Operation timed out)
03:37 🔗 qw3rty_ has joined #archiveteam-bs
03:38 🔗 Mateon1 has quit IRC (Remote host closed the connection)
03:41 🔗 Mateon1 has joined #archiveteam-bs
03:45 🔗 qw3rty__ has quit IRC (Read error: Operation timed out)
04:23 🔗 kiska Who is ready for me to requeue 6k items :D
04:24 🔗 kiska YEET!
04:45 🔗 mgrandi so on the subject of 'targets', is there any documentation on what is needed to be a target?
04:46 🔗 mgrandi my friend says he has a lot of disk space and is curious
04:48 🔗 Craigle I don't know all of the details, but I know upload bandwidth is a big one. I believe 1GB up and down is considered a minimum.
04:56 🔗 mgrandi "I have gigabit symmetric"
04:59 🔗 mgrandi but regardless, would be nice to document it
05:05 🔗 kiska https://www.archiveteam.org/index.php?title=Dev/Staging :D
05:10 🔗 kiska An addendum: Fast disks(min SATA SSDs with redundancy, RAID1 minimum, RAID5/6 preferred, NVMe drives preferred with redundancy RAID1 min, RAID5/6 preferred)
06:04 🔗 kiska mgrandi: ^
06:18 🔗 endrift oh I blinked and missed it
06:19 🔗 endrift no need for my bandwidth this month it seems
06:23 🔗 purplebot has joined #archiveteam-bs
06:42 🔗 kiska endrift: Will be doing requeue shortly
06:47 🔗 endrift ah, ok
06:48 🔗 DigiDigi has quit IRC (Read error: Operation timed out)
06:50 🔗 kiska endrift: Wanna spin something up?
06:51 🔗 endrift I spun up 3 docker containers in advance
06:51 🔗 endrift I don't know if/what the deadline is
06:51 🔗 endrift I have not been paying attention
06:51 🔗 endrift had other priorities today
06:51 🔗 kiska lol
06:51 🔗 kiska I requeued and all gone
06:53 🔗 endrift I don't appear to have caught any of them haha
06:53 🔗 endrift ah well
06:53 🔗 kiska https://server8.kiska.pw/uploads/47ad6e57ff56618a/image.png And the targets appear... HOT
07:18 🔗 DigiDigi has joined #archiveteam-bs
07:50 🔗 mgrandi i'm at 9800/10,000 for the last of the files i was getting
07:50 🔗 mgrandi i'll talk to you @kiska later tomorrow possibly about when i should rsync these files over
07:51 🔗 NOTBARN has joined #archiveteam-bs
07:53 🔗 mgrandi 2.5 TiB so far, in 5 GiB .warc.gz chunks
07:53 🔗 NOTBARN has quit IRC (Remote host closed the connection)
07:53 🔗 Kaz morning
07:54 🔗 Kaz whats the situ
07:54 🔗 mgrandi i think pretty much everything on the microsoft download site has been archived, minus the 200 files i have left
08:01 🔗 Mateon1 has quit IRC (Remote host closed the connection)
08:04 🔗 Mateon1 has joined #archiveteam-bs
08:17 🔗 MillerBOS has joined #archiveteam-bs
08:27 🔗 mgrandi has quit IRC (Leaving)
08:37 🔗 mgrandi has joined #archiveteam-bs
08:54 🔗 kiska Did we do this with the warrior project?
08:54 🔗 kiska Kaz: My machine is happily accepting data
08:55 🔗 Kaz does it still idle at 50% disk usage
08:55 🔗 kiska Cause of mixer
08:55 🔗 kiska Also its not idling I am still accepting 3 conns for mixer
08:56 🔗 Kaz why do you have 250G of mixer sitting around
08:56 🔗 Kaz those are some pretty huge videos if so
08:57 🔗 kiska I am pretty sure 1 or 2 of them are stale given I killed the connection just now, so I rm'd them
08:57 🔗 kiska I kicked them over to buyvm where I have more disk space
08:58 🔗 Kaz you've had 250G of mixer sitting there for a week? https://usercontent.irccloud-cdn.com/file/bJgYPpA0/image.png
08:58 🔗 kiska Yep, didn't know what finished and what didn't
08:59 🔗 Kaz O.o
08:59 🔗 kiska And since this project kicked off, I just killed all the connections and punted them over to buyvm
08:59 🔗 Kaz colour me confused
09:04 🔗 DLoader_ has joined #archiveteam-bs
09:04 🔗 kiska 7.5G in reddit, 12G in bitbucket(mercurial), 4.5G in the artists union, etc etc
09:04 🔗 BlueMaxim has joined #archiveteam-bs
09:05 🔗 kiska 85G in mixer, so I need to poke it further
09:05 🔗 SketchCo1 has joined #archiveteam-bs
09:05 🔗 swebb sets mode: +o SketchCo1
09:06 🔗 kiska Also the size is 468G :D\
09:06 🔗 Doran has joined #archiveteam-bs
09:06 🔗 Hecatz- has joined #archiveteam-bs
09:07 🔗 Meli-sama has joined #archiveteam-bs
09:07 🔗 synm0nger has joined #archiveteam-bs
09:07 🔗 Sanqui has joined #archiveteam-bs
09:07 🔗 Silvan has joined #archiveteam-bs
09:08 🔗 sirvy has joined #archiveteam-bs
09:08 🔗 Coderjo has joined #archiveteam-bs
09:08 🔗 tapedriv1 has joined #archiveteam-bs
09:08 🔗 Zebranky_ has joined #archiveteam-bs
09:09 🔗 sknebel_ has joined #archiveteam-bs
09:09 🔗 betamax_ has joined #archiveteam-bs
09:10 🔗 Meroje_ has quit IRC (Ping timeout: 745 seconds)
09:11 🔗 girst_ has joined #archiveteam-bs
09:12 🔗 N4Y_ has joined #archiveteam-bs
09:12 🔗 Meroje has joined #archiveteam-bs
09:12 🔗 gandalf_ has joined #archiveteam-bs
09:12 🔗 ephemer0l has quit IRC (Ping timeout: 745 seconds)
09:13 🔗 AlsoJAA_ has joined #archiveteam-bs
09:13 🔗 JAA sets mode: +o AlsoJAA_
09:13 🔗 Sanky has quit IRC (Ping timeout: 745 seconds)
09:13 🔗 nepeat_ has quit IRC (Ping timeout: 745 seconds)
09:14 🔗 DFJustin has quit IRC (Ping timeout: 745 seconds)
09:14 🔗 SynMonger has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 Craigle has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 acridAxid has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 Meli has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 atg has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 tapedrive has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 PotcFdk has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 step has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 Doranwen has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 Jonimoose has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 sirvy_ has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 underscor has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 Zebranky has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 Coderjo_ has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 BlueMax has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 SilSte has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 DLoader has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 girst has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 K4k__ has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 zhongfu has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 ivan has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 second has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 sknebel has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 coderobe has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 apache2 has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 mr_archiv has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 betamax has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 N4Y has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 gandalf has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 AlsoJAA has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 Hecatz has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 N4Y_ is now known as N4Y
09:15 🔗 gandalf_ is now known as gandalf
09:15 🔗 Hecatz- is now known as Hecatz
09:15 🔗 SketchCow has quit IRC (Ping timeout: 745 seconds)
09:15 🔗 girst_ is now known as girst
09:15 🔗 DLoader_ is now known as DLoader
09:18 🔗 second has joined #archiveteam-bs
09:21 🔗 Jonimoose has joined #archiveteam-bs
09:23 🔗 K4k__ has joined #archiveteam-bs
09:23 🔗 zhongfu has joined #archiveteam-bs
09:23 🔗 nepeat has joined #archiveteam-bs
09:23 🔗 apache2 has joined #archiveteam-bs
09:23 🔗 DFJustin has joined #archiveteam-bs
09:23 🔗 PotcFdk has joined #archiveteam-bs
09:24 🔗 step has joined #archiveteam-bs
09:24 🔗 Craigle has joined #archiveteam-bs
09:24 🔗 coderobe has joined #archiveteam-bs
09:24 🔗 ivan has joined #archiveteam-bs
09:24 🔗 atg has joined #archiveteam-bs
09:30 🔗 acridAxid has joined #archiveteam-bs
10:02 🔗 mgrandi My files for Microsoft dl center finished
10:03 🔗 mgrandi I dunno if wget has problems counting a file if it has an error (and retries) but it says it downloaded like 9997/10,000 at the "finished" message
10:03 🔗 JAA My third scan finished as well with lots of false 404s. I fucking hate this site.
10:03 🔗 JAA mgrandi: Can you check whether there were any 404s or other non-200s?
10:04 🔗 mgrandi Even with retries?
10:04 🔗 JAA No, without.
10:04 🔗 JAA This is the slower scan I started before we talked about that.
10:05 🔗 mgrandi I'm not sure if the wget log logs any of those, I got a handful of "read error: (success) in headers"
10:06 🔗 mgrandi I think my command told it to retry forever for files
10:09 🔗 JAA 121 files were missed by this third scan but found by one of the previous two. Based on some brief spot-checking, they all still exist. This sucks.
10:10 🔗 JAA 11 additional files found in the third scan.
10:10 🔗 JAA https://transfer.notkiska.pw/10q0Kt/microsoft-download-center-files-scan-3-sorted-new.jsonl
10:15 🔗 mgrandi Maybe scan just the 404s in your next scan?
10:15 🔗 mgrandi But yeah, I'll check our my data tomorrow, I'm just gonna move it to a smaller drive real fast and zzz
10:26 🔗 JAA mgrandi: The reason I asked about 404s/non-200s is that my extraction of the download links isn't entirely clean. I'm not parsing HTML but just extracting strings from it. So in some edge cases, the links might be slightly incorrect and result in 404s or other errors.
10:28 🔗 JAA Queued the 11 extra files.
10:28 🔗 JAA And naturally, Fusl already ate them all.
10:30 🔗 mgrandi Ah, well I don't think wget encountered any of those
10:30 🔗 JAA Alright, that's good.
10:30 🔗 mgrandi I would need strings to search for, I'll compare against the list tomorrow to see what files are missing or if it is just a reporting bug when wget finishes
10:37 🔗 JAA Yeah, not sure about wget's log format.
10:53 🔗 mtntmnky has quit IRC (Remote host closed the connection)
10:53 🔗 mtntmnky has joined #archiveteam-bs
11:00 🔗 trc has joined #archiveteam-bs
11:18 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
11:20 🔗 ephemer0l has joined #archiveteam-bs
12:10 🔗 Pixi` has quit IRC (Ping timeout: 260 seconds)
13:27 🔗 AlsoJAA_ is now known as AlsoJAA
13:43 🔗 JAA I started a fourth scan. I'm now retrying 404 redirects four times with a 5-sec sleep between requests. I'm only writing the last 404 redirect to WARC. Hopefully, that produces a clean dataset now.
13:45 🔗 britmob has quit IRC (Ping timeout: 265 seconds)
13:48 🔗 arkiver Craigle: you're doing a 64 GB item for the microsoft project
13:48 🔗 arkiver is that one still running?
13:50 🔗 JAA I noticed that as well. Fusl was running it before I requeued it.
13:50 🔗 JAA Had been out for 4 hours at the time though with little stuff still coming in, so I expected that to be dead.
13:55 🔗 arkiver I'll run a machine as well and get that one archived
13:55 🔗 arkiver if we have two copies of it, it's not a problem
13:58 🔗 arkiver downloading at 400 Mbit :)
14:15 🔗 kiska And I requeued it at some point as well
14:16 🔗 kiska Oh well we'll have some number of copies of it
15:33 🔗 Arcorann has quit IRC (Read error: Connection reset by peer)
15:39 🔗 Jens has quit IRC (Quit: Jens)
15:40 🔗 Jens has joined #archiveteam-bs
15:40 🔗 Lord_Nigh has quit IRC (Read error: Operation timed out)
15:55 🔗 JAA Fourth scan should be done within the next hour I think.
16:04 🔗 TC01 has quit IRC (Read error: Operation timed out)
16:30 🔗 Craigle arkiver: I don't show that running. But I would also expect it to fail since it's larger than the storage on those cloud machines
16:30 🔗 Craigle I didn't expect any items that large
16:42 🔗 arkiver all done
16:42 🔗 arkiver I just finished that 65 GB item
16:47 🔗 JAA Yay
16:48 🔗 JAA Fourth scan should be done in 10 minutes or so.
16:48 🔗 JAA That should finally be complete, I hope.
16:48 🔗 arkiver awesome
16:49 🔗 kiska So what do we do with the files that mgrandi has downloaded?
16:50 🔗 JAA I'd say either verify that we got all of it in the DPoS project data or just throw it on IA as well to be safe.
16:52 🔗 arkiver yeah if it's just a TB or two, I guess it can be dumped to IA
17:03 🔗 britmob has joined #archiveteam-bs
17:13 🔗 JAA WTF? There are *still* false 404s.
17:13 🔗 JAA I retried those four times with sleeps and clearing cookies...
17:15 🔗 kiska Broken?
17:15 🔗 JAA Actually, hold on, will need to investigate more.
17:22 🔗 JAA Soo, it turns out that Microsoft is replacing files on items occasionally.
17:22 🔗 JAA E.g. https://www.microsoft.com/en-us/download/details.aspx?id=41653 had a file from last week until yesterday.
17:26 🔗 JAA Yeah, no false 404s anymore now I think.
17:26 🔗 JAA A bunch of replaced files and three new IDs.
17:27 🔗 JAA Just to underline that it's not at all sequential: the new IDs are 39717, 51495, and 100429.
17:29 🔗 JAA And the only new files I discovered on this fourth scan are two updated ones. Cool.
17:30 🔗 JAA Looks like they delete the replaced files, by the way, e.g. https://download.microsoft.com/download/0/1/8/018E208D-54F8-44CD-AA26-CD7BC9524A8C/PublicIPs_20200727.xml from 41653 is a 404 now.
17:35 🔗 mgrandi Did you guys queue the entire list of data? Including IDs like 0-30,000? (Aka what i downloaded)
17:35 🔗 JAA mgrandi: Yes
17:36 🔗 mgrandi Oh so you don't need my data then lol
17:37 🔗 JAA *If* everything went fine, no. There's always a chance something went wrong for some reason though.
17:37 🔗 mgrandi Is there a way to like get a list/hash of files within a warc so we can see if all the files were grabbed by someone else?
17:37 🔗 mgrandi Yeah, I won't delete it yet
17:37 🔗 JAA That's the CDX.
17:38 🔗 mgrandi Well I haven't uploaded mine to IA yet
17:38 🔗 mgrandi So I don't have that
17:38 🔗 JAA Right
17:39 🔗 JAA You could generate a CDX somehow from your WARCs for comparison I guess.
17:42 🔗 JAA But probably not all data is on IA yet from the DPoS project.
17:43 🔗 JAA Two more files from my fourth scan queued.
17:43 🔗 kiska I think all of it is, besides the 2 items
17:43 🔗 JAA And done already. :-)
17:43 🔗 kiska :D
17:44 🔗 kiska Actually I have 16G on my machine to upload
17:47 🔗 JAA Also, another correction, those three IDs mentioned above are also updated, not new.
17:48 🔗 mgrandi Yeah, I'll have to look to see if I can somehow generate then without too much hassle
17:49 🔗 Lord_Nigh has joined #archiveteam-bs
18:16 🔗 TC01 has joined #archiveteam-bs
18:27 🔗 systwi_ has joined #archiveteam-bs
18:33 🔗 systwi has quit IRC (Ping timeout: 622 seconds)
18:53 🔗 not_barn has joined #archiveteam-bs
19:08 🔗 MaximeleG has joined #archiveteam-bs
19:12 🔗 Larsenv_ has joined #archiveteam-bs
19:12 🔗 MaximeleG has quit IRC (Client Quit)
19:16 🔗 Larsenv has quit IRC (Read error: Operation timed out)
20:28 🔗 lennier1 mgrandi: Warcat is good for extracting files from a .warc. https://github.com/chfoo/warcat
20:32 🔗 mgrandi I don't really want to extract them since I have 3 TB of files...I just mainly want to iterate over the files in the warc and get a file size / sha256 hash or something so I can compare it to the files that everyone else got to make sure we have everything before I nuke the files I have
20:34 🔗 mgrandi I can probably whip up a script to do it myself later if one doesn't exist
20:46 🔗 Larsenv_ is now known as Larsenv
20:51 🔗 JAA mgrandi: The CDX contains the URL and the SHA-1 in base36, so you just need to extract those from the file.
20:52 🔗 JAA Here's a dirty way: zgrep '^WARC-Target-URI: \|^WARC-Payload-Digest: \|^WARC-Type: ' $file + further processing that into some format you can diff/comm against an extract from the CDXs.
20:52 🔗 mgrandi So wget-at creates that inside the warc.gz?
20:52 🔗 JAA Yeah, the WARC-Payload-Digest is in that format.
20:52 🔗 mgrandi I didn't realize, thought IA added those
20:52 🔗 mgrandi Well that makes it easier
20:53 🔗 JAA IA might recompute them, not sure. There's some ambiguity/misunderstandings about what the payload digest should be exactly in some cases (chunked responses), but let's hope that doesn't apply here. :-)
20:54 🔗 JAA I don't see chunked TE in a brief test, so I think that's fine.
20:55 🔗 JAA Note that WARC lines end with CRLF, not just LF. You might want to `tr -d '\r'` somewhere in the processing.
20:56 🔗 mgrandi I mainly just want a sorted list of the files (and hashes) I have, and the I need to get a sorted list of the files that were downloaded by the seesaw project and then diff them
20:56 🔗 mgrandi Cause I started downloading earlier than everyone so maybe I have some files that weren't picked up
20:58 🔗 JAA Looks like at least some of the DPoS items aren't derived yet, so no CDX for those.
20:58 🔗 JAA Actually, only the most recent couple items.
20:59 🔗 mgrandi I'm at "work" now so I'll look at it later
21:06 🔗 JAA mgrandi: Here's how to generate a list of URL + digest from the data on IA: download them using ia download --search 'Archive Team Microsoft Download Center collection:archiveteam' --glob '*.warc.os.cdx.gz' and process with zcat */*.cdx.gz | awk '{ print $3 " " $6 }' | grep -v '^b s$'
21:06 🔗 mgrandi Thanks!
21:08 🔗 JAA But you might want to wait for all items to finish deriving before running that.
21:09 🔗 JAA Also, not sure if there's any data still on the targets.
21:40 🔗 Wingy has quit IRC (Read error: Operation timed out)
21:44 🔗 Wingy has joined #archiveteam-bs
21:48 🔗 Wingy has quit IRC (Client Quit)
21:48 🔗 Pixi has joined #archiveteam-bs
22:09 🔗 Wingy has joined #archiveteam-bs
22:09 🔗 mgrandi Yeah I'll wait
22:52 🔗 Wingy has quit IRC (Read error: Operation timed out)
22:56 🔗 Wingy has joined #archiveteam-bs
23:05 🔗 Arcorann has joined #archiveteam-bs
23:30 🔗 Wingy has quit IRC (Read error: Operation timed out)
23:31 🔗 Wingy has joined #archiveteam-bs
23:35 🔗 lunik19 has joined #archiveteam-bs
23:36 🔗 lunik1 has quit IRC (Ping timeout: 265 seconds)
23:36 🔗 lunik19 is now known as lunik1

irclogger-viewer