[00:00] *** Gallifrey has joined #archiveteam-bs
[00:02] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[00:05] *** Gallifrey has joined #archiveteam-bs
[00:07] <SketchCow> Oh, let's just back up all the Microsofts
[00:14] *** Gallifrey has quit IRC (Ping timeout: 265 seconds)
[00:21] *** Gallifrey has joined #archiveteam-bs
[00:24] *** BlueMax has joined #archiveteam-bs
[00:27] *** Raccoon has quit IRC (Read error: Operation timed out)
[00:29] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[00:31] *** Gallifrey has joined #archiveteam-bs
[00:38] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[00:40] <mgrandi> 566/51298
[00:40] <mgrandi> might need to start multiple then
[00:54] *** Gallifrey has joined #archiveteam-bs
[01:00] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[01:07] *** Gallifrey has joined #archiveteam-bs
[01:21] *** lennier2 has joined #archiveteam-bs
[01:21] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[01:27] *** lennier1 has quit IRC (Read error: Operation timed out)
[01:27] *** lennier2 is now known as lennier1
[02:56] *** Gallifrey has joined #archiveteam-bs
[03:27] <mgrandi> is it safe to ctrl+c wget-at?
[03:27] <mgrandi> or can i have it stop gracefully?
[03:29] <Ryz> What's the status of the grab?
[03:30] <mgrandi> its going but probably not fast enough
[03:30] <mgrandi> was gonna try and split up the file
[03:30] <Ryz> Oof :c
[03:30] <mgrandi> file 3463
[03:30] <mgrandi> out of 51298
[03:31] <OrIdow6> Try touching "STOP", maybe?
[03:31] <OrIdow6> Not sure how that's implemented
[03:33] <OrIdow6> I.e. where in the process it checks
[03:34] <mgrandi> i think thats seesaw
[03:34] *** mtntmnky has joined #archiveteam-bs
[03:34] <mgrandi> mainly concerned that the warc will be like malformed or something
[03:34] <OrIdow6> Oh, I think you're right
[03:35] *** mtntmnky_ has quit IRC (Remote host closed the connection)
[03:35] <OrIdow6> Yeah, it is seesaw
[03:37] <OrIdow6> If you're really worried about it, you can just wait until it makes a new warc, stop it, and then redo that item
[03:37] <OrIdow6> Or item(s), if they're small
[03:37] <mgrandi> yeah, i think it completes the item even if it goes over the 'warc limit'
[03:37] <OrIdow6> Files, items, same thing
[03:37] <mgrandi> so i think there are no half finished items
[03:38] <OrIdow6> That makes sense; I don't think that would be a valid warc
[03:38] *** qw3rty__ has joined #archiveteam-bs
[03:43] <OrIdow6> 566 (https://www.microsoft.com/en-us/download/details.aspx?id=1387) looks like it puts you in 2009
[03:43] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[03:45] <mgrandi> yeah last item was https://download.microsoft.com/download/C/6/3/C63FC9B6-CC3C-4AA4-97CC-38D6E5EB43FC/WindowsServer2003.WindowsXP-KB2286198-x64-ENU.exe
[03:46] *** qw3rty_ has quit IRC (Read error: Operation timed out)
[03:48] *** Gallifrey has joined #archiveteam-bs
[03:48] <OrIdow6> Oh, missed the second update; that's better (by a year or so)
[03:58] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[04:13] <mgrandi> ok i split it into 10,000 line files, doing 3 of them at once
[04:15] <mgrandi> doing like 200MBit/s now
[04:18] <OrIdow6> Hopefully that keeps up
[04:19] *** Gallifrey has joined #archiveteam-bs
[04:23] <mgrandi> warrior bot would make this much easier lol
[04:24] <OrIdow6> Yeah
[04:28] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[04:30] <OrIdow6> Would be good for short-notice things like this
[04:31] <OrIdow6> Where there's a lot of data that needs to be moved in a shorter time than all the necessary people can be expected to get on and coordinate
[04:32] *** Gallifrey has joined #archiveteam-bs
[04:32] <mgrandi> well yeah, even if its just 1 person, i had to basically take the ifle , split it up, start 4 tabs in byobu, edit the command to be slightly different 4 times, etc
[04:33] <mgrandi> and then its not very good at any one process stopping, etc
[04:33] <mgrandi> just having a queueing system would help a lot
[04:33] <mgrandi> and yes the addition of other people helping would be great too
[05:03] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[05:08] *** Gallifrey has joined #archiveteam-bs
[05:21] <mgrandi> "Read error (Success.) in headers."
[05:21] <mgrandi> whats that mean
[05:27] <OrIdow6> Which item was it on?
[05:29] <OrIdow6> The only thing I can find (besides spam pages) is https://marc.info/?t=106397208600003&r=1 , which relates it to TLS problems
[05:29] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[05:33] *** Gallifrey has joined #archiveteam-bs
[05:48] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[05:51] <mgrandi> i don't know which item, i have retry turned on so i think it should retry it..
[05:52] *** Gallifrey has joined #archiveteam-bs
[06:02] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[06:05] *** Gallifrey has joined #archiveteam-bs
[06:15] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[06:17] *** Gallifrey has joined #archiveteam-bs
[06:22] *** bsmith093 has quit IRC (Quit: Leaving.)
[06:23] *** bsmith093 has joined #archiveteam-bs
[07:01] <mgrandi> 2020-08-03 07:01:08 URL:https://download.microsoft.com/download/4/B/3/4B300C83-A439-4E9F-B889-60FD8B83D7F2/Lync2013_SP-10.bin [1000000000/1000000000]
[07:01] <mgrandi> what is that file size
[07:19] <mgrandi> also i assume the '8' means it retried 8 times: 
[07:19] <mgrandi> 2020-08-03 07:14:21 URL:https://download.microsoft.com/download/4/B/3/4B300C83-A439-4E9F-B889-60FD8B83D7F2/Lync2013_SP-3.bin [1000000000/1000000000] -> "/mnt/volume_sfo3_04/ms_dl/wget02.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp" [8]
[07:26] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[07:27] *** Gallifrey has joined #archiveteam-bs
[07:29] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[07:36] *** Gallifrey has joined #archiveteam-bs
[07:45] <mgrandi> @JAA, did you download the Details.aspx pages?
[08:10] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[08:27] <OrIdow6> mgrandi: Looks like it's a single file (a VM image, I think) split into multiple parts
[08:28] <OrIdow6> Hence the size - that was the cutoff someone decided to use when splitting
[08:28] <OrIdow6> https://www.microsoft.com/en-us/download/details.aspx?id=40267
[08:28] <OrIdow6> 953.7 MiB ~= the size you got
[08:28] <OrIdow6> w/i rounding error
[08:30] <OrIdow6> Don't know about the multiple tries, though; works fine in curl
[08:31] <OrIdow6> If that's what that output is indicating
[08:34] *** Gallifrey has joined #archiveteam-bs
[08:42] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[08:43] *** Mayeau has joined #archiveteam-bs
[08:43] *** Mayonaise has quit IRC (Read error: Operation timed out)
[08:45] *** Gallifrey has joined #archiveteam-bs
[09:02] *** BlueMax has quit IRC (Read error: Connection reset by peer)
[09:08] *** Raccoon has joined #archiveteam-bs
[09:51] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[09:53] *** Gallifrey has joined #archiveteam-bs
[09:59] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[10:03] *** Gallifrey has joined #archiveteam-bs
[10:13] *** TC01 has joined #archiveteam-bs
[10:14] *** TC01_ has quit IRC (Read error: Operation timed out)
[10:25] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[10:29] *** Gallifrey has joined #archiveteam-bs
[10:35] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[10:38] *** Gallifrey has joined #archiveteam-bs
[10:40] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[10:44] *** Gallifrey has joined #archiveteam-bs
[10:45] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[10:49] *** Gallifrey has joined #archiveteam-bs
[11:05] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[11:09] *** Gallifrey has joined #archiveteam-bs
[11:28] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[11:33] *** Dallas8 has quit IRC (Quit: Dallas8)
[11:33] *** Dallas has joined #archiveteam-bs
[11:34] *** BlueMax has joined #archiveteam-bs
[12:09] *** Gallifrey has joined #archiveteam-bs
[12:10] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[12:13] *** Gallifrey has joined #archiveteam-bs
[12:15] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[12:16] *** Gallifrey has joined #archiveteam-bs
[12:36] *** Mayeau is now known as Mayonaise
[12:42] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[12:44] *** Gallifrey has joined #archiveteam-bs
[12:47] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[12:50] *** Gallifrey has joined #archiveteam-bs
[12:52] *** Gallifrey has quit IRC (Remote host closed the connection)
[12:59] *** Gallifrey has joined #archiveteam-bs
[13:05] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[13:07] *** Gallifrey has joined #archiveteam-bs
[13:18] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[13:20] *** yano has quit IRC (Quit: WeeChat, The Better IRC Client, https://weechat.org/)
[13:22] *** Gallifrey has joined #archiveteam-bs
[13:23] *** TC01 has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
[13:23] *** TC01 has joined #archiveteam-bs
[13:26] *** yano has joined #archiveteam-bs
[13:27] *** svchfoo3 sets mode: +o yano
[13:30] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[13:37] *** Gallifrey has joined #archiveteam-bs
[13:42] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[13:46] *** Gallifrey has joined #archiveteam-bs
[14:07] <JAA> mgrandi: I downloaded details.aspx and confirmation.aspx for all IDs between 1 and 60k (inclusive).
[14:07] <arkiver> how is microsoft going?
[14:08] <arkiver> if I can help with a project, please let me know
[14:08] <arkiver> will try to get some more project started now for other upcoming deadlines
[14:08] <JAA> arkiver: mgrandi is trying to grab it and was going too slow, but they're asleep now so not sure about the most current status.
[14:09] <arkiver> i guess the "too slow" part is not on the side of microsoft?
[14:09] <arkiver> if we have a list of URLs we can do a quick project to get everything backed up
[14:09] <arkiver> but will maybe wait until mgrandi is back online
[14:09] <JAA> I would assume that Microsoft's servers are fast.
[14:09] <OrIdow6> It's a CDN
[14:10] <arkiver> warriorbot would be great for this :)
[14:10] <kiska> :D
[14:10] <arkiver> JAA: ^
[14:10] <kiska> Except it doesn't work atm
[14:10] <JAA> Last statement was 200 Mb/s, which is way too slow to grab it in time.
[14:10] <JAA> 1 Gb/s = ~16 hours
[14:10] <arkiver> how much do we expect?
[14:10] <JAA> 7.1 TB
[14:10] <JAA> Across 51k files or so.
[14:10] <JAA> I still need to scan the higher IDs though.
[14:11] <arkiver> we can scan them with a project?
[14:11] <arkiver> what do they look like?
[14:11] <JAA> We can, but some downloads are huge.
[14:11] <arkiver> well ping me if you need help, I have time :)
[14:12] <arkiver> if it's large, then we can put a warning on the project
[14:13] <JAA> 20:45:37 <@JAA> https://transfer.notkiska.pw/AzcCd/microsoft-download-center-files-below-id-60000-sorted.jsonl
[14:13] <JAA> 19:17:02 <@JAA> Microsoft Download Center: I found 51298 files with a total size of about 7.1 TB.
[14:13] <JAA> 20:58:31 <@JAA> Here are the ten most frequent file extensions: 13780 msu, 13111 exe, 6812 zip, 3927 msi, 3770 pdf, 2228 pptx, 1214 docx, 888 bin, 828 doc, 483 xps
[14:13] <JAA> 21:06:32 <@JAA> Further statistics: top ten by size in GiB: 4100.1 .zip, 864.1 .exe, 597.3 .bin, 355.1 .iso, 182.3 .rar, 149.6 .cab, 118.1 .msi, 90.5 .msu, 17.6 .wmv, 16.7 .ISO
[14:14] <JAA> That's the most important bits from last night.
[14:14] <JAA> Only some of this is getting deleted, but we have no way to tell what.
[14:14] <JAA> (Well, not without downloading it first.)
[14:14] <arkiver> yeah 7 TB is fine for this stuff
[14:15] <JAA> I only scanned up to ID 60k, but I found some downloads at higher IDs. They're weird and not at all sequential.
[14:15] <arkiver> how high are those IDs?
[14:15] <JAA> Couldn't figure out the upper bound.
[14:15] <JAA> Highest I've seen is https://www.microsoft.com/en-us/download/details.aspx?id=100688
[14:16] <arkiver> we could also make it a discovery project, with a txt file in case something is found
[14:16] <JAA> (Note, the "Date Published" is completely unreliable.)
[14:16] <arkiver> if microsoft can handle it, we can probably scan up to a billion or so IDs in a day
[14:19] <JAA> One thing worth mentioning is that Microsoft's 404 handling is ... weird.
[14:19] <JAA> What they do is 302-redirect to https://www.microsoft.com/en-us/download/404Error.aspx
[14:19] <arkiver> that's fine
[14:19] <JAA> But I've seen two cases where I got such a redirect even though the page existed.
[14:20] <arkiver> uh
[14:20] <JAA> Retrying succeeded there.
[14:20] <arkiver> oh no
[14:20] *** ephemer0l has quit IRC (Read error: Connection reset by peer)
[14:20] <JAA> So uh yeah
[14:20] <JAA> I was only able to detect this on the two cases because it failed on confirmation.aspx.
[14:20] <JAA> If I got any false 404 redirects on details.aspx, my scan might be incomplete.
[14:31] <JAA> arkiver: Here's my qwarc spec file for the scan. For the most part, it should be pretty obvious what's happening even if you're not familiar with qwarc. https://transfer.notkiska.pw/67hNZ/microsoft-download-center.py
[14:34] <JAA> (I handled the one case of 404 idiocy manually.)
[14:37] *** prq has joined #archiveteam-bs
[14:37] <phuzion> @ me if y'all need some instances spun up, it's trivial for me to spin up DO droplets with custom code.
[15:13] <JAA> In other news, my Clutch discovery scan has arrived at March 2020.
[15:17] <kiska> Should we just repurpose #stops.tv for clutch? "D
[15:24] <JAA> I'm at 644k posts/clips so far, by the way.
[15:26] <arkiver> can we move clutch to hackint?
[15:26] <arkiver> JAA: do we have a channel?
[15:26] <JAA> Nope, not yet.
[15:26] <JAA> #pearls ?
[15:26] <arkiver> sure :P
[15:27] *** BlueMax has quit IRC (Read error: Connection reset by peer)
[16:22] <arkiver> getting a project ready for microsoft since there's so little time left
[16:22] <arkiver> for the URLs in the jsonl file initially, can add more later
[16:27] <JAA> Sounds good. I'm setting up another scan now to higher IDs and grabbing confirmation.aspx regardless of whether details.aspx exists. That should massively reduce the chance of missing something.
[16:27] <kiska> You know my mobile call me when you have things ready I am in bed
[16:28] <kiska> I can spin up reserves on hetz cloud if needed
[16:31] *** ephemer0l has joined #archiveteam-bs
[16:40] <JAA> New scan running now.
[16:43] *** VADemon has joined #archiveteam-bs
[16:44] <arkiver> JAA: yes please, are you getting all the HTML into WARCs?
[16:53] *** Arcorann has quit IRC (Read error: Connection reset by peer)
[17:03] *** RichardG_ has joined #archiveteam-bs
[17:04] <arkiver> kiska: we could use a target :)
[17:09] *** RichardG has quit IRC (Read error: Operation timed out)
[17:10] <Aoede> is there a channel for MS download center?
[17:13] <arkiver> on
[17:19] <arkiver> items are online, grab script coming in a bit
[17:35] <arkiver> also JAA THOSE LONG LINES 
[17:36] <arkiver> JAA: I won't be adding support for checking IDs yet to the project
[17:36] <arkiver> will initially just be about archiving the download.microsoft.com lines
[17:36] <arkiver> from the jsonl files
[17:36] <arkiver> file*
[17:38] <arkiver> all we need now is targets
[17:38] <arkiver> HCross: kiska: do you have some target we can add for the microsoft downloads project?
[17:39] <arkiver> the tracker name will be microsoft-download-center
[17:39] <arkiver> official name is Microsoft Download Center
[17:45] <JAA> arkiver: I'm grabbing details.aspx and confirmation.aspx into WARC, yes. I'm not grabbing page requisites etc.
[17:46] <JAA> Already got a couple of cases of 404 on details.aspx and 200 on confirmation.aspx. :-|
[17:46] <kiska> rsync://rsync.hel1.kiska.pw/microshaft/:downloader/
[17:46] <JAA> I'm rescanning 1-60k as well due to that mess.
[17:46] <kiska> :D
[17:47] <JAA> Correction: I rescanned*
[17:47] <JAA> Now scanning 60k-250k.
[17:48] <kiska> IA_ITEM_TITLE="Archive Team Microsoft Download Center:"
[17:48] <kiska> IA_ITEM_PREFIX="archiveteam_microsoft_"
[17:48] <kiska> FILE_PREFIX="microsoft_"
[17:49] <kiska> Yes? No?
[17:50] <JAA> 'microsoft' seems a bit too generic.
[17:50] <kiska> Just a reminder there is like 4 active projects on this machine, so don't slam it too hard?
[17:50] <JAA> Just 7 TB in a few hours, not too big.
[17:50] <JAA> :-)
[17:50] <kiska> IA_ITEM_PREFIX="archiveteam_microsoft_download_"
[17:50] <kiska> FILE_PREFIX="microsoft_download_"
[17:50] <kiska> Better?
[17:51] <JAA> Sounds good to me.
[17:51] <kiska> If so, I am going to run those settings
[17:52] <kiska> For now 1 packer and 1 uploader is running
[18:00] <kiska> I am trying to not make the item prefix too long as there is a char limit 
[18:01] <kiska> If uploads are backed up just get clients to buffer the uploads
[18:02] <JAA> There is a limit, but I think it's around 100 chars, so that should be fine.
[18:02] <JAA> I hit it during the NRATV uploads.
[18:16] <mgrandi> i have 1.7tb right now
[18:16] <mgrandi> i also think that the box i'm running on is too slow
[18:18] <mgrandi> i can also give you what files i have downloaded if this is gonna be made an actual project
[18:19] <mgrandi> (i split the list of files into sections and have 3 instances of wget-at running)
[18:27] <mgrandi> stuff still seems to be downloading @JAA @kiska 
[18:29] <JAA> arkiver: ^
[18:32] <mgrandi> i have a 8 tb drive that i'm paying for atm, so if we get a proper warrior set up i can beef up the computer
[18:37] <mgrandi> and i am hitting like 200MBits/s but every time a file finishes it has to add it to the warc which stops any download activity for that process so its not consistent
[18:43] <mgrandi> 11571 (or 8969/10,000) on section 00
[18:44] <mgrandi> 6064/10,000 on section 1
[18:45] <mgrandi> 5478/10,000 on section 2
[18:45] <mgrandi> section 3 and 4 haven't started
[18:46] <mgrandi> 3 is 10,000 files, 4 is 7685 files
[18:49] <mgrandi> and yeah, the 1.7 TB is compressed warcs so i don't know how much uncompressed data that is
[19:43] <JAA> So my second scan finished and discovered 54045 files (instead of 51298 in the first).
[19:48] <JAA> Oof
[19:48] <JAA> I did get 404s on both details.aspx and confirmation.aspx. :-(
[19:48] <JAA> E.g. https://www.microsoft.com/en-us/download/details.aspx?id=1557 is missing on my second scan.
[19:49] <JAA> What the fuck, Microsoft?
[19:49] <JAA> Overall, 22 files were discovered on the first scan but are missing from the second.
[19:50] <JAA> So yeah, my scan is definitely incomplete...
[19:50] <Laverne> Works here, maybe you hit some rate limiter?
[19:50] <JAA> Laverne: Nope, Microsoft's servers suck and occasionally return 404s (really 302s to a 404 page) under load.
[19:51] <arkiver> yeah
[19:51] <JAA> There is rate limiting as well. I got a few 403s.
[19:51] <arkiver> project starting a few minutes
[19:51] <arkiver> now actually
[19:51] <arkiver> (the question)
[19:52] <JAA> On a positive note, it looks like nothing was removed since my first scan.
[19:53] <JAA> The highest existing ID up to 250k is 101583.
[19:54] <JAA> (Well, assuming I didn't miss any higher ones due to 404s.)
[20:10] <arkiver> microsoft download center project is online
[20:10] <arkiver> kiska: added that as target, thanks
[20:10] <mgrandi> do you want the 2 files that i haven't started on?
[20:10] <mgrandi> err lists of files
[20:11] <arkiver> mgrandi: yeah, please PR them in the raw/ dir in repo microsoft-download-center-items
[20:11] <arkiver> likely we'll just get everything though
[20:11] <arkiver> I'm fine with queuing everything
[20:12] <mgrandi> well if microsoft is indeed deleting stuff later today shouldn't we prioritize stuff that hasn't been downloaded by me?
[20:12] <arkiver> yeah
[20:12] *** Ravenloft has quit IRC (Remote host closed the connection)
[20:12] <arkiver> so if you have the lists, let me know
[20:12] <arkiver> kind of need them now then to queue something
[20:13] <mgrandi> https://gist.github.com/fe8360f7bc6380aef52e028bc0749d07
[20:13] <mgrandi> that is 'file 3'
[20:13] <arkiver> I didn't read much backlog, so I'll ping in case of questions
[20:13] <mgrandi> https://gist.github.com/95a836abc19210d07384d294eaba22f7 file 4
[20:13] <mgrandi> i a doing files 0, 1 and 2
[20:14] <arkiver> ok, if these lists are done the other will likely be queued as well
[20:14] <mgrandi> i have not started on those two files at all ^ the other sections i am at least half way done with the rest of htem
[20:14] <mgrandi> with them &
[20:18] <yano> oh hai
[20:19] <arkiver> hui
[20:19] <arkiver> hi
[20:19] <JAA> So there are 2769 more files I discovered during my second scan. (That's 54045 in the second scan minus 51298 in the first scan plus 22 that should've been discovered in the first scan but were missed due to 404s.)
[20:19] <mgrandi> file 00: 91% done, file 01: 66% done, file 02: 54% done
[20:20] <JAA> Another 1.65 TB in those.
[20:20] <mgrandi> i say its easier to just queue the 2 links i posted above (sections 3 and 4 of the initial list JAA gave me) and plus the additional files JAA just found now
[20:21] <mgrandi> 2769 = 1 TB? geeze
[20:21] <mgrandi> those are some big files
[20:21] <JAA> Results of my second scan: https://transfer.notkiska.pw/7dyt8/microsoft-download-center-files-below-id-250000-sorted.jsonl
[20:22] <JAA> The 2769 newly discovered files: https://transfer.notkiska.pw/WlXAX/microsoft-download-center-files-below-id-250000-sorted-new.jsonl
[20:23] <mgrandi> the files i have are gonna have to be pulled apart and sorted anyway
[20:23] <arkiver> items added to the tracker!
[20:24] <arkiver> haha yano have fun :P
[20:24] <mgrandi> is kickthebucket basically done?
[20:24] <yano> :D
[20:24] <yano> lol
[20:24] <arkiver> JAA: can you please put the new lists in raw in the repo? 
[20:24] <arkiver> or the processed lists starting with num 03
[20:25] <mgrandi> i'm technically working right now so if someone can take my two gist urls and PR them for me, thanks :)
[20:25] <JAA> phuzion: You wanted to be pinged.
[20:25] <arkiver> mgrandi: those 'file 3' and 'file 4'?
[20:25] <mgrandi> yeah
[20:25] <JAA> arkiver: Yeah, will do in a second.
[20:25] <arkiver> mgrandi: I already processed them
[20:26] <mgrandi> ok cool
[20:26] <arkiver> I was talking about the new stuff from JAA 
[20:26] <mgrandi> (after this i will contribute help to warrior bot, this was a very messy project lol )
[20:26] <arkiver> Kaz: but we could sure use another target, the faster this gets done the better
[20:27] <Kaz> _screams_
[20:27] <mgrandi> how hard is it to set up a target?
[20:27] <mgrandi> i have 8 tb i'm paying for on this machine
[20:27] <Kaz> anywhere from 'dead simple' to 'horrifyingly complicated' depending inversely on the number of braincells you have
[20:28] <arkiver> yano: someone is scaling up :P
[20:28] <Kaz> JAA: can you find the config in here? have scrolled but couldn't see from a quick glance
[20:28] <yano> :3
[20:28] <Kaz> everyone's favourite irc service doesn't let me search easily either
[20:29] <JAA> Kaz: Around 17:50
[20:29] <Kaz> yeah just clocked it, ta
[20:29] <arkiver> yano: hmm, update coming up though
[20:29] <yano> dangit
[20:30] <Kaz> dictionary required?
[20:30] <arkiver> URL is in
[20:30] <arkiver> Kaz: no
[20:30] <Kaz> what's it gonna do if I specify one
[20:30] <Kaz> break, or?
[20:30] <arkiver> nothing
[20:30] <Kaz> ok cool
[20:30] <arkiver> I would add it by default for every project
[20:31] <mgrandi> what is the github url for the project? i can use my kickthebucket boxes for it
[20:31] <Kaz> yeah, it's in my boilerplate for new projects
[20:31] <Kaz> target is in, anyway
[20:31] <JAA> (For reference, some people can't be arsed to connect to EFnet and are chatting about this project in -bs on hackint.)
[20:32] <Craigle> https://github.com/ArchiveTeam/microsoft-download-center-grab
[20:32] <arkiver> arsed
[20:32] <Craigle> ^mgrandi
[20:32] <JAA> arkiver: The commit messages on -items are maximally useless, so what was queued so far?
[20:33] <arkiver> JAA: everything in ADDED
[20:33] <JAA> Well yeah, but what's 00, 01, and 02?
[20:33] <arkiver> 01 and 02 are the two lists mgrandi isn't working on
[20:33] <mgrandi> 00, 01, and 02 *
[20:33] <arkiver> 00 is 'the rest' of the jsonl file that's not in 01 and 02
[20:34] <mgrandi> without the project, just plain wget-at --input-file
[20:34] <JAA> Ok, so 00 corresponds to mgrandi's first three chunks?
[20:34] <mgrandi> uhhh
[20:34] <mgrandi> so basically i had the entire list, i did X amount of it, ctrl+ced it, then basically removed the entries that I had completed
[20:34] <mgrandi> then i did split --lines 10000 and got 5 files
[20:34] <JAA> Ah
[20:35] <mgrandi> so i guess i have sections -1, 00, 01, 02, 03, 04
[20:35] <arkiver> 01 = file 3 from mgrandi
[20:35] <arkiver> 02 = file 4 from mgrandi
[20:35] <mgrandi> section -1 is already done as that was a ctrl+c, then i'm running 3 wget-at instances for sections 00 01 and 02
[20:35] <arkiver> 00 = https://transfer.notkiska.pw/AzcCd/microsoft-download-center-files-below-id-60000-sorted.jsonl - 01 - 02
[20:35] <arkiver> (notice the - 01 - 02)
[20:35] <JAA> Got it.
[20:36] <arkiver> all items are taken already :P
[20:36] <JAA> RIP kiska's target.
[20:36] <arkiver> and Kaz
[20:36] <Kaz> pfft, we're fine
[20:36] <JAA> Ok, so I guess I just need to add the -new.jsonl from my second scan then.
[20:37] <arkiver> JAA: I'd think so?
[20:37] <JAA> Since 00+01+02 should be the same as my first scan?
[20:37] <mgrandi> let me upload all of the files
[20:38] <arkiver> JAA: 00+01+02 = https://transfer.notkiska.pw/AzcCd/microsoft-download-center-files-below-id-60000-sorted.jsonl
[20:38] <JAA> :-)
[20:38] *** Doranwen has joined #archiveteam-bs
[20:38] <mgrandi> (technically that is files 0 -> 4 but yeah, this is confusing)
[20:38] <JAA> mgrandi: I think we're good now. 00 is what you've grabbed or are grabbing.
[20:41] <mgrandi> wait, you guys have already taken all 20,000 remaining files?
[20:41] <yano> yeah
[20:41] <mgrandi> where were you yesterday D:
[20:42] <yano> we fast
[20:42] <yano> lol
[20:42] <arkiver> JAA: do you have an ETA on the new lists?
[20:42] <arkiver> else I might queue 00_url.txt ('the rest') as well
[20:42] <arkiver> else as in if it takes a few more hours
[20:43] <JAA> arkiver: Few minutes
[20:43] <arkiver> nice
[20:44] <JAA> arkiver: 03_url.txt is in now. Can you spot-check whether I derived it right?
[20:46] <mgrandi> https://www.dropbox.com/s/ratsewjzpyu4kox/mgrandi_microsoft_dl_center_file_lists.zip?dl=0 these are my file lists 
[20:46] <JAA> Also, I'll run another scan up to 110k overnight at a lower rate to reduce the risk of missing stuff.
[20:47] <arkiver> JAA: didn't check for duplicates, but looks fine!
[20:47] <arkiver> shall I queue it?
[20:48] <arkiver> queued it :P
[20:48] <JAA> The new raw file is already deduped against -60000-sorted.jsonl, so there shouldn't be any dupes in it.
[20:48] <JAA> :-)
[20:49] <arkiver> and gone
[20:50] <JAA> lol
[20:51] <Aoede> that took, like, 30 seconds <.<
[20:51] <arkiver> will queue 00_url.txt as well
[20:52] <yano> tracker ded?
[20:52] <JAA> By the way, I think it'd be nice to keep this project going in the future as well and regularly archive new downloads.
[20:52] <arkiver> sure
[20:53] <Kaz> yano: nah, it just handed out 17k jobs in the last 60 seconds
[20:53] <JAA> Although I guess we can switch to warriorbot once that's up and running.
[20:53] <mgrandi> again where were you all yesterday lol
[20:53] <mgrandi> rip my wallet
[20:53] <yano> ah
[20:53] <Kaz> you missed the magic words, clearly
[20:53] <Kaz> they're "new project, no limits"
[20:54] <yano> hehe :D
[20:54] <yano> those are the magic words
[20:54] <mgrandi> well we didn't have a project set up
[20:54] <JAA> Fusl Fusl Fusl Fusl Fusl Fusl Fusl yano Fusl Fusl Fusl Fusl  :-)
[20:54] <yano> if you want one Fusl-cane and a Yanado
[20:54] <mgrandi> i'm brand new and didn't know how to set one up
[20:55] <arkiver> can always ping me as well if you think we need one
[20:55] <JAA> I still dislike the bus factor of 1 that is arkiver being the only person to have set up DPoS projects in years.
[20:55] <yano> just click your heels three times and whisper "new project, no limits" 3 times and Fusl and me will show up
[20:55] <arkiver> "bus factor"?
[20:55] <yano> lol
[20:55] <arkiver> oh wait
[20:55] <arkiver> I remember bus factor
[20:55] <JAA> https://en.wikipedia.org/wiki/Bus_factor
[20:55] <arkiver> didnt they call it truck factor as well?
[20:55] <JAA> "The "bus factor" is the minimum number of team members that have to suddenly disappear from a project before the project stalls due to lack of knowledgeable or competent personnel."
[20:56] <mgrandi> would be nice to have some sort of automation to setting up some stuff like a target
[20:56] <arkiver> it's called warriorbot :P
[20:57] <arkiver> it'll be awesome if we can get running
[20:57] <arkiver> can queue lists of 100 thousands of URLs to that
[20:57] <arkiver> especially would have been nice for this
[20:57] <JAA> Yeah, this would've been perfect.
[20:57] <mgrandi> yes
[20:57] <mgrandi> saves me from having to manually split lists and run wget manually in 3 byobu tabs
[20:59] <mgrandi> how many computers does fusl have geez
[20:59] <JAA> byobu *twitch*
[20:59] <Kaz> https://usercontent.irccloud-cdn.com/file/HkdLkD13/image.png
[20:59] <arkiver> mgrandi: Fusl is on hackint
[20:59] <Kaz> fuck byobu
[21:00] <arkiver> Kaz: nice painting skills
[21:00] <Kaz> thanks, I've practiced from a very young age
[21:00] <Kaz> I think it shows
[21:00] <arkiver> definitely does
[21:01] <mgrandi> isn't byobu just...tmux lol
[21:01] <JAA> So why not just use plain tmux then?
[21:01] <mgrandi> that is indeed a lot of ip addresses
[21:01] <mgrandi> cause i'm basic
[21:01] <mgrandi> and byobu hasn't done me dirty i guess
[21:05] <mgrandi> anyway, we can talk later arkiver about once my files are done on what to do with them
[21:07] <BartoCH> i have my 20 jobs done, but here i am, waiting for an rsync slot so that i can upload them. I feel like grandpa compared to fusl :D
[21:07] <Aoede> i hate waiting in the rsync queue :c
[21:07] <mgrandi> yeah, 4 got lost because apparently i ran out of space on kickthebucket and it never deleted the half downloaded file
[21:08] <Doranwen> does anyone here know what's happening with marked1?  his nick's been sitting in channels on Hackint but he's not responded to any communications since late May… we're needing to get a hold of him for the Yahoo Groups project and I was told maybe ask here and see if anyone knows
[21:08] <BartoCH> Aoede: honestly, i don't know how fusl got so much bandwidth :O
[21:08] <arkiver> Doranwen: what do you need him for?
[21:09] <arkiver> and what is the channel you all are in?
[21:09] <Kaz> BartoCH: it's all hetzner cloud
[21:09] <Doranwen> we're in #yahoosucks - the fandom project is getting a set of all the GMD data and he's got some that isn't anywhere else now
[21:10] <Kaz> what if we just nailed 4 boxes to the wall with uploads https://usercontent.irccloud-cdn.com/file/PAJkkzW6/image.png
[21:10] <Doranwen> I've been transferring a lot of stuff back and forth with other project members but there's some sitting on his server that isn't accessible
[21:13] <mgrandi> is that all microsoft dl center uploads?
[21:14] <Kaz> 4G of it is, yep
[21:20] <JAA> That third slower scan I mentioned earlier is now running.
[21:21] <JAA> I expect that to take about 9 hours.
[21:21] <mgrandi> but at least all of it is downloaded off of microsoft's site
[21:29] <arkiver> JAA: nice, might be ready just in time
[21:29] <JAA> And we're at 0 todo again. :-)
[21:35] <OrIdow6> JAA: Today is the removal date, so they may remove files (if that hasn't happened already), and you may see that on your scan
[21:36] <JAA> I know. Nothing has been removed as of my second scan that finished about 2 hours ago.
[21:37] <OrIdow6> Not only would warriorbot "solve" the bus crash problem, it would be able to get rid of the delay that comes with having to have "replacable" people come to start a project
[21:37] <OrIdow6> Someone to set up workers (or switch the warrior HQ over to the project), someone to run a target, someone with permissions for the tracker, etc.
[21:38] <OrIdow6> Had they removed files for this at midnight, would have lost them because of this delay
[21:38] <OrIdow6> Obviously can't do this if the "project" isn't simple (list of files, in this case)
[21:40] <yano> i'm sitting on so many things to upload but the targets are too busy :p
[21:41] <arkiver> we're at 1 TB now!
[21:41] <arkiver> yano: yep :P
[21:41] <arkiver> yano: blame yourself and Fusl
[21:41] <arkiver> :)
[21:43] <mgrandi> yeah, thats why i started downloading it at my own, but i def couldn't get all of them by midnight
[21:43] <mgrandi> but i also work at microsoft and they are washington state based, they are not gonna do something at midnight on a sunday xD
[21:43] <yano> arkiver: lol
[21:44] <arkiver> mgrandi: so we can bug you if we need insider help? :P
[21:44] <yano> i have so many VMs running right now
[21:44] <yano> lol
[21:44] <mgrandi> possibly, i just started but i can probably ask
[21:44] <yano> i told them all to stop gracefully; and they are all waiting to upload lol
[21:45] <mgrandi> i have access to the entire MS directory and can just...message people on teams
[21:49] <arkiver> nice
[21:49] <arkiver> it's sometimes very hard to reach these companies
[23:02] *** Arcorann has joined #archiveteam-bs
[23:02] *** Arcorann has quit IRC (Remote host closed the connection)
[23:03] *** Arcorann has joined #archiveteam-bs
[23:20] <mgrandi> so , my 'file list section 00' is 100% done now, so that is the first 10,300 so files
[23:22] <mgrandi> file list section 2 is 5794/10,000 done, and file list section 1 is 9695/10,000 done
[23:22] <JAA> Meanwhile the DPoS project is almost halfway done with the uploads.
[23:23] <mgrandi> yeah, heh
[23:23] <mgrandi> impressive we got all of them this fast
[23:30] <mgrandi> did you manage to get the rest of the URLS past 60k @JAA 
[23:31] <JAA> mgrandi: Yes, I scanned up to 250k and found another 2.7k files. Link's somewhere above, search for 'jsonl' I guess.
[23:31] <mgrandi> did that get added to the tracker and already retrieved? lol
[23:32] <JAA> Yep
[23:32] <mgrandi> geez
[23:32] <mgrandi> thankfully they have no rate limiting or anything...
[23:32] <JAA> All those items were gone in less than 30 seconds. lol
[23:34] <mgrandi> how big was the data up to id 250k ? 
[23:34] *** SilSte has joined #archiveteam-bs
[23:35] <JAA> That was the extra 1.65 TB on top of the 7.1.
[23:35] <mgrandi> ah ok
[23:35] <mgrandi> so the first 60k is the majority of the size
[23:35] <phuzion> JAA: still need instances?
[23:35] <mgrandi> no, lol
[23:36] <JAA> mgrandi: Yes. But note that my scan is incomplete as explained above.
[23:36] <mgrandi> we needed them sunday >:(
[23:36] <JAA> phuzion: Nope, Fusl ate all the items.
[23:36] <mgrandi> (the others hoovered up everything and i'm in process of finishing up)
[23:37] <phuzion> JAA: alright. For future reference, you can @ me any time you think you might need a bunch of DO droplets spun up quickly. Only takes me about 5 minutes to re-adapt my ansible script for a new project, and spin up the droplets to do it.
[23:38] <mgrandi> can you post that ansible script? :eyes:
[23:38] <mgrandi> well the problem was we needed a project set up
[23:38] <mgrandi> so even with a bunch of instances its a lot of manual fiddling
[23:38] <phuzion> Ah, my script mostly solves the "we don't have enough workers" issue.
[23:39] <mgrandi> but does it just start a bunch of warrior workers? 
[23:39] <phuzion> It starts a bunch of pipeline scripts, without the warrior overhead
[23:39] <mgrandi> well yeah, i guess seesaw pipelines
[23:39] <mgrandi> but the point is we didn't hvae a pipeline set up yesterday
[23:40] <mgrandi> arkiver wrote one quickly this morning
[23:40] <phuzion> mgrandi: https://github.com/phuzion/archiveteam-deploy/blob/master/setup.yml
[23:40] <phuzion> it's super old and probably breaks a bunch of best practices, but I've successfully used it a few times to spin up like 50 droplets without a problem.
[23:40] <mgrandi> but since it was just me and jaa yesterday we were just like 'aaa' so i just started downloading them onto a large DO droplet volume
[23:41] <phuzion> yeah my script is basically the solution to "Oh fuck this project isn't gonna succeed unless we get like 100 more instances downloading stat"
[23:41] <phuzion> or "hey I have some DO credit I wanna piss away" lol
[23:42] <phuzion> speaking of DO, I need to check my account, my invoice this month was more than I expected.
[23:43] <JAA> I'm still getting false 404s at the low concurrency. :-(
[23:43] <mgrandi> how are you downloading them? maybe its still setting a cookie/
[23:44] <JAA> I'm clearing cookies after each ID though.
[23:45] <JAA> Actually, it can't be cookie-based since I'm sometimes seeing the false 404s on details.aspx but not on confirmation.aspx. I'm not clearing cookies between those requests.
[23:45] <mgrandi> maybe the detail pages go away but not the confirmation or the files? :thinking:
[23:46] <JAA> Nope, the pages exist. The server's just broken I think.
[23:46] <OrIdow6> My suspicion was that the cookie thing was something auth-related
[23:46] <JAA> Maybe some DB query fails occasionally and that causes it to 404.
[23:51] <mgrandi> do those urls work manually?
[23:52] <JAA> Yes
[23:52] <JAA> It's just random false 404s.
[23:52] <JAA> Retrying the exact same request succeeds usually.
[23:56] *** BlueMax has joined #archiveteam-bs