#archiveteam-bs 2020-08-03,Mon

↑back Search

Time Nickname Message
00:00 🔗 Gallifrey has joined #archiveteam-bs
00:02 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
00:05 🔗 Gallifrey has joined #archiveteam-bs
00:07 🔗 SketchCow Oh, let's just back up all the Microsofts
00:14 🔗 Gallifrey has quit IRC (Ping timeout: 265 seconds)
00:21 🔗 Gallifrey has joined #archiveteam-bs
00:24 🔗 BlueMax has joined #archiveteam-bs
00:27 🔗 Raccoon has quit IRC (Read error: Operation timed out)
00:29 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
00:31 🔗 Gallifrey has joined #archiveteam-bs
00:38 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
00:40 🔗 mgrandi 566/51298
00:40 🔗 mgrandi might need to start multiple then
00:54 🔗 Gallifrey has joined #archiveteam-bs
01:00 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
01:07 🔗 Gallifrey has joined #archiveteam-bs
01:21 🔗 lennier2 has joined #archiveteam-bs
01:21 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
01:27 🔗 lennier1 has quit IRC (Read error: Operation timed out)
01:27 🔗 lennier2 is now known as lennier1
02:56 🔗 Gallifrey has joined #archiveteam-bs
03:27 🔗 mgrandi is it safe to ctrl+c wget-at?
03:27 🔗 mgrandi or can i have it stop gracefully?
03:29 🔗 Ryz What's the status of the grab?
03:30 🔗 mgrandi its going but probably not fast enough
03:30 🔗 mgrandi was gonna try and split up the file
03:30 🔗 Ryz Oof :c
03:30 🔗 mgrandi file 3463
03:30 🔗 mgrandi out of 51298
03:31 🔗 OrIdow6 Try touching "STOP", maybe?
03:31 🔗 OrIdow6 Not sure how that's implemented
03:33 🔗 OrIdow6 I.e. where in the process it checks
03:34 🔗 mgrandi i think thats seesaw
03:34 🔗 mtntmnky has joined #archiveteam-bs
03:34 🔗 mgrandi mainly concerned that the warc will be like malformed or something
03:34 🔗 OrIdow6 Oh, I think you're right
03:35 🔗 mtntmnky_ has quit IRC (Remote host closed the connection)
03:35 🔗 OrIdow6 Yeah, it is seesaw
03:37 🔗 OrIdow6 If you're really worried about it, you can just wait until it makes a new warc, stop it, and then redo that item
03:37 🔗 OrIdow6 Or item(s), if they're small
03:37 🔗 mgrandi yeah, i think it completes the item even if it goes over the 'warc limit'
03:37 🔗 OrIdow6 Files, items, same thing
03:37 🔗 mgrandi so i think there are no half finished items
03:38 🔗 OrIdow6 That makes sense; I don't think that would be a valid warc
03:38 🔗 qw3rty__ has joined #archiveteam-bs
03:43 🔗 OrIdow6 566 (https://www.microsoft.com/en-us/download/details.aspx?id=1387) looks like it puts you in 2009
03:43 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
03:45 🔗 mgrandi yeah last item was https://download.microsoft.com/download/C/6/3/C63FC9B6-CC3C-4AA4-97CC-38D6E5EB43FC/WindowsServer2003.WindowsXP-KB2286198-x64-ENU.exe
03:46 🔗 qw3rty_ has quit IRC (Read error: Operation timed out)
03:48 🔗 Gallifrey has joined #archiveteam-bs
03:48 🔗 OrIdow6 Oh, missed the second update; that's better (by a year or so)
03:58 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
04:13 🔗 mgrandi ok i split it into 10,000 line files, doing 3 of them at once
04:15 🔗 mgrandi doing like 200MBit/s now
04:18 🔗 OrIdow6 Hopefully that keeps up
04:19 🔗 Gallifrey has joined #archiveteam-bs
04:23 🔗 mgrandi warrior bot would make this much easier lol
04:24 🔗 OrIdow6 Yeah
04:28 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
04:30 🔗 OrIdow6 Would be good for short-notice things like this
04:31 🔗 OrIdow6 Where there's a lot of data that needs to be moved in a shorter time than all the necessary people can be expected to get on and coordinate
04:32 🔗 Gallifrey has joined #archiveteam-bs
04:32 🔗 mgrandi well yeah, even if its just 1 person, i had to basically take the ifle , split it up, start 4 tabs in byobu, edit the command to be slightly different 4 times, etc
04:33 🔗 mgrandi and then its not very good at any one process stopping, etc
04:33 🔗 mgrandi just having a queueing system would help a lot
04:33 🔗 mgrandi and yes the addition of other people helping would be great too
05:03 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
05:08 🔗 Gallifrey has joined #archiveteam-bs
05:21 🔗 mgrandi "Read error (Success.) in headers."
05:21 🔗 mgrandi whats that mean
05:27 🔗 OrIdow6 Which item was it on?
05:29 🔗 OrIdow6 The only thing I can find (besides spam pages) is https://marc.info/?t=106397208600003&r=1 , which relates it to TLS problems
05:29 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
05:33 🔗 Gallifrey has joined #archiveteam-bs
05:48 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
05:51 🔗 mgrandi i don't know which item, i have retry turned on so i think it should retry it..
05:52 🔗 Gallifrey has joined #archiveteam-bs
06:02 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
06:05 🔗 Gallifrey has joined #archiveteam-bs
06:15 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
06:17 🔗 Gallifrey has joined #archiveteam-bs
06:22 🔗 bsmith093 has quit IRC (Quit: Leaving.)
06:23 🔗 bsmith093 has joined #archiveteam-bs
07:01 🔗 mgrandi 2020-08-03 07:01:08 URL:https://download.microsoft.com/download/4/B/3/4B300C83-A439-4E9F-B889-60FD8B83D7F2/Lync2013_SP-10.bin [1000000000/1000000000]
07:01 🔗 mgrandi what is that file size
07:19 🔗 mgrandi also i assume the '8' means it retried 8 times:
07:19 🔗 mgrandi 2020-08-03 07:14:21 URL:https://download.microsoft.com/download/4/B/3/4B300C83-A439-4E9F-B889-60FD8B83D7F2/Lync2013_SP-3.bin [1000000000/1000000000] -> "/mnt/volume_sfo3_04/ms_dl/wget02.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp" [8]
07:26 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
07:27 🔗 Gallifrey has joined #archiveteam-bs
07:29 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
07:36 🔗 Gallifrey has joined #archiveteam-bs
07:45 🔗 mgrandi @JAA, did you download the Details.aspx pages?
08:10 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
08:27 🔗 OrIdow6 mgrandi: Looks like it's a single file (a VM image, I think) split into multiple parts
08:28 🔗 OrIdow6 Hence the size - that was the cutoff someone decided to use when splitting
08:28 🔗 OrIdow6 https://www.microsoft.com/en-us/download/details.aspx?id=40267
08:28 🔗 OrIdow6 953.7 MiB ~= the size you got
08:28 🔗 OrIdow6 w/i rounding error
08:30 🔗 OrIdow6 Don't know about the multiple tries, though; works fine in curl
08:31 🔗 OrIdow6 If that's what that output is indicating
08:34 🔗 Gallifrey has joined #archiveteam-bs
08:42 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
08:43 🔗 Mayeau has joined #archiveteam-bs
08:43 🔗 Mayonaise has quit IRC (Read error: Operation timed out)
08:45 🔗 Gallifrey has joined #archiveteam-bs
09:02 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
09:08 🔗 Raccoon has joined #archiveteam-bs
09:51 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
09:53 🔗 Gallifrey has joined #archiveteam-bs
09:59 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
10:03 🔗 Gallifrey has joined #archiveteam-bs
10:13 🔗 TC01 has joined #archiveteam-bs
10:14 🔗 TC01_ has quit IRC (Read error: Operation timed out)
10:25 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
10:29 🔗 Gallifrey has joined #archiveteam-bs
10:35 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
10:38 🔗 Gallifrey has joined #archiveteam-bs
10:40 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
10:44 🔗 Gallifrey has joined #archiveteam-bs
10:45 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
10:49 🔗 Gallifrey has joined #archiveteam-bs
11:05 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
11:09 🔗 Gallifrey has joined #archiveteam-bs
11:28 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
11:33 🔗 Dallas8 has quit IRC (Quit: Dallas8)
11:33 🔗 Dallas has joined #archiveteam-bs
11:34 🔗 BlueMax has joined #archiveteam-bs
12:09 🔗 Gallifrey has joined #archiveteam-bs
12:10 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
12:13 🔗 Gallifrey has joined #archiveteam-bs
12:15 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
12:16 🔗 Gallifrey has joined #archiveteam-bs
12:36 🔗 Mayeau is now known as Mayonaise
12:42 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
12:44 🔗 Gallifrey has joined #archiveteam-bs
12:47 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
12:50 🔗 Gallifrey has joined #archiveteam-bs
12:52 🔗 Gallifrey has quit IRC (Remote host closed the connection)
12:59 🔗 Gallifrey has joined #archiveteam-bs
13:05 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
13:07 🔗 Gallifrey has joined #archiveteam-bs
13:18 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
13:20 🔗 yano has quit IRC (Quit: WeeChat, The Better IRC Client, https://weechat.org/)
13:22 🔗 Gallifrey has joined #archiveteam-bs
13:23 🔗 TC01 has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
13:23 🔗 TC01 has joined #archiveteam-bs
13:26 🔗 yano has joined #archiveteam-bs
13:27 🔗 svchfoo3 sets mode: +o yano
13:30 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
13:37 🔗 Gallifrey has joined #archiveteam-bs
13:42 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
13:46 🔗 Gallifrey has joined #archiveteam-bs
14:07 🔗 JAA mgrandi: I downloaded details.aspx and confirmation.aspx for all IDs between 1 and 60k (inclusive).
14:07 🔗 arkiver how is microsoft going?
14:08 🔗 arkiver if I can help with a project, please let me know
14:08 🔗 arkiver will try to get some more project started now for other upcoming deadlines
14:08 🔗 JAA arkiver: mgrandi is trying to grab it and was going too slow, but they're asleep now so not sure about the most current status.
14:09 🔗 arkiver i guess the "too slow" part is not on the side of microsoft?
14:09 🔗 arkiver if we have a list of URLs we can do a quick project to get everything backed up
14:09 🔗 arkiver but will maybe wait until mgrandi is back online
14:09 🔗 JAA I would assume that Microsoft's servers are fast.
14:09 🔗 OrIdow6 It's a CDN
14:10 🔗 arkiver warriorbot would be great for this :)
14:10 🔗 kiska :D
14:10 🔗 arkiver JAA: ^
14:10 🔗 kiska Except it doesn't work atm
14:10 🔗 JAA Last statement was 200 Mb/s, which is way too slow to grab it in time.
14:10 🔗 JAA 1 Gb/s = ~16 hours
14:10 🔗 arkiver how much do we expect?
14:10 🔗 JAA 7.1 TB
14:10 🔗 JAA Across 51k files or so.
14:10 🔗 JAA I still need to scan the higher IDs though.
14:11 🔗 arkiver we can scan them with a project?
14:11 🔗 arkiver what do they look like?
14:11 🔗 JAA We can, but some downloads are huge.
14:11 🔗 arkiver well ping me if you need help, I have time :)
14:12 🔗 arkiver if it's large, then we can put a warning on the project
14:13 🔗 JAA 20:45:37 <@JAA> https://transfer.notkiska.pw/AzcCd/microsoft-download-center-files-below-id-60000-sorted.jsonl
14:13 🔗 JAA 19:17:02 <@JAA> Microsoft Download Center: I found 51298 files with a total size of about 7.1 TB.
14:13 🔗 JAA 20:58:31 <@JAA> Here are the ten most frequent file extensions: 13780 msu, 13111 exe, 6812 zip, 3927 msi, 3770 pdf, 2228 pptx, 1214 docx, 888 bin, 828 doc, 483 xps
14:13 🔗 JAA 21:06:32 <@JAA> Further statistics: top ten by size in GiB: 4100.1 .zip, 864.1 .exe, 597.3 .bin, 355.1 .iso, 182.3 .rar, 149.6 .cab, 118.1 .msi, 90.5 .msu, 17.6 .wmv, 16.7 .ISO
14:14 🔗 JAA That's the most important bits from last night.
14:14 🔗 JAA Only some of this is getting deleted, but we have no way to tell what.
14:14 🔗 JAA (Well, not without downloading it first.)
14:14 🔗 arkiver yeah 7 TB is fine for this stuff
14:15 🔗 JAA I only scanned up to ID 60k, but I found some downloads at higher IDs. They're weird and not at all sequential.
14:15 🔗 arkiver how high are those IDs?
14:15 🔗 JAA Couldn't figure out the upper bound.
14:15 🔗 JAA Highest I've seen is https://www.microsoft.com/en-us/download/details.aspx?id=100688
14:16 🔗 arkiver we could also make it a discovery project, with a txt file in case something is found
14:16 🔗 JAA (Note, the "Date Published" is completely unreliable.)
14:16 🔗 arkiver if microsoft can handle it, we can probably scan up to a billion or so IDs in a day
14:19 🔗 JAA One thing worth mentioning is that Microsoft's 404 handling is ... weird.
14:19 🔗 JAA What they do is 302-redirect to https://www.microsoft.com/en-us/download/404Error.aspx
14:19 🔗 arkiver that's fine
14:19 🔗 JAA But I've seen two cases where I got such a redirect even though the page existed.
14:20 🔗 arkiver uh
14:20 🔗 JAA Retrying succeeded there.
14:20 🔗 arkiver oh no
14:20 🔗 ephemer0l has quit IRC (Read error: Connection reset by peer)
14:20 🔗 JAA So uh yeah
14:20 🔗 JAA I was only able to detect this on the two cases because it failed on confirmation.aspx.
14:20 🔗 JAA If I got any false 404 redirects on details.aspx, my scan might be incomplete.
14:31 🔗 JAA arkiver: Here's my qwarc spec file for the scan. For the most part, it should be pretty obvious what's happening even if you're not familiar with qwarc. https://transfer.notkiska.pw/67hNZ/microsoft-download-center.py
14:34 🔗 JAA (I handled the one case of 404 idiocy manually.)
14:37 🔗 prq has joined #archiveteam-bs
14:37 🔗 phuzion @ me if y'all need some instances spun up, it's trivial for me to spin up DO droplets with custom code.
15:13 🔗 JAA In other news, my Clutch discovery scan has arrived at March 2020.
15:17 🔗 kiska Should we just repurpose #stops.tv for clutch? "D
15:24 🔗 JAA I'm at 644k posts/clips so far, by the way.
15:26 🔗 arkiver can we move clutch to hackint?
15:26 🔗 arkiver JAA: do we have a channel?
15:26 🔗 JAA Nope, not yet.
15:26 🔗 JAA #pearls ?
15:26 🔗 arkiver sure :P
15:27 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
16:22 🔗 arkiver getting a project ready for microsoft since there's so little time left
16:22 🔗 arkiver for the URLs in the jsonl file initially, can add more later
16:27 🔗 JAA Sounds good. I'm setting up another scan now to higher IDs and grabbing confirmation.aspx regardless of whether details.aspx exists. That should massively reduce the chance of missing something.
16:27 🔗 kiska You know my mobile call me when you have things ready I am in bed
16:28 🔗 kiska I can spin up reserves on hetz cloud if needed
16:31 🔗 ephemer0l has joined #archiveteam-bs
16:40 🔗 JAA New scan running now.
16:43 🔗 VADemon has joined #archiveteam-bs
16:44 🔗 arkiver JAA: yes please, are you getting all the HTML into WARCs?
16:53 🔗 Arcorann has quit IRC (Read error: Connection reset by peer)
17:03 🔗 RichardG_ has joined #archiveteam-bs
17:04 🔗 arkiver kiska: we could use a target :)
17:09 🔗 RichardG has quit IRC (Read error: Operation timed out)
17:10 🔗 Aoede is there a channel for MS download center?
17:13 🔗 arkiver on
17:19 🔗 arkiver items are online, grab script coming in a bit
17:35 🔗 arkiver also JAA THOSE LONG LINES
17:36 🔗 arkiver JAA: I won't be adding support for checking IDs yet to the project
17:36 🔗 arkiver will initially just be about archiving the download.microsoft.com lines
17:36 🔗 arkiver from the jsonl files
17:36 🔗 arkiver file*
17:38 🔗 arkiver all we need now is targets
17:38 🔗 arkiver HCross: kiska: do you have some target we can add for the microsoft downloads project?
17:39 🔗 arkiver the tracker name will be microsoft-download-center
17:39 🔗 arkiver official name is Microsoft Download Center
17:45 🔗 JAA arkiver: I'm grabbing details.aspx and confirmation.aspx into WARC, yes. I'm not grabbing page requisites etc.
17:46 🔗 JAA Already got a couple of cases of 404 on details.aspx and 200 on confirmation.aspx. :-|
17:46 🔗 kiska rsync://rsync.hel1.kiska.pw/microshaft/:downloader/
17:46 🔗 JAA I'm rescanning 1-60k as well due to that mess.
17:46 🔗 kiska :D
17:47 🔗 JAA Correction: I rescanned*
17:47 🔗 JAA Now scanning 60k-250k.
17:48 🔗 kiska IA_ITEM_TITLE="Archive Team Microsoft Download Center:"
17:48 🔗 kiska IA_ITEM_PREFIX="archiveteam_microsoft_"
17:48 🔗 kiska FILE_PREFIX="microsoft_"
17:49 🔗 kiska Yes? No?
17:50 🔗 JAA 'microsoft' seems a bit too generic.
17:50 🔗 kiska Just a reminder there is like 4 active projects on this machine, so don't slam it too hard?
17:50 🔗 JAA Just 7 TB in a few hours, not too big.
17:50 🔗 JAA :-)
17:50 🔗 kiska IA_ITEM_PREFIX="archiveteam_microsoft_download_"
17:50 🔗 kiska FILE_PREFIX="microsoft_download_"
17:50 🔗 kiska Better?
17:51 🔗 JAA Sounds good to me.
17:51 🔗 kiska If so, I am going to run those settings
17:52 🔗 kiska For now 1 packer and 1 uploader is running
18:00 🔗 kiska I am trying to not make the item prefix too long as there is a char limit
18:01 🔗 kiska If uploads are backed up just get clients to buffer the uploads
18:02 🔗 JAA There is a limit, but I think it's around 100 chars, so that should be fine.
18:02 🔗 JAA I hit it during the NRATV uploads.
18:16 🔗 mgrandi i have 1.7tb right now
18:16 🔗 mgrandi i also think that the box i'm running on is too slow
18:18 🔗 mgrandi i can also give you what files i have downloaded if this is gonna be made an actual project
18:19 🔗 mgrandi (i split the list of files into sections and have 3 instances of wget-at running)
18:27 🔗 mgrandi stuff still seems to be downloading @JAA @kiska
18:29 🔗 JAA arkiver: ^
18:32 🔗 mgrandi i have a 8 tb drive that i'm paying for atm, so if we get a proper warrior set up i can beef up the computer
18:37 🔗 mgrandi and i am hitting like 200MBits/s but every time a file finishes it has to add it to the warc which stops any download activity for that process so its not consistent
18:43 🔗 mgrandi 11571 (or 8969/10,000) on section 00
18:44 🔗 mgrandi 6064/10,000 on section 1
18:45 🔗 mgrandi 5478/10,000 on section 2
18:45 🔗 mgrandi section 3 and 4 haven't started
18:46 🔗 mgrandi 3 is 10,000 files, 4 is 7685 files
18:49 🔗 mgrandi and yeah, the 1.7 TB is compressed warcs so i don't know how much uncompressed data that is
19:43 🔗 JAA So my second scan finished and discovered 54045 files (instead of 51298 in the first).
19:48 🔗 JAA Oof
19:48 🔗 JAA I did get 404s on both details.aspx and confirmation.aspx. :-(
19:48 🔗 JAA E.g. https://www.microsoft.com/en-us/download/details.aspx?id=1557 is missing on my second scan.
19:49 🔗 JAA What the fuck, Microsoft?
19:49 🔗 JAA Overall, 22 files were discovered on the first scan but are missing from the second.
19:50 🔗 JAA So yeah, my scan is definitely incomplete...
19:50 🔗 Laverne Works here, maybe you hit some rate limiter?
19:50 🔗 JAA Laverne: Nope, Microsoft's servers suck and occasionally return 404s (really 302s to a 404 page) under load.
19:51 🔗 arkiver yeah
19:51 🔗 JAA There is rate limiting as well. I got a few 403s.
19:51 🔗 arkiver project starting a few minutes
19:51 🔗 arkiver now actually
19:51 🔗 arkiver (the question)
19:52 🔗 JAA On a positive note, it looks like nothing was removed since my first scan.
19:53 🔗 JAA The highest existing ID up to 250k is 101583.
19:54 🔗 JAA (Well, assuming I didn't miss any higher ones due to 404s.)
20:10 🔗 arkiver microsoft download center project is online
20:10 🔗 arkiver kiska: added that as target, thanks
20:10 🔗 mgrandi do you want the 2 files that i haven't started on?
20:10 🔗 mgrandi err lists of files
20:11 🔗 arkiver mgrandi: yeah, please PR them in the raw/ dir in repo microsoft-download-center-items
20:11 🔗 arkiver likely we'll just get everything though
20:11 🔗 arkiver I'm fine with queuing everything
20:12 🔗 mgrandi well if microsoft is indeed deleting stuff later today shouldn't we prioritize stuff that hasn't been downloaded by me?
20:12 🔗 arkiver yeah
20:12 🔗 Ravenloft has quit IRC (Remote host closed the connection)
20:12 🔗 arkiver so if you have the lists, let me know
20:12 🔗 arkiver kind of need them now then to queue something
20:13 🔗 mgrandi https://gist.github.com/fe8360f7bc6380aef52e028bc0749d07
20:13 🔗 mgrandi that is 'file 3'
20:13 🔗 arkiver I didn't read much backlog, so I'll ping in case of questions
20:13 🔗 mgrandi https://gist.github.com/95a836abc19210d07384d294eaba22f7 file 4
20:13 🔗 mgrandi i a doing files 0, 1 and 2
20:14 🔗 arkiver ok, if these lists are done the other will likely be queued as well
20:14 🔗 mgrandi i have not started on those two files at all ^ the other sections i am at least half way done with the rest of htem
20:14 🔗 mgrandi with them &
20:18 🔗 yano oh hai
20:19 🔗 arkiver hui
20:19 🔗 arkiver hi
20:19 🔗 JAA So there are 2769 more files I discovered during my second scan. (That's 54045 in the second scan minus 51298 in the first scan plus 22 that should've been discovered in the first scan but were missed due to 404s.)
20:19 🔗 mgrandi file 00: 91% done, file 01: 66% done, file 02: 54% done
20:20 🔗 JAA Another 1.65 TB in those.
20:20 🔗 mgrandi i say its easier to just queue the 2 links i posted above (sections 3 and 4 of the initial list JAA gave me) and plus the additional files JAA just found now
20:21 🔗 mgrandi 2769 = 1 TB? geeze
20:21 🔗 mgrandi those are some big files
20:21 🔗 JAA Results of my second scan: https://transfer.notkiska.pw/7dyt8/microsoft-download-center-files-below-id-250000-sorted.jsonl
20:22 🔗 JAA The 2769 newly discovered files: https://transfer.notkiska.pw/WlXAX/microsoft-download-center-files-below-id-250000-sorted-new.jsonl
20:23 🔗 mgrandi the files i have are gonna have to be pulled apart and sorted anyway
20:23 🔗 arkiver items added to the tracker!
20:24 🔗 arkiver haha yano have fun :P
20:24 🔗 mgrandi is kickthebucket basically done?
20:24 🔗 yano :D
20:24 🔗 yano lol
20:24 🔗 arkiver JAA: can you please put the new lists in raw in the repo?
20:24 🔗 arkiver or the processed lists starting with num 03
20:25 🔗 mgrandi i'm technically working right now so if someone can take my two gist urls and PR them for me, thanks :)
20:25 🔗 JAA phuzion: You wanted to be pinged.
20:25 🔗 arkiver mgrandi: those 'file 3' and 'file 4'?
20:25 🔗 mgrandi yeah
20:25 🔗 JAA arkiver: Yeah, will do in a second.
20:25 🔗 arkiver mgrandi: I already processed them
20:26 🔗 mgrandi ok cool
20:26 🔗 arkiver I was talking about the new stuff from JAA
20:26 🔗 mgrandi (after this i will contribute help to warrior bot, this was a very messy project lol )
20:26 🔗 arkiver Kaz: but we could sure use another target, the faster this gets done the better
20:27 🔗 Kaz _screams_
20:27 🔗 mgrandi how hard is it to set up a target?
20:27 🔗 mgrandi i have 8 tb i'm paying for on this machine
20:27 🔗 Kaz anywhere from 'dead simple' to 'horrifyingly complicated' depending inversely on the number of braincells you have
20:28 🔗 arkiver yano: someone is scaling up :P
20:28 🔗 Kaz JAA: can you find the config in here? have scrolled but couldn't see from a quick glance
20:28 🔗 yano :3
20:28 🔗 Kaz everyone's favourite irc service doesn't let me search easily either
20:29 🔗 JAA Kaz: Around 17:50
20:29 🔗 Kaz yeah just clocked it, ta
20:29 🔗 arkiver yano: hmm, update coming up though
20:29 🔗 yano dangit
20:30 🔗 Kaz dictionary required?
20:30 🔗 arkiver URL is in
20:30 🔗 arkiver Kaz: no
20:30 🔗 Kaz what's it gonna do if I specify one
20:30 🔗 Kaz break, or?
20:30 🔗 arkiver nothing
20:30 🔗 Kaz ok cool
20:30 🔗 arkiver I would add it by default for every project
20:31 🔗 mgrandi what is the github url for the project? i can use my kickthebucket boxes for it
20:31 🔗 Kaz yeah, it's in my boilerplate for new projects
20:31 🔗 Kaz target is in, anyway
20:31 🔗 JAA (For reference, some people can't be arsed to connect to EFnet and are chatting about this project in -bs on hackint.)
20:32 🔗 Craigle https://github.com/ArchiveTeam/microsoft-download-center-grab
20:32 🔗 arkiver arsed
20:32 🔗 Craigle ^mgrandi
20:32 🔗 JAA arkiver: The commit messages on -items are maximally useless, so what was queued so far?
20:33 🔗 arkiver JAA: everything in ADDED
20:33 🔗 JAA Well yeah, but what's 00, 01, and 02?
20:33 🔗 arkiver 01 and 02 are the two lists mgrandi isn't working on
20:33 🔗 mgrandi 00, 01, and 02 *
20:33 🔗 arkiver 00 is 'the rest' of the jsonl file that's not in 01 and 02
20:34 🔗 mgrandi without the project, just plain wget-at --input-file
20:34 🔗 JAA Ok, so 00 corresponds to mgrandi's first three chunks?
20:34 🔗 mgrandi uhhh
20:34 🔗 mgrandi so basically i had the entire list, i did X amount of it, ctrl+ced it, then basically removed the entries that I had completed
20:34 🔗 mgrandi then i did split --lines 10000 and got 5 files
20:34 🔗 JAA Ah
20:35 🔗 mgrandi so i guess i have sections -1, 00, 01, 02, 03, 04
20:35 🔗 arkiver 01 = file 3 from mgrandi
20:35 🔗 arkiver 02 = file 4 from mgrandi
20:35 🔗 mgrandi section -1 is already done as that was a ctrl+c, then i'm running 3 wget-at instances for sections 00 01 and 02
20:35 🔗 arkiver 00 = https://transfer.notkiska.pw/AzcCd/microsoft-download-center-files-below-id-60000-sorted.jsonl - 01 - 02
20:35 🔗 arkiver (notice the - 01 - 02)
20:35 🔗 JAA Got it.
20:36 🔗 arkiver all items are taken already :P
20:36 🔗 JAA RIP kiska's target.
20:36 🔗 arkiver and Kaz
20:36 🔗 Kaz pfft, we're fine
20:36 🔗 JAA Ok, so I guess I just need to add the -new.jsonl from my second scan then.
20:37 🔗 arkiver JAA: I'd think so?
20:37 🔗 JAA Since 00+01+02 should be the same as my first scan?
20:37 🔗 mgrandi let me upload all of the files
20:38 🔗 arkiver JAA: 00+01+02 = https://transfer.notkiska.pw/AzcCd/microsoft-download-center-files-below-id-60000-sorted.jsonl
20:38 🔗 JAA :-)
20:38 🔗 Doranwen has joined #archiveteam-bs
20:38 🔗 mgrandi (technically that is files 0 -> 4 but yeah, this is confusing)
20:38 🔗 JAA mgrandi: I think we're good now. 00 is what you've grabbed or are grabbing.
20:41 🔗 mgrandi wait, you guys have already taken all 20,000 remaining files?
20:41 🔗 yano yeah
20:41 🔗 mgrandi where were you yesterday D:
20:42 🔗 yano we fast
20:42 🔗 yano lol
20:42 🔗 arkiver JAA: do you have an ETA on the new lists?
20:42 🔗 arkiver else I might queue 00_url.txt ('the rest') as well
20:42 🔗 arkiver else as in if it takes a few more hours
20:43 🔗 JAA arkiver: Few minutes
20:43 🔗 arkiver nice
20:44 🔗 JAA arkiver: 03_url.txt is in now. Can you spot-check whether I derived it right?
20:46 🔗 mgrandi https://www.dropbox.com/s/ratsewjzpyu4kox/mgrandi_microsoft_dl_center_file_lists.zip?dl=0 these are my file lists
20:46 🔗 JAA Also, I'll run another scan up to 110k overnight at a lower rate to reduce the risk of missing stuff.
20:47 🔗 arkiver JAA: didn't check for duplicates, but looks fine!
20:47 🔗 arkiver shall I queue it?
20:48 🔗 arkiver queued it :P
20:48 🔗 JAA The new raw file is already deduped against -60000-sorted.jsonl, so there shouldn't be any dupes in it.
20:48 🔗 JAA :-)
20:49 🔗 arkiver and gone
20:50 🔗 JAA lol
20:51 🔗 Aoede that took, like, 30 seconds <.<
20:51 🔗 arkiver will queue 00_url.txt as well
20:52 🔗 yano tracker ded?
20:52 🔗 JAA By the way, I think it'd be nice to keep this project going in the future as well and regularly archive new downloads.
20:52 🔗 arkiver sure
20:53 🔗 Kaz yano: nah, it just handed out 17k jobs in the last 60 seconds
20:53 🔗 JAA Although I guess we can switch to warriorbot once that's up and running.
20:53 🔗 mgrandi again where were you all yesterday lol
20:53 🔗 mgrandi rip my wallet
20:53 🔗 yano ah
20:53 🔗 Kaz you missed the magic words, clearly
20:53 🔗 Kaz they're "new project, no limits"
20:54 🔗 yano hehe :D
20:54 🔗 yano those are the magic words
20:54 🔗 mgrandi well we didn't have a project set up
20:54 🔗 JAA Fusl Fusl Fusl Fusl Fusl Fusl Fusl yano Fusl Fusl Fusl Fusl :-)
20:54 🔗 yano if you want one Fusl-cane and a Yanado
20:54 🔗 mgrandi i'm brand new and didn't know how to set one up
20:55 🔗 arkiver can always ping me as well if you think we need one
20:55 🔗 JAA I still dislike the bus factor of 1 that is arkiver being the only person to have set up DPoS projects in years.
20:55 🔗 yano just click your heels three times and whisper "new project, no limits" 3 times and Fusl and me will show up
20:55 🔗 arkiver "bus factor"?
20:55 🔗 yano lol
20:55 🔗 arkiver oh wait
20:55 🔗 arkiver I remember bus factor
20:55 🔗 JAA https://en.wikipedia.org/wiki/Bus_factor
20:55 🔗 arkiver didnt they call it truck factor as well?
20:55 🔗 JAA "The "bus factor" is the minimum number of team members that have to suddenly disappear from a project before the project stalls due to lack of knowledgeable or competent personnel."
20:56 🔗 mgrandi would be nice to have some sort of automation to setting up some stuff like a target
20:56 🔗 arkiver it's called warriorbot :P
20:57 🔗 arkiver it'll be awesome if we can get running
20:57 🔗 arkiver can queue lists of 100 thousands of URLs to that
20:57 🔗 arkiver especially would have been nice for this
20:57 🔗 JAA Yeah, this would've been perfect.
20:57 🔗 mgrandi yes
20:57 🔗 mgrandi saves me from having to manually split lists and run wget manually in 3 byobu tabs
20:59 🔗 mgrandi how many computers does fusl have geez
20:59 🔗 JAA byobu *twitch*
20:59 🔗 Kaz https://usercontent.irccloud-cdn.com/file/HkdLkD13/image.png
20:59 🔗 arkiver mgrandi: Fusl is on hackint
20:59 🔗 Kaz fuck byobu
21:00 🔗 arkiver Kaz: nice painting skills
21:00 🔗 Kaz thanks, I've practiced from a very young age
21:00 🔗 Kaz I think it shows
21:00 🔗 arkiver definitely does
21:01 🔗 mgrandi isn't byobu just...tmux lol
21:01 🔗 JAA So why not just use plain tmux then?
21:01 🔗 mgrandi that is indeed a lot of ip addresses
21:01 🔗 mgrandi cause i'm basic
21:01 🔗 mgrandi and byobu hasn't done me dirty i guess
21:05 🔗 mgrandi anyway, we can talk later arkiver about once my files are done on what to do with them
21:07 🔗 BartoCH i have my 20 jobs done, but here i am, waiting for an rsync slot so that i can upload them. I feel like grandpa compared to fusl :D
21:07 🔗 Aoede i hate waiting in the rsync queue :c
21:07 🔗 mgrandi yeah, 4 got lost because apparently i ran out of space on kickthebucket and it never deleted the half downloaded file
21:08 🔗 Doranwen does anyone here know what's happening with marked1? his nick's been sitting in channels on Hackint but he's not responded to any communications since late May… we're needing to get a hold of him for the Yahoo Groups project and I was told maybe ask here and see if anyone knows
21:08 🔗 BartoCH Aoede: honestly, i don't know how fusl got so much bandwidth :O
21:08 🔗 arkiver Doranwen: what do you need him for?
21:09 🔗 arkiver and what is the channel you all are in?
21:09 🔗 Kaz BartoCH: it's all hetzner cloud
21:09 🔗 Doranwen we're in #yahoosucks - the fandom project is getting a set of all the GMD data and he's got some that isn't anywhere else now
21:10 🔗 Kaz what if we just nailed 4 boxes to the wall with uploads https://usercontent.irccloud-cdn.com/file/PAJkkzW6/image.png
21:10 🔗 Doranwen I've been transferring a lot of stuff back and forth with other project members but there's some sitting on his server that isn't accessible
21:13 🔗 mgrandi is that all microsoft dl center uploads?
21:14 🔗 Kaz 4G of it is, yep
21:20 🔗 JAA That third slower scan I mentioned earlier is now running.
21:21 🔗 JAA I expect that to take about 9 hours.
21:21 🔗 mgrandi but at least all of it is downloaded off of microsoft's site
21:29 🔗 arkiver JAA: nice, might be ready just in time
21:29 🔗 JAA And we're at 0 todo again. :-)
21:35 🔗 OrIdow6 JAA: Today is the removal date, so they may remove files (if that hasn't happened already), and you may see that on your scan
21:36 🔗 JAA I know. Nothing has been removed as of my second scan that finished about 2 hours ago.
21:37 🔗 OrIdow6 Not only would warriorbot "solve" the bus crash problem, it would be able to get rid of the delay that comes with having to have "replacable" people come to start a project
21:37 🔗 OrIdow6 Someone to set up workers (or switch the warrior HQ over to the project), someone to run a target, someone with permissions for the tracker, etc.
21:38 🔗 OrIdow6 Had they removed files for this at midnight, would have lost them because of this delay
21:38 🔗 OrIdow6 Obviously can't do this if the "project" isn't simple (list of files, in this case)
21:40 🔗 yano i'm sitting on so many things to upload but the targets are too busy :p
21:41 🔗 arkiver we're at 1 TB now!
21:41 🔗 arkiver yano: yep :P
21:41 🔗 arkiver yano: blame yourself and Fusl
21:41 🔗 arkiver :)
21:43 🔗 mgrandi yeah, thats why i started downloading it at my own, but i def couldn't get all of them by midnight
21:43 🔗 mgrandi but i also work at microsoft and they are washington state based, they are not gonna do something at midnight on a sunday xD
21:43 🔗 yano arkiver: lol
21:44 🔗 arkiver mgrandi: so we can bug you if we need insider help? :P
21:44 🔗 yano i have so many VMs running right now
21:44 🔗 yano lol
21:44 🔗 mgrandi possibly, i just started but i can probably ask
21:44 🔗 yano i told them all to stop gracefully; and they are all waiting to upload lol
21:45 🔗 mgrandi i have access to the entire MS directory and can just...message people on teams
21:49 🔗 arkiver nice
21:49 🔗 arkiver it's sometimes very hard to reach these companies
23:02 🔗 Arcorann has joined #archiveteam-bs
23:02 🔗 Arcorann has quit IRC (Remote host closed the connection)
23:03 🔗 Arcorann has joined #archiveteam-bs
23:20 🔗 mgrandi so , my 'file list section 00' is 100% done now, so that is the first 10,300 so files
23:22 🔗 mgrandi file list section 2 is 5794/10,000 done, and file list section 1 is 9695/10,000 done
23:22 🔗 JAA Meanwhile the DPoS project is almost halfway done with the uploads.
23:23 🔗 mgrandi yeah, heh
23:23 🔗 mgrandi impressive we got all of them this fast
23:30 🔗 mgrandi did you manage to get the rest of the URLS past 60k @JAA
23:31 🔗 JAA mgrandi: Yes, I scanned up to 250k and found another 2.7k files. Link's somewhere above, search for 'jsonl' I guess.
23:31 🔗 mgrandi did that get added to the tracker and already retrieved? lol
23:32 🔗 JAA Yep
23:32 🔗 mgrandi geez
23:32 🔗 mgrandi thankfully they have no rate limiting or anything...
23:32 🔗 JAA All those items were gone in less than 30 seconds. lol
23:34 🔗 mgrandi how big was the data up to id 250k ?
23:34 🔗 SilSte has joined #archiveteam-bs
23:35 🔗 JAA That was the extra 1.65 TB on top of the 7.1.
23:35 🔗 mgrandi ah ok
23:35 🔗 mgrandi so the first 60k is the majority of the size
23:35 🔗 phuzion JAA: still need instances?
23:35 🔗 mgrandi no, lol
23:36 🔗 JAA mgrandi: Yes. But note that my scan is incomplete as explained above.
23:36 🔗 mgrandi we needed them sunday >:(
23:36 🔗 JAA phuzion: Nope, Fusl ate all the items.
23:36 🔗 mgrandi (the others hoovered up everything and i'm in process of finishing up)
23:37 🔗 phuzion JAA: alright. For future reference, you can @ me any time you think you might need a bunch of DO droplets spun up quickly. Only takes me about 5 minutes to re-adapt my ansible script for a new project, and spin up the droplets to do it.
23:38 🔗 mgrandi can you post that ansible script? :eyes:
23:38 🔗 mgrandi well the problem was we needed a project set up
23:38 🔗 mgrandi so even with a bunch of instances its a lot of manual fiddling
23:38 🔗 phuzion Ah, my script mostly solves the "we don't have enough workers" issue.
23:39 🔗 mgrandi but does it just start a bunch of warrior workers?
23:39 🔗 phuzion It starts a bunch of pipeline scripts, without the warrior overhead
23:39 🔗 mgrandi well yeah, i guess seesaw pipelines
23:39 🔗 mgrandi but the point is we didn't hvae a pipeline set up yesterday
23:40 🔗 mgrandi arkiver wrote one quickly this morning
23:40 🔗 phuzion mgrandi: https://github.com/phuzion/archiveteam-deploy/blob/master/setup.yml
23:40 🔗 phuzion it's super old and probably breaks a bunch of best practices, but I've successfully used it a few times to spin up like 50 droplets without a problem.
23:40 🔗 mgrandi but since it was just me and jaa yesterday we were just like 'aaa' so i just started downloading them onto a large DO droplet volume
23:41 🔗 phuzion yeah my script is basically the solution to "Oh fuck this project isn't gonna succeed unless we get like 100 more instances downloading stat"
23:41 🔗 phuzion or "hey I have some DO credit I wanna piss away" lol
23:42 🔗 phuzion speaking of DO, I need to check my account, my invoice this month was more than I expected.
23:43 🔗 JAA I'm still getting false 404s at the low concurrency. :-(
23:43 🔗 mgrandi how are you downloading them? maybe its still setting a cookie/
23:44 🔗 JAA I'm clearing cookies after each ID though.
23:45 🔗 JAA Actually, it can't be cookie-based since I'm sometimes seeing the false 404s on details.aspx but not on confirmation.aspx. I'm not clearing cookies between those requests.
23:45 🔗 mgrandi maybe the detail pages go away but not the confirmation or the files? :thinking:
23:46 🔗 JAA Nope, the pages exist. The server's just broken I think.
23:46 🔗 OrIdow6 My suspicion was that the cookie thing was something auth-related
23:46 🔗 JAA Maybe some DB query fails occasionally and that causes it to 404.
23:51 🔗 mgrandi do those urls work manually?
23:52 🔗 JAA Yes
23:52 🔗 JAA It's just random false 404s.
23:52 🔗 JAA Retrying the exact same request succeeds usually.
23:56 🔗 BlueMax has joined #archiveteam-bs

irclogger-viewer