[00:06] *** Gallifrey has joined #archiveteam-bs
[00:10] *** chirlu has quit IRC (Quit: Bye)
[00:16] <JAA> Has anyone looked into https://www.bleepingcomputer.com/news/microsoft/microsoft-to-remove-all-windows-downloads-signed-with-sha-1/ ?
[00:18] <SketchCow> Not here
[00:19] *** Arcorann has joined #archiveteam-bs
[00:20] *** Arcorann has quit IRC (Remote host closed the connection)
[00:20] *** Arcorann has joined #archiveteam-bs
[00:32] *** systwi_ is now known as systwi
[02:19] *** VADemon has quit IRC (left4dead)
[02:30] *** HP_Archiv has quit IRC (Quit: Leaving)
[02:32] *** HP_Archiv has joined #archiveteam-bs
[02:42] *** SmileyG has quit IRC (Read error: Operation timed out)
[02:42] *** Smiley has joined #archiveteam-bs
[03:24] *** VADemon has joined #archiveteam-bs
[03:40] *** qw3rty_ has joined #archiveteam-bs
[03:47] *** qw3rty__ has quit IRC (Read error: Operation timed out)
[04:28] *** HP_Archiv has quit IRC (Quit: Leaving)
[05:34] *** Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat)
[05:48] *** mgrandi has joined #archiveteam-bs
[05:49] <mgrandi> has anyone looked into saving SHA1 signed stuff from microsoft's download center?
[05:51] <mgrandi> oof, its happening august 3rd
[05:52] *** bsmith093 has quit IRC (Read error: Operation timed out)
[05:56] <mgrandi> https://www.zdnet.com/google-amp/article/microsoft-to-remove-all-sha-1-windows-downloads-next-week/
[05:57] <OrIdow6> I looked *at* it for about 5 minutes
[06:01] <OrIdow6> Looking at it again...
[06:02] *** jmtd is now known as Jon
[06:03] <OrIdow6> It's hard to tell where the Microsoft Download Center ends and other things begin. They use the microsoft.com-wide search function to list downloads (and it only goes up to page 126, so creativity is needed with the various parameters); and is anything at update.microsoft.com being removed?
[06:05] <OrIdow6> And is there any way to tell what's using sha1 besides downloading the file, figuring out what format it's in (exe, msi), and extracting the information in a format-specific way?
[06:08] *** bsmith093 has joined #archiveteam-bs
[06:12] <OrIdow6> Looks like it's divided into (what I will call) items, e.g. https://www.microsoft.com/en-us/download/confirmation.aspx?id=41658 - I give that one as an example because it contains multiple files
[06:12] <OrIdow6> The site started giving me 400s for everything at one point; going away and clearing cookied worked in that case
[06:13] <OrIdow6> They're using radio buttons where checkboxes should be used, and using JS to make them behave like checkboxes
[06:16] <OrIdow6> wget 403s, but UA of "abc" (and presumably most other things) work
[06:16] <OrIdow6> *makes it 403
[06:18] <OrIdow6> Looks like enumerating and downloading them should be straightforward (assuming they don't block) - can't say the same about figuring out what uses sha1 and playback
[06:18] <OrIdow6> Maybe it's best to get the whole thing anyhow, if it's not too big (hopefully)
[06:19] <OrIdow6> Obviously the rest of the site is going to go down some day
[06:19] <OrIdow6> *sha1, and
[06:58] *** Craigle has joined #archiveteam-bs
[07:11] *** auror__ has joined #archiveteam-bs
[07:17] *** mgrandi has quit IRC (Read error: Operation timed out)
[07:22] *** HP_Archiv has joined #archiveteam-bs
[07:24] *** VADemon_ has joined #archiveteam-bs
[07:26] *** Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat)
[07:28] *** VADemon has quit IRC (Ping timeout: 492 seconds)
[07:39] *** auror__ is now known as mgrandi
[07:39] <mgrandi> do you see a easy way to get a list of urls?
[07:40] <mgrandi> i also can't see what things are SHA1 to make it easier
[07:40] <mgrandi> without downloading everything and checking
[07:41] <OrIdow6> mgrandi: Just go through all the numerical IDs
[07:41] <mgrandi> ah. well, thats easy lol
[07:41] <mgrandi> also, that link you posted, it seems to be 1 file but then it has like 'popular downloads' underneath it?
[07:43] <OrIdow6> See "click here to download manually" if it doesn't get all 3 automatically
[07:44] <OrIdow6> An msi and 2 pdfs
[07:46] <mgrandi> for me its just auto prompting a download for a .exe
[07:47] <mgrandi> wait no i must have been on a different page
[07:47] <mgrandi> ok yeah i see it
[07:49] <mgrandi> so it seems like if it has multiple files it has a <div> with a class of `multifile-failover-list`
[07:50] <mgrandi> wait, did that link suddenly stop working for you?
[07:50] <mgrandi> i'm getting "item is no longer available"
[07:54] <OrIdow6> No
[07:54] <OrIdow6> Try clearing your cookies
[07:54] <mgrandi> ok, its weird, yeah it must be like using the cookies to try and see if you downloaded it recently
[07:56] <mgrandi> hmm, should i try just writing a python script that iterates over all 50000 ids and gets the links and then downloads them?
[07:56] <mgrandi> i don't even think archiveteam can host these probably , but at least someone will have a copy
[07:56] <mgrandi> archive.org *
[07:59] <OrIdow6> There's a lot of old software hosted by the IA, even though it's technically in-copyright
[08:00] <OrIdow6> These are free downloads, of security updates and other "technical" files, with no ads, that are being removed permanently; I would think (though you never know) that they're fairly "safe"
[08:00] <mgrandi> Swift on security on twitter brought up the point that a lot of these are needed even by modern software
[08:01] <mgrandi> stuff like the VC2XXX c++ redist packages and stuff
[08:01] <OrIdow6> Are you talking about something else when you say that the IA has an inability to host them?
[08:01] <OrIdow6> And I don't know who that is
[08:01] <mgrandi> yeah i meant that they are in copyright 
[08:01] <mgrandi> slash by a company that would probably DCMA them being on the IA
[08:02] <mgrandi> uhhh, some twitter half taylor swift parody account slash technology/infosec account
[08:03] <mgrandi> "In 2017, Microsoft removed downloads for Movie Maker.
[08:03] <mgrandi> What resulted was years of customers looking for the file being infected and scammed by malware."
[08:04] <mgrandi> (whoops, didn't mean to copy the newline) https://www.welivesecurity.com/2017/11/09/eset-detected-windows-movie-maker-scam-2017/
[08:04] <OrIdow6> You're preaching to the choir here with the public access thing
[08:04] <mgrandi> yeah, heh
[08:05] <mgrandi> so If you aren't working on anything i'll probably see if i can just whip up a quick script to download the files, if there are no complications, i don't think its worth getting WARCs of the entire pages
[08:06] <OrIdow6> You might want to get warcs of the description pages, at least, to get all the metadata
[08:08] <OrIdow6> And I know it's somewhat contrary to the orthodoxy, but I think these would better be in the form of individual IA items rather than in warcs locked behind playback problems
[08:09] <OrIdow6> As they are practically already separated like that
[08:09] <mgrandi> i have never done anything like this before so i assume i'll be making them per item,, i dunno what is the best practice
[08:09] <mgrandi> what is the best thing that has warc integration? would be manually downloading them with python requests into a warc file or using something like wpull?
[08:10] <OrIdow6> First things first, get the data, seeing that, at minimum, it may be removed in 16 hours
[08:10] <mgrandi> (i have experience with page scraping and all that but not with generating warcs)
[08:11] <OrIdow6> https://github.com/webrecorder/warcio#quick-start-to-writing-a-warc - easy way to write to warc when using requests
[08:12] *** Craigle has joined #archiveteam-bs
[08:13] <mgrandi> thank goodness someone has that
[08:13] <OrIdow6> Though you could also make a list of the confirm pages, wpull them all, extract it after the fact from there, and then get the URLs of the downloads from that; or any number of other things; but this looks like it would disrupt your current idea the least
[08:13] <OrIdow6> "This" being the thing I linked (thanks J A A)
[08:14] <mgrandi> wpull would probably be easiest yeah
[08:14] <mgrandi> if i can just figure out what urls to get and tell it to not go off recursively on some other microsoft site
[08:17] <OrIdow6> You could whitelist instead of blacklist
[08:18] <mgrandi> yeah
[08:20] *** Laverne has quit IRC (Ping timeout: 272 seconds)
[08:21] *** Aoede has quit IRC (Ping timeout: 272 seconds)
[08:21] *** brayden has quit IRC (Ping timeout: 272 seconds)
[08:22] *** mgrytbak has quit IRC (Ping timeout: 272 seconds)
[08:35] <mgrandi> ok, i will work on this when i get up
[08:35] *** mgrandi has quit IRC (Leaving)
[08:37] *** i0npulse has quit IRC (Quit: leaving)
[09:06] *** i0npulse has joined #archiveteam-bs
[09:10] *** jshoard has joined #archiveteam-bs
[09:22] *** Raccoon has quit IRC (Ping timeout: 745 seconds)
[09:25] *** Aoede has joined #archiveteam-bs
[09:25] *** Laverne has joined #archiveteam-bs
[09:26] *** brayden has joined #archiveteam-bs
[09:31] *** OrIdow6 has quit IRC (Ping timeout: 265 seconds)
[09:33] *** OrIdow6 has joined #archiveteam-bs
[09:34] *** mgrytbak has joined #archiveteam-bs
[09:39] *** VADemon_ has quit IRC (Read error: Connection reset by peer)
[10:13] *** BartoCH has quit IRC (Quit: WeeChat 2.9)
[10:13] *** BartoCH has joined #archiveteam-bs
[10:32] *** jshoard has quit IRC (Read error: Operation timed out)
[10:59] *** HP_Archiv has quit IRC (Quit: Leaving)
[12:19] *** jshoard has joined #archiveteam-bs
[13:34] *** BlueMax has quit IRC (Quit: Leaving)
[14:24] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[14:30] *** Gallifrey has joined #archiveteam-bs
[14:34] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[14:35] *** Gallifrey has joined #archiveteam-bs
[14:36] *** Ravenloft has joined #archiveteam-bs
[14:48] <JAA> I feel like we should "simply" continuously mirror all downloads Microsoft makes available at this point.
[15:03] *** Gallifrey has quit IRC (Read error: Connection reset by peer)
[15:05] *** Gallifrey has joined #archiveteam-bs
[15:07] <JAA> Oh yeah, my listing of Clutch's S3 finished and discovered some 33M files totalling 188 TB.
[15:13] *** schbirid has joined #archiveteam-bs
[15:13] <JAA> The video counts are ... interesting.
[15:13] <JAA> High-resolution videos have no suffix on the filename after the SHA-1. There are 5800580 of them totalling 97.6 TB.
[15:14] <JAA> Watermarked videos (*-watermarked.mp4): 4855831 files, 59.5 TB
[15:14] <JAA> Low-resolution videos (*-480.mp4): 5816979 files, 29.5 TB
[15:27] *** godane has quit IRC (Read error: Connection reset by peer)
[15:27] *** Arcorann has quit IRC (Read error: Connection reset by peer)
[15:44] *** fredgido_ has joined #archiveteam-bs
[15:46] *** fredgido has quit IRC (Ping timeout: 622 seconds)
[15:49] *** godane has joined #archiveteam-bs
[15:53] *** schbirid has quit IRC (Quit: Leaving)
[15:54] *** Raccoon has joined #archiveteam-bs
[16:06] *** Ctrl has quit IRC (Read error: Operation timed out)
[16:27] *** RichardG has quit IRC (Keyboard not found, press F1 to continue)
[16:30] *** RichardG has joined #archiveteam-bs
[16:51] *** prq has quit IRC (Remote host closed the connection)
[17:01] <JAA> My discovery on the API is running now. I'm simply iterating over the recent posts endpoint and extracting posts and users with a bunch of interesting metadata. The API is slooooow though, so that might take a bit.
[17:24] <JAA> OrIdow6: So are you grabbing the Microsoft downloads?
[17:25] <JAA> Er, mgrandi I guess.
[17:52] <JAA> I'm running something on it now.
[18:06] *** fivechan_ has joined #archiveteam-bs
[18:08] <fivechan_> I have a question. If I have WARC files and upload them to InternetArchive with keyward "archiveteam", will they show up in Wayback Machine?
[18:10] <JAA> fivechan_: No, they won't. Only WARCs from trusted accounts are included in the Wayback Machine.
[18:14] *** Mateon1 has quit IRC (Ping timeout: 260 seconds)
[18:15] <fivechan_> To show web page in Wayback, I must ask archive team to archive team?
[18:16] <fivechan_> I must ask archive team to archive them?
[18:18] <JAA> Yes, or use the Wayback Machine's save tool.
[18:19] <fivechan_> Thank you!! I understood.
[18:24] *** fivechan_ has quit IRC (Ping timeout: 252 seconds)
[18:25] <JAA> Turns out that Microsoft doesn't really like it when their Download Center gets hammered with requests. 403 pretty quickly.
[18:25] <JAA> To be fair, I sent 100+ req/s at times. :-)
[18:26] *** mgrandi has joined #archiveteam-bs
[18:27] *** Mateon1 has joined #archiveteam-bs
[18:31] <JAA> Hey mgrandi. In case you didn't check the logs, I've been looking into Microsoft's downloads a bit.
[18:39] <JAA> My Clutch discovery is at 2020-07-21 after 1.5 hours. Yeah, this is going to take a while.
[18:42] <mgrandi> ah so you are already working on it?
[18:43] <mgrandi> or are you just getting URLs
[18:45] <JAA> The latter. Trying to, anyway.
[18:46] <JAA> I'm investigating Clutch's cursor format to see if I can speed this up a bit.
[18:46] <mgrandi> hmm, not familiar with Clutch
[18:48] <JAA> Two separate things I'm working on at the moment.
[18:51] <mgrandi> ok, let me know if you need help
[18:53] <mgrandi> i was personally just gonna iterate over 1->50000 and see if a page has anything vs a 404 and then download it there
[18:53] <JAA> Well, maybe you have an idea, so here it goes: I'm iterating over https://clutch.win/v1/posts/recent/ . The next page is https://clutch.win/v1/posts/recent/?cursor=<cursor value from previous page> . I'm trying to figure out how to construct a cursor value to start at a particular point in time. The cursors are opaque though and have a weird format.
[18:54] <JAA> E.g. CksKFwoKY3JlYXRlZF9hdBIJCN6b6YuB_eoCEixqCXN-ZnR3LXV0bHIfCxIEdXNlchiAgICp8PqGCQwLEgRjbGlwGKzBzdIIDBgAIAE= which decodes to b'\nK\n\x17\n\ncreated_at\x12\t\x08\xde\x9b\xe9\x8b\x81\xfd\xea\x02\x12,j\ts~ftw-utlr\x1f\x0b\x12\x04user\x18\x80\x80\x80\xa9\xf0\xfa\x86\t\x0c\x0b\x12\x04clip\x18\xac\xc1\xcd\xd2\x08\x0c\x18\x00 \x01' (in Python notation).
[18:54] <JAA> The created_at part controls the time axis, but I can't figure out what the rest is.
[18:56] <mgrandi> oh clutch is a website, i was mentioning my idea for microsoft download center heh
[18:57] <mgrandi> the cursors are probably dynamic, as in it represents a view of the database at that point in time
[18:57] <mgrandi> so it does a query, stores it, and then cursors iterate over it so its not constantly changing while you are iterating over it
[18:57] <mgrandi> you probably need to just iterate over the cursor values it gives you until you reach the end, i don't think you would be able ot create one dynamically 
[18:58] <JAA> Possibly, but often cursors work as opaque identifiers similar to "before X time/DB ID".
[18:58] <mgrandi> hmm, is that a pickled object?
[19:00] <JAA> I can just iterate over it until done, but it's slow, so I'm trying to slice it into chunks of e.g. one day to process those in parallel.
[19:00] <mgrandi> it looks like its a serialized object format of some kind
[19:01] <JAA> Yeah
[19:01] <mgrandi> it has `created_at`, `clip`, and `ftw-utl`, `user` fields in it
[19:01] <mgrandi> but yeah, are you working on archiving the microsoft download center stuff ? or should i still work on that
[19:02] <JAA> I'm about 2/3 done enumerating the downloads now.
[19:03] <mgrandi> ok cool, how did you do it? just wpull over the urls with item 0->50000?
[19:03] <JAA> Just retrieving details.aspx and confirmation.aspx, extracting file URLs and sizes from the latter.
[19:03] <JAA> qwarc for IDs 1 to 60k.
[19:03] <mgrandi> k. are you actually downloading the files yet?
[19:04] <JAA> Nope
[19:04] <mgrandi> those are probably the biggest ones
[19:04] <JAA> Yeah, definitely. Just wanted to collect the URLs and get a size estimate first.
[19:04] <mgrandi> i don't think microsoft is gonna nuke these from history, i wonder if they are gonna put them back up.
[19:04] <JAA> By the way, I saw occasional weird things where the details.aspx page would work but confirmation.aspx would redirect to the 404 page.
[19:04] <JAA> I hope it doesn't happen the other way around...
[19:04] <mgrandi> full disclosure, i just started working for microsoft, but not on any team that deals with that
[19:05] <mgrandi> did you see where you have to like clear your cookies?
[19:05] <mgrandi> that was happening to me
[19:06] <JAA> Yeah, I'm clearing them after every ID because why not.
[19:06] <mgrandi> (not sure why it does that, seems weird)
[19:06] <mgrandi> also not ALL of the downloads are going away, just SHA1 ones
[19:06] <mgrandi> although i have no idea if there is a way to tell which ones are going away without..downloading them first
[19:08] <JAA> Yeah, exactly.
[19:08] <JAA> Hence why I suggested just grabbing everything and also doing so continuously in the future.
[19:11] <JAA> Retrieval is done, just need to fix that one weird 404 now.
[19:12] <mgrandi> i still have my box with a 500gb hard drive for kickthebucket if you want me to download the actual files
[19:12] <JAA> It's not reproducible by the way, just seems to happen under load or something like that.
[19:15] <mgrandi> hmm
[19:15] <mgrandi> also, is that base64 string you posted complete? or did you leave off a few = signs
[19:16] <JAA> Nope, that's complete.
[19:16] <mgrandi> is it base64?
[19:17] <JAA> Microsoft Download Center: I found 51298 files with a total size of about 7.1 TB.
[19:17] <mgrandi> its saying its not a divisible of 4 characters
[19:17] <mgrandi> @JAA hmm, thats a bit big
[19:18] <JAA> Hmm, yeah, odd. Python's base64.urlsafe_b64decode doesn't have any issues with it though.
[19:19] <mgrandi> oh its url safe, duh
[19:23] <JAA> The highest ID I found was 58507, by the way, which was uploaded ... a year ago (assuming weird US date format)?
[19:23] <JAA> https://www.microsoft.com/en-us/download/details.aspx?id=58507
[19:23] <OrIdow6> It's protobuf
[19:24] <JAA> I don't like protobuf.
[19:24] <mgrandi> oof, protobuf is not great if we dont have the proto file
[19:25] <JAA> Yeah
[19:25] <mgrandi> but i assume you just need to figure out the created_at
[19:25] <mgrandi> which is probably one of the integer types
[19:25] <JAA> It seems that the other values also influence the results, sadly.
[19:26] <JAA> This is getting messy here discussing about Microsoft Download Center and Clutch at the same time.
[19:27] <OrIdow6> Yeah
[19:27] <JAA> Let's focus on Microsoft first since it has such a short deadline.
[19:27] <JAA> Surely the Download Center is still in use, right? Any ideas why the highest ID is a year old?
[19:28] <mgrandi> maybe they migrated to other things? 
[19:28] <JAA> I noticed some big gaps in the IDs in some places though.
[19:28] <JAA> There's almost nothing between 31k and 34k for example, just a few files.
[19:28] <mgrandi> like i know that visual studio stuff has its own download page now instead of the download center
[19:31] <JAA> Well, apparently the IDs aren't sequential *at all*.
[19:31] <JAA> https://www.microsoft.com/en-us/download/details.aspx?id=1230 is from December...
[19:31] <mgrandi> lol wut
[19:32] <JAA> Damn, IDs go much higher also: https://www.microsoft.com/en-us/download/details.aspx?id=100688
[19:32] <mgrandi> so, how do EXEs work ? are the signing certificates at the very start?
[19:32] <mgrandi> if so we could like...download 32kb of the file, check the cert and see if its a SHA1 cert or something?
[19:33] <OrIdow6> But 1230 is apparently a security patch from 2010 (https://support.microsoft.com/en-us/help/2345000/ms10-079-description-of-the-security-update-for-word-2010-october-12-2) 
[19:34] <JAA> Huh
[19:34] <JAA> So the 'Date Published' is completely unreliable.
[19:34] <mgrandi> cause while everything should be archived eventually, only the SHA1 stuff is getting removed soon
[19:38] <JAA> Yeah, but I'm also seeing .bin, .msi, .zip, even .tar.gz...
[19:39] <JAA> .msu
[19:39] <JAA> .msp
[19:39] <JAA> etc.
[19:40] <JAA> If you can figure something out to selectively archive those, that's great. Otherwise, we should just grab everything.
[19:41] <mgrandi> well, i assume the associated files with the SHA1 downloads will be removed
[19:42] <mgrandi> but if the cert is in a predictiable spot, my strategy would be: for every page, download some amount of data for each EXE, see if its signed with a SHA1 cert, if it is, download everything for that 'item', else skip it
[19:42] <JAA> There are more .msu than .exe.
[19:42] <mgrandi> what is a .msu?
[19:43] <JAA> I have no idea. 'Microsoft Update' maybe?
[19:43] <OrIdow6> http://fileformats.archiveteam.org/wiki/Microsoft_Update_Standalone_Package
[19:43] <JAA> This fucking mess is precisely why I left the Windows world years ago. lol
[19:43] *** Raccoon has quit IRC (Ping timeout: 610 seconds)
[19:43] <mgrandi> not sure why its listed under EA files, heh
[19:44] <mgrandi> well to be fair, they added these so these are not direct executables so they are a bit safer than just EXE files
[19:44] <mgrandi> are MSU files signed?
[19:44] <mgrandi> do you have an example download link? i'll check it out
[19:44] <JAA> First one my scan found: https://download.microsoft.com/download/0/B/8/0B8852B8-8A3A-4A70-97CE-A84B5F4C5FC8/IE9-Windows6.0-KB2618444-x86.msu from ID 28401.
[19:45] <mgrandi> yeah, that actually doesn't run cause windows 10 doesn't accept SHA1 certs anymore
[19:46] <mgrandi> so they are Cabinet files (mszip i guess? )
[19:47] <JAA> I found two different search interfaces for the Download Center, and they both suck.
[19:47] <JAA> https://www.microsoft.com/en-us/search/downloadresults?FORM=DLC&ftapplicableproducts=^AllDownloads&sortby=+weight returns only 1000 results.
[19:47] <JAA> https://www.microsoft.com/en-us/download/search.aspx is just broken.
[19:48] <JAA> Sometimes returns the same results as you go through the pagination etc.
[19:48] <JAA> Trying to establish the upper bound for the IDs.
[19:48] <mgrandi> and the cert seems to be at the end of the WSU file
[19:49] <OrIdow6> I'm going to guess that generally, the lower the ID, the older it is, and the more likely it is to use sha1
[19:50] <mgrandi> that probably seems like a safe assumption
[19:50] <OrIdow6> Depending on how slowly that search goes until it reaches the present (~11 hours left until midnight Pacific), it might be useful to start downloading before it finishes
[19:51] <mgrandi> microsoft is US based, i'm not sure if they are gonna nuke it right at midnight on a sunday, so hopefully have a bit more time
[19:51] <mgrandi> but yeah, probably. how should we handle...the data? are we allowed to upload these to archive.org ?
[19:52] <JAA> https://transfer.notkiska.pw/Kwk8n/microsoft-download-center-files-below-id-60000
[19:56] <JAA> Actually, let me do that differently.
[20:01] <mgrandi> so we gonna split it up and curl our way to victory?
[20:04] <OrIdow6> I like the idea of "mirror everything as individual IA items"
[20:04] *** wyatt8740 has quit IRC (Read error: Operation timed out)
[20:06] *** wyatt8740 has joined #archiveteam-bs
[20:14] <JAA> That'd actually be nice, yeah. With full metadata etc.
[20:14] <JAA> But for now, we just need to grab everything we can.
[20:14] <JAA> Download as WARCs, further processing later.
[20:16] <mgrandi> yeah, but how are we gonna download them
[20:16] <mgrandi> is it easy to set up a tracker thingy like kickthebucket?
[20:44] *** jshoard has quit IRC (Quit: Leaving)
[20:45] <JAA> https://transfer.notkiska.pw/AzcCd/microsoft-download-center-files-below-id-60000-sorted.jsonl
[20:54] <mgrandi> nice
[20:54] <mgrandi> so how do we coordinate the downloads
[20:58] <JAA> Here are the ten most frequent file extensions: 13780 msu, 13111 exe, 6812 zip, 3927 msi, 3770 pdf, 2228 pptx, 1214 docx, 888 bin, 828 doc, 483 xps
[21:03] <mgrandi> so the ID cooresponds to what page its on?
[21:04] <JAA> That's the 'id' parameter from the URLs.
[21:04] <mgrandi> yeah so if a 'item' has multiple downloads then they have different IDs
[21:05] <JAA> Uh
[21:05] <JAA> If an entry on the Download Center has multiple files, those all have the same ID in my list.
[21:05] <JAA> E.g. https://www.microsoft.com/en-us/download/confirmation.aspx?id=41658 -> three entries with ID 41658.
[21:06] <mgrandi> ok
[21:06] <mgrandi> thanks for setting this up :)
[21:06] <JAA> Further statistics: top ten by size in GiB: 4100.1 .zip, 864.1 .exe, 597.3 .bin, 355.1 .iso, 182.3 .rar, 149.6 .cab, 118.1 .msi, 90.5 .msu, 17.6 .wmv, 16.7 .ISO
[21:09] *** Mateon1 has quit IRC (Remote host closed the connection)
[21:09] <JAA> I don't have a good idea how to do the actual retrieval though. Warriorbot isn't ready yet I think. :-/
[21:10] *** Mateon1 has joined #archiveteam-bs
[21:10] <mgrandi> is that the thing that sets up a 'warrior' project?
[21:11] <JAA> No, it's a distributed ("warrior") project that simply retrieves lists of URLs.
[21:12] <OrIdow6> Assuming everyone necessary is here, we could set up a quick warrior project
[21:12] <mgrandi> setting up a warrior project pipeline seems easy right
[21:12] <mgrandi> like you don't even need a lua script right, no recursion necessary or allow/deny list of urls 
[21:12] <OrIdow6> Seems like a lot of overhead considering this is basically just "wget --input-file" at scale
[21:12] <OrIdow6> though
[21:12] <JAA> Yeah
[21:12] *** Gallifrey has quit IRC (Ping timeout: 265 seconds)
[21:13] <JAA> (Yeah, a lot of overhead)
[21:13] <mgrandi> but given the space requirements it might be good, because at least the rsync upload has the nice property of failing/retrying endlessly until the rsync target frees up space
[21:13] <mgrandi> i assume thats what 'warriorbot' is meant to fix, to have a premade warrior project for just lists of urls
[21:13] <JAA> Yes
[21:13] <OrIdow6> I'm more concerned about getting the infrastructure set up
[21:14] <OrIdow6> Viz. temporary storage (target or similar)
[21:14] <mgrandi> yeah, the infastructure of offloading the data is gonna be the hard part
[21:15] <OrIdow6> In other words... anyone have 17TB free?
[21:15] *** Gallifrey has joined #archiveteam-bs
[21:15] <JAA> 17?
[21:16] <OrIdow6> 7.1
[21:16] <JAA> :-)
[21:16] <OrIdow6> Transposed the digits
[21:20] <mgrandi> uhh
[21:21] <mgrandi> i have 500 gb + 100 + 100 on my boxes i was using for kick the bucket
[21:21] <mgrandi> we can possibly alleviate some of it if we upload to archive.org like a normal warrior project right?
[21:22] <OrIdow6> If S3 is feeling nice today
[21:23] <JAA> (Narrator: It wasn't.)
[21:25] <mgrandi> 17 tb is $170 a month on digital ocean which isn't terrible
[21:26] <mgrandi> oh no left of a 0, never mind, it doesn't even support that lol
[21:28] <mgrandi> but as long as i'm not paying for this for a full month i probably could swing enough volumes to do 17 tb
[21:30] <OrIdow6> How's network/transfer pricing?
[21:30] <mgrandi> inbound is free
[21:31] <mgrandi> yeah the outbound is what gets you
[21:31] <mgrandi> Droplets include free outbound data transfer, starting at 1,000 GiB/month for the smallest plan. Excess data transfer is billed at $0.01/GiB. For example, the cost of 1,000 GiB of overage is $10. Inbound bandwidth to Droplets is always free.
[21:32] <mgrandi> so $170, minus the free outbound bandwidth i get for my droplets which i think is 3tb for 3 droplets i have running now
[21:32] <mgrandi> again , not terrible, but if someone else has a cheaper option =P
[21:34] <JAA> It's 7.1 TB, not 17 TB.
[21:34] <mgrandi> now i'm doing it xD
[21:35] <mgrandi> so even better then
[21:35] <mgrandi> any other ideas before i pull the lever?
[21:37] <mgrandi> (brb like 40 minutes)
[21:47] <Terbium> buyvm and some other VPS providers have unmetered bandwidth
[21:52] <Terbium> $30/mo VPS + $10/mo for 2TB (1TB/$5) attached block storage might work
[21:53] <Terbium> scratch that... they're all of out stock....
[22:37] <mgrandi> well guess i'm buying the 7tb volume then
[22:38] <mgrandi> what is the format that wget/curl takes for a file list @OrIdow6 
[22:38] <JAA> mgrandi: Please write WARCs, not plain files.
[22:39] *** Gallifrey has quit IRC (Ping timeout: 265 seconds)
[22:39] <JAA> wget/wpull has --input-file or -i for that.
[22:40] <mgrandi> so what tool should i use?
[22:40] <mgrandi> i have wget-at for the kickthebucket archive
[22:40] <JAA> Yeah, wget-at seems good.
[22:40] <mgrandi> it will take a jsonl file?
[22:40] <JAA> Nope
[22:41] <JAA> Plain lines of URLs
[22:41] <mgrandi> ok
[22:41] <mgrandi> do you have a convenient list of urls or do you want me to make one? 
[22:42] <JAA> I don't, but I can easily make one.
[22:42] <mgrandi> if you can make it easily that would be good
[22:44] <mgrandi> i'll do a 7.5 tb volume
[22:45] <JAA> https://transfer.notkiska.pw/3SHDe/microsoft-download-center-files-below-id-60000-sorted-urls
[22:45] *** Gallifrey has joined #archiveteam-bs
[22:47] <mgrandi> do we have a way of exfilling these files to somewhere else?
[22:47] <mgrandi> thats a big number for the monthly cost that i'd rather not pay lol
[22:48] <JAA> I don't have any free storage at the moment I'm afraid.
[22:50] <mgrandi> wait, i have 15tb at home, but will cox hate me
[22:51] <JAA> Maybe SketchCow can set you up with space on FOS, although probably not the whole thing at once.
[22:51] <mgrandi> i think i'll start with this, but its gonna cost 24$/day
[22:52] <mgrandi> and helps since its a commercial data center without residential ISP limits
[22:52] <JAA> Or upload to IA as you grab.
[22:52] <OrIdow6> Make sure to split it up, instead of getting one huge warc
[22:52] <JAA> Yeah
[22:52] <mgrandi> so does anyone know the wget-at args to do that?
[22:52] <JAA> I tend to do 5 GiB WARCs.
[22:52] <mgrandi> or just partition the file into chunks
[22:54] <JAA> ArchiveBot's wpull options are a good starting point: https://github.com/ArchiveTeam/ArchiveBot/blob/3585ed999010665a7b367e37fd6f325f30a23983/pipeline/archivebot/seesaw/wpull.py#L12
[22:54] <JAA> But wpull isn't fully compatible with wget.
[22:56] <JAA> Or the DPoS project code repos, e.g. https://github.com/ArchiveTeam/mercurial-grab/blob/20b40049911bb721603de491d4e8a3aa5c4d3a81/pipeline.py#L173
[22:56] <JAA> --warc-max-size to get multiple WARCs instead of one huge file.
[22:59] <mgrandi> ok, yeah let me craft one based on that one
[23:00] <JAA> Another important one is --delete-after so the plain file isn't kept after download.
[23:02] <mgrandi> so --output-document outputs to a temp file, it writes it to a WARC, and then --delete-after deletes the temp file?
[23:04] <OrIdow6> JAA: Is that list in any particular order?
[23:04] <JAA> Yeah, something like that. I don't know what the exact data flow is in wget though. I think it writes it to the WARC immediately as the data is retrieved, not from the temp file.
[23:04] <JAA> OrIdow6: Yes, sorted by ID.
[23:05] <mgrandi> so do i need --output-document?
[23:05] <OrIdow6> JAA: Good, that's what I was going to ask about
[23:05] <mgrandi> or does wget=at need to write to something 
[23:06] <OrIdow6> IIRC output-file is only useful when dealing with a single file
[23:06] <JAA> mgrandi: Not entirely sure to be honest. I'd include it though to be safe. Might have something to do with not creating directory structures or dealing with filenames containing odd characters.
[23:07] <mgrandi> i'll include it anyway to be safe
[23:08] <OrIdow6> *output-document (output-file sets the logfile location)
[23:09] <JAA> By the way, ~16 hours at 1 Gb/s to retrieve it all.
[23:10] <OrIdow6> Worst case is that it goes down at midnight automatically - not enough
[23:10] <OrIdow6> Though I don't know the speed of whatever's downloading it
[23:12] <mgrandi> its in digitalocean so it should be pretty fast 
[23:12] <mgrandi> so how do the warc file names impact the split on size?
[23:13] <OrIdow6> Maybe split the list up, in case more people want to start downloading?
[23:14] *** Arcorann has joined #archiveteam-bs
[23:14] <JAA> I've never actually used wget(-lua/at) directly myself, but at least in wpull, --warc-file sets the filename prefix when --warc-max-size is used. `--warc-file foo --warc-max-size 1234` would produce foo-00000.warc.gz, foo-00001.warc.gz, etc., each "about" 1234 bytes (in wpull, the split happens as soon as possible after reaching that size).
[23:14] <mgrandi> ok
[23:16] <SketchCow> What what
[23:17] <JAA> SketchCow: Microsoft deleting SHA-1-signed downloads from the Download Center tomorrow. No good way to determine which downloads are affected, total size 7.1 TB.
[23:18] <mgrandi> https://gist.github.com/mgrandi/0904bbeeaba2a4c1bc7084ad26ec236e
[23:18] <JAA> Not covered very well last time I checked.
[23:18] <mgrandi> commands look good? any warc headers i should add?
[23:19] <JAA> mgrandi: I'd remove --page-requisites --span-hosts --recursive --level inf since recursion isn't necessary here.
[23:20] <JAA> --warc-max-size is missing.
[23:20] <mgrandi> oh good call
[23:21] <mgrandi> is that a number like `5gb` ?
[23:21] <mgrandi> it just says 'NUMBER'
[23:21] <JAA> Bytes as an int, I think.
[23:21] <OrIdow6> And the extra --warc-headers
[23:21] <JAA> 5368709120
[23:22] <JAA> Test it with a small --warc-max-size and the first couple URLs maybe to see if it does what you expect.
[23:22] <mgrandi> what headers should i include?
[23:22] <mgrandi> or does that matter / we can edit it later
[23:23] <OrIdow6> mgrandi: I'm just referring to the two pointless lines: --warc-header ""
[23:23] <JAA> Not sure. I often don't add any and document things in the item description on IA instead. It doesn't matter for the grab itself.
[23:24] <mgrandi> ok
[23:24] *** Raccoon has joined #archiveteam-bs
[23:25] <mgrandi> updated: https://gist.github.com/mgrandi/0904bbeeaba2a4c1bc7084ad26ec236e 
[23:25] <mgrandi> i'm gonna try that with 12 urls and then 10 mb warc limit
[23:27] <JAA> You might also want to split the list up and run multiple processes in parallel for higher throughput.
[23:27] <JAA> Depending on transfer and disk speed obviously.
[23:28] <mgrandi> ok
[23:32] <mgrandi> 162 MB/s apparently
[23:33] <JAA> Nice
[23:33] <mgrandi> still think i need to split it up and run multiple processes?
[23:34] <JAA> If it stays at that speed, probably not.
[23:41] <mgrandi> and --delete-after is safe to have?
[23:41] <mgrandi> since its saving it to the WARC?
[23:44] <mgrandi> looks like its fine, i'll just leave it
[23:45] <JAA> Yes, should be safe.
[23:46] <JAA> Although it probably doesn't even matter that much since you have --output-document, so each download overwrites that file anyway.
[23:47] <mgrandi> cool, lets begin
[23:47] <mgrandi> if its going too slow i can always just start another one with different sections of the list and possibly have duplicates or ctrl+c after a certain point
[23:52] <JAA> Good luck, and let me know if you see anything that isn't status 200.
[23:55] <mgrandi> average is 27 MBit/s
[23:56] <mgrandi> 2.5GB done already (compressed)
[23:59] *** Gallifrey has quit IRC (Read error: Connection reset by peer)