#archiveteam-bs 2020-08-02,Sun

↑back Search

Time Nickname Message
00:06 🔗 Gallifrey has joined #archiveteam-bs
00:10 🔗 chirlu has quit IRC (Quit: Bye)
00:16 🔗 JAA Has anyone looked into https://www.bleepingcomputer.com/news/microsoft/microsoft-to-remove-all-windows-downloads-signed-with-sha-1/ ?
00:18 🔗 SketchCow Not here
00:19 🔗 Arcorann has joined #archiveteam-bs
00:20 🔗 Arcorann has quit IRC (Remote host closed the connection)
00:20 🔗 Arcorann has joined #archiveteam-bs
00:32 🔗 systwi_ is now known as systwi
02:19 🔗 VADemon has quit IRC (left4dead)
02:30 🔗 HP_Archiv has quit IRC (Quit: Leaving)
02:32 🔗 HP_Archiv has joined #archiveteam-bs
02:42 🔗 SmileyG has quit IRC (Read error: Operation timed out)
02:42 🔗 Smiley has joined #archiveteam-bs
03:24 🔗 VADemon has joined #archiveteam-bs
03:40 🔗 qw3rty_ has joined #archiveteam-bs
03:47 🔗 qw3rty__ has quit IRC (Read error: Operation timed out)
04:28 🔗 HP_Archiv has quit IRC (Quit: Leaving)
05:34 🔗 Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat)
05:48 🔗 mgrandi has joined #archiveteam-bs
05:49 🔗 mgrandi has anyone looked into saving SHA1 signed stuff from microsoft's download center?
05:51 🔗 mgrandi oof, its happening august 3rd
05:52 🔗 bsmith093 has quit IRC (Read error: Operation timed out)
05:56 🔗 mgrandi https://www.zdnet.com/google-amp/article/microsoft-to-remove-all-sha-1-windows-downloads-next-week/
05:57 🔗 OrIdow6 I looked *at* it for about 5 minutes
06:01 🔗 OrIdow6 Looking at it again...
06:02 🔗 jmtd is now known as Jon
06:03 🔗 OrIdow6 It's hard to tell where the Microsoft Download Center ends and other things begin. They use the microsoft.com-wide search function to list downloads (and it only goes up to page 126, so creativity is needed with the various parameters); and is anything at update.microsoft.com being removed?
06:05 🔗 OrIdow6 And is there any way to tell what's using sha1 besides downloading the file, figuring out what format it's in (exe, msi), and extracting the information in a format-specific way?
06:08 🔗 bsmith093 has joined #archiveteam-bs
06:12 🔗 OrIdow6 Looks like it's divided into (what I will call) items, e.g. https://www.microsoft.com/en-us/download/confirmation.aspx?id=41658 - I give that one as an example because it contains multiple files
06:12 🔗 OrIdow6 The site started giving me 400s for everything at one point; going away and clearing cookied worked in that case
06:13 🔗 OrIdow6 They're using radio buttons where checkboxes should be used, and using JS to make them behave like checkboxes
06:16 🔗 OrIdow6 wget 403s, but UA of "abc" (and presumably most other things) work
06:16 🔗 OrIdow6 *makes it 403
06:18 🔗 OrIdow6 Looks like enumerating and downloading them should be straightforward (assuming they don't block) - can't say the same about figuring out what uses sha1 and playback
06:18 🔗 OrIdow6 Maybe it's best to get the whole thing anyhow, if it's not too big (hopefully)
06:19 🔗 OrIdow6 Obviously the rest of the site is going to go down some day
06:19 🔗 OrIdow6 *sha1, and
06:58 🔗 Craigle has joined #archiveteam-bs
07:11 🔗 auror__ has joined #archiveteam-bs
07:17 🔗 mgrandi has quit IRC (Read error: Operation timed out)
07:22 🔗 HP_Archiv has joined #archiveteam-bs
07:24 🔗 VADemon_ has joined #archiveteam-bs
07:26 🔗 Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat)
07:28 🔗 VADemon has quit IRC (Ping timeout: 492 seconds)
07:39 🔗 auror__ is now known as mgrandi
07:39 🔗 mgrandi do you see a easy way to get a list of urls?
07:40 🔗 mgrandi i also can't see what things are SHA1 to make it easier
07:40 🔗 mgrandi without downloading everything and checking
07:41 🔗 OrIdow6 mgrandi: Just go through all the numerical IDs
07:41 🔗 mgrandi ah. well, thats easy lol
07:41 🔗 mgrandi also, that link you posted, it seems to be 1 file but then it has like 'popular downloads' underneath it?
07:43 🔗 OrIdow6 See "click here to download manually" if it doesn't get all 3 automatically
07:44 🔗 OrIdow6 An msi and 2 pdfs
07:46 🔗 mgrandi for me its just auto prompting a download for a .exe
07:47 🔗 mgrandi wait no i must have been on a different page
07:47 🔗 mgrandi ok yeah i see it
07:49 🔗 mgrandi so it seems like if it has multiple files it has a <div> with a class of `multifile-failover-list`
07:50 🔗 mgrandi wait, did that link suddenly stop working for you?
07:50 🔗 mgrandi i'm getting "item is no longer available"
07:54 🔗 OrIdow6 No
07:54 🔗 OrIdow6 Try clearing your cookies
07:54 🔗 mgrandi ok, its weird, yeah it must be like using the cookies to try and see if you downloaded it recently
07:56 🔗 mgrandi hmm, should i try just writing a python script that iterates over all 50000 ids and gets the links and then downloads them?
07:56 🔗 mgrandi i don't even think archiveteam can host these probably , but at least someone will have a copy
07:56 🔗 mgrandi archive.org *
07:59 🔗 OrIdow6 There's a lot of old software hosted by the IA, even though it's technically in-copyright
08:00 🔗 OrIdow6 These are free downloads, of security updates and other "technical" files, with no ads, that are being removed permanently; I would think (though you never know) that they're fairly "safe"
08:00 🔗 mgrandi Swift on security on twitter brought up the point that a lot of these are needed even by modern software
08:01 🔗 mgrandi stuff like the VC2XXX c++ redist packages and stuff
08:01 🔗 OrIdow6 Are you talking about something else when you say that the IA has an inability to host them?
08:01 🔗 OrIdow6 And I don't know who that is
08:01 🔗 mgrandi yeah i meant that they are in copyright
08:01 🔗 mgrandi slash by a company that would probably DCMA them being on the IA
08:02 🔗 mgrandi uhhh, some twitter half taylor swift parody account slash technology/infosec account
08:03 🔗 mgrandi "In 2017, Microsoft removed downloads for Movie Maker.
08:03 🔗 mgrandi What resulted was years of customers looking for the file being infected and scammed by malware."
08:04 🔗 mgrandi (whoops, didn't mean to copy the newline) https://www.welivesecurity.com/2017/11/09/eset-detected-windows-movie-maker-scam-2017/
08:04 🔗 OrIdow6 You're preaching to the choir here with the public access thing
08:04 🔗 mgrandi yeah, heh
08:05 🔗 mgrandi so If you aren't working on anything i'll probably see if i can just whip up a quick script to download the files, if there are no complications, i don't think its worth getting WARCs of the entire pages
08:06 🔗 OrIdow6 You might want to get warcs of the description pages, at least, to get all the metadata
08:08 🔗 OrIdow6 And I know it's somewhat contrary to the orthodoxy, but I think these would better be in the form of individual IA items rather than in warcs locked behind playback problems
08:09 🔗 OrIdow6 As they are practically already separated like that
08:09 🔗 mgrandi i have never done anything like this before so i assume i'll be making them per item,, i dunno what is the best practice
08:09 🔗 mgrandi what is the best thing that has warc integration? would be manually downloading them with python requests into a warc file or using something like wpull?
08:10 🔗 OrIdow6 First things first, get the data, seeing that, at minimum, it may be removed in 16 hours
08:10 🔗 mgrandi (i have experience with page scraping and all that but not with generating warcs)
08:11 🔗 OrIdow6 https://github.com/webrecorder/warcio#quick-start-to-writing-a-warc - easy way to write to warc when using requests
08:12 🔗 Craigle has joined #archiveteam-bs
08:13 🔗 mgrandi thank goodness someone has that
08:13 🔗 OrIdow6 Though you could also make a list of the confirm pages, wpull them all, extract it after the fact from there, and then get the URLs of the downloads from that; or any number of other things; but this looks like it would disrupt your current idea the least
08:13 🔗 OrIdow6 "This" being the thing I linked (thanks J A A)
08:14 🔗 mgrandi wpull would probably be easiest yeah
08:14 🔗 mgrandi if i can just figure out what urls to get and tell it to not go off recursively on some other microsoft site
08:17 🔗 OrIdow6 You could whitelist instead of blacklist
08:18 🔗 mgrandi yeah
08:20 🔗 Laverne has quit IRC (Ping timeout: 272 seconds)
08:21 🔗 Aoede has quit IRC (Ping timeout: 272 seconds)
08:21 🔗 brayden has quit IRC (Ping timeout: 272 seconds)
08:22 🔗 mgrytbak has quit IRC (Ping timeout: 272 seconds)
08:35 🔗 mgrandi ok, i will work on this when i get up
08:35 🔗 mgrandi has quit IRC (Leaving)
08:37 🔗 i0npulse has quit IRC (Quit: leaving)
09:06 🔗 i0npulse has joined #archiveteam-bs
09:10 🔗 jshoard has joined #archiveteam-bs
09:22 🔗 Raccoon has quit IRC (Ping timeout: 745 seconds)
09:25 🔗 Aoede has joined #archiveteam-bs
09:25 🔗 Laverne has joined #archiveteam-bs
09:26 🔗 brayden has joined #archiveteam-bs
09:31 🔗 OrIdow6 has quit IRC (Ping timeout: 265 seconds)
09:33 🔗 OrIdow6 has joined #archiveteam-bs
09:34 🔗 mgrytbak has joined #archiveteam-bs
09:39 🔗 VADemon_ has quit IRC (Read error: Connection reset by peer)
10:13 🔗 BartoCH has quit IRC (Quit: WeeChat 2.9)
10:13 🔗 BartoCH has joined #archiveteam-bs
10:32 🔗 jshoard has quit IRC (Read error: Operation timed out)
10:59 🔗 HP_Archiv has quit IRC (Quit: Leaving)
12:19 🔗 jshoard has joined #archiveteam-bs
13:34 🔗 BlueMax has quit IRC (Quit: Leaving)
14:24 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
14:30 🔗 Gallifrey has joined #archiveteam-bs
14:34 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
14:35 🔗 Gallifrey has joined #archiveteam-bs
14:36 🔗 Ravenloft has joined #archiveteam-bs
14:48 🔗 JAA I feel like we should "simply" continuously mirror all downloads Microsoft makes available at this point.
15:03 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)
15:05 🔗 Gallifrey has joined #archiveteam-bs
15:07 🔗 JAA Oh yeah, my listing of Clutch's S3 finished and discovered some 33M files totalling 188 TB.
15:13 🔗 schbirid has joined #archiveteam-bs
15:13 🔗 JAA The video counts are ... interesting.
15:13 🔗 JAA High-resolution videos have no suffix on the filename after the SHA-1. There are 5800580 of them totalling 97.6 TB.
15:14 🔗 JAA Watermarked videos (*-watermarked.mp4): 4855831 files, 59.5 TB
15:14 🔗 JAA Low-resolution videos (*-480.mp4): 5816979 files, 29.5 TB
15:27 🔗 godane has quit IRC (Read error: Connection reset by peer)
15:27 🔗 Arcorann has quit IRC (Read error: Connection reset by peer)
15:44 🔗 fredgido_ has joined #archiveteam-bs
15:46 🔗 fredgido has quit IRC (Ping timeout: 622 seconds)
15:49 🔗 godane has joined #archiveteam-bs
15:53 🔗 schbirid has quit IRC (Quit: Leaving)
15:54 🔗 Raccoon has joined #archiveteam-bs
16:06 🔗 Ctrl has quit IRC (Read error: Operation timed out)
16:27 🔗 RichardG has quit IRC (Keyboard not found, press F1 to continue)
16:30 🔗 RichardG has joined #archiveteam-bs
16:51 🔗 prq has quit IRC (Remote host closed the connection)
17:01 🔗 JAA My discovery on the API is running now. I'm simply iterating over the recent posts endpoint and extracting posts and users with a bunch of interesting metadata. The API is slooooow though, so that might take a bit.
17:24 🔗 JAA OrIdow6: So are you grabbing the Microsoft downloads?
17:25 🔗 JAA Er, mgrandi I guess.
17:52 🔗 JAA I'm running something on it now.
18:06 🔗 fivechan_ has joined #archiveteam-bs
18:08 🔗 fivechan_ I have a question. If I have WARC files and upload them to InternetArchive with keyward "archiveteam", will they show up in Wayback Machine?
18:10 🔗 JAA fivechan_: No, they won't. Only WARCs from trusted accounts are included in the Wayback Machine.
18:14 🔗 Mateon1 has quit IRC (Ping timeout: 260 seconds)
18:15 🔗 fivechan_ To show web page in Wayback, I must ask archive team to archive team?
18:16 🔗 fivechan_ I must ask archive team to archive them?
18:18 🔗 JAA Yes, or use the Wayback Machine's save tool.
18:19 🔗 fivechan_ Thank you!! I understood.
18:24 🔗 fivechan_ has quit IRC (Ping timeout: 252 seconds)
18:25 🔗 JAA Turns out that Microsoft doesn't really like it when their Download Center gets hammered with requests. 403 pretty quickly.
18:25 🔗 JAA To be fair, I sent 100+ req/s at times. :-)
18:26 🔗 mgrandi has joined #archiveteam-bs
18:27 🔗 Mateon1 has joined #archiveteam-bs
18:31 🔗 JAA Hey mgrandi. In case you didn't check the logs, I've been looking into Microsoft's downloads a bit.
18:39 🔗 JAA My Clutch discovery is at 2020-07-21 after 1.5 hours. Yeah, this is going to take a while.
18:42 🔗 mgrandi ah so you are already working on it?
18:43 🔗 mgrandi or are you just getting URLs
18:45 🔗 JAA The latter. Trying to, anyway.
18:46 🔗 JAA I'm investigating Clutch's cursor format to see if I can speed this up a bit.
18:46 🔗 mgrandi hmm, not familiar with Clutch
18:48 🔗 JAA Two separate things I'm working on at the moment.
18:51 🔗 mgrandi ok, let me know if you need help
18:53 🔗 mgrandi i was personally just gonna iterate over 1->50000 and see if a page has anything vs a 404 and then download it there
18:53 🔗 JAA Well, maybe you have an idea, so here it goes: I'm iterating over https://clutch.win/v1/posts/recent/ . The next page is https://clutch.win/v1/posts/recent/?cursor=<cursor value from previous page> . I'm trying to figure out how to construct a cursor value to start at a particular point in time. The cursors are opaque though and have a weird format.
18:54 🔗 JAA E.g. CksKFwoKY3JlYXRlZF9hdBIJCN6b6YuB_eoCEixqCXN-ZnR3LXV0bHIfCxIEdXNlchiAgICp8PqGCQwLEgRjbGlwGKzBzdIIDBgAIAE= which decodes to b'\nK\n\x17\n\ncreated_at\x12\t\x08\xde\x9b\xe9\x8b\x81\xfd\xea\x02\x12,j\ts~ftw-utlr\x1f\x0b\x12\x04user\x18\x80\x80\x80\xa9\xf0\xfa\x86\t\x0c\x0b\x12\x04clip\x18\xac\xc1\xcd\xd2\x08\x0c\x18\x00 \x01' (in Python notation).
18:54 🔗 JAA The created_at part controls the time axis, but I can't figure out what the rest is.
18:56 🔗 mgrandi oh clutch is a website, i was mentioning my idea for microsoft download center heh
18:57 🔗 mgrandi the cursors are probably dynamic, as in it represents a view of the database at that point in time
18:57 🔗 mgrandi so it does a query, stores it, and then cursors iterate over it so its not constantly changing while you are iterating over it
18:57 🔗 mgrandi you probably need to just iterate over the cursor values it gives you until you reach the end, i don't think you would be able ot create one dynamically
18:58 🔗 JAA Possibly, but often cursors work as opaque identifiers similar to "before X time/DB ID".
18:58 🔗 mgrandi hmm, is that a pickled object?
19:00 🔗 JAA I can just iterate over it until done, but it's slow, so I'm trying to slice it into chunks of e.g. one day to process those in parallel.
19:00 🔗 mgrandi it looks like its a serialized object format of some kind
19:01 🔗 JAA Yeah
19:01 🔗 mgrandi it has `created_at`, `clip`, and `ftw-utl`, `user` fields in it
19:01 🔗 mgrandi but yeah, are you working on archiving the microsoft download center stuff ? or should i still work on that
19:02 🔗 JAA I'm about 2/3 done enumerating the downloads now.
19:03 🔗 mgrandi ok cool, how did you do it? just wpull over the urls with item 0->50000?
19:03 🔗 JAA Just retrieving details.aspx and confirmation.aspx, extracting file URLs and sizes from the latter.
19:03 🔗 JAA qwarc for IDs 1 to 60k.
19:03 🔗 mgrandi k. are you actually downloading the files yet?
19:04 🔗 JAA Nope
19:04 🔗 mgrandi those are probably the biggest ones
19:04 🔗 JAA Yeah, definitely. Just wanted to collect the URLs and get a size estimate first.
19:04 🔗 mgrandi i don't think microsoft is gonna nuke these from history, i wonder if they are gonna put them back up.
19:04 🔗 JAA By the way, I saw occasional weird things where the details.aspx page would work but confirmation.aspx would redirect to the 404 page.
19:04 🔗 JAA I hope it doesn't happen the other way around...
19:04 🔗 mgrandi full disclosure, i just started working for microsoft, but not on any team that deals with that
19:05 🔗 mgrandi did you see where you have to like clear your cookies?
19:05 🔗 mgrandi that was happening to me
19:06 🔗 JAA Yeah, I'm clearing them after every ID because why not.
19:06 🔗 mgrandi (not sure why it does that, seems weird)
19:06 🔗 mgrandi also not ALL of the downloads are going away, just SHA1 ones
19:06 🔗 mgrandi although i have no idea if there is a way to tell which ones are going away without..downloading them first
19:08 🔗 JAA Yeah, exactly.
19:08 🔗 JAA Hence why I suggested just grabbing everything and also doing so continuously in the future.
19:11 🔗 JAA Retrieval is done, just need to fix that one weird 404 now.
19:12 🔗 mgrandi i still have my box with a 500gb hard drive for kickthebucket if you want me to download the actual files
19:12 🔗 JAA It's not reproducible by the way, just seems to happen under load or something like that.
19:15 🔗 mgrandi hmm
19:15 🔗 mgrandi also, is that base64 string you posted complete? or did you leave off a few = signs
19:16 🔗 JAA Nope, that's complete.
19:16 🔗 mgrandi is it base64?
19:17 🔗 JAA Microsoft Download Center: I found 51298 files with a total size of about 7.1 TB.
19:17 🔗 mgrandi its saying its not a divisible of 4 characters
19:17 🔗 mgrandi @JAA hmm, thats a bit big
19:18 🔗 JAA Hmm, yeah, odd. Python's base64.urlsafe_b64decode doesn't have any issues with it though.
19:19 🔗 mgrandi oh its url safe, duh
19:23 🔗 JAA The highest ID I found was 58507, by the way, which was uploaded ... a year ago (assuming weird US date format)?
19:23 🔗 JAA https://www.microsoft.com/en-us/download/details.aspx?id=58507
19:23 🔗 OrIdow6 It's protobuf
19:24 🔗 JAA I don't like protobuf.
19:24 🔗 mgrandi oof, protobuf is not great if we dont have the proto file
19:25 🔗 JAA Yeah
19:25 🔗 mgrandi but i assume you just need to figure out the created_at
19:25 🔗 mgrandi which is probably one of the integer types
19:25 🔗 JAA It seems that the other values also influence the results, sadly.
19:26 🔗 JAA This is getting messy here discussing about Microsoft Download Center and Clutch at the same time.
19:27 🔗 OrIdow6 Yeah
19:27 🔗 JAA Let's focus on Microsoft first since it has such a short deadline.
19:27 🔗 JAA Surely the Download Center is still in use, right? Any ideas why the highest ID is a year old?
19:28 🔗 mgrandi maybe they migrated to other things?
19:28 🔗 JAA I noticed some big gaps in the IDs in some places though.
19:28 🔗 JAA There's almost nothing between 31k and 34k for example, just a few files.
19:28 🔗 mgrandi like i know that visual studio stuff has its own download page now instead of the download center
19:31 🔗 JAA Well, apparently the IDs aren't sequential *at all*.
19:31 🔗 JAA https://www.microsoft.com/en-us/download/details.aspx?id=1230 is from December...
19:31 🔗 mgrandi lol wut
19:32 🔗 JAA Damn, IDs go much higher also: https://www.microsoft.com/en-us/download/details.aspx?id=100688
19:32 🔗 mgrandi so, how do EXEs work ? are the signing certificates at the very start?
19:32 🔗 mgrandi if so we could like...download 32kb of the file, check the cert and see if its a SHA1 cert or something?
19:33 🔗 OrIdow6 But 1230 is apparently a security patch from 2010 (https://support.microsoft.com/en-us/help/2345000/ms10-079-description-of-the-security-update-for-word-2010-october-12-2)
19:34 🔗 JAA Huh
19:34 🔗 JAA So the 'Date Published' is completely unreliable.
19:34 🔗 mgrandi cause while everything should be archived eventually, only the SHA1 stuff is getting removed soon
19:38 🔗 JAA Yeah, but I'm also seeing .bin, .msi, .zip, even .tar.gz...
19:39 🔗 JAA .msu
19:39 🔗 JAA .msp
19:39 🔗 JAA etc.
19:40 🔗 JAA If you can figure something out to selectively archive those, that's great. Otherwise, we should just grab everything.
19:41 🔗 mgrandi well, i assume the associated files with the SHA1 downloads will be removed
19:42 🔗 mgrandi but if the cert is in a predictiable spot, my strategy would be: for every page, download some amount of data for each EXE, see if its signed with a SHA1 cert, if it is, download everything for that 'item', else skip it
19:42 🔗 JAA There are more .msu than .exe.
19:42 🔗 mgrandi what is a .msu?
19:43 🔗 JAA I have no idea. 'Microsoft Update' maybe?
19:43 🔗 OrIdow6 http://fileformats.archiveteam.org/wiki/Microsoft_Update_Standalone_Package
19:43 🔗 JAA This fucking mess is precisely why I left the Windows world years ago. lol
19:43 🔗 Raccoon has quit IRC (Ping timeout: 610 seconds)
19:43 🔗 mgrandi not sure why its listed under EA files, heh
19:44 🔗 mgrandi well to be fair, they added these so these are not direct executables so they are a bit safer than just EXE files
19:44 🔗 mgrandi are MSU files signed?
19:44 🔗 mgrandi do you have an example download link? i'll check it out
19:44 🔗 JAA First one my scan found: https://download.microsoft.com/download/0/B/8/0B8852B8-8A3A-4A70-97CE-A84B5F4C5FC8/IE9-Windows6.0-KB2618444-x86.msu from ID 28401.
19:45 🔗 mgrandi yeah, that actually doesn't run cause windows 10 doesn't accept SHA1 certs anymore
19:46 🔗 mgrandi so they are Cabinet files (mszip i guess? )
19:47 🔗 JAA I found two different search interfaces for the Download Center, and they both suck.
19:47 🔗 JAA https://www.microsoft.com/en-us/search/downloadresults?FORM=DLC&ftapplicableproducts=^AllDownloads&sortby=+weight returns only 1000 results.
19:47 🔗 JAA https://www.microsoft.com/en-us/download/search.aspx is just broken.
19:48 🔗 JAA Sometimes returns the same results as you go through the pagination etc.
19:48 🔗 JAA Trying to establish the upper bound for the IDs.
19:48 🔗 mgrandi and the cert seems to be at the end of the WSU file
19:49 🔗 OrIdow6 I'm going to guess that generally, the lower the ID, the older it is, and the more likely it is to use sha1
19:50 🔗 mgrandi that probably seems like a safe assumption
19:50 🔗 OrIdow6 Depending on how slowly that search goes until it reaches the present (~11 hours left until midnight Pacific), it might be useful to start downloading before it finishes
19:51 🔗 mgrandi microsoft is US based, i'm not sure if they are gonna nuke it right at midnight on a sunday, so hopefully have a bit more time
19:51 🔗 mgrandi but yeah, probably. how should we handle...the data? are we allowed to upload these to archive.org ?
19:52 🔗 JAA https://transfer.notkiska.pw/Kwk8n/microsoft-download-center-files-below-id-60000
19:56 🔗 JAA Actually, let me do that differently.
20:01 🔗 mgrandi so we gonna split it up and curl our way to victory?
20:04 🔗 OrIdow6 I like the idea of "mirror everything as individual IA items"
20:04 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
20:06 🔗 wyatt8740 has joined #archiveteam-bs
20:14 🔗 JAA That'd actually be nice, yeah. With full metadata etc.
20:14 🔗 JAA But for now, we just need to grab everything we can.
20:14 🔗 JAA Download as WARCs, further processing later.
20:16 🔗 mgrandi yeah, but how are we gonna download them
20:16 🔗 mgrandi is it easy to set up a tracker thingy like kickthebucket?
20:44 🔗 jshoard has quit IRC (Quit: Leaving)
20:45 🔗 JAA https://transfer.notkiska.pw/AzcCd/microsoft-download-center-files-below-id-60000-sorted.jsonl
20:54 🔗 mgrandi nice
20:54 🔗 mgrandi so how do we coordinate the downloads
20:58 🔗 JAA Here are the ten most frequent file extensions: 13780 msu, 13111 exe, 6812 zip, 3927 msi, 3770 pdf, 2228 pptx, 1214 docx, 888 bin, 828 doc, 483 xps
21:03 🔗 mgrandi so the ID cooresponds to what page its on?
21:04 🔗 JAA That's the 'id' parameter from the URLs.
21:04 🔗 mgrandi yeah so if a 'item' has multiple downloads then they have different IDs
21:05 🔗 JAA Uh
21:05 🔗 JAA If an entry on the Download Center has multiple files, those all have the same ID in my list.
21:05 🔗 JAA E.g. https://www.microsoft.com/en-us/download/confirmation.aspx?id=41658 -> three entries with ID 41658.
21:06 🔗 mgrandi ok
21:06 🔗 mgrandi thanks for setting this up :)
21:06 🔗 JAA Further statistics: top ten by size in GiB: 4100.1 .zip, 864.1 .exe, 597.3 .bin, 355.1 .iso, 182.3 .rar, 149.6 .cab, 118.1 .msi, 90.5 .msu, 17.6 .wmv, 16.7 .ISO
21:09 🔗 Mateon1 has quit IRC (Remote host closed the connection)
21:09 🔗 JAA I don't have a good idea how to do the actual retrieval though. Warriorbot isn't ready yet I think. :-/
21:10 🔗 Mateon1 has joined #archiveteam-bs
21:10 🔗 mgrandi is that the thing that sets up a 'warrior' project?
21:11 🔗 JAA No, it's a distributed ("warrior") project that simply retrieves lists of URLs.
21:12 🔗 OrIdow6 Assuming everyone necessary is here, we could set up a quick warrior project
21:12 🔗 mgrandi setting up a warrior project pipeline seems easy right
21:12 🔗 mgrandi like you don't even need a lua script right, no recursion necessary or allow/deny list of urls
21:12 🔗 OrIdow6 Seems like a lot of overhead considering this is basically just "wget --input-file" at scale
21:12 🔗 OrIdow6 though
21:12 🔗 JAA Yeah
21:12 🔗 Gallifrey has quit IRC (Ping timeout: 265 seconds)
21:13 🔗 JAA (Yeah, a lot of overhead)
21:13 🔗 mgrandi but given the space requirements it might be good, because at least the rsync upload has the nice property of failing/retrying endlessly until the rsync target frees up space
21:13 🔗 mgrandi i assume thats what 'warriorbot' is meant to fix, to have a premade warrior project for just lists of urls
21:13 🔗 JAA Yes
21:13 🔗 OrIdow6 I'm more concerned about getting the infrastructure set up
21:14 🔗 OrIdow6 Viz. temporary storage (target or similar)
21:14 🔗 mgrandi yeah, the infastructure of offloading the data is gonna be the hard part
21:15 🔗 OrIdow6 In other words... anyone have 17TB free?
21:15 🔗 Gallifrey has joined #archiveteam-bs
21:15 🔗 JAA 17?
21:16 🔗 OrIdow6 7.1
21:16 🔗 JAA :-)
21:16 🔗 OrIdow6 Transposed the digits
21:20 🔗 mgrandi uhh
21:21 🔗 mgrandi i have 500 gb + 100 + 100 on my boxes i was using for kick the bucket
21:21 🔗 mgrandi we can possibly alleviate some of it if we upload to archive.org like a normal warrior project right?
21:22 🔗 OrIdow6 If S3 is feeling nice today
21:23 🔗 JAA (Narrator: It wasn't.)
21:25 🔗 mgrandi 17 tb is $170 a month on digital ocean which isn't terrible
21:26 🔗 mgrandi oh no left of a 0, never mind, it doesn't even support that lol
21:28 🔗 mgrandi but as long as i'm not paying for this for a full month i probably could swing enough volumes to do 17 tb
21:30 🔗 OrIdow6 How's network/transfer pricing?
21:30 🔗 mgrandi inbound is free
21:31 🔗 mgrandi yeah the outbound is what gets you
21:31 🔗 mgrandi Droplets include free outbound data transfer, starting at 1,000 GiB/month for the smallest plan. Excess data transfer is billed at $0.01/GiB. For example, the cost of 1,000 GiB of overage is $10. Inbound bandwidth to Droplets is always free.
21:32 🔗 mgrandi so $170, minus the free outbound bandwidth i get for my droplets which i think is 3tb for 3 droplets i have running now
21:32 🔗 mgrandi again , not terrible, but if someone else has a cheaper option =P
21:34 🔗 JAA It's 7.1 TB, not 17 TB.
21:34 🔗 mgrandi now i'm doing it xD
21:35 🔗 mgrandi so even better then
21:35 🔗 mgrandi any other ideas before i pull the lever?
21:37 🔗 mgrandi (brb like 40 minutes)
21:47 🔗 Terbium buyvm and some other VPS providers have unmetered bandwidth
21:52 🔗 Terbium $30/mo VPS + $10/mo for 2TB (1TB/$5) attached block storage might work
21:53 🔗 Terbium scratch that... they're all of out stock....
22:37 🔗 mgrandi well guess i'm buying the 7tb volume then
22:38 🔗 mgrandi what is the format that wget/curl takes for a file list @OrIdow6
22:38 🔗 JAA mgrandi: Please write WARCs, not plain files.
22:39 🔗 Gallifrey has quit IRC (Ping timeout: 265 seconds)
22:39 🔗 JAA wget/wpull has --input-file or -i for that.
22:40 🔗 mgrandi so what tool should i use?
22:40 🔗 mgrandi i have wget-at for the kickthebucket archive
22:40 🔗 JAA Yeah, wget-at seems good.
22:40 🔗 mgrandi it will take a jsonl file?
22:40 🔗 JAA Nope
22:41 🔗 JAA Plain lines of URLs
22:41 🔗 mgrandi ok
22:41 🔗 mgrandi do you have a convenient list of urls or do you want me to make one?
22:42 🔗 JAA I don't, but I can easily make one.
22:42 🔗 mgrandi if you can make it easily that would be good
22:44 🔗 mgrandi i'll do a 7.5 tb volume
22:45 🔗 JAA https://transfer.notkiska.pw/3SHDe/microsoft-download-center-files-below-id-60000-sorted-urls
22:45 🔗 Gallifrey has joined #archiveteam-bs
22:47 🔗 mgrandi do we have a way of exfilling these files to somewhere else?
22:47 🔗 mgrandi thats a big number for the monthly cost that i'd rather not pay lol
22:48 🔗 JAA I don't have any free storage at the moment I'm afraid.
22:50 🔗 mgrandi wait, i have 15tb at home, but will cox hate me
22:51 🔗 JAA Maybe SketchCow can set you up with space on FOS, although probably not the whole thing at once.
22:51 🔗 mgrandi i think i'll start with this, but its gonna cost 24$/day
22:52 🔗 mgrandi and helps since its a commercial data center without residential ISP limits
22:52 🔗 JAA Or upload to IA as you grab.
22:52 🔗 OrIdow6 Make sure to split it up, instead of getting one huge warc
22:52 🔗 JAA Yeah
22:52 🔗 mgrandi so does anyone know the wget-at args to do that?
22:52 🔗 JAA I tend to do 5 GiB WARCs.
22:52 🔗 mgrandi or just partition the file into chunks
22:54 🔗 JAA ArchiveBot's wpull options are a good starting point: https://github.com/ArchiveTeam/ArchiveBot/blob/3585ed999010665a7b367e37fd6f325f30a23983/pipeline/archivebot/seesaw/wpull.py#L12
22:54 🔗 JAA But wpull isn't fully compatible with wget.
22:56 🔗 JAA Or the DPoS project code repos, e.g. https://github.com/ArchiveTeam/mercurial-grab/blob/20b40049911bb721603de491d4e8a3aa5c4d3a81/pipeline.py#L173
22:56 🔗 JAA --warc-max-size to get multiple WARCs instead of one huge file.
22:59 🔗 mgrandi ok, yeah let me craft one based on that one
23:00 🔗 JAA Another important one is --delete-after so the plain file isn't kept after download.
23:02 🔗 mgrandi so --output-document outputs to a temp file, it writes it to a WARC, and then --delete-after deletes the temp file?
23:04 🔗 OrIdow6 JAA: Is that list in any particular order?
23:04 🔗 JAA Yeah, something like that. I don't know what the exact data flow is in wget though. I think it writes it to the WARC immediately as the data is retrieved, not from the temp file.
23:04 🔗 JAA OrIdow6: Yes, sorted by ID.
23:05 🔗 mgrandi so do i need --output-document?
23:05 🔗 OrIdow6 JAA: Good, that's what I was going to ask about
23:05 🔗 mgrandi or does wget=at need to write to something
23:06 🔗 OrIdow6 IIRC output-file is only useful when dealing with a single file
23:06 🔗 JAA mgrandi: Not entirely sure to be honest. I'd include it though to be safe. Might have something to do with not creating directory structures or dealing with filenames containing odd characters.
23:07 🔗 mgrandi i'll include it anyway to be safe
23:08 🔗 OrIdow6 *output-document (output-file sets the logfile location)
23:09 🔗 JAA By the way, ~16 hours at 1 Gb/s to retrieve it all.
23:10 🔗 OrIdow6 Worst case is that it goes down at midnight automatically - not enough
23:10 🔗 OrIdow6 Though I don't know the speed of whatever's downloading it
23:12 🔗 mgrandi its in digitalocean so it should be pretty fast
23:12 🔗 mgrandi so how do the warc file names impact the split on size?
23:13 🔗 OrIdow6 Maybe split the list up, in case more people want to start downloading?
23:14 🔗 Arcorann has joined #archiveteam-bs
23:14 🔗 JAA I've never actually used wget(-lua/at) directly myself, but at least in wpull, --warc-file sets the filename prefix when --warc-max-size is used. `--warc-file foo --warc-max-size 1234` would produce foo-00000.warc.gz, foo-00001.warc.gz, etc., each "about" 1234 bytes (in wpull, the split happens as soon as possible after reaching that size).
23:14 🔗 mgrandi ok
23:16 🔗 SketchCow What what
23:17 🔗 JAA SketchCow: Microsoft deleting SHA-1-signed downloads from the Download Center tomorrow. No good way to determine which downloads are affected, total size 7.1 TB.
23:18 🔗 mgrandi https://gist.github.com/mgrandi/0904bbeeaba2a4c1bc7084ad26ec236e
23:18 🔗 JAA Not covered very well last time I checked.
23:18 🔗 mgrandi commands look good? any warc headers i should add?
23:19 🔗 JAA mgrandi: I'd remove --page-requisites --span-hosts --recursive --level inf since recursion isn't necessary here.
23:20 🔗 JAA --warc-max-size is missing.
23:20 🔗 mgrandi oh good call
23:21 🔗 mgrandi is that a number like `5gb` ?
23:21 🔗 mgrandi it just says 'NUMBER'
23:21 🔗 JAA Bytes as an int, I think.
23:21 🔗 OrIdow6 And the extra --warc-headers
23:21 🔗 JAA 5368709120
23:22 🔗 JAA Test it with a small --warc-max-size and the first couple URLs maybe to see if it does what you expect.
23:22 🔗 mgrandi what headers should i include?
23:22 🔗 mgrandi or does that matter / we can edit it later
23:23 🔗 OrIdow6 mgrandi: I'm just referring to the two pointless lines: --warc-header ""
23:23 🔗 JAA Not sure. I often don't add any and document things in the item description on IA instead. It doesn't matter for the grab itself.
23:24 🔗 mgrandi ok
23:24 🔗 Raccoon has joined #archiveteam-bs
23:25 🔗 mgrandi updated: https://gist.github.com/mgrandi/0904bbeeaba2a4c1bc7084ad26ec236e
23:25 🔗 mgrandi i'm gonna try that with 12 urls and then 10 mb warc limit
23:27 🔗 JAA You might also want to split the list up and run multiple processes in parallel for higher throughput.
23:27 🔗 JAA Depending on transfer and disk speed obviously.
23:28 🔗 mgrandi ok
23:32 🔗 mgrandi 162 MB/s apparently
23:33 🔗 JAA Nice
23:33 🔗 mgrandi still think i need to split it up and run multiple processes?
23:34 🔗 JAA If it stays at that speed, probably not.
23:41 🔗 mgrandi and --delete-after is safe to have?
23:41 🔗 mgrandi since its saving it to the WARC?
23:44 🔗 mgrandi looks like its fine, i'll just leave it
23:45 🔗 JAA Yes, should be safe.
23:46 🔗 JAA Although it probably doesn't even matter that much since you have --output-document, so each download overwrites that file anyway.
23:47 🔗 mgrandi cool, lets begin
23:47 🔗 mgrandi if its going too slow i can always just start another one with different sections of the list and possibly have duplicates or ctrl+c after a certain point
23:52 🔗 JAA Good luck, and let me know if you see anything that isn't status 200.
23:55 🔗 mgrandi average is 27 MBit/s
23:56 🔗 mgrandi 2.5GB done already (compressed)
23:59 🔗 Gallifrey has quit IRC (Read error: Connection reset by peer)

irclogger-viewer