#archiveteam-bs 2020-08-02,Sun

↑back Search

Time	Nickname	Message
00:06 ^🔗		Gallifrey has joined #archiveteam-bs
00:10 ^🔗		chirlu has quit IRC (Quit: Bye)
00:16 ^🔗	JAA	Has anyone looked into https://www.bleepingcomputer.com/news/microsoft/microsoft-to-remove-all-windows-downloads-signed-with-sha-1/ ?
00:18 ^🔗	SketchCow	Not here
00:19 ^🔗		Arcorann has joined #archiveteam-bs
00:20 ^🔗		Arcorann has quit IRC (Remote host closed the connection)
00:20 ^🔗		Arcorann has joined #archiveteam-bs
00:32 ^🔗		systwi_ is now known as systwi
02:19 ^🔗		VADemon has quit IRC (left4dead)
02:30 ^🔗		HP_Archiv has quit IRC (Quit: Leaving)
02:32 ^🔗		HP_Archiv has joined #archiveteam-bs
02:42 ^🔗		SmileyG has quit IRC (Read error: Operation timed out)
02:42 ^🔗		Smiley has joined #archiveteam-bs
03:24 ^🔗		VADemon has joined #archiveteam-bs
03:40 ^🔗		qw3rty_ has joined #archiveteam-bs
03:47 ^🔗		qw3rty__ has quit IRC (Read error: Operation timed out)
04:28 ^🔗		HP_Archiv has quit IRC (Quit: Leaving)
05:34 ^🔗		Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat)
05:48 ^🔗		mgrandi has joined #archiveteam-bs
05:49 ^🔗	mgrandi	has anyone looked into saving SHA1 signed stuff from microsoft's download center?
05:51 ^🔗	mgrandi	oof, its happening august 3rd
05:52 ^🔗		bsmith093 has quit IRC (Read error: Operation timed out)
05:56 ^🔗	mgrandi	https://www.zdnet.com/google-amp/article/microsoft-to-remove-all-sha-1-windows-downloads-next-week/
05:57 ^🔗	OrIdow6	I looked at it for about 5 minutes
06:01 ^🔗	OrIdow6	Looking at it again...
06:02 ^🔗		jmtd is now known as Jon
06:03 ^🔗	OrIdow6	It's hard to tell where the Microsoft Download Center ends and other things begin. They use the microsoft.com-wide search function to list downloads (and it only goes up to page 126, so creativity is needed with the various parameters); and is anything at update.microsoft.com being removed?
06:05 ^🔗	OrIdow6	And is there any way to tell what's using sha1 besides downloading the file, figuring out what format it's in (exe, msi), and extracting the information in a format-specific way?
06:08 ^🔗		bsmith093 has joined #archiveteam-bs
06:12 ^🔗	OrIdow6	Looks like it's divided into (what I will call) items, e.g. https://www.microsoft.com/en-us/download/confirmation.aspx?id=41658 - I give that one as an example because it contains multiple files
06:12 ^🔗	OrIdow6	The site started giving me 400s for everything at one point; going away and clearing cookied worked in that case
06:13 ^🔗	OrIdow6	They're using radio buttons where checkboxes should be used, and using JS to make them behave like checkboxes
06:16 ^🔗	OrIdow6	wget 403s, but UA of "abc" (and presumably most other things) work
06:16 ^🔗	OrIdow6	*makes it 403
06:18 ^🔗	OrIdow6	Looks like enumerating and downloading them should be straightforward (assuming they don't block) - can't say the same about figuring out what uses sha1 and playback
06:18 ^🔗	OrIdow6	Maybe it's best to get the whole thing anyhow, if it's not too big (hopefully)
06:19 ^🔗	OrIdow6	Obviously the rest of the site is going to go down some day
06:19 ^🔗	OrIdow6	*sha1, and
06:58 ^🔗		Craigle has joined #archiveteam-bs
07:11 ^🔗		auror__ has joined #archiveteam-bs
07:17 ^🔗		mgrandi has quit IRC (Read error: Operation timed out)
07:22 ^🔗		HP_Archiv has joined #archiveteam-bs
07:24 ^🔗		VADemon_ has joined #archiveteam-bs
07:26 ^🔗		Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat)
07:28 ^🔗		VADemon has quit IRC (Ping timeout: 492 seconds)
07:39 ^🔗		auror__ is now known as mgrandi
07:39 ^🔗	mgrandi	do you see a easy way to get a list of urls?
07:40 ^🔗	mgrandi	i also can't see what things are SHA1 to make it easier
07:40 ^🔗	mgrandi	without downloading everything and checking
07:41 ^🔗	OrIdow6	mgrandi: Just go through all the numerical IDs
07:41 ^🔗	mgrandi	ah. well, thats easy lol
07:41 ^🔗	mgrandi	also, that link you posted, it seems to be 1 file but then it has like 'popular downloads' underneath it?
07:43 ^🔗	OrIdow6	See "click here to download manually" if it doesn't get all 3 automatically
07:44 ^🔗	OrIdow6	An msi and 2 pdfs
07:46 ^🔗	mgrandi	for me its just auto prompting a download for a .exe
07:47 ^🔗	mgrandi	wait no i must have been on a different page
07:47 ^🔗	mgrandi	ok yeah i see it
07:49 ^🔗	mgrandi	so it seems like if it has multiple files it has a <div> with a class of `multifile-failover-list`
07:50 ^🔗	mgrandi	wait, did that link suddenly stop working for you?
07:50 ^🔗	mgrandi	i'm getting "item is no longer available"
07:54 ^🔗	OrIdow6	No
07:54 ^🔗	OrIdow6	Try clearing your cookies
07:54 ^🔗	mgrandi	ok, its weird, yeah it must be like using the cookies to try and see if you downloaded it recently
07:56 ^🔗	mgrandi	hmm, should i try just writing a python script that iterates over all 50000 ids and gets the links and then downloads them?
07:56 ^🔗	mgrandi	i don't even think archiveteam can host these probably , but at least someone will have a copy
07:56 ^🔗	mgrandi	archive.org *
07:59 ^🔗	OrIdow6	There's a lot of old software hosted by the IA, even though it's technically in-copyright
08:00 ^🔗	OrIdow6	These are free downloads, of security updates and other "technical" files, with no ads, that are being removed permanently; I would think (though you never know) that they're fairly "safe"
08:00 ^🔗	mgrandi	Swift on security on twitter brought up the point that a lot of these are needed even by modern software
08:01 ^🔗	mgrandi	stuff like the VC2XXX c++ redist packages and stuff
08:01 ^🔗	OrIdow6	Are you talking about something else when you say that the IA has an inability to host them?
08:01 ^🔗	OrIdow6	And I don't know who that is
08:01 ^🔗	mgrandi	yeah i meant that they are in copyright
08:01 ^🔗	mgrandi	slash by a company that would probably DCMA them being on the IA
08:02 ^🔗	mgrandi	uhhh, some twitter half taylor swift parody account slash technology/infosec account
08:03 ^🔗	mgrandi	"In 2017, Microsoft removed downloads for Movie Maker.
08:03 ^🔗	mgrandi	What resulted was years of customers looking for the file being infected and scammed by malware."
08:04 ^🔗	mgrandi	(whoops, didn't mean to copy the newline) https://www.welivesecurity.com/2017/11/09/eset-detected-windows-movie-maker-scam-2017/
08:04 ^🔗	OrIdow6	You're preaching to the choir here with the public access thing
08:04 ^🔗	mgrandi	yeah, heh
08:05 ^🔗	mgrandi	so If you aren't working on anything i'll probably see if i can just whip up a quick script to download the files, if there are no complications, i don't think its worth getting WARCs of the entire pages
08:06 ^🔗	OrIdow6	You might want to get warcs of the description pages, at least, to get all the metadata
08:08 ^🔗	OrIdow6	And I know it's somewhat contrary to the orthodoxy, but I think these would better be in the form of individual IA items rather than in warcs locked behind playback problems
08:09 ^🔗	OrIdow6	As they are practically already separated like that
08:09 ^🔗	mgrandi	i have never done anything like this before so i assume i'll be making them per item,, i dunno what is the best practice
08:09 ^🔗	mgrandi	what is the best thing that has warc integration? would be manually downloading them with python requests into a warc file or using something like wpull?
08:10 ^🔗	OrIdow6	First things first, get the data, seeing that, at minimum, it may be removed in 16 hours
08:10 ^🔗	mgrandi	(i have experience with page scraping and all that but not with generating warcs)
08:11 ^🔗	OrIdow6	https://github.com/webrecorder/warcio#quick-start-to-writing-a-warc - easy way to write to warc when using requests
08:12 ^🔗		Craigle has joined #archiveteam-bs
08:13 ^🔗	mgrandi	thank goodness someone has that
08:13 ^🔗	OrIdow6	Though you could also make a list of the confirm pages, wpull them all, extract it after the fact from there, and then get the URLs of the downloads from that; or any number of other things; but this looks like it would disrupt your current idea the least
08:13 ^🔗	OrIdow6	"This" being the thing I linked (thanks J A A)
08:14 ^🔗	mgrandi	wpull would probably be easiest yeah
08:14 ^🔗	mgrandi	if i can just figure out what urls to get and tell it to not go off recursively on some other microsoft site
08:17 ^🔗	OrIdow6	You could whitelist instead of blacklist
08:18 ^🔗	mgrandi	yeah
08:20 ^🔗		Laverne has quit IRC (Ping timeout: 272 seconds)
08:21 ^🔗		Aoede has quit IRC (Ping timeout: 272 seconds)
08:21 ^🔗		brayden has quit IRC (Ping timeout: 272 seconds)
08:22 ^🔗		mgrytbak has quit IRC (Ping timeout: 272 seconds)
08:35 ^🔗	mgrandi	ok, i will work on this when i get up
08:35 ^🔗		mgrandi has quit IRC (Leaving)
08:37 ^🔗		i0npulse has quit IRC (Quit: leaving)
09:06 ^🔗		i0npulse has joined #archiveteam-bs
09:10 ^🔗		jshoard has joined #archiveteam-bs
09:22 ^🔗		Raccoon has quit IRC (Ping timeout: 745 seconds)
09:25 ^🔗		Aoede has joined #archiveteam-bs
09:25 ^🔗		Laverne has joined #archiveteam-bs
09:26 ^🔗		brayden has joined #archiveteam-bs
09:31 ^🔗		OrIdow6 has quit IRC (Ping timeout: 265 seconds)
09:33 ^🔗		OrIdow6 has joined #archiveteam-bs
09:34 ^🔗		mgrytbak has joined #archiveteam-bs
09:39 ^🔗		VADemon_ has quit IRC (Read error: Connection reset by peer)
10:13 ^🔗		BartoCH has quit IRC (Quit: WeeChat 2.9)
10:13 ^🔗		BartoCH has joined #archiveteam-bs
10:32 ^🔗		jshoard has quit IRC (Read error: Operation timed out)
10:59 ^🔗		HP_Archiv has quit IRC (Quit: Leaving)
12:19 ^🔗		jshoard has joined #archiveteam-bs
13:34 ^🔗		BlueMax has quit IRC (Quit: Leaving)
14:24 ^🔗		Gallifrey has quit IRC (Read error: Connection reset by peer)
14:30 ^🔗		Gallifrey has joined #archiveteam-bs
14:34 ^🔗		Gallifrey has quit IRC (Read error: Connection reset by peer)
14:35 ^🔗		Gallifrey has joined #archiveteam-bs
14:36 ^🔗		Ravenloft has joined #archiveteam-bs
14:48 ^🔗	JAA	I feel like we should "simply" continuously mirror all downloads Microsoft makes available at this point.
15:03 ^🔗		Gallifrey has quit IRC (Read error: Connection reset by peer)
15:05 ^🔗		Gallifrey has joined #archiveteam-bs
15:07 ^🔗	JAA	Oh yeah, my listing of Clutch's S3 finished and discovered some 33M files totalling 188 TB.
15:13 ^🔗		schbirid has joined #archiveteam-bs
15:13 ^🔗	JAA	The video counts are ... interesting.
15:13 ^🔗	JAA	High-resolution videos have no suffix on the filename after the SHA-1. There are 5800580 of them totalling 97.6 TB.
15:14 ^🔗	JAA	Watermarked videos (*-watermarked.mp4): 4855831 files, 59.5 TB
15:14 ^🔗	JAA	Low-resolution videos (*-480.mp4): 5816979 files, 29.5 TB
15:27 ^🔗		godane has quit IRC (Read error: Connection reset by peer)
15:27 ^🔗		Arcorann has quit IRC (Read error: Connection reset by peer)
15:44 ^🔗		fredgido_ has joined #archiveteam-bs
15:46 ^🔗		fredgido has quit IRC (Ping timeout: 622 seconds)
15:49 ^🔗		godane has joined #archiveteam-bs
15:53 ^🔗		schbirid has quit IRC (Quit: Leaving)
15:54 ^🔗		Raccoon has joined #archiveteam-bs
16:06 ^🔗		Ctrl has quit IRC (Read error: Operation timed out)
16:27 ^🔗		RichardG has quit IRC (Keyboard not found, press F1 to continue)
16:30 ^🔗		RichardG has joined #archiveteam-bs
16:51 ^🔗		prq has quit IRC (Remote host closed the connection)
17:01 ^🔗	JAA	My discovery on the API is running now. I'm simply iterating over the recent posts endpoint and extracting posts and users with a bunch of interesting metadata. The API is slooooow though, so that might take a bit.
17:24 ^🔗	JAA	OrIdow6: So are you grabbing the Microsoft downloads?
17:25 ^🔗	JAA	Er, mgrandi I guess.
17:52 ^🔗	JAA	I'm running something on it now.
18:06 ^🔗		fivechan_ has joined #archiveteam-bs
18:08 ^🔗	fivechan_	I have a question. If I have WARC files and upload them to InternetArchive with keyward "archiveteam", will they show up in Wayback Machine?
18:10 ^🔗	JAA	fivechan_: No, they won't. Only WARCs from trusted accounts are included in the Wayback Machine.
18:14 ^🔗		Mateon1 has quit IRC (Ping timeout: 260 seconds)
18:15 ^🔗	fivechan_	To show web page in Wayback, I must ask archive team to archive team?
18:16 ^🔗	fivechan_	I must ask archive team to archive them?
18:18 ^🔗	JAA	Yes, or use the Wayback Machine's save tool.
18:19 ^🔗	fivechan_	Thank you!! I understood.
18:24 ^🔗		fivechan_ has quit IRC (Ping timeout: 252 seconds)
18:25 ^🔗	JAA	Turns out that Microsoft doesn't really like it when their Download Center gets hammered with requests. 403 pretty quickly.
18:25 ^🔗	JAA	To be fair, I sent 100+ req/s at times. :-)
18:26 ^🔗		mgrandi has joined #archiveteam-bs
18:27 ^🔗		Mateon1 has joined #archiveteam-bs
18:31 ^🔗	JAA	Hey mgrandi. In case you didn't check the logs, I've been looking into Microsoft's downloads a bit.
18:39 ^🔗	JAA	My Clutch discovery is at 2020-07-21 after 1.5 hours. Yeah, this is going to take a while.
18:42 ^🔗	mgrandi	ah so you are already working on it?
18:43 ^🔗	mgrandi	or are you just getting URLs
18:45 ^🔗	JAA	The latter. Trying to, anyway.
18:46 ^🔗	JAA	I'm investigating Clutch's cursor format to see if I can speed this up a bit.
18:46 ^🔗	mgrandi	hmm, not familiar with Clutch
18:48 ^🔗	JAA	Two separate things I'm working on at the moment.
18:51 ^🔗	mgrandi	ok, let me know if you need help
18:53 ^🔗	mgrandi	i was personally just gonna iterate over 1->50000 and see if a page has anything vs a 404 and then download it there
18:53 ^🔗	JAA	Well, maybe you have an idea, so here it goes: I'm iterating over https://clutch.win/v1/posts/recent/ . The next page is https://clutch.win/v1/posts/recent/?cursor=<cursor value from previous page> . I'm trying to figure out how to construct a cursor value to start at a particular point in time. The cursors are opaque though and have a weird format.
18:54 ^🔗	JAA	E.g. CksKFwoKY3JlYXRlZF9hdBIJCN6b6YuB_eoCEixqCXN-ZnR3LXV0bHIfCxIEdXNlchiAgICp8PqGCQwLEgRjbGlwGKzBzdIIDBgAIAE= which decodes to b'\nK\n\x17\n\ncreated_at\x12\t\x08\xde\x9b\xe9\x8b\x81\xfd\xea\x02\x12,j\ts~ftw-utlr\x1f\x0b\x12\x04user\x18\x80\x80\x80\xa9\xf0\xfa\x86\t\x0c\x0b\x12\x04clip\x18\xac\xc1\xcd\xd2\x08\x0c\x18\x00 \x01' (in Python notation).
18:54 ^🔗	JAA	The created_at part controls the time axis, but I can't figure out what the rest is.
18:56 ^🔗	mgrandi	oh clutch is a website, i was mentioning my idea for microsoft download center heh
18:57 ^🔗	mgrandi	the cursors are probably dynamic, as in it represents a view of the database at that point in time
18:57 ^🔗	mgrandi	so it does a query, stores it, and then cursors iterate over it so its not constantly changing while you are iterating over it
18:57 ^🔗	mgrandi	you probably need to just iterate over the cursor values it gives you until you reach the end, i don't think you would be able ot create one dynamically
18:58 ^🔗	JAA	Possibly, but often cursors work as opaque identifiers similar to "before X time/DB ID".
18:58 ^🔗	mgrandi	hmm, is that a pickled object?
19:00 ^🔗	JAA	I can just iterate over it until done, but it's slow, so I'm trying to slice it into chunks of e.g. one day to process those in parallel.
19:00 ^🔗	mgrandi	it looks like its a serialized object format of some kind
19:01 ^🔗	JAA	Yeah
19:01 ^🔗	mgrandi	it has `created_at`, `clip`, and `ftw-utl`, `user` fields in it
19:01 ^🔗	mgrandi	but yeah, are you working on archiving the microsoft download center stuff ? or should i still work on that
19:02 ^🔗	JAA	I'm about 2/3 done enumerating the downloads now.
19:03 ^🔗	mgrandi	ok cool, how did you do it? just wpull over the urls with item 0->50000?
19:03 ^🔗	JAA	Just retrieving details.aspx and confirmation.aspx, extracting file URLs and sizes from the latter.
19:03 ^🔗	JAA	qwarc for IDs 1 to 60k.
19:03 ^🔗	mgrandi	k. are you actually downloading the files yet?
19:04 ^🔗	JAA	Nope
19:04 ^🔗	mgrandi	those are probably the biggest ones
19:04 ^🔗	JAA	Yeah, definitely. Just wanted to collect the URLs and get a size estimate first.
19:04 ^🔗	mgrandi	i don't think microsoft is gonna nuke these from history, i wonder if they are gonna put them back up.
19:04 ^🔗	JAA	By the way, I saw occasional weird things where the details.aspx page would work but confirmation.aspx would redirect to the 404 page.
19:04 ^🔗	JAA	I hope it doesn't happen the other way around...
19:04 ^🔗	mgrandi	full disclosure, i just started working for microsoft, but not on any team that deals with that
19:05 ^🔗	mgrandi	did you see where you have to like clear your cookies?
19:05 ^🔗	mgrandi	that was happening to me
19:06 ^🔗	JAA	Yeah, I'm clearing them after every ID because why not.
19:06 ^🔗	mgrandi	(not sure why it does that, seems weird)
19:06 ^🔗	mgrandi	also not ALL of the downloads are going away, just SHA1 ones
19:06 ^🔗	mgrandi	although i have no idea if there is a way to tell which ones are going away without..downloading them first
19:08 ^🔗	JAA	Yeah, exactly.
19:08 ^🔗	JAA	Hence why I suggested just grabbing everything and also doing so continuously in the future.
19:11 ^🔗	JAA	Retrieval is done, just need to fix that one weird 404 now.
19:12 ^🔗	mgrandi	i still have my box with a 500gb hard drive for kickthebucket if you want me to download the actual files
19:12 ^🔗	JAA	It's not reproducible by the way, just seems to happen under load or something like that.
19:15 ^🔗	mgrandi	hmm
19:15 ^🔗	mgrandi	also, is that base64 string you posted complete? or did you leave off a few = signs
19:16 ^🔗	JAA	Nope, that's complete.
19:16 ^🔗	mgrandi	is it base64?
19:17 ^🔗	JAA	Microsoft Download Center: I found 51298 files with a total size of about 7.1 TB.
19:17 ^🔗	mgrandi	its saying its not a divisible of 4 characters
19:17 ^🔗	mgrandi	@JAA hmm, thats a bit big
19:18 ^🔗	JAA	Hmm, yeah, odd. Python's base64.urlsafe_b64decode doesn't have any issues with it though.
19:19 ^🔗	mgrandi	oh its url safe, duh
19:23 ^🔗	JAA	The highest ID I found was 58507, by the way, which was uploaded ... a year ago (assuming weird US date format)?
19:23 ^🔗	JAA	https://www.microsoft.com/en-us/download/details.aspx?id=58507
19:23 ^🔗	OrIdow6	It's protobuf
19:24 ^🔗	JAA	I don't like protobuf.
19:24 ^🔗	mgrandi	oof, protobuf is not great if we dont have the proto file
19:25 ^🔗	JAA	Yeah
19:25 ^🔗	mgrandi	but i assume you just need to figure out the created_at
19:25 ^🔗	mgrandi	which is probably one of the integer types
19:25 ^🔗	JAA	It seems that the other values also influence the results, sadly.
19:26 ^🔗	JAA	This is getting messy here discussing about Microsoft Download Center and Clutch at the same time.
19:27 ^🔗	OrIdow6	Yeah
19:27 ^🔗	JAA	Let's focus on Microsoft first since it has such a short deadline.
19:27 ^🔗	JAA	Surely the Download Center is still in use, right? Any ideas why the highest ID is a year old?
19:28 ^🔗	mgrandi	maybe they migrated to other things?
19:28 ^🔗	JAA	I noticed some big gaps in the IDs in some places though.
19:28 ^🔗	JAA	There's almost nothing between 31k and 34k for example, just a few files.
19:28 ^🔗	mgrandi	like i know that visual studio stuff has its own download page now instead of the download center
19:31 ^🔗	JAA	Well, apparently the IDs aren't sequential at all.
19:31 ^🔗	JAA	https://www.microsoft.com/en-us/download/details.aspx?id=1230 is from December...
19:31 ^🔗	mgrandi	lol wut
19:32 ^🔗	JAA	Damn, IDs go much higher also: https://www.microsoft.com/en-us/download/details.aspx?id=100688
19:32 ^🔗	mgrandi	so, how do EXEs work ? are the signing certificates at the very start?
19:32 ^🔗	mgrandi	if so we could like...download 32kb of the file, check the cert and see if its a SHA1 cert or something?
19:33 ^🔗	OrIdow6	But 1230 is apparently a security patch from 2010 (https://support.microsoft.com/en-us/help/2345000/ms10-079-description-of-the-security-update-for-word-2010-october-12-2)
19:34 ^🔗	JAA	Huh
19:34 ^🔗	JAA	So the 'Date Published' is completely unreliable.
19:34 ^🔗	mgrandi	cause while everything should be archived eventually, only the SHA1 stuff is getting removed soon
19:38 ^🔗	JAA	Yeah, but I'm also seeing .bin, .msi, .zip, even .tar.gz...
19:39 ^🔗	JAA	.msu
19:39 ^🔗	JAA	.msp
19:39 ^🔗	JAA	etc.
19:40 ^🔗	JAA	If you can figure something out to selectively archive those, that's great. Otherwise, we should just grab everything.
19:41 ^🔗	mgrandi	well, i assume the associated files with the SHA1 downloads will be removed
19:42 ^🔗	mgrandi	but if the cert is in a predictiable spot, my strategy would be: for every page, download some amount of data for each EXE, see if its signed with a SHA1 cert, if it is, download everything for that 'item', else skip it
19:42 ^🔗	JAA	There are more .msu than .exe.
19:42 ^🔗	mgrandi	what is a .msu?
19:43 ^🔗	JAA	I have no idea. 'Microsoft Update' maybe?
19:43 ^🔗	OrIdow6	http://fileformats.archiveteam.org/wiki/Microsoft_Update_Standalone_Package
19:43 ^🔗	JAA	This fucking mess is precisely why I left the Windows world years ago. lol
19:43 ^🔗		Raccoon has quit IRC (Ping timeout: 610 seconds)
19:43 ^🔗	mgrandi	not sure why its listed under EA files, heh
19:44 ^🔗	mgrandi	well to be fair, they added these so these are not direct executables so they are a bit safer than just EXE files
19:44 ^🔗	mgrandi	are MSU files signed?
19:44 ^🔗	mgrandi	do you have an example download link? i'll check it out
19:44 ^🔗	JAA	First one my scan found: https://download.microsoft.com/download/0/B/8/0B8852B8-8A3A-4A70-97CE-A84B5F4C5FC8/IE9-Windows6.0-KB2618444-x86.msu from ID 28401.
19:45 ^🔗	mgrandi	yeah, that actually doesn't run cause windows 10 doesn't accept SHA1 certs anymore
19:46 ^🔗	mgrandi	so they are Cabinet files (mszip i guess? )
19:47 ^🔗	JAA	I found two different search interfaces for the Download Center, and they both suck.
19:47 ^🔗	JAA	https://www.microsoft.com/en-us/search/downloadresults?FORM=DLC&ftapplicableproducts=^AllDownloads&sortby=+weight returns only 1000 results.
19:47 ^🔗	JAA	https://www.microsoft.com/en-us/download/search.aspx is just broken.
19:48 ^🔗	JAA	Sometimes returns the same results as you go through the pagination etc.
19:48 ^🔗	JAA	Trying to establish the upper bound for the IDs.
19:48 ^🔗	mgrandi	and the cert seems to be at the end of the WSU file
19:49 ^🔗	OrIdow6	I'm going to guess that generally, the lower the ID, the older it is, and the more likely it is to use sha1
19:50 ^🔗	mgrandi	that probably seems like a safe assumption
19:50 ^🔗	OrIdow6	Depending on how slowly that search goes until it reaches the present (~11 hours left until midnight Pacific), it might be useful to start downloading before it finishes
19:51 ^🔗	mgrandi	microsoft is US based, i'm not sure if they are gonna nuke it right at midnight on a sunday, so hopefully have a bit more time
19:51 ^🔗	mgrandi	but yeah, probably. how should we handle...the data? are we allowed to upload these to archive.org ?
19:52 ^🔗	JAA	https://transfer.notkiska.pw/Kwk8n/microsoft-download-center-files-below-id-60000
19:56 ^🔗	JAA	Actually, let me do that differently.
20:01 ^🔗	mgrandi	so we gonna split it up and curl our way to victory?
20:04 ^🔗	OrIdow6	I like the idea of "mirror everything as individual IA items"
20:04 ^🔗		wyatt8740 has quit IRC (Read error: Operation timed out)
20:06 ^🔗		wyatt8740 has joined #archiveteam-bs
20:14 ^🔗	JAA	That'd actually be nice, yeah. With full metadata etc.
20:14 ^🔗	JAA	But for now, we just need to grab everything we can.
20:14 ^🔗	JAA	Download as WARCs, further processing later.
20:16 ^🔗	mgrandi	yeah, but how are we gonna download them
20:16 ^🔗	mgrandi	is it easy to set up a tracker thingy like kickthebucket?
20:44 ^🔗		jshoard has quit IRC (Quit: Leaving)
20:45 ^🔗	JAA	https://transfer.notkiska.pw/AzcCd/microsoft-download-center-files-below-id-60000-sorted.jsonl
20:54 ^🔗	mgrandi	nice
20:54 ^🔗	mgrandi	so how do we coordinate the downloads
20:58 ^🔗	JAA	Here are the ten most frequent file extensions: 13780 msu, 13111 exe, 6812 zip, 3927 msi, 3770 pdf, 2228 pptx, 1214 docx, 888 bin, 828 doc, 483 xps
21:03 ^🔗	mgrandi	so the ID cooresponds to what page its on?
21:04 ^🔗	JAA	That's the 'id' parameter from the URLs.
21:04 ^🔗	mgrandi	yeah so if a 'item' has multiple downloads then they have different IDs
21:05 ^🔗	JAA	Uh
21:05 ^🔗	JAA	If an entry on the Download Center has multiple files, those all have the same ID in my list.
21:05 ^🔗	JAA	E.g. https://www.microsoft.com/en-us/download/confirmation.aspx?id=41658 -> three entries with ID 41658.
21:06 ^🔗	mgrandi	ok
21:06 ^🔗	mgrandi	thanks for setting this up :)
21:06 ^🔗	JAA	Further statistics: top ten by size in GiB: 4100.1 .zip, 864.1 .exe, 597.3 .bin, 355.1 .iso, 182.3 .rar, 149.6 .cab, 118.1 .msi, 90.5 .msu, 17.6 .wmv, 16.7 .ISO
21:09 ^🔗		Mateon1 has quit IRC (Remote host closed the connection)
21:09 ^🔗	JAA	I don't have a good idea how to do the actual retrieval though. Warriorbot isn't ready yet I think. :-/
21:10 ^🔗		Mateon1 has joined #archiveteam-bs
21:10 ^🔗	mgrandi	is that the thing that sets up a 'warrior' project?
21:11 ^🔗	JAA	No, it's a distributed ("warrior") project that simply retrieves lists of URLs.
21:12 ^🔗	OrIdow6	Assuming everyone necessary is here, we could set up a quick warrior project
21:12 ^🔗	mgrandi	setting up a warrior project pipeline seems easy right
21:12 ^🔗	mgrandi	like you don't even need a lua script right, no recursion necessary or allow/deny list of urls
21:12 ^🔗	OrIdow6	Seems like a lot of overhead considering this is basically just "wget --input-file" at scale
21:12 ^🔗	OrIdow6	though
21:12 ^🔗	JAA	Yeah
21:12 ^🔗		Gallifrey has quit IRC (Ping timeout: 265 seconds)
21:13 ^🔗	JAA	(Yeah, a lot of overhead)
21:13 ^🔗	mgrandi	but given the space requirements it might be good, because at least the rsync upload has the nice property of failing/retrying endlessly until the rsync target frees up space
21:13 ^🔗	mgrandi	i assume thats what 'warriorbot' is meant to fix, to have a premade warrior project for just lists of urls
21:13 ^🔗	JAA	Yes
21:13 ^🔗	OrIdow6	I'm more concerned about getting the infrastructure set up
21:14 ^🔗	OrIdow6	Viz. temporary storage (target or similar)
21:14 ^🔗	mgrandi	yeah, the infastructure of offloading the data is gonna be the hard part
21:15 ^🔗	OrIdow6	In other words... anyone have 17TB free?
21:15 ^🔗		Gallifrey has joined #archiveteam-bs
21:15 ^🔗	JAA	17?
21:16 ^🔗	OrIdow6	7.1
21:16 ^🔗	JAA	:-)
21:16 ^🔗	OrIdow6	Transposed the digits
21:20 ^🔗	mgrandi	uhh
21:21 ^🔗	mgrandi	i have 500 gb + 100 + 100 on my boxes i was using for kick the bucket
21:21 ^🔗	mgrandi	we can possibly alleviate some of it if we upload to archive.org like a normal warrior project right?
21:22 ^🔗	OrIdow6	If S3 is feeling nice today
21:23 ^🔗	JAA	(Narrator: It wasn't.)
21:25 ^🔗	mgrandi	17 tb is $170 a month on digital ocean which isn't terrible
21:26 ^🔗	mgrandi	oh no left of a 0, never mind, it doesn't even support that lol
21:28 ^🔗	mgrandi	but as long as i'm not paying for this for a full month i probably could swing enough volumes to do 17 tb
21:30 ^🔗	OrIdow6	How's network/transfer pricing?
21:30 ^🔗	mgrandi	inbound is free
21:31 ^🔗	mgrandi	yeah the outbound is what gets you
21:31 ^🔗	mgrandi	Droplets include free outbound data transfer, starting at 1,000 GiB/month for the smallest plan. Excess data transfer is billed at $0.01/GiB. For example, the cost of 1,000 GiB of overage is $10. Inbound bandwidth to Droplets is always free.
21:32 ^🔗	mgrandi	so $170, minus the free outbound bandwidth i get for my droplets which i think is 3tb for 3 droplets i have running now
21:32 ^🔗	mgrandi	again , not terrible, but if someone else has a cheaper option =P
21:34 ^🔗	JAA	It's 7.1 TB, not 17 TB.
21:34 ^🔗	mgrandi	now i'm doing it xD
21:35 ^🔗	mgrandi	so even better then
21:35 ^🔗	mgrandi	any other ideas before i pull the lever?
21:37 ^🔗	mgrandi	(brb like 40 minutes)
21:47 ^🔗	Terbium	buyvm and some other VPS providers have unmetered bandwidth
21:52 ^🔗	Terbium	$30/mo VPS + $10/mo for 2TB (1TB/$5) attached block storage might work
21:53 ^🔗	Terbium	scratch that... they're all of out stock....
22:37 ^🔗	mgrandi	well guess i'm buying the 7tb volume then
22:38 ^🔗	mgrandi	what is the format that wget/curl takes for a file list @OrIdow6
22:38 ^🔗	JAA	mgrandi: Please write WARCs, not plain files.
22:39 ^🔗		Gallifrey has quit IRC (Ping timeout: 265 seconds)
22:39 ^🔗	JAA	wget/wpull has --input-file or -i for that.
22:40 ^🔗	mgrandi	so what tool should i use?
22:40 ^🔗	mgrandi	i have wget-at for the kickthebucket archive
22:40 ^🔗	JAA	Yeah, wget-at seems good.
22:40 ^🔗	mgrandi	it will take a jsonl file?
22:40 ^🔗	JAA	Nope
22:41 ^🔗	JAA	Plain lines of URLs
22:41 ^🔗	mgrandi	ok
22:41 ^🔗	mgrandi	do you have a convenient list of urls or do you want me to make one?
22:42 ^🔗	JAA	I don't, but I can easily make one.
22:42 ^🔗	mgrandi	if you can make it easily that would be good
22:44 ^🔗	mgrandi	i'll do a 7.5 tb volume
22:45 ^🔗	JAA	https://transfer.notkiska.pw/3SHDe/microsoft-download-center-files-below-id-60000-sorted-urls
22:45 ^🔗		Gallifrey has joined #archiveteam-bs
22:47 ^🔗	mgrandi	do we have a way of exfilling these files to somewhere else?
22:47 ^🔗	mgrandi	thats a big number for the monthly cost that i'd rather not pay lol
22:48 ^🔗	JAA	I don't have any free storage at the moment I'm afraid.
22:50 ^🔗	mgrandi	wait, i have 15tb at home, but will cox hate me
22:51 ^🔗	JAA	Maybe SketchCow can set you up with space on FOS, although probably not the whole thing at once.
22:51 ^🔗	mgrandi	i think i'll start with this, but its gonna cost 24$/day
22:52 ^🔗	mgrandi	and helps since its a commercial data center without residential ISP limits
22:52 ^🔗	JAA	Or upload to IA as you grab.
22:52 ^🔗	OrIdow6	Make sure to split it up, instead of getting one huge warc
22:52 ^🔗	JAA	Yeah
22:52 ^🔗	mgrandi	so does anyone know the wget-at args to do that?
22:52 ^🔗	JAA	I tend to do 5 GiB WARCs.
22:52 ^🔗	mgrandi	or just partition the file into chunks
22:54 ^🔗	JAA	ArchiveBot's wpull options are a good starting point: https://github.com/ArchiveTeam/ArchiveBot/blob/3585ed999010665a7b367e37fd6f325f30a23983/pipeline/archivebot/seesaw/wpull.py#L12
22:54 ^🔗	JAA	But wpull isn't fully compatible with wget.
22:56 ^🔗	JAA	Or the DPoS project code repos, e.g. https://github.com/ArchiveTeam/mercurial-grab/blob/20b40049911bb721603de491d4e8a3aa5c4d3a81/pipeline.py#L173
22:56 ^🔗	JAA	--warc-max-size to get multiple WARCs instead of one huge file.
22:59 ^🔗	mgrandi	ok, yeah let me craft one based on that one
23:00 ^🔗	JAA	Another important one is --delete-after so the plain file isn't kept after download.
23:02 ^🔗	mgrandi	so --output-document outputs to a temp file, it writes it to a WARC, and then --delete-after deletes the temp file?
23:04 ^🔗	OrIdow6	JAA: Is that list in any particular order?
23:04 ^🔗	JAA	Yeah, something like that. I don't know what the exact data flow is in wget though. I think it writes it to the WARC immediately as the data is retrieved, not from the temp file.
23:04 ^🔗	JAA	OrIdow6: Yes, sorted by ID.
23:05 ^🔗	mgrandi	so do i need --output-document?
23:05 ^🔗	OrIdow6	JAA: Good, that's what I was going to ask about
23:05 ^🔗	mgrandi	or does wget=at need to write to something
23:06 ^🔗	OrIdow6	IIRC output-file is only useful when dealing with a single file
23:06 ^🔗	JAA	mgrandi: Not entirely sure to be honest. I'd include it though to be safe. Might have something to do with not creating directory structures or dealing with filenames containing odd characters.
23:07 ^🔗	mgrandi	i'll include it anyway to be safe
23:08 ^🔗	OrIdow6	*output-document (output-file sets the logfile location)
23:09 ^🔗	JAA	By the way, ~16 hours at 1 Gb/s to retrieve it all.
23:10 ^🔗	OrIdow6	Worst case is that it goes down at midnight automatically - not enough
23:10 ^🔗	OrIdow6	Though I don't know the speed of whatever's downloading it
23:12 ^🔗	mgrandi	its in digitalocean so it should be pretty fast
23:12 ^🔗	mgrandi	so how do the warc file names impact the split on size?
23:13 ^🔗	OrIdow6	Maybe split the list up, in case more people want to start downloading?
23:14 ^🔗		Arcorann has joined #archiveteam-bs
23:14 ^🔗	JAA	I've never actually used wget(-lua/at) directly myself, but at least in wpull, --warc-file sets the filename prefix when --warc-max-size is used. `--warc-file foo --warc-max-size 1234` would produce foo-00000.warc.gz, foo-00001.warc.gz, etc., each "about" 1234 bytes (in wpull, the split happens as soon as possible after reaching that size).
23:14 ^🔗	mgrandi	ok
23:16 ^🔗	SketchCow	What what
23:17 ^🔗	JAA	SketchCow: Microsoft deleting SHA-1-signed downloads from the Download Center tomorrow. No good way to determine which downloads are affected, total size 7.1 TB.
23:18 ^🔗	mgrandi	https://gist.github.com/mgrandi/0904bbeeaba2a4c1bc7084ad26ec236e
23:18 ^🔗	JAA	Not covered very well last time I checked.
23:18 ^🔗	mgrandi	commands look good? any warc headers i should add?
23:19 ^🔗	JAA	mgrandi: I'd remove --page-requisites --span-hosts --recursive --level inf since recursion isn't necessary here.
23:20 ^🔗	JAA	--warc-max-size is missing.
23:20 ^🔗	mgrandi	oh good call
23:21 ^🔗	mgrandi	is that a number like `5gb` ?
23:21 ^🔗	mgrandi	it just says 'NUMBER'
23:21 ^🔗	JAA	Bytes as an int, I think.
23:21 ^🔗	OrIdow6	And the extra --warc-headers
23:21 ^🔗	JAA	5368709120
23:22 ^🔗	JAA	Test it with a small --warc-max-size and the first couple URLs maybe to see if it does what you expect.
23:22 ^🔗	mgrandi	what headers should i include?
23:22 ^🔗	mgrandi	or does that matter / we can edit it later
23:23 ^🔗	OrIdow6	mgrandi: I'm just referring to the two pointless lines: --warc-header ""
23:23 ^🔗	JAA	Not sure. I often don't add any and document things in the item description on IA instead. It doesn't matter for the grab itself.
23:24 ^🔗	mgrandi	ok
23:24 ^🔗		Raccoon has joined #archiveteam-bs
23:25 ^🔗	mgrandi	updated: https://gist.github.com/mgrandi/0904bbeeaba2a4c1bc7084ad26ec236e
23:25 ^🔗	mgrandi	i'm gonna try that with 12 urls and then 10 mb warc limit
23:27 ^🔗	JAA	You might also want to split the list up and run multiple processes in parallel for higher throughput.
23:27 ^🔗	JAA	Depending on transfer and disk speed obviously.
23:28 ^🔗	mgrandi	ok
23:32 ^🔗	mgrandi	162 MB/s apparently
23:33 ^🔗	JAA	Nice
23:33 ^🔗	mgrandi	still think i need to split it up and run multiple processes?
23:34 ^🔗	JAA	If it stays at that speed, probably not.
23:41 ^🔗	mgrandi	and --delete-after is safe to have?
23:41 ^🔗	mgrandi	since its saving it to the WARC?
23:44 ^🔗	mgrandi	looks like its fine, i'll just leave it
23:45 ^🔗	JAA	Yes, should be safe.
23:46 ^🔗	JAA	Although it probably doesn't even matter that much since you have --output-document, so each download overwrites that file anyway.
23:47 ^🔗	mgrandi	cool, lets begin
23:47 ^🔗	mgrandi	if its going too slow i can always just start another one with different sections of the list and possibly have duplicates or ctrl+c after a certain point
23:52 ^🔗	JAA	Good luck, and let me know if you see anything that isn't status 200.
23:55 ^🔗	mgrandi	average is 27 MBit/s
23:56 ^🔗	mgrandi	2.5GB done already (compressed)
23:59 ^🔗		Gallifrey has quit IRC (Read error: Connection reset by peer)

irclogger-viewer