Time |
Nickname |
Message |
00:06
🔗
|
|
Gallifrey has joined #archiveteam-bs |
00:10
🔗
|
|
chirlu has quit IRC (Quit: Bye) |
00:16
🔗
|
JAA |
Has anyone looked into https://www.bleepingcomputer.com/news/microsoft/microsoft-to-remove-all-windows-downloads-signed-with-sha-1/ ? |
00:18
🔗
|
SketchCow |
Not here |
00:19
🔗
|
|
Arcorann has joined #archiveteam-bs |
00:20
🔗
|
|
Arcorann has quit IRC (Remote host closed the connection) |
00:20
🔗
|
|
Arcorann has joined #archiveteam-bs |
00:32
🔗
|
|
systwi_ is now known as systwi |
02:19
🔗
|
|
VADemon has quit IRC (left4dead) |
02:30
🔗
|
|
HP_Archiv has quit IRC (Quit: Leaving) |
02:32
🔗
|
|
HP_Archiv has joined #archiveteam-bs |
02:42
🔗
|
|
SmileyG has quit IRC (Read error: Operation timed out) |
02:42
🔗
|
|
Smiley has joined #archiveteam-bs |
03:24
🔗
|
|
VADemon has joined #archiveteam-bs |
03:40
🔗
|
|
qw3rty_ has joined #archiveteam-bs |
03:47
🔗
|
|
qw3rty__ has quit IRC (Read error: Operation timed out) |
04:28
🔗
|
|
HP_Archiv has quit IRC (Quit: Leaving) |
05:34
🔗
|
|
Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat) |
05:48
🔗
|
|
mgrandi has joined #archiveteam-bs |
05:49
🔗
|
mgrandi |
has anyone looked into saving SHA1 signed stuff from microsoft's download center? |
05:51
🔗
|
mgrandi |
oof, its happening august 3rd |
05:52
🔗
|
|
bsmith093 has quit IRC (Read error: Operation timed out) |
05:56
🔗
|
mgrandi |
https://www.zdnet.com/google-amp/article/microsoft-to-remove-all-sha-1-windows-downloads-next-week/ |
05:57
🔗
|
OrIdow6 |
I looked *at* it for about 5 minutes |
06:01
🔗
|
OrIdow6 |
Looking at it again... |
06:02
🔗
|
|
jmtd is now known as Jon |
06:03
🔗
|
OrIdow6 |
It's hard to tell where the Microsoft Download Center ends and other things begin. They use the microsoft.com-wide search function to list downloads (and it only goes up to page 126, so creativity is needed with the various parameters); and is anything at update.microsoft.com being removed? |
06:05
🔗
|
OrIdow6 |
And is there any way to tell what's using sha1 besides downloading the file, figuring out what format it's in (exe, msi), and extracting the information in a format-specific way? |
06:08
🔗
|
|
bsmith093 has joined #archiveteam-bs |
06:12
🔗
|
OrIdow6 |
Looks like it's divided into (what I will call) items, e.g. https://www.microsoft.com/en-us/download/confirmation.aspx?id=41658 - I give that one as an example because it contains multiple files |
06:12
🔗
|
OrIdow6 |
The site started giving me 400s for everything at one point; going away and clearing cookied worked in that case |
06:13
🔗
|
OrIdow6 |
They're using radio buttons where checkboxes should be used, and using JS to make them behave like checkboxes |
06:16
🔗
|
OrIdow6 |
wget 403s, but UA of "abc" (and presumably most other things) work |
06:16
🔗
|
OrIdow6 |
*makes it 403 |
06:18
🔗
|
OrIdow6 |
Looks like enumerating and downloading them should be straightforward (assuming they don't block) - can't say the same about figuring out what uses sha1 and playback |
06:18
🔗
|
OrIdow6 |
Maybe it's best to get the whole thing anyhow, if it's not too big (hopefully) |
06:19
🔗
|
OrIdow6 |
Obviously the rest of the site is going to go down some day |
06:19
🔗
|
OrIdow6 |
*sha1, and |
06:58
🔗
|
|
Craigle has joined #archiveteam-bs |
07:11
🔗
|
|
auror__ has joined #archiveteam-bs |
07:17
🔗
|
|
mgrandi has quit IRC (Read error: Operation timed out) |
07:22
🔗
|
|
HP_Archiv has joined #archiveteam-bs |
07:24
🔗
|
|
VADemon_ has joined #archiveteam-bs |
07:26
🔗
|
|
Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat) |
07:28
🔗
|
|
VADemon has quit IRC (Ping timeout: 492 seconds) |
07:39
🔗
|
|
auror__ is now known as mgrandi |
07:39
🔗
|
mgrandi |
do you see a easy way to get a list of urls? |
07:40
🔗
|
mgrandi |
i also can't see what things are SHA1 to make it easier |
07:40
🔗
|
mgrandi |
without downloading everything and checking |
07:41
🔗
|
OrIdow6 |
mgrandi: Just go through all the numerical IDs |
07:41
🔗
|
mgrandi |
ah. well, thats easy lol |
07:41
🔗
|
mgrandi |
also, that link you posted, it seems to be 1 file but then it has like 'popular downloads' underneath it? |
07:43
🔗
|
OrIdow6 |
See "click here to download manually" if it doesn't get all 3 automatically |
07:44
🔗
|
OrIdow6 |
An msi and 2 pdfs |
07:46
🔗
|
mgrandi |
for me its just auto prompting a download for a .exe |
07:47
🔗
|
mgrandi |
wait no i must have been on a different page |
07:47
🔗
|
mgrandi |
ok yeah i see it |
07:49
🔗
|
mgrandi |
so it seems like if it has multiple files it has a <div> with a class of `multifile-failover-list` |
07:50
🔗
|
mgrandi |
wait, did that link suddenly stop working for you? |
07:50
🔗
|
mgrandi |
i'm getting "item is no longer available" |
07:54
🔗
|
OrIdow6 |
No |
07:54
🔗
|
OrIdow6 |
Try clearing your cookies |
07:54
🔗
|
mgrandi |
ok, its weird, yeah it must be like using the cookies to try and see if you downloaded it recently |
07:56
🔗
|
mgrandi |
hmm, should i try just writing a python script that iterates over all 50000 ids and gets the links and then downloads them? |
07:56
🔗
|
mgrandi |
i don't even think archiveteam can host these probably , but at least someone will have a copy |
07:56
🔗
|
mgrandi |
archive.org * |
07:59
🔗
|
OrIdow6 |
There's a lot of old software hosted by the IA, even though it's technically in-copyright |
08:00
🔗
|
OrIdow6 |
These are free downloads, of security updates and other "technical" files, with no ads, that are being removed permanently; I would think (though you never know) that they're fairly "safe" |
08:00
🔗
|
mgrandi |
Swift on security on twitter brought up the point that a lot of these are needed even by modern software |
08:01
🔗
|
mgrandi |
stuff like the VC2XXX c++ redist packages and stuff |
08:01
🔗
|
OrIdow6 |
Are you talking about something else when you say that the IA has an inability to host them? |
08:01
🔗
|
OrIdow6 |
And I don't know who that is |
08:01
🔗
|
mgrandi |
yeah i meant that they are in copyright |
08:01
🔗
|
mgrandi |
slash by a company that would probably DCMA them being on the IA |
08:02
🔗
|
mgrandi |
uhhh, some twitter half taylor swift parody account slash technology/infosec account |
08:03
🔗
|
mgrandi |
"In 2017, Microsoft removed downloads for Movie Maker. |
08:03
🔗
|
mgrandi |
What resulted was years of customers looking for the file being infected and scammed by malware." |
08:04
🔗
|
mgrandi |
(whoops, didn't mean to copy the newline) https://www.welivesecurity.com/2017/11/09/eset-detected-windows-movie-maker-scam-2017/ |
08:04
🔗
|
OrIdow6 |
You're preaching to the choir here with the public access thing |
08:04
🔗
|
mgrandi |
yeah, heh |
08:05
🔗
|
mgrandi |
so If you aren't working on anything i'll probably see if i can just whip up a quick script to download the files, if there are no complications, i don't think its worth getting WARCs of the entire pages |
08:06
🔗
|
OrIdow6 |
You might want to get warcs of the description pages, at least, to get all the metadata |
08:08
🔗
|
OrIdow6 |
And I know it's somewhat contrary to the orthodoxy, but I think these would better be in the form of individual IA items rather than in warcs locked behind playback problems |
08:09
🔗
|
OrIdow6 |
As they are practically already separated like that |
08:09
🔗
|
mgrandi |
i have never done anything like this before so i assume i'll be making them per item,, i dunno what is the best practice |
08:09
🔗
|
mgrandi |
what is the best thing that has warc integration? would be manually downloading them with python requests into a warc file or using something like wpull? |
08:10
🔗
|
OrIdow6 |
First things first, get the data, seeing that, at minimum, it may be removed in 16 hours |
08:10
🔗
|
mgrandi |
(i have experience with page scraping and all that but not with generating warcs) |
08:11
🔗
|
OrIdow6 |
https://github.com/webrecorder/warcio#quick-start-to-writing-a-warc - easy way to write to warc when using requests |
08:12
🔗
|
|
Craigle has joined #archiveteam-bs |
08:13
🔗
|
mgrandi |
thank goodness someone has that |
08:13
🔗
|
OrIdow6 |
Though you could also make a list of the confirm pages, wpull them all, extract it after the fact from there, and then get the URLs of the downloads from that; or any number of other things; but this looks like it would disrupt your current idea the least |
08:13
🔗
|
OrIdow6 |
"This" being the thing I linked (thanks J A A) |
08:14
🔗
|
mgrandi |
wpull would probably be easiest yeah |
08:14
🔗
|
mgrandi |
if i can just figure out what urls to get and tell it to not go off recursively on some other microsoft site |
08:17
🔗
|
OrIdow6 |
You could whitelist instead of blacklist |
08:18
🔗
|
mgrandi |
yeah |
08:20
🔗
|
|
Laverne has quit IRC (Ping timeout: 272 seconds) |
08:21
🔗
|
|
Aoede has quit IRC (Ping timeout: 272 seconds) |
08:21
🔗
|
|
brayden has quit IRC (Ping timeout: 272 seconds) |
08:22
🔗
|
|
mgrytbak has quit IRC (Ping timeout: 272 seconds) |
08:35
🔗
|
mgrandi |
ok, i will work on this when i get up |
08:35
🔗
|
|
mgrandi has quit IRC (Leaving) |
08:37
🔗
|
|
i0npulse has quit IRC (Quit: leaving) |
09:06
🔗
|
|
i0npulse has joined #archiveteam-bs |
09:10
🔗
|
|
jshoard has joined #archiveteam-bs |
09:22
🔗
|
|
Raccoon has quit IRC (Ping timeout: 745 seconds) |
09:25
🔗
|
|
Aoede has joined #archiveteam-bs |
09:25
🔗
|
|
Laverne has joined #archiveteam-bs |
09:26
🔗
|
|
brayden has joined #archiveteam-bs |
09:31
🔗
|
|
OrIdow6 has quit IRC (Ping timeout: 265 seconds) |
09:33
🔗
|
|
OrIdow6 has joined #archiveteam-bs |
09:34
🔗
|
|
mgrytbak has joined #archiveteam-bs |
09:39
🔗
|
|
VADemon_ has quit IRC (Read error: Connection reset by peer) |
10:13
🔗
|
|
BartoCH has quit IRC (Quit: WeeChat 2.9) |
10:13
🔗
|
|
BartoCH has joined #archiveteam-bs |
10:32
🔗
|
|
jshoard has quit IRC (Read error: Operation timed out) |
10:59
🔗
|
|
HP_Archiv has quit IRC (Quit: Leaving) |
12:19
🔗
|
|
jshoard has joined #archiveteam-bs |
13:34
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
14:24
🔗
|
|
Gallifrey has quit IRC (Read error: Connection reset by peer) |
14:30
🔗
|
|
Gallifrey has joined #archiveteam-bs |
14:34
🔗
|
|
Gallifrey has quit IRC (Read error: Connection reset by peer) |
14:35
🔗
|
|
Gallifrey has joined #archiveteam-bs |
14:36
🔗
|
|
Ravenloft has joined #archiveteam-bs |
14:48
🔗
|
JAA |
I feel like we should "simply" continuously mirror all downloads Microsoft makes available at this point. |
15:03
🔗
|
|
Gallifrey has quit IRC (Read error: Connection reset by peer) |
15:05
🔗
|
|
Gallifrey has joined #archiveteam-bs |
15:07
🔗
|
JAA |
Oh yeah, my listing of Clutch's S3 finished and discovered some 33M files totalling 188 TB. |
15:13
🔗
|
|
schbirid has joined #archiveteam-bs |
15:13
🔗
|
JAA |
The video counts are ... interesting. |
15:13
🔗
|
JAA |
High-resolution videos have no suffix on the filename after the SHA-1. There are 5800580 of them totalling 97.6 TB. |
15:14
🔗
|
JAA |
Watermarked videos (*-watermarked.mp4): 4855831 files, 59.5 TB |
15:14
🔗
|
JAA |
Low-resolution videos (*-480.mp4): 5816979 files, 29.5 TB |
15:27
🔗
|
|
godane has quit IRC (Read error: Connection reset by peer) |
15:27
🔗
|
|
Arcorann has quit IRC (Read error: Connection reset by peer) |
15:44
🔗
|
|
fredgido_ has joined #archiveteam-bs |
15:46
🔗
|
|
fredgido has quit IRC (Ping timeout: 622 seconds) |
15:49
🔗
|
|
godane has joined #archiveteam-bs |
15:53
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
15:54
🔗
|
|
Raccoon has joined #archiveteam-bs |
16:06
🔗
|
|
Ctrl has quit IRC (Read error: Operation timed out) |
16:27
🔗
|
|
RichardG has quit IRC (Keyboard not found, press F1 to continue) |
16:30
🔗
|
|
RichardG has joined #archiveteam-bs |
16:51
🔗
|
|
prq has quit IRC (Remote host closed the connection) |
17:01
🔗
|
JAA |
My discovery on the API is running now. I'm simply iterating over the recent posts endpoint and extracting posts and users with a bunch of interesting metadata. The API is slooooow though, so that might take a bit. |
17:24
🔗
|
JAA |
OrIdow6: So are you grabbing the Microsoft downloads? |
17:25
🔗
|
JAA |
Er, mgrandi I guess. |
17:52
🔗
|
JAA |
I'm running something on it now. |
18:06
🔗
|
|
fivechan_ has joined #archiveteam-bs |
18:08
🔗
|
fivechan_ |
I have a question. If I have WARC files and upload them to InternetArchive with keyward "archiveteam", will they show up in Wayback Machine? |
18:10
🔗
|
JAA |
fivechan_: No, they won't. Only WARCs from trusted accounts are included in the Wayback Machine. |
18:14
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 260 seconds) |
18:15
🔗
|
fivechan_ |
To show web page in Wayback, I must ask archive team to archive team? |
18:16
🔗
|
fivechan_ |
I must ask archive team to archive them? |
18:18
🔗
|
JAA |
Yes, or use the Wayback Machine's save tool. |
18:19
🔗
|
fivechan_ |
Thank you!! I understood. |
18:24
🔗
|
|
fivechan_ has quit IRC (Ping timeout: 252 seconds) |
18:25
🔗
|
JAA |
Turns out that Microsoft doesn't really like it when their Download Center gets hammered with requests. 403 pretty quickly. |
18:25
🔗
|
JAA |
To be fair, I sent 100+ req/s at times. :-) |
18:26
🔗
|
|
mgrandi has joined #archiveteam-bs |
18:27
🔗
|
|
Mateon1 has joined #archiveteam-bs |
18:31
🔗
|
JAA |
Hey mgrandi. In case you didn't check the logs, I've been looking into Microsoft's downloads a bit. |
18:39
🔗
|
JAA |
My Clutch discovery is at 2020-07-21 after 1.5 hours. Yeah, this is going to take a while. |
18:42
🔗
|
mgrandi |
ah so you are already working on it? |
18:43
🔗
|
mgrandi |
or are you just getting URLs |
18:45
🔗
|
JAA |
The latter. Trying to, anyway. |
18:46
🔗
|
JAA |
I'm investigating Clutch's cursor format to see if I can speed this up a bit. |
18:46
🔗
|
mgrandi |
hmm, not familiar with Clutch |
18:48
🔗
|
JAA |
Two separate things I'm working on at the moment. |
18:51
🔗
|
mgrandi |
ok, let me know if you need help |
18:53
🔗
|
mgrandi |
i was personally just gonna iterate over 1->50000 and see if a page has anything vs a 404 and then download it there |
18:53
🔗
|
JAA |
Well, maybe you have an idea, so here it goes: I'm iterating over https://clutch.win/v1/posts/recent/ . The next page is https://clutch.win/v1/posts/recent/?cursor=<cursor value from previous page> . I'm trying to figure out how to construct a cursor value to start at a particular point in time. The cursors are opaque though and have a weird format. |
18:54
🔗
|
JAA |
E.g. CksKFwoKY3JlYXRlZF9hdBIJCN6b6YuB_eoCEixqCXN-ZnR3LXV0bHIfCxIEdXNlchiAgICp8PqGCQwLEgRjbGlwGKzBzdIIDBgAIAE= which decodes to b'\nK\n\x17\n\ncreated_at\x12\t\x08\xde\x9b\xe9\x8b\x81\xfd\xea\x02\x12,j\ts~ftw-utlr\x1f\x0b\x12\x04user\x18\x80\x80\x80\xa9\xf0\xfa\x86\t\x0c\x0b\x12\x04clip\x18\xac\xc1\xcd\xd2\x08\x0c\x18\x00 \x01' (in Python notation). |
18:54
🔗
|
JAA |
The created_at part controls the time axis, but I can't figure out what the rest is. |
18:56
🔗
|
mgrandi |
oh clutch is a website, i was mentioning my idea for microsoft download center heh |
18:57
🔗
|
mgrandi |
the cursors are probably dynamic, as in it represents a view of the database at that point in time |
18:57
🔗
|
mgrandi |
so it does a query, stores it, and then cursors iterate over it so its not constantly changing while you are iterating over it |
18:57
🔗
|
mgrandi |
you probably need to just iterate over the cursor values it gives you until you reach the end, i don't think you would be able ot create one dynamically |
18:58
🔗
|
JAA |
Possibly, but often cursors work as opaque identifiers similar to "before X time/DB ID". |
18:58
🔗
|
mgrandi |
hmm, is that a pickled object? |
19:00
🔗
|
JAA |
I can just iterate over it until done, but it's slow, so I'm trying to slice it into chunks of e.g. one day to process those in parallel. |
19:00
🔗
|
mgrandi |
it looks like its a serialized object format of some kind |
19:01
🔗
|
JAA |
Yeah |
19:01
🔗
|
mgrandi |
it has `created_at`, `clip`, and `ftw-utl`, `user` fields in it |
19:01
🔗
|
mgrandi |
but yeah, are you working on archiving the microsoft download center stuff ? or should i still work on that |
19:02
🔗
|
JAA |
I'm about 2/3 done enumerating the downloads now. |
19:03
🔗
|
mgrandi |
ok cool, how did you do it? just wpull over the urls with item 0->50000? |
19:03
🔗
|
JAA |
Just retrieving details.aspx and confirmation.aspx, extracting file URLs and sizes from the latter. |
19:03
🔗
|
JAA |
qwarc for IDs 1 to 60k. |
19:03
🔗
|
mgrandi |
k. are you actually downloading the files yet? |
19:04
🔗
|
JAA |
Nope |
19:04
🔗
|
mgrandi |
those are probably the biggest ones |
19:04
🔗
|
JAA |
Yeah, definitely. Just wanted to collect the URLs and get a size estimate first. |
19:04
🔗
|
mgrandi |
i don't think microsoft is gonna nuke these from history, i wonder if they are gonna put them back up. |
19:04
🔗
|
JAA |
By the way, I saw occasional weird things where the details.aspx page would work but confirmation.aspx would redirect to the 404 page. |
19:04
🔗
|
JAA |
I hope it doesn't happen the other way around... |
19:04
🔗
|
mgrandi |
full disclosure, i just started working for microsoft, but not on any team that deals with that |
19:05
🔗
|
mgrandi |
did you see where you have to like clear your cookies? |
19:05
🔗
|
mgrandi |
that was happening to me |
19:06
🔗
|
JAA |
Yeah, I'm clearing them after every ID because why not. |
19:06
🔗
|
mgrandi |
(not sure why it does that, seems weird) |
19:06
🔗
|
mgrandi |
also not ALL of the downloads are going away, just SHA1 ones |
19:06
🔗
|
mgrandi |
although i have no idea if there is a way to tell which ones are going away without..downloading them first |
19:08
🔗
|
JAA |
Yeah, exactly. |
19:08
🔗
|
JAA |
Hence why I suggested just grabbing everything and also doing so continuously in the future. |
19:11
🔗
|
JAA |
Retrieval is done, just need to fix that one weird 404 now. |
19:12
🔗
|
mgrandi |
i still have my box with a 500gb hard drive for kickthebucket if you want me to download the actual files |
19:12
🔗
|
JAA |
It's not reproducible by the way, just seems to happen under load or something like that. |
19:15
🔗
|
mgrandi |
hmm |
19:15
🔗
|
mgrandi |
also, is that base64 string you posted complete? or did you leave off a few = signs |
19:16
🔗
|
JAA |
Nope, that's complete. |
19:16
🔗
|
mgrandi |
is it base64? |
19:17
🔗
|
JAA |
Microsoft Download Center: I found 51298 files with a total size of about 7.1 TB. |
19:17
🔗
|
mgrandi |
its saying its not a divisible of 4 characters |
19:17
🔗
|
mgrandi |
@JAA hmm, thats a bit big |
19:18
🔗
|
JAA |
Hmm, yeah, odd. Python's base64.urlsafe_b64decode doesn't have any issues with it though. |
19:19
🔗
|
mgrandi |
oh its url safe, duh |
19:23
🔗
|
JAA |
The highest ID I found was 58507, by the way, which was uploaded ... a year ago (assuming weird US date format)? |
19:23
🔗
|
JAA |
https://www.microsoft.com/en-us/download/details.aspx?id=58507 |
19:23
🔗
|
OrIdow6 |
It's protobuf |
19:24
🔗
|
JAA |
I don't like protobuf. |
19:24
🔗
|
mgrandi |
oof, protobuf is not great if we dont have the proto file |
19:25
🔗
|
JAA |
Yeah |
19:25
🔗
|
mgrandi |
but i assume you just need to figure out the created_at |
19:25
🔗
|
mgrandi |
which is probably one of the integer types |
19:25
🔗
|
JAA |
It seems that the other values also influence the results, sadly. |
19:26
🔗
|
JAA |
This is getting messy here discussing about Microsoft Download Center and Clutch at the same time. |
19:27
🔗
|
OrIdow6 |
Yeah |
19:27
🔗
|
JAA |
Let's focus on Microsoft first since it has such a short deadline. |
19:27
🔗
|
JAA |
Surely the Download Center is still in use, right? Any ideas why the highest ID is a year old? |
19:28
🔗
|
mgrandi |
maybe they migrated to other things? |
19:28
🔗
|
JAA |
I noticed some big gaps in the IDs in some places though. |
19:28
🔗
|
JAA |
There's almost nothing between 31k and 34k for example, just a few files. |
19:28
🔗
|
mgrandi |
like i know that visual studio stuff has its own download page now instead of the download center |
19:31
🔗
|
JAA |
Well, apparently the IDs aren't sequential *at all*. |
19:31
🔗
|
JAA |
https://www.microsoft.com/en-us/download/details.aspx?id=1230 is from December... |
19:31
🔗
|
mgrandi |
lol wut |
19:32
🔗
|
JAA |
Damn, IDs go much higher also: https://www.microsoft.com/en-us/download/details.aspx?id=100688 |
19:32
🔗
|
mgrandi |
so, how do EXEs work ? are the signing certificates at the very start? |
19:32
🔗
|
mgrandi |
if so we could like...download 32kb of the file, check the cert and see if its a SHA1 cert or something? |
19:33
🔗
|
OrIdow6 |
But 1230 is apparently a security patch from 2010 (https://support.microsoft.com/en-us/help/2345000/ms10-079-description-of-the-security-update-for-word-2010-october-12-2) |
19:34
🔗
|
JAA |
Huh |
19:34
🔗
|
JAA |
So the 'Date Published' is completely unreliable. |
19:34
🔗
|
mgrandi |
cause while everything should be archived eventually, only the SHA1 stuff is getting removed soon |
19:38
🔗
|
JAA |
Yeah, but I'm also seeing .bin, .msi, .zip, even .tar.gz... |
19:39
🔗
|
JAA |
.msu |
19:39
🔗
|
JAA |
.msp |
19:39
🔗
|
JAA |
etc. |
19:40
🔗
|
JAA |
If you can figure something out to selectively archive those, that's great. Otherwise, we should just grab everything. |
19:41
🔗
|
mgrandi |
well, i assume the associated files with the SHA1 downloads will be removed |
19:42
🔗
|
mgrandi |
but if the cert is in a predictiable spot, my strategy would be: for every page, download some amount of data for each EXE, see if its signed with a SHA1 cert, if it is, download everything for that 'item', else skip it |
19:42
🔗
|
JAA |
There are more .msu than .exe. |
19:42
🔗
|
mgrandi |
what is a .msu? |
19:43
🔗
|
JAA |
I have no idea. 'Microsoft Update' maybe? |
19:43
🔗
|
OrIdow6 |
http://fileformats.archiveteam.org/wiki/Microsoft_Update_Standalone_Package |
19:43
🔗
|
JAA |
This fucking mess is precisely why I left the Windows world years ago. lol |
19:43
🔗
|
|
Raccoon has quit IRC (Ping timeout: 610 seconds) |
19:43
🔗
|
mgrandi |
not sure why its listed under EA files, heh |
19:44
🔗
|
mgrandi |
well to be fair, they added these so these are not direct executables so they are a bit safer than just EXE files |
19:44
🔗
|
mgrandi |
are MSU files signed? |
19:44
🔗
|
mgrandi |
do you have an example download link? i'll check it out |
19:44
🔗
|
JAA |
First one my scan found: https://download.microsoft.com/download/0/B/8/0B8852B8-8A3A-4A70-97CE-A84B5F4C5FC8/IE9-Windows6.0-KB2618444-x86.msu from ID 28401. |
19:45
🔗
|
mgrandi |
yeah, that actually doesn't run cause windows 10 doesn't accept SHA1 certs anymore |
19:46
🔗
|
mgrandi |
so they are Cabinet files (mszip i guess? ) |
19:47
🔗
|
JAA |
I found two different search interfaces for the Download Center, and they both suck. |
19:47
🔗
|
JAA |
https://www.microsoft.com/en-us/search/downloadresults?FORM=DLC&ftapplicableproducts=^AllDownloads&sortby=+weight returns only 1000 results. |
19:47
🔗
|
JAA |
https://www.microsoft.com/en-us/download/search.aspx is just broken. |
19:48
🔗
|
JAA |
Sometimes returns the same results as you go through the pagination etc. |
19:48
🔗
|
JAA |
Trying to establish the upper bound for the IDs. |
19:48
🔗
|
mgrandi |
and the cert seems to be at the end of the WSU file |
19:49
🔗
|
OrIdow6 |
I'm going to guess that generally, the lower the ID, the older it is, and the more likely it is to use sha1 |
19:50
🔗
|
mgrandi |
that probably seems like a safe assumption |
19:50
🔗
|
OrIdow6 |
Depending on how slowly that search goes until it reaches the present (~11 hours left until midnight Pacific), it might be useful to start downloading before it finishes |
19:51
🔗
|
mgrandi |
microsoft is US based, i'm not sure if they are gonna nuke it right at midnight on a sunday, so hopefully have a bit more time |
19:51
🔗
|
mgrandi |
but yeah, probably. how should we handle...the data? are we allowed to upload these to archive.org ? |
19:52
🔗
|
JAA |
https://transfer.notkiska.pw/Kwk8n/microsoft-download-center-files-below-id-60000 |
19:56
🔗
|
JAA |
Actually, let me do that differently. |
20:01
🔗
|
mgrandi |
so we gonna split it up and curl our way to victory? |
20:04
🔗
|
OrIdow6 |
I like the idea of "mirror everything as individual IA items" |
20:04
🔗
|
|
wyatt8740 has quit IRC (Read error: Operation timed out) |
20:06
🔗
|
|
wyatt8740 has joined #archiveteam-bs |
20:14
🔗
|
JAA |
That'd actually be nice, yeah. With full metadata etc. |
20:14
🔗
|
JAA |
But for now, we just need to grab everything we can. |
20:14
🔗
|
JAA |
Download as WARCs, further processing later. |
20:16
🔗
|
mgrandi |
yeah, but how are we gonna download them |
20:16
🔗
|
mgrandi |
is it easy to set up a tracker thingy like kickthebucket? |
20:44
🔗
|
|
jshoard has quit IRC (Quit: Leaving) |
20:45
🔗
|
JAA |
https://transfer.notkiska.pw/AzcCd/microsoft-download-center-files-below-id-60000-sorted.jsonl |
20:54
🔗
|
mgrandi |
nice |
20:54
🔗
|
mgrandi |
so how do we coordinate the downloads |
20:58
🔗
|
JAA |
Here are the ten most frequent file extensions: 13780 msu, 13111 exe, 6812 zip, 3927 msi, 3770 pdf, 2228 pptx, 1214 docx, 888 bin, 828 doc, 483 xps |
21:03
🔗
|
mgrandi |
so the ID cooresponds to what page its on? |
21:04
🔗
|
JAA |
That's the 'id' parameter from the URLs. |
21:04
🔗
|
mgrandi |
yeah so if a 'item' has multiple downloads then they have different IDs |
21:05
🔗
|
JAA |
Uh |
21:05
🔗
|
JAA |
If an entry on the Download Center has multiple files, those all have the same ID in my list. |
21:05
🔗
|
JAA |
E.g. https://www.microsoft.com/en-us/download/confirmation.aspx?id=41658 -> three entries with ID 41658. |
21:06
🔗
|
mgrandi |
ok |
21:06
🔗
|
mgrandi |
thanks for setting this up :) |
21:06
🔗
|
JAA |
Further statistics: top ten by size in GiB: 4100.1 .zip, 864.1 .exe, 597.3 .bin, 355.1 .iso, 182.3 .rar, 149.6 .cab, 118.1 .msi, 90.5 .msu, 17.6 .wmv, 16.7 .ISO |
21:09
🔗
|
|
Mateon1 has quit IRC (Remote host closed the connection) |
21:09
🔗
|
JAA |
I don't have a good idea how to do the actual retrieval though. Warriorbot isn't ready yet I think. :-/ |
21:10
🔗
|
|
Mateon1 has joined #archiveteam-bs |
21:10
🔗
|
mgrandi |
is that the thing that sets up a 'warrior' project? |
21:11
🔗
|
JAA |
No, it's a distributed ("warrior") project that simply retrieves lists of URLs. |
21:12
🔗
|
OrIdow6 |
Assuming everyone necessary is here, we could set up a quick warrior project |
21:12
🔗
|
mgrandi |
setting up a warrior project pipeline seems easy right |
21:12
🔗
|
mgrandi |
like you don't even need a lua script right, no recursion necessary or allow/deny list of urls |
21:12
🔗
|
OrIdow6 |
Seems like a lot of overhead considering this is basically just "wget --input-file" at scale |
21:12
🔗
|
OrIdow6 |
though |
21:12
🔗
|
JAA |
Yeah |
21:12
🔗
|
|
Gallifrey has quit IRC (Ping timeout: 265 seconds) |
21:13
🔗
|
JAA |
(Yeah, a lot of overhead) |
21:13
🔗
|
mgrandi |
but given the space requirements it might be good, because at least the rsync upload has the nice property of failing/retrying endlessly until the rsync target frees up space |
21:13
🔗
|
mgrandi |
i assume thats what 'warriorbot' is meant to fix, to have a premade warrior project for just lists of urls |
21:13
🔗
|
JAA |
Yes |
21:13
🔗
|
OrIdow6 |
I'm more concerned about getting the infrastructure set up |
21:14
🔗
|
OrIdow6 |
Viz. temporary storage (target or similar) |
21:14
🔗
|
mgrandi |
yeah, the infastructure of offloading the data is gonna be the hard part |
21:15
🔗
|
OrIdow6 |
In other words... anyone have 17TB free? |
21:15
🔗
|
|
Gallifrey has joined #archiveteam-bs |
21:15
🔗
|
JAA |
17? |
21:16
🔗
|
OrIdow6 |
7.1 |
21:16
🔗
|
JAA |
:-) |
21:16
🔗
|
OrIdow6 |
Transposed the digits |
21:20
🔗
|
mgrandi |
uhh |
21:21
🔗
|
mgrandi |
i have 500 gb + 100 + 100 on my boxes i was using for kick the bucket |
21:21
🔗
|
mgrandi |
we can possibly alleviate some of it if we upload to archive.org like a normal warrior project right? |
21:22
🔗
|
OrIdow6 |
If S3 is feeling nice today |
21:23
🔗
|
JAA |
(Narrator: It wasn't.) |
21:25
🔗
|
mgrandi |
17 tb is $170 a month on digital ocean which isn't terrible |
21:26
🔗
|
mgrandi |
oh no left of a 0, never mind, it doesn't even support that lol |
21:28
🔗
|
mgrandi |
but as long as i'm not paying for this for a full month i probably could swing enough volumes to do 17 tb |
21:30
🔗
|
OrIdow6 |
How's network/transfer pricing? |
21:30
🔗
|
mgrandi |
inbound is free |
21:31
🔗
|
mgrandi |
yeah the outbound is what gets you |
21:31
🔗
|
mgrandi |
Droplets include free outbound data transfer, starting at 1,000 GiB/month for the smallest plan. Excess data transfer is billed at $0.01/GiB. For example, the cost of 1,000 GiB of overage is $10. Inbound bandwidth to Droplets is always free. |
21:32
🔗
|
mgrandi |
so $170, minus the free outbound bandwidth i get for my droplets which i think is 3tb for 3 droplets i have running now |
21:32
🔗
|
mgrandi |
again , not terrible, but if someone else has a cheaper option =P |
21:34
🔗
|
JAA |
It's 7.1 TB, not 17 TB. |
21:34
🔗
|
mgrandi |
now i'm doing it xD |
21:35
🔗
|
mgrandi |
so even better then |
21:35
🔗
|
mgrandi |
any other ideas before i pull the lever? |
21:37
🔗
|
mgrandi |
(brb like 40 minutes) |
21:47
🔗
|
Terbium |
buyvm and some other VPS providers have unmetered bandwidth |
21:52
🔗
|
Terbium |
$30/mo VPS + $10/mo for 2TB (1TB/$5) attached block storage might work |
21:53
🔗
|
Terbium |
scratch that... they're all of out stock.... |
22:37
🔗
|
mgrandi |
well guess i'm buying the 7tb volume then |
22:38
🔗
|
mgrandi |
what is the format that wget/curl takes for a file list @OrIdow6 |
22:38
🔗
|
JAA |
mgrandi: Please write WARCs, not plain files. |
22:39
🔗
|
|
Gallifrey has quit IRC (Ping timeout: 265 seconds) |
22:39
🔗
|
JAA |
wget/wpull has --input-file or -i for that. |
22:40
🔗
|
mgrandi |
so what tool should i use? |
22:40
🔗
|
mgrandi |
i have wget-at for the kickthebucket archive |
22:40
🔗
|
JAA |
Yeah, wget-at seems good. |
22:40
🔗
|
mgrandi |
it will take a jsonl file? |
22:40
🔗
|
JAA |
Nope |
22:41
🔗
|
JAA |
Plain lines of URLs |
22:41
🔗
|
mgrandi |
ok |
22:41
🔗
|
mgrandi |
do you have a convenient list of urls or do you want me to make one? |
22:42
🔗
|
JAA |
I don't, but I can easily make one. |
22:42
🔗
|
mgrandi |
if you can make it easily that would be good |
22:44
🔗
|
mgrandi |
i'll do a 7.5 tb volume |
22:45
🔗
|
JAA |
https://transfer.notkiska.pw/3SHDe/microsoft-download-center-files-below-id-60000-sorted-urls |
22:45
🔗
|
|
Gallifrey has joined #archiveteam-bs |
22:47
🔗
|
mgrandi |
do we have a way of exfilling these files to somewhere else? |
22:47
🔗
|
mgrandi |
thats a big number for the monthly cost that i'd rather not pay lol |
22:48
🔗
|
JAA |
I don't have any free storage at the moment I'm afraid. |
22:50
🔗
|
mgrandi |
wait, i have 15tb at home, but will cox hate me |
22:51
🔗
|
JAA |
Maybe SketchCow can set you up with space on FOS, although probably not the whole thing at once. |
22:51
🔗
|
mgrandi |
i think i'll start with this, but its gonna cost 24$/day |
22:52
🔗
|
mgrandi |
and helps since its a commercial data center without residential ISP limits |
22:52
🔗
|
JAA |
Or upload to IA as you grab. |
22:52
🔗
|
OrIdow6 |
Make sure to split it up, instead of getting one huge warc |
22:52
🔗
|
JAA |
Yeah |
22:52
🔗
|
mgrandi |
so does anyone know the wget-at args to do that? |
22:52
🔗
|
JAA |
I tend to do 5 GiB WARCs. |
22:52
🔗
|
mgrandi |
or just partition the file into chunks |
22:54
🔗
|
JAA |
ArchiveBot's wpull options are a good starting point: https://github.com/ArchiveTeam/ArchiveBot/blob/3585ed999010665a7b367e37fd6f325f30a23983/pipeline/archivebot/seesaw/wpull.py#L12 |
22:54
🔗
|
JAA |
But wpull isn't fully compatible with wget. |
22:56
🔗
|
JAA |
Or the DPoS project code repos, e.g. https://github.com/ArchiveTeam/mercurial-grab/blob/20b40049911bb721603de491d4e8a3aa5c4d3a81/pipeline.py#L173 |
22:56
🔗
|
JAA |
--warc-max-size to get multiple WARCs instead of one huge file. |
22:59
🔗
|
mgrandi |
ok, yeah let me craft one based on that one |
23:00
🔗
|
JAA |
Another important one is --delete-after so the plain file isn't kept after download. |
23:02
🔗
|
mgrandi |
so --output-document outputs to a temp file, it writes it to a WARC, and then --delete-after deletes the temp file? |
23:04
🔗
|
OrIdow6 |
JAA: Is that list in any particular order? |
23:04
🔗
|
JAA |
Yeah, something like that. I don't know what the exact data flow is in wget though. I think it writes it to the WARC immediately as the data is retrieved, not from the temp file. |
23:04
🔗
|
JAA |
OrIdow6: Yes, sorted by ID. |
23:05
🔗
|
mgrandi |
so do i need --output-document? |
23:05
🔗
|
OrIdow6 |
JAA: Good, that's what I was going to ask about |
23:05
🔗
|
mgrandi |
or does wget=at need to write to something |
23:06
🔗
|
OrIdow6 |
IIRC output-file is only useful when dealing with a single file |
23:06
🔗
|
JAA |
mgrandi: Not entirely sure to be honest. I'd include it though to be safe. Might have something to do with not creating directory structures or dealing with filenames containing odd characters. |
23:07
🔗
|
mgrandi |
i'll include it anyway to be safe |
23:08
🔗
|
OrIdow6 |
*output-document (output-file sets the logfile location) |
23:09
🔗
|
JAA |
By the way, ~16 hours at 1 Gb/s to retrieve it all. |
23:10
🔗
|
OrIdow6 |
Worst case is that it goes down at midnight automatically - not enough |
23:10
🔗
|
OrIdow6 |
Though I don't know the speed of whatever's downloading it |
23:12
🔗
|
mgrandi |
its in digitalocean so it should be pretty fast |
23:12
🔗
|
mgrandi |
so how do the warc file names impact the split on size? |
23:13
🔗
|
OrIdow6 |
Maybe split the list up, in case more people want to start downloading? |
23:14
🔗
|
|
Arcorann has joined #archiveteam-bs |
23:14
🔗
|
JAA |
I've never actually used wget(-lua/at) directly myself, but at least in wpull, --warc-file sets the filename prefix when --warc-max-size is used. `--warc-file foo --warc-max-size 1234` would produce foo-00000.warc.gz, foo-00001.warc.gz, etc., each "about" 1234 bytes (in wpull, the split happens as soon as possible after reaching that size). |
23:14
🔗
|
mgrandi |
ok |
23:16
🔗
|
SketchCow |
What what |
23:17
🔗
|
JAA |
SketchCow: Microsoft deleting SHA-1-signed downloads from the Download Center tomorrow. No good way to determine which downloads are affected, total size 7.1 TB. |
23:18
🔗
|
mgrandi |
https://gist.github.com/mgrandi/0904bbeeaba2a4c1bc7084ad26ec236e |
23:18
🔗
|
JAA |
Not covered very well last time I checked. |
23:18
🔗
|
mgrandi |
commands look good? any warc headers i should add? |
23:19
🔗
|
JAA |
mgrandi: I'd remove --page-requisites --span-hosts --recursive --level inf since recursion isn't necessary here. |
23:20
🔗
|
JAA |
--warc-max-size is missing. |
23:20
🔗
|
mgrandi |
oh good call |
23:21
🔗
|
mgrandi |
is that a number like `5gb` ? |
23:21
🔗
|
mgrandi |
it just says 'NUMBER' |
23:21
🔗
|
JAA |
Bytes as an int, I think. |
23:21
🔗
|
OrIdow6 |
And the extra --warc-headers |
23:21
🔗
|
JAA |
5368709120 |
23:22
🔗
|
JAA |
Test it with a small --warc-max-size and the first couple URLs maybe to see if it does what you expect. |
23:22
🔗
|
mgrandi |
what headers should i include? |
23:22
🔗
|
mgrandi |
or does that matter / we can edit it later |
23:23
🔗
|
OrIdow6 |
mgrandi: I'm just referring to the two pointless lines: --warc-header "" |
23:23
🔗
|
JAA |
Not sure. I often don't add any and document things in the item description on IA instead. It doesn't matter for the grab itself. |
23:24
🔗
|
mgrandi |
ok |
23:24
🔗
|
|
Raccoon has joined #archiveteam-bs |
23:25
🔗
|
mgrandi |
updated: https://gist.github.com/mgrandi/0904bbeeaba2a4c1bc7084ad26ec236e |
23:25
🔗
|
mgrandi |
i'm gonna try that with 12 urls and then 10 mb warc limit |
23:27
🔗
|
JAA |
You might also want to split the list up and run multiple processes in parallel for higher throughput. |
23:27
🔗
|
JAA |
Depending on transfer and disk speed obviously. |
23:28
🔗
|
mgrandi |
ok |
23:32
🔗
|
mgrandi |
162 MB/s apparently |
23:33
🔗
|
JAA |
Nice |
23:33
🔗
|
mgrandi |
still think i need to split it up and run multiple processes? |
23:34
🔗
|
JAA |
If it stays at that speed, probably not. |
23:41
🔗
|
mgrandi |
and --delete-after is safe to have? |
23:41
🔗
|
mgrandi |
since its saving it to the WARC? |
23:44
🔗
|
mgrandi |
looks like its fine, i'll just leave it |
23:45
🔗
|
JAA |
Yes, should be safe. |
23:46
🔗
|
JAA |
Although it probably doesn't even matter that much since you have --output-document, so each download overwrites that file anyway. |
23:47
🔗
|
mgrandi |
cool, lets begin |
23:47
🔗
|
mgrandi |
if its going too slow i can always just start another one with different sections of the list and possibly have duplicates or ctrl+c after a certain point |
23:52
🔗
|
JAA |
Good luck, and let me know if you see anything that isn't status 200. |
23:55
🔗
|
mgrandi |
average is 27 MBit/s |
23:56
🔗
|
mgrandi |
2.5GB done already (compressed) |
23:59
🔗
|
|
Gallifrey has quit IRC (Read error: Connection reset by peer) |