[00:06] *** Gallifrey has joined #archiveteam-bs [00:10] *** chirlu has quit IRC (Quit: Bye) [00:16] Has anyone looked into https://www.bleepingcomputer.com/news/microsoft/microsoft-to-remove-all-windows-downloads-signed-with-sha-1/ ? [00:18] Not here [00:19] *** Arcorann has joined #archiveteam-bs [00:20] *** Arcorann has quit IRC (Remote host closed the connection) [00:20] *** Arcorann has joined #archiveteam-bs [00:32] *** systwi_ is now known as systwi [02:19] *** VADemon has quit IRC (left4dead) [02:30] *** HP_Archiv has quit IRC (Quit: Leaving) [02:32] *** HP_Archiv has joined #archiveteam-bs [02:42] *** SmileyG has quit IRC (Read error: Operation timed out) [02:42] *** Smiley has joined #archiveteam-bs [03:24] *** VADemon has joined #archiveteam-bs [03:40] *** qw3rty_ has joined #archiveteam-bs [03:47] *** qw3rty__ has quit IRC (Read error: Operation timed out) [04:28] *** HP_Archiv has quit IRC (Quit: Leaving) [05:34] *** Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat) [05:48] *** mgrandi has joined #archiveteam-bs [05:49] has anyone looked into saving SHA1 signed stuff from microsoft's download center? [05:51] oof, its happening august 3rd [05:52] *** bsmith093 has quit IRC (Read error: Operation timed out) [05:56] https://www.zdnet.com/google-amp/article/microsoft-to-remove-all-sha-1-windows-downloads-next-week/ [05:57] I looked *at* it for about 5 minutes [06:01] Looking at it again... [06:02] *** jmtd is now known as Jon [06:03] It's hard to tell where the Microsoft Download Center ends and other things begin. They use the microsoft.com-wide search function to list downloads (and it only goes up to page 126, so creativity is needed with the various parameters); and is anything at update.microsoft.com being removed? [06:05] And is there any way to tell what's using sha1 besides downloading the file, figuring out what format it's in (exe, msi), and extracting the information in a format-specific way? [06:08] *** bsmith093 has joined #archiveteam-bs [06:12] Looks like it's divided into (what I will call) items, e.g. https://www.microsoft.com/en-us/download/confirmation.aspx?id=41658 - I give that one as an example because it contains multiple files [06:12] The site started giving me 400s for everything at one point; going away and clearing cookied worked in that case [06:13] They're using radio buttons where checkboxes should be used, and using JS to make them behave like checkboxes [06:16] wget 403s, but UA of "abc" (and presumably most other things) work [06:16] *makes it 403 [06:18] Looks like enumerating and downloading them should be straightforward (assuming they don't block) - can't say the same about figuring out what uses sha1 and playback [06:18] Maybe it's best to get the whole thing anyhow, if it's not too big (hopefully) [06:19] Obviously the rest of the site is going to go down some day [06:19] *sha1, and [06:58] *** Craigle has joined #archiveteam-bs [07:11] *** auror__ has joined #archiveteam-bs [07:17] *** mgrandi has quit IRC (Read error: Operation timed out) [07:22] *** HP_Archiv has joined #archiveteam-bs [07:24] *** VADemon_ has joined #archiveteam-bs [07:26] *** Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat) [07:28] *** VADemon has quit IRC (Ping timeout: 492 seconds) [07:39] *** auror__ is now known as mgrandi [07:39] do you see a easy way to get a list of urls? [07:40] i also can't see what things are SHA1 to make it easier [07:40] without downloading everything and checking [07:41] mgrandi: Just go through all the numerical IDs [07:41] ah. well, thats easy lol [07:41] also, that link you posted, it seems to be 1 file but then it has like 'popular downloads' underneath it? [07:43] See "click here to download manually" if it doesn't get all 3 automatically [07:44] An msi and 2 pdfs [07:46] for me its just auto prompting a download for a .exe [07:47] wait no i must have been on a different page [07:47] ok yeah i see it [07:49] so it seems like if it has multiple files it has a
with a class of `multifile-failover-list` [07:50] wait, did that link suddenly stop working for you? [07:50] i'm getting "item is no longer available" [07:54] No [07:54] Try clearing your cookies [07:54] ok, its weird, yeah it must be like using the cookies to try and see if you downloaded it recently [07:56] hmm, should i try just writing a python script that iterates over all 50000 ids and gets the links and then downloads them? [07:56] i don't even think archiveteam can host these probably , but at least someone will have a copy [07:56] archive.org * [07:59] There's a lot of old software hosted by the IA, even though it's technically in-copyright [08:00] These are free downloads, of security updates and other "technical" files, with no ads, that are being removed permanently; I would think (though you never know) that they're fairly "safe" [08:00] Swift on security on twitter brought up the point that a lot of these are needed even by modern software [08:01] stuff like the VC2XXX c++ redist packages and stuff [08:01] Are you talking about something else when you say that the IA has an inability to host them? [08:01] And I don't know who that is [08:01] yeah i meant that they are in copyright [08:01] slash by a company that would probably DCMA them being on the IA [08:02] uhhh, some twitter half taylor swift parody account slash technology/infosec account [08:03] "In 2017, Microsoft removed downloads for Movie Maker. [08:03] What resulted was years of customers looking for the file being infected and scammed by malware." [08:04] (whoops, didn't mean to copy the newline) https://www.welivesecurity.com/2017/11/09/eset-detected-windows-movie-maker-scam-2017/ [08:04] You're preaching to the choir here with the public access thing [08:04] yeah, heh [08:05] so If you aren't working on anything i'll probably see if i can just whip up a quick script to download the files, if there are no complications, i don't think its worth getting WARCs of the entire pages [08:06] You might want to get warcs of the description pages, at least, to get all the metadata [08:08] And I know it's somewhat contrary to the orthodoxy, but I think these would better be in the form of individual IA items rather than in warcs locked behind playback problems [08:09] As they are practically already separated like that [08:09] i have never done anything like this before so i assume i'll be making them per item,, i dunno what is the best practice [08:09] what is the best thing that has warc integration? would be manually downloading them with python requests into a warc file or using something like wpull? [08:10] First things first, get the data, seeing that, at minimum, it may be removed in 16 hours [08:10] (i have experience with page scraping and all that but not with generating warcs) [08:11] https://github.com/webrecorder/warcio#quick-start-to-writing-a-warc - easy way to write to warc when using requests [08:12] *** Craigle has joined #archiveteam-bs [08:13] thank goodness someone has that [08:13] Though you could also make a list of the confirm pages, wpull them all, extract it after the fact from there, and then get the URLs of the downloads from that; or any number of other things; but this looks like it would disrupt your current idea the least [08:13] "This" being the thing I linked (thanks J A A) [08:14] wpull would probably be easiest yeah [08:14] if i can just figure out what urls to get and tell it to not go off recursively on some other microsoft site [08:17] You could whitelist instead of blacklist [08:18] yeah [08:20] *** Laverne has quit IRC (Ping timeout: 272 seconds) [08:21] *** Aoede has quit IRC (Ping timeout: 272 seconds) [08:21] *** brayden has quit IRC (Ping timeout: 272 seconds) [08:22] *** mgrytbak has quit IRC (Ping timeout: 272 seconds) [08:35] ok, i will work on this when i get up [08:35] *** mgrandi has quit IRC (Leaving) [08:37] *** i0npulse has quit IRC (Quit: leaving) [09:06] *** i0npulse has joined #archiveteam-bs [09:10] *** jshoard has joined #archiveteam-bs [09:22] *** Raccoon has quit IRC (Ping timeout: 745 seconds) [09:25] *** Aoede has joined #archiveteam-bs [09:25] *** Laverne has joined #archiveteam-bs [09:26] *** brayden has joined #archiveteam-bs [09:31] *** OrIdow6 has quit IRC (Ping timeout: 265 seconds) [09:33] *** OrIdow6 has joined #archiveteam-bs [09:34] *** mgrytbak has joined #archiveteam-bs [09:39] *** VADemon_ has quit IRC (Read error: Connection reset by peer) [10:13] *** BartoCH has quit IRC (Quit: WeeChat 2.9) [10:13] *** BartoCH has joined #archiveteam-bs [10:32] *** jshoard has quit IRC (Read error: Operation timed out) [10:59] *** HP_Archiv has quit IRC (Quit: Leaving) [12:19] *** jshoard has joined #archiveteam-bs [13:34] *** BlueMax has quit IRC (Quit: Leaving) [14:24] *** Gallifrey has quit IRC (Read error: Connection reset by peer) [14:30] *** Gallifrey has joined #archiveteam-bs [14:34] *** Gallifrey has quit IRC (Read error: Connection reset by peer) [14:35] *** Gallifrey has joined #archiveteam-bs [14:36] *** Ravenloft has joined #archiveteam-bs [14:48] I feel like we should "simply" continuously mirror all downloads Microsoft makes available at this point. [15:03] *** Gallifrey has quit IRC (Read error: Connection reset by peer) [15:05] *** Gallifrey has joined #archiveteam-bs [15:07] Oh yeah, my listing of Clutch's S3 finished and discovered some 33M files totalling 188 TB. [15:13] *** schbirid has joined #archiveteam-bs [15:13] The video counts are ... interesting. [15:13] High-resolution videos have no suffix on the filename after the SHA-1. There are 5800580 of them totalling 97.6 TB. [15:14] Watermarked videos (*-watermarked.mp4): 4855831 files, 59.5 TB [15:14] Low-resolution videos (*-480.mp4): 5816979 files, 29.5 TB [15:27] *** godane has quit IRC (Read error: Connection reset by peer) [15:27] *** Arcorann has quit IRC (Read error: Connection reset by peer) [15:44] *** fredgido_ has joined #archiveteam-bs [15:46] *** fredgido has quit IRC (Ping timeout: 622 seconds) [15:49] *** godane has joined #archiveteam-bs [15:53] *** schbirid has quit IRC (Quit: Leaving) [15:54] *** Raccoon has joined #archiveteam-bs [16:06] *** Ctrl has quit IRC (Read error: Operation timed out) [16:27] *** RichardG has quit IRC (Keyboard not found, press F1 to continue) [16:30] *** RichardG has joined #archiveteam-bs [16:51] *** prq has quit IRC (Remote host closed the connection) [17:01] My discovery on the API is running now. I'm simply iterating over the recent posts endpoint and extracting posts and users with a bunch of interesting metadata. The API is slooooow though, so that might take a bit. [17:24] OrIdow6: So are you grabbing the Microsoft downloads? [17:25] Er, mgrandi I guess. [17:52] I'm running something on it now. [18:06] *** fivechan_ has joined #archiveteam-bs [18:08] I have a question. If I have WARC files and upload them to InternetArchive with keyward "archiveteam", will they show up in Wayback Machine? [18:10] fivechan_: No, they won't. Only WARCs from trusted accounts are included in the Wayback Machine. [18:14] *** Mateon1 has quit IRC (Ping timeout: 260 seconds) [18:15] To show web page in Wayback, I must ask archive team to archive team? [18:16] I must ask archive team to archive them? [18:18] Yes, or use the Wayback Machine's save tool. [18:19] Thank you!! I understood. [18:24] *** fivechan_ has quit IRC (Ping timeout: 252 seconds) [18:25] Turns out that Microsoft doesn't really like it when their Download Center gets hammered with requests. 403 pretty quickly. [18:25] To be fair, I sent 100+ req/s at times. :-) [18:26] *** mgrandi has joined #archiveteam-bs [18:27] *** Mateon1 has joined #archiveteam-bs [18:31] Hey mgrandi. In case you didn't check the logs, I've been looking into Microsoft's downloads a bit. [18:39] My Clutch discovery is at 2020-07-21 after 1.5 hours. Yeah, this is going to take a while. [18:42] ah so you are already working on it? [18:43] or are you just getting URLs [18:45] The latter. Trying to, anyway. [18:46] I'm investigating Clutch's cursor format to see if I can speed this up a bit. [18:46] hmm, not familiar with Clutch [18:48] Two separate things I'm working on at the moment. [18:51] ok, let me know if you need help [18:53] i was personally just gonna iterate over 1->50000 and see if a page has anything vs a 404 and then download it there [18:53] Well, maybe you have an idea, so here it goes: I'm iterating over https://clutch.win/v1/posts/recent/ . The next page is https://clutch.win/v1/posts/recent/?cursor= . I'm trying to figure out how to construct a cursor value to start at a particular point in time. The cursors are opaque though and have a weird format. [18:54] E.g. CksKFwoKY3JlYXRlZF9hdBIJCN6b6YuB_eoCEixqCXN-ZnR3LXV0bHIfCxIEdXNlchiAgICp8PqGCQwLEgRjbGlwGKzBzdIIDBgAIAE= which decodes to b'\nK\n\x17\n\ncreated_at\x12\t\x08\xde\x9b\xe9\x8b\x81\xfd\xea\x02\x12,j\ts~ftw-utlr\x1f\x0b\x12\x04user\x18\x80\x80\x80\xa9\xf0\xfa\x86\t\x0c\x0b\x12\x04clip\x18\xac\xc1\xcd\xd2\x08\x0c\x18\x00 \x01' (in Python notation). [18:54] The created_at part controls the time axis, but I can't figure out what the rest is. [18:56] oh clutch is a website, i was mentioning my idea for microsoft download center heh [18:57] the cursors are probably dynamic, as in it represents a view of the database at that point in time [18:57] so it does a query, stores it, and then cursors iterate over it so its not constantly changing while you are iterating over it [18:57] you probably need to just iterate over the cursor values it gives you until you reach the end, i don't think you would be able ot create one dynamically [18:58] Possibly, but often cursors work as opaque identifiers similar to "before X time/DB ID". [18:58] hmm, is that a pickled object? [19:00] I can just iterate over it until done, but it's slow, so I'm trying to slice it into chunks of e.g. one day to process those in parallel. [19:00] it looks like its a serialized object format of some kind [19:01] Yeah [19:01] it has `created_at`, `clip`, and `ftw-utl`, `user` fields in it [19:01] but yeah, are you working on archiving the microsoft download center stuff ? or should i still work on that [19:02] I'm about 2/3 done enumerating the downloads now. [19:03] ok cool, how did you do it? just wpull over the urls with item 0->50000? [19:03] Just retrieving details.aspx and confirmation.aspx, extracting file URLs and sizes from the latter. [19:03] qwarc for IDs 1 to 60k. [19:03] k. are you actually downloading the files yet? [19:04] Nope [19:04] those are probably the biggest ones [19:04] Yeah, definitely. Just wanted to collect the URLs and get a size estimate first. [19:04] i don't think microsoft is gonna nuke these from history, i wonder if they are gonna put them back up. [19:04] By the way, I saw occasional weird things where the details.aspx page would work but confirmation.aspx would redirect to the 404 page. [19:04] I hope it doesn't happen the other way around... [19:04] full disclosure, i just started working for microsoft, but not on any team that deals with that [19:05] did you see where you have to like clear your cookies? [19:05] that was happening to me [19:06] Yeah, I'm clearing them after every ID because why not. [19:06] (not sure why it does that, seems weird) [19:06] also not ALL of the downloads are going away, just SHA1 ones [19:06] although i have no idea if there is a way to tell which ones are going away without..downloading them first [19:08] Yeah, exactly. [19:08] Hence why I suggested just grabbing everything and also doing so continuously in the future. [19:11] Retrieval is done, just need to fix that one weird 404 now. [19:12] i still have my box with a 500gb hard drive for kickthebucket if you want me to download the actual files [19:12] It's not reproducible by the way, just seems to happen under load or something like that. [19:15] hmm [19:15] also, is that base64 string you posted complete? or did you leave off a few = signs [19:16] Nope, that's complete. [19:16] is it base64? [19:17] Microsoft Download Center: I found 51298 files with a total size of about 7.1 TB. [19:17] its saying its not a divisible of 4 characters [19:17] @JAA hmm, thats a bit big [19:18] Hmm, yeah, odd. Python's base64.urlsafe_b64decode doesn't have any issues with it though. [19:19] oh its url safe, duh [19:23] The highest ID I found was 58507, by the way, which was uploaded ... a year ago (assuming weird US date format)? [19:23] https://www.microsoft.com/en-us/download/details.aspx?id=58507 [19:23] It's protobuf [19:24] I don't like protobuf. [19:24] oof, protobuf is not great if we dont have the proto file [19:25] Yeah [19:25] but i assume you just need to figure out the created_at [19:25] which is probably one of the integer types [19:25] It seems that the other values also influence the results, sadly. [19:26] This is getting messy here discussing about Microsoft Download Center and Clutch at the same time. [19:27] Yeah [19:27] Let's focus on Microsoft first since it has such a short deadline. [19:27] Surely the Download Center is still in use, right? Any ideas why the highest ID is a year old? [19:28] maybe they migrated to other things? [19:28] I noticed some big gaps in the IDs in some places though. [19:28] There's almost nothing between 31k and 34k for example, just a few files. [19:28] like i know that visual studio stuff has its own download page now instead of the download center [19:31] Well, apparently the IDs aren't sequential *at all*. [19:31] https://www.microsoft.com/en-us/download/details.aspx?id=1230 is from December... [19:31] lol wut [19:32] Damn, IDs go much higher also: https://www.microsoft.com/en-us/download/details.aspx?id=100688 [19:32] so, how do EXEs work ? are the signing certificates at the very start? [19:32] if so we could like...download 32kb of the file, check the cert and see if its a SHA1 cert or something? [19:33] But 1230 is apparently a security patch from 2010 (https://support.microsoft.com/en-us/help/2345000/ms10-079-description-of-the-security-update-for-word-2010-october-12-2) [19:34] Huh [19:34] So the 'Date Published' is completely unreliable. [19:34] cause while everything should be archived eventually, only the SHA1 stuff is getting removed soon [19:38] Yeah, but I'm also seeing .bin, .msi, .zip, even .tar.gz... [19:39] .msu [19:39] .msp [19:39] etc. [19:40] If you can figure something out to selectively archive those, that's great. Otherwise, we should just grab everything. [19:41] well, i assume the associated files with the SHA1 downloads will be removed [19:42] but if the cert is in a predictiable spot, my strategy would be: for every page, download some amount of data for each EXE, see if its signed with a SHA1 cert, if it is, download everything for that 'item', else skip it [19:42] There are more .msu than .exe. [19:42] what is a .msu? [19:43] I have no idea. 'Microsoft Update' maybe? [19:43] http://fileformats.archiveteam.org/wiki/Microsoft_Update_Standalone_Package [19:43] This fucking mess is precisely why I left the Windows world years ago. lol [19:43] *** Raccoon has quit IRC (Ping timeout: 610 seconds) [19:43] not sure why its listed under EA files, heh [19:44] well to be fair, they added these so these are not direct executables so they are a bit safer than just EXE files [19:44] are MSU files signed? [19:44] do you have an example download link? i'll check it out [19:44] First one my scan found: https://download.microsoft.com/download/0/B/8/0B8852B8-8A3A-4A70-97CE-A84B5F4C5FC8/IE9-Windows6.0-KB2618444-x86.msu from ID 28401. [19:45] yeah, that actually doesn't run cause windows 10 doesn't accept SHA1 certs anymore [19:46] so they are Cabinet files (mszip i guess? ) [19:47] I found two different search interfaces for the Download Center, and they both suck. [19:47] https://www.microsoft.com/en-us/search/downloadresults?FORM=DLC&ftapplicableproducts=^AllDownloads&sortby=+weight returns only 1000 results. [19:47] https://www.microsoft.com/en-us/download/search.aspx is just broken. [19:48] Sometimes returns the same results as you go through the pagination etc. [19:48] Trying to establish the upper bound for the IDs. [19:48] and the cert seems to be at the end of the WSU file [19:49] I'm going to guess that generally, the lower the ID, the older it is, and the more likely it is to use sha1 [19:50] that probably seems like a safe assumption [19:50] Depending on how slowly that search goes until it reaches the present (~11 hours left until midnight Pacific), it might be useful to start downloading before it finishes [19:51] microsoft is US based, i'm not sure if they are gonna nuke it right at midnight on a sunday, so hopefully have a bit more time [19:51] but yeah, probably. how should we handle...the data? are we allowed to upload these to archive.org ? [19:52] https://transfer.notkiska.pw/Kwk8n/microsoft-download-center-files-below-id-60000 [19:56] Actually, let me do that differently. [20:01] so we gonna split it up and curl our way to victory? [20:04] I like the idea of "mirror everything as individual IA items" [20:04] *** wyatt8740 has quit IRC (Read error: Operation timed out) [20:06] *** wyatt8740 has joined #archiveteam-bs [20:14] That'd actually be nice, yeah. With full metadata etc. [20:14] But for now, we just need to grab everything we can. [20:14] Download as WARCs, further processing later. [20:16] yeah, but how are we gonna download them [20:16] is it easy to set up a tracker thingy like kickthebucket? [20:44] *** jshoard has quit IRC (Quit: Leaving) [20:45] https://transfer.notkiska.pw/AzcCd/microsoft-download-center-files-below-id-60000-sorted.jsonl [20:54] nice [20:54] so how do we coordinate the downloads [20:58] Here are the ten most frequent file extensions: 13780 msu, 13111 exe, 6812 zip, 3927 msi, 3770 pdf, 2228 pptx, 1214 docx, 888 bin, 828 doc, 483 xps [21:03] so the ID cooresponds to what page its on? [21:04] That's the 'id' parameter from the URLs. [21:04] yeah so if a 'item' has multiple downloads then they have different IDs [21:05] Uh [21:05] If an entry on the Download Center has multiple files, those all have the same ID in my list. [21:05] E.g. https://www.microsoft.com/en-us/download/confirmation.aspx?id=41658 -> three entries with ID 41658. [21:06] ok [21:06] thanks for setting this up :) [21:06] Further statistics: top ten by size in GiB: 4100.1 .zip, 864.1 .exe, 597.3 .bin, 355.1 .iso, 182.3 .rar, 149.6 .cab, 118.1 .msi, 90.5 .msu, 17.6 .wmv, 16.7 .ISO [21:09] *** Mateon1 has quit IRC (Remote host closed the connection) [21:09] I don't have a good idea how to do the actual retrieval though. Warriorbot isn't ready yet I think. :-/ [21:10] *** Mateon1 has joined #archiveteam-bs [21:10] is that the thing that sets up a 'warrior' project? [21:11] No, it's a distributed ("warrior") project that simply retrieves lists of URLs. [21:12] Assuming everyone necessary is here, we could set up a quick warrior project [21:12] setting up a warrior project pipeline seems easy right [21:12] like you don't even need a lua script right, no recursion necessary or allow/deny list of urls [21:12] Seems like a lot of overhead considering this is basically just "wget --input-file" at scale [21:12] though [21:12] Yeah [21:12] *** Gallifrey has quit IRC (Ping timeout: 265 seconds) [21:13] (Yeah, a lot of overhead) [21:13] but given the space requirements it might be good, because at least the rsync upload has the nice property of failing/retrying endlessly until the rsync target frees up space [21:13] i assume thats what 'warriorbot' is meant to fix, to have a premade warrior project for just lists of urls [21:13] Yes [21:13] I'm more concerned about getting the infrastructure set up [21:14] Viz. temporary storage (target or similar) [21:14] yeah, the infastructure of offloading the data is gonna be the hard part [21:15] In other words... anyone have 17TB free? [21:15] *** Gallifrey has joined #archiveteam-bs [21:15] 17? [21:16] 7.1 [21:16] :-) [21:16] Transposed the digits [21:20] uhh [21:21] i have 500 gb + 100 + 100 on my boxes i was using for kick the bucket [21:21] we can possibly alleviate some of it if we upload to archive.org like a normal warrior project right? [21:22] If S3 is feeling nice today [21:23] (Narrator: It wasn't.) [21:25] 17 tb is $170 a month on digital ocean which isn't terrible [21:26] oh no left of a 0, never mind, it doesn't even support that lol [21:28] but as long as i'm not paying for this for a full month i probably could swing enough volumes to do 17 tb [21:30] How's network/transfer pricing? [21:30] inbound is free [21:31] yeah the outbound is what gets you [21:31] Droplets include free outbound data transfer, starting at 1,000 GiB/month for the smallest plan. Excess data transfer is billed at $0.01/GiB. For example, the cost of 1,000 GiB of overage is $10. Inbound bandwidth to Droplets is always free. [21:32] so $170, minus the free outbound bandwidth i get for my droplets which i think is 3tb for 3 droplets i have running now [21:32] again , not terrible, but if someone else has a cheaper option =P [21:34] It's 7.1 TB, not 17 TB. [21:34] now i'm doing it xD [21:35] so even better then [21:35] any other ideas before i pull the lever? [21:37] (brb like 40 minutes) [21:47] buyvm and some other VPS providers have unmetered bandwidth [21:52] $30/mo VPS + $10/mo for 2TB (1TB/$5) attached block storage might work [21:53] scratch that... they're all of out stock.... [22:37] well guess i'm buying the 7tb volume then [22:38] what is the format that wget/curl takes for a file list @OrIdow6 [22:38] mgrandi: Please write WARCs, not plain files. [22:39] *** Gallifrey has quit IRC (Ping timeout: 265 seconds) [22:39] wget/wpull has --input-file or -i for that. [22:40] so what tool should i use? [22:40] i have wget-at for the kickthebucket archive [22:40] Yeah, wget-at seems good. [22:40] it will take a jsonl file? [22:40] Nope [22:41] Plain lines of URLs [22:41] ok [22:41] do you have a convenient list of urls or do you want me to make one? [22:42] I don't, but I can easily make one. [22:42] if you can make it easily that would be good [22:44] i'll do a 7.5 tb volume [22:45] https://transfer.notkiska.pw/3SHDe/microsoft-download-center-files-below-id-60000-sorted-urls [22:45] *** Gallifrey has joined #archiveteam-bs [22:47] do we have a way of exfilling these files to somewhere else? [22:47] thats a big number for the monthly cost that i'd rather not pay lol [22:48] I don't have any free storage at the moment I'm afraid. [22:50] wait, i have 15tb at home, but will cox hate me [22:51] Maybe SketchCow can set you up with space on FOS, although probably not the whole thing at once. [22:51] i think i'll start with this, but its gonna cost 24$/day [22:52] and helps since its a commercial data center without residential ISP limits [22:52] Or upload to IA as you grab. [22:52] Make sure to split it up, instead of getting one huge warc [22:52] Yeah [22:52] so does anyone know the wget-at args to do that? [22:52] I tend to do 5 GiB WARCs. [22:52] or just partition the file into chunks [22:54] ArchiveBot's wpull options are a good starting point: https://github.com/ArchiveTeam/ArchiveBot/blob/3585ed999010665a7b367e37fd6f325f30a23983/pipeline/archivebot/seesaw/wpull.py#L12 [22:54] But wpull isn't fully compatible with wget. [22:56] Or the DPoS project code repos, e.g. https://github.com/ArchiveTeam/mercurial-grab/blob/20b40049911bb721603de491d4e8a3aa5c4d3a81/pipeline.py#L173 [22:56] --warc-max-size to get multiple WARCs instead of one huge file. [22:59] ok, yeah let me craft one based on that one [23:00] Another important one is --delete-after so the plain file isn't kept after download. [23:02] so --output-document outputs to a temp file, it writes it to a WARC, and then --delete-after deletes the temp file? [23:04] JAA: Is that list in any particular order? [23:04] Yeah, something like that. I don't know what the exact data flow is in wget though. I think it writes it to the WARC immediately as the data is retrieved, not from the temp file. [23:04] OrIdow6: Yes, sorted by ID. [23:05] so do i need --output-document? [23:05] JAA: Good, that's what I was going to ask about [23:05] or does wget=at need to write to something [23:06] IIRC output-file is only useful when dealing with a single file [23:06] mgrandi: Not entirely sure to be honest. I'd include it though to be safe. Might have something to do with not creating directory structures or dealing with filenames containing odd characters. [23:07] i'll include it anyway to be safe [23:08] *output-document (output-file sets the logfile location) [23:09] By the way, ~16 hours at 1 Gb/s to retrieve it all. [23:10] Worst case is that it goes down at midnight automatically - not enough [23:10] Though I don't know the speed of whatever's downloading it [23:12] its in digitalocean so it should be pretty fast [23:12] so how do the warc file names impact the split on size? [23:13] Maybe split the list up, in case more people want to start downloading? [23:14] *** Arcorann has joined #archiveteam-bs [23:14] I've never actually used wget(-lua/at) directly myself, but at least in wpull, --warc-file sets the filename prefix when --warc-max-size is used. `--warc-file foo --warc-max-size 1234` would produce foo-00000.warc.gz, foo-00001.warc.gz, etc., each "about" 1234 bytes (in wpull, the split happens as soon as possible after reaching that size). [23:14] ok [23:16] What what [23:17] SketchCow: Microsoft deleting SHA-1-signed downloads from the Download Center tomorrow. No good way to determine which downloads are affected, total size 7.1 TB. [23:18] https://gist.github.com/mgrandi/0904bbeeaba2a4c1bc7084ad26ec236e [23:18] Not covered very well last time I checked. [23:18] commands look good? any warc headers i should add? [23:19] mgrandi: I'd remove --page-requisites --span-hosts --recursive --level inf since recursion isn't necessary here. [23:20] --warc-max-size is missing. [23:20] oh good call [23:21] is that a number like `5gb` ? [23:21] it just says 'NUMBER' [23:21] Bytes as an int, I think. [23:21] And the extra --warc-headers [23:21] 5368709120 [23:22] Test it with a small --warc-max-size and the first couple URLs maybe to see if it does what you expect. [23:22] what headers should i include? [23:22] or does that matter / we can edit it later [23:23] mgrandi: I'm just referring to the two pointless lines: --warc-header "" [23:23] Not sure. I often don't add any and document things in the item description on IA instead. It doesn't matter for the grab itself. [23:24] ok [23:24] *** Raccoon has joined #archiveteam-bs [23:25] updated: https://gist.github.com/mgrandi/0904bbeeaba2a4c1bc7084ad26ec236e [23:25] i'm gonna try that with 12 urls and then 10 mb warc limit [23:27] You might also want to split the list up and run multiple processes in parallel for higher throughput. [23:27] Depending on transfer and disk speed obviously. [23:28] ok [23:32] 162 MB/s apparently [23:33] Nice [23:33] still think i need to split it up and run multiple processes? [23:34] If it stays at that speed, probably not. [23:41] and --delete-after is safe to have? [23:41] since its saving it to the WARC? [23:44] looks like its fine, i'll just leave it [23:45] Yes, should be safe. [23:46] Although it probably doesn't even matter that much since you have --output-document, so each download overwrites that file anyway. [23:47] cool, lets begin [23:47] if its going too slow i can always just start another one with different sections of the list and possibly have duplicates or ctrl+c after a certain point [23:52] Good luck, and let me know if you see anything that isn't status 200. [23:55] average is 27 MBit/s [23:56] 2.5GB done already (compressed) [23:59] *** Gallifrey has quit IRC (Read error: Connection reset by peer)