Well, that reduced the number of pending URLs from 390k to 80k. Nice. :D JAA: I kind of forgot about it. arkiver: sweet baby jesus thats a ton of data arkiver yep so the future of the project is a little unclear Ya, more then 200TB you need to start getting IA Tech team on it well, we can just pump the data in but they'll notice especially with this size It looks like a amazing datastore and might not be happy with this much data yeah we coordinated with IA but was paused after the estimate turned out to be 800 TB anyway It would be best if they rsynced it right into their own search tool for it, so it had the right metadata not sure but I'm off good day :P So just curious, anybody have any scripts for downloading say wikiProjects? Trying to download one in particular and not the entirety of Wikipedia. Otherwise, looks like I'll need to writer a spider and a basic implementation of PageRank to sort each topic into the correct subgrouping... Gilfoyle: We have a set of scripts for scraping the API of Mediawiki installations: https://raw.githubusercontent.com/WikiTeam/wikiteam/master/dumpgenerator.py A bit more info on the WikiTeam page: http://archiveteam.org/index.php?title=WikiTeam Has anyone recorded things on Video 8 cassettes? https://www.youtube.com/watch?v=8t5TYw2bkOk okay, NOW pixiv is done, right? or have we still not done the +18 sweep? Doesn't look even slightly done. Still 200k items out. Yep, we still need to do the 18+ rooms. Tanobb grab complete. That was way quicker than I expected. I grabbed all 13 languages and the mobile versions as well. I skipped some redundant pages though, e.g. those listing all posts by one user in a thread. arkiver: No, I'm using the same format bsmith093 used (https://github.com/JimmXinu/FanFicFare) which extracts the story text into markdown. Any other way and my limited disk space would be completly used up. I see What are you currently using? maybe the WARCs could be uploaded to IA for the wayback machine if there's any copyright issues they'll probably block the pages or website in the wayback machine arkiver: Example format: http://storage.savefanfiction.tk/Prince%20Consort-ffnet_8902231.txt (re-archived today) All fanfiction.net URLs are robots.txt blocked anyway, actually. Yet another good reason to grab them all as WARCs... so when the site goes down in the future and someone removes the robots.txt block there will be content to show Also, I wonder how well git would handle if you put all these files into it... you could point FFF at warcproxy maybe? Total of 7,382,393 plaintext files, all in the same folder? Few hundred gigabytes? No idea if git would cope with that. you could move based off the first few letters of its sha1 of the name of the file into their own folders, or even the first few letters of the title of the file I've thought about doing WARCs and stuff like that, but I'm running everything on a couple of Raspberry Pis and a 2tb HD, so... thing is that once it's not in WARCs, it will probably never go into WARCs and therefor will never get into an archive like the wayback machine tapedrive: you will probably get better performance in any case if you split that into subdirectories :D username1: Yeah, I can't realy do much in that directory any more But it makes merging newly updates stories much easier. arkiver: I'll think about WARCs for the future. To get them into wayback, do I just upload them to IA with type web? and let us know about them we'll first have to make sure the WARCs are valid of course and not edited And if I upload more WARCs into that item, will they be auto-added to wayback? I think so but multiple items might be better in that case also again depending on the number of WARCs you could do like 1 or 10 GB per item Okay, I'll see if I can add that in. nice :) quick question: do you guys know how to download from veoh.com? there's this rare video that i can only find there link? http://www.veoh.com/watch/v1313996wddKMNqf?h1=Super+Spacefortress+Macross it's the only copy i can find of the uncut, hilariously bad dub of the first macross movie Ugh, Flash. ndiddy: uh ndiddy: youtube-dl "http://www.veoh.com/watch/v1313996wddKMNqf?h1=Super+Spacefortress+Macross" done i tried jdownloader but it only downloads the first 1/4 of it go to source search for fullPreviewHashLowPath and download that URL Low vs. High quality/resolution? ah sorry [download] 10.0% of 881.42MiB at 321.50KiB/s ETA 42:07 i can up it elsewhere afterwards if needed tell me if it downloads all the way :) also, youtube-dl gives me an "unsupported url" error update search for fullPreviewHashHighPath that's the same resolution as when HQ is selected in the player uh i assumed so same size What does "Preview" mean in there though? no idea it looks to be the same size though but it's the version that's loaded in browse yeah the way they have the site set up is kinda strange it downloads the first couple megs at full speed than throttles you to 300 kbps Yeah, the whole page is really cancerous. many streaming sites do that The amount of third-party JavaScript code being pulled there is ridiculous. That kind of throttling is pretty common because most people don't watch most of the video anyways So they give you just enough to ensure there's a decent buffer, then feed you the rest at approx. the playback rate So they don't waste too much bandwidth if you close the tab after a minute have you tried watching the network tab while watching via flash? no, why? yeah, it's the same to figure out how it works? Could it be that the Preview is only the first 25% (what JDownloader grabs)? looks like jdownloader was using a different url http://fcache.veoh.com/file/f/h1313996.mp4?e=1497209817&ri=6000&rs=300&h=fa032ed6a31a75daab4b4503f60e35f9 vs page editing, which gives you http://content.veoh.com/flash/p/2/v1313996wddKMNqf/h1313996.mp4?ct=bdd4eff4404837d31915158fe9f6a3fe3c9aaf63da409283 I see. damn it, same thing arkiver: can you download the whole video? well, yeah, why? i just got 200 mb again from that link in source search for fullPreviewHashHighPath and download that that's what i did i'll see if watching the video in a muted tab while downloading fixes anything i'm assuming that that url is the one the player buffers from and it won't buffer more than 25% into the video before you can see fullPreviewHashHighPath you need to have 18+ cookie of course sure you didn't use fullPreviewHashLowPath? that gives you a 200 MB one fullPreviewHashHighPath gives 881 or so 881 MB* [download] 100% of 881.42MiB in 12:40 ERROR: content too short (expected 924240150 bytes and served 239785474) see what i mean well, can you actually skip to later via flash? yep it's not like nnd or something Does anything happen in the Network tab? you mean, like the windows one? no, the browser one The browser's. :O is that a thing zomg :P you have a lot to learn Press F12 in the tab Essential tool #1 for web devs. :-P and people trying to download special things looks like there's a ?start parameter sorry, &start ex: http://content.veoh.com/flash/f/2/v1313996wddKMNqf/l1313996.mp4?ct=466b8df98e1762caf22b228f86934f1623d189f7d4c429bb&start=4659.76 it will only send you the video starting from that time i guess what i have to do is download the video from the first link, count how many seconds are in it, then redownload and splice all the clips together tapedrive: thank you for grabbing the text of that fanfiction, in any case. It's useful, even though WARCs would be very nice too. I'm seeing how easy it would be to add in WARC to my system now. Although I think it would mean an entire recrawl, which has taken several (about 10) months due to their rate limiting. https://archive.org/details/SuperSpacefortressMacross SketchCow: Fix it please - Could not store file "/tmp/phphIFGty" at "mwstore://local-backend/local-public/d/da/Chatpixivicon.gif". its still broken :(