[01:18] Foursquare contacted Internet Archive and wants that Foursquare dataset GONE [01:18] So if you want it from there, grab it [01:19] https://archive.org/details/201309_foursquare_dataset_umn [01:33] Why would they want it taken down? [01:35] and if they were extracted from the public API what's the big deal? [01:38] SketchCow: i'm doing a mirror of dailymail [01:38] just for the fun of it [01:38] turns out just grabbing dailymail.co.uk/news/article-#id will give me the article [01:39] even if its not in news [03:22] https://thepiratebay.sx/torrent/9027196/UMN_Sarwat_Foursquare_Dataset_%28September_2013%29 oops :) [03:25] the keys are like right next to each other [05:04] https://github.com/robots.txt [08:25] I have no idea of your command structure, or who decides what to archive and what not to, but I will just put this out here. http://forum.gaijinpot.com/forum/ is going to bite the dust on the 15th. [08:26] There's a lot of information for the English-speaking/foreign community in Japan on that board. A lot of it is valuable advice from long-time expats. [08:26] I hope that this message will reach the right people and this information can be saved before being lost forever. [08:27] cat____, It has already reached the right people. We save everything and let the historians later figure out the value of the data [08:29] That gives me some relief. [08:29] It is a forum with a sane url scheme and no javascript is needed to page through results [08:29] did they just announce the closing? [08:29] Yes. Click on any post. There will be a box on the top. [08:29] I saw that, no emails or a forum post just on that topic? We like to save those too [08:30] I've sent the admin an email. No response so far. The members have made a post at http://forum.gaijinpot.com/forum/community-center/general-discussion/1527872-forum-closing [08:31] What I get from it, they consider forums Web 1.0 and Facebook/Twitter as the future. :( [08:32] Forums are social media [08:32] duh [08:32] The backstory goes something like this: the board ran vBulletin 3.x, got hacked. They upgraded to 5.x which happens to be a maintenance nightmare. [08:33] So rather than using a sane forum software, they just shut it down. *sigh* [08:33] We'll certainly run a scan against it. [08:33] thank you <3 [08:36] 863,537 is how many posts that site has [08:59] omf_: Let's wait for archivebot to be moved to my infrastructure (storage) and we'll do it. [08:59] SketchCow, we can warrior gaikinpot. I am already collecting the urls to do it [09:12] omf_: feel free to continue collecting, but keep an eye on http://archivebot.at.ninjawedding.org:4567/ too [09:12] archivebot's working on it right now [09:14] I don't think throwing dozens of machines at it would help, but it's possible that they have some per-connection limits [09:14] "they" being gaijinpot [09:44] yipdw, I don't think archivebot is going to get it all. It closes in 5 days which is 432,000 seconds. There are 863,537 posts meaning it would have to get 2 pages a second minimum to get the whole site with no interruptions. It doesn't look to be going that fast according to the dashboard hence my suggestion of the warrior or maybe just a few clown instances grabbing at different points [09:46] yipdw is ignoring individual post pages [09:46] divide by some number to get the # of topics? [09:48] 62669 topics according to forum index [09:51] according to the dashboard it is grabbing everything [09:59] another fun fact I learned about this site [09:59] it only lets you page up to 20 pages but there are more beyond that [10:04] vBulletin 5 is very broken in that regard. The bottom paginator is broken. The textbox at the top works. [10:07] also if you have web storage disabled in your browser, the entire ajax construct behind the forum just dies. [10:08] yeah I test with javascript off [12:02] Hello again. I'm the same "cat" that posted about GaijinPot earlier. I just found out they have an API that gets all forum posts for a thread. description at: http://pastebin.com/E19xTpGC Will try archiving through that. [12:12] well it looks like I might have all the thread ids since I crawled all the category list pages [12:12] will have to give this a try [12:55] http://pastebin.com/E0piGg16 < C# to get all topic html; turns out its similar to the API to get all threads. rush job coding, sorry. [13:50] SketchCow: how do I rederive (re-OCR) 5000 items? [15:31] omf_: I don't think it's going to go any faster; the dominating delay is on gaijinpot's end [15:32] also now it's throwing 500s [15:32] is someone slamming the forum right now? [15:32] this is way slower than it was last night [15:34] well, that' great [15:34] gaijinpot just ate it [15:39] RIP :( [15:39] seriously, what the hell [15:39] it was working fine last night [15:40] I really hope this isn't due to someone killing the server with page requests [15:40] oh, there it is again [18:10] Nemo_bis: You ask me [18:27] SketchCow: is a list of items on CSV good enough? [18:27] actually it's two sets of 5000, one lower priority [18:29] They're all books being manually transcribed on Wikisource, but half of them are inactive so I don't need them immediately. [18:40] Are these items all uploaded by you, in the same collection, or what. [19:23] SketchCow: no, random items [19:23] Needing an OCR update for Wikisource users using them [23:03] I need someone comfortable running seesaw from the command line to try out the blip.tv grab pipeline [23:04] https://github.com/ArchiveTeam/blip.tv-grab-video-only [23:04] I have a few items in the tracker already [23:05] get-wget-lua.sh-ing [23:06] run-pipeline pipeline.py nickname [23:06] rgr [23:07] chug chug chug [23:09] The --contimeout option may only be used when connecting to an rsync daemon. [23:09] rsync error: syntax or usage error (code 1) at main.c(1275) [sender=3.0.9] [23:09] Process RsyncUpload returned exit code 1 for Item http://blip.tv/file/get/00c6-00c6Episode1IslandC001713.mpeg [23:09] Failed RsyncUpload for Item http://blip.tv/file/get/00c6-00c6Episode1IslandC001713.mpeg [23:12] so i'm guessing that fos.textfiles.com::blooper isn't ready to take the bits? [23:13] I just changed the upload target on the tracker [23:13] restart the script and see if it works this time [23:14] ok chugchuging again [23:14] The rsync code in the pipeline looks good [23:15] Starting RsyncUpload for Item http://blip.tv/file/get/00c6-00c6Episode4Bowlders946.mpeg [23:15] @ERROR: Unknown module 'warrior' [23:16] Uploading with Rsync to rsync://fos.textfiles.com/warrior/blooper/Levon/ [23:17] I tried the syntax both ways [23:17] hmm [23:17] rsyncd is so gross [23:26] do you know the commandline rsync uses to push the files up? [23:26] i could try manually running it [23:41] About GaijinPot: I managed to save a list of all threads in all boards last night. I'll probably have to re-run the scraper for that again today because new threads appeared. [23:41] What isn't working well is downloading all threads using the API. While speed is good (ETA 2 days), it's impossible to go from 0 to 999999 because unless the ID is an actual thread, it seems the API will just return a single post. [23:41] Next step: Re-run thread list scraper, extract thread IDs, download those. That might cut download time to just one day. [23:41] Out of curiosity, where would I upload any data that I manage to save? Is there some sort of central repository? [23:56] pft: the --contimeout error you're seeing is because the rsync version you've got is too old [23:56] pft: seeing that error is pretty common on OS X Mountain Lion, FWIW [23:56] thanks yipdw [23:57] i'm on debian somethingorother [23:57] whatever stable is [23:57] so i'd estimate my rsync is about 37 years old [23:57] you can get the debian version by running [23:57] lsb_release -a [23:57] just a little younger than Andrew Tridgell [23:57] "wheezy" 7.1 [23:57] hmm, that should work [23:58] I have run RsyncUpload tasks on Debian 7 before [23:58] maybe it's a different cause [23:58] well the rsync is failing [23:58] cat___: hold on to it for now, ping SketchCow later [23:58] rsync version 3.0.9 protocol version 30 [23:59] cat___: or you can throw it straight into the Internet Archive [23:59] omf_: any idea why the rsync upload is failing? [23:59] odd [23:59] pft: that should be good enough [23:59] also is there anything i can do with these ~900mb files in data [23:59] hmm, I'm not sure what's up with that then