#archiveteam 2013-10-10,Thu

↑back Search

Time Nickname Message
01:18 🔗 SketchCow Foursquare contacted Internet Archive and wants that Foursquare dataset GONE
01:18 🔗 SketchCow So if you want it from there, grab it
01:19 🔗 SketchCow https://archive.org/details/201309_foursquare_dataset_umn
01:33 🔗 arkhive Why would they want it taken down?
01:35 🔗 arkhive and if they were extracted from the public API what's the big deal?
01:38 🔗 godane SketchCow: i'm doing a mirror of dailymail
01:38 🔗 godane just for the fun of it
01:38 🔗 godane turns out just grabbing dailymail.co.uk/news/article-#id will give me the article
01:39 🔗 godane even if its not in news
03:22 🔗 Cameron_D https://thepiratebay.sx/torrent/9027196/UMN_Sarwat_Foursquare_Dataset_%28September_2013%29 oops :)
03:25 🔗 DFJustin the keys are like right next to each other
05:04 🔗 ivan` https://github.com/robots.txt
08:25 🔗 cat____ I have no idea of your command structure, or who decides what to archive and what not to, but I will just put this out here. http://forum.gaijinpot.com/forum/ is going to bite the dust on the 15th.
08:26 🔗 cat____ There's a lot of information for the English-speaking/foreign community in Japan on that board. A lot of it is valuable advice from long-time expats.
08:26 🔗 cat____ I hope that this message will reach the right people and this information can be saved before being lost forever.
08:27 🔗 omf_ cat____, It has already reached the right people. We save everything and let the historians later figure out the value of the data
08:29 🔗 cat____ That gives me some relief.
08:29 🔗 omf_ It is a forum with a sane url scheme and no javascript is needed to page through results
08:29 🔗 omf_ did they just announce the closing?
08:29 🔗 cat____ Yes. Click on any post. There will be a box on the top.
08:29 🔗 omf_ I saw that, no emails or a forum post just on that topic? We like to save those too
08:30 🔗 cat____ I've sent the admin an email. No response so far. The members have made a post at http://forum.gaijinpot.com/forum/community-center/general-discussion/1527872-forum-closing
08:31 🔗 cat____ What I get from it, they consider forums Web 1.0 and Facebook/Twitter as the future. :(
08:32 🔗 omf_ Forums are social media
08:32 🔗 omf_ duh
08:32 🔗 cat____ The backstory goes something like this: the board ran vBulletin 3.x, got hacked. They upgraded to 5.x which happens to be a maintenance nightmare.
08:33 🔗 cat____ So rather than using a sane forum software, they just shut it down. *sigh*
08:33 🔗 SketchCow We'll certainly run a scan against it.
08:33 🔗 cat____ thank you <3
08:36 🔗 omf_ 863,537 is how many posts that site has
08:59 🔗 SketchCow omf_: Let's wait for archivebot to be moved to my infrastructure (storage) and we'll do it.
08:59 🔗 omf_ SketchCow, we can warrior gaikinpot. I am already collecting the urls to do it
09:12 🔗 yipdw omf_: feel free to continue collecting, but keep an eye on http://archivebot.at.ninjawedding.org:4567/ too
09:12 🔗 yipdw archivebot's working on it right now
09:14 🔗 yipdw I don't think throwing dozens of machines at it would help, but it's possible that they have some per-connection limits
09:14 🔗 yipdw "they" being gaijinpot
09:44 🔗 omf_ yipdw, I don't think archivebot is going to get it all. It closes in 5 days which is 432,000 seconds. There are 863,537 posts meaning it would have to get 2 pages a second minimum to get the whole site with no interruptions. It doesn't look to be going that fast according to the dashboard hence my suggestion of the warrior or maybe just a few clown instances grabbing at different points
09:46 🔗 ivan` yipdw is ignoring individual post pages
09:46 🔗 ivan` divide by some number to get the # of topics?
09:48 🔗 cat____ 62669 topics according to forum index
09:51 🔗 omf_ according to the dashboard it is grabbing everything
09:59 🔗 omf_ another fun fact I learned about this site
09:59 🔗 omf_ it only lets you page up to 20 pages but there are more beyond that
10:04 🔗 cat____ vBulletin 5 is very broken in that regard. The bottom paginator is broken. The textbox at the top works.
10:07 🔗 cat____ also if you have web storage disabled in your browser, the entire ajax construct behind the forum just dies.
10:08 🔗 omf_ yeah I test with javascript off
12:02 🔗 cat___ Hello again. I'm the same "cat" that posted about GaijinPot earlier. I just found out they have an API that gets all forum posts for a thread. description at: http://pastebin.com/E19xTpGC Will try archiving through that.
12:12 🔗 omf_ well it looks like I might have all the thread ids since I crawled all the category list pages
12:12 🔗 omf_ will have to give this a try
12:55 🔗 cat___ http://pastebin.com/E0piGg16 < C# to get all topic html; turns out its similar to the API to get all threads. rush job coding, sorry.
13:50 🔗 Nemo_bis SketchCow: how do I rederive (re-OCR) 5000 items?
15:31 🔗 yipdw omf_: I don't think it's going to go any faster; the dominating delay is on gaijinpot's end
15:32 🔗 yipdw also now it's throwing 500s
15:32 🔗 yipdw is someone slamming the forum right now?
15:32 🔗 yipdw this is way slower than it was last night
15:34 🔗 yipdw well, that' great
15:34 🔗 yipdw gaijinpot just ate it
15:39 🔗 ersi RIP :(
15:39 🔗 yipdw seriously, what the hell
15:39 🔗 yipdw it was working fine last night
15:40 🔗 yipdw I really hope this isn't due to someone killing the server with page requests
15:40 🔗 yipdw oh, there it is again
18:10 🔗 SketchCow Nemo_bis: You ask me
18:27 🔗 Nemo_bis SketchCow: is a list of items on CSV good enough?
18:27 🔗 Nemo_bis actually it's two sets of 5000, one lower priority
18:29 🔗 Nemo_bis They're all books being manually transcribed on Wikisource, but half of them are inactive so I don't need them immediately.
18:40 🔗 SketchCow Are these items all uploaded by you, in the same collection, or what.
19:23 🔗 Nemo_bis SketchCow: no, random items
19:23 🔗 Nemo_bis Needing an OCR update for Wikisource users using them
23:03 🔗 omf_ I need someone comfortable running seesaw from the command line to try out the blip.tv grab pipeline
23:04 🔗 omf_ https://github.com/ArchiveTeam/blip.tv-grab-video-only
23:04 🔗 omf_ I have a few items in the tracker already
23:05 🔗 pft get-wget-lua.sh-ing
23:06 🔗 omf_ run-pipeline pipeline.py nickname
23:06 🔗 pft rgr
23:07 🔗 pft chug chug chug
23:09 🔗 pft The --contimeout option may only be used when connecting to an rsync daemon.
23:09 🔗 pft rsync error: syntax or usage error (code 1) at main.c(1275) [sender=3.0.9]
23:09 🔗 pft Process RsyncUpload returned exit code 1 for Item http://blip.tv/file/get/00c6-00c6Episode1IslandC001713.mpeg
23:09 🔗 pft Failed RsyncUpload for Item http://blip.tv/file/get/00c6-00c6Episode1IslandC001713.mpeg
23:12 🔗 pft so i'm guessing that fos.textfiles.com::blooper isn't ready to take the bits?
23:13 🔗 omf_ I just changed the upload target on the tracker
23:13 🔗 omf_ restart the script and see if it works this time
23:14 🔗 pft ok chugchuging again
23:14 🔗 omf_ The rsync code in the pipeline looks good
23:15 🔗 pft Starting RsyncUpload for Item http://blip.tv/file/get/00c6-00c6Episode4Bowlders946.mpeg
23:15 🔗 pft @ERROR: Unknown module 'warrior'
23:16 🔗 pft Uploading with Rsync to rsync://fos.textfiles.com/warrior/blooper/Levon/
23:17 🔗 omf_ I tried the syntax both ways
23:17 🔗 omf_ hmm
23:17 🔗 pft rsyncd is so gross
23:26 🔗 pft do you know the commandline rsync uses to push the files up?
23:26 🔗 pft i could try manually running it
23:41 🔗 cat___ About GaijinPot: I managed to save a list of all threads in all boards last night. I'll probably have to re-run the scraper for that again today because new threads appeared.
23:41 🔗 cat___ What isn't working well is downloading all threads using the API. While speed is good (ETA 2 days), it's impossible to go from 0 to 999999 because unless the ID is an actual thread, it seems the API will just return a single post.
23:41 🔗 cat___ Next step: Re-run thread list scraper, extract thread IDs, download those. That might cut download time to just one day.
23:41 🔗 cat___ Out of curiosity, where would I upload any data that I manage to save? Is there some sort of central repository?
23:56 🔗 yipdw pft: the --contimeout error you're seeing is because the rsync version you've got is too old
23:56 🔗 yipdw pft: seeing that error is pretty common on OS X Mountain Lion, FWIW
23:56 🔗 omf_ thanks yipdw
23:57 🔗 pft i'm on debian somethingorother
23:57 🔗 pft whatever stable is
23:57 🔗 pft so i'd estimate my rsync is about 37 years old
23:57 🔗 omf_ you can get the debian version by running
23:57 🔗 omf_ lsb_release -a
23:57 🔗 yipdw just a little younger than Andrew Tridgell
23:57 🔗 pft "wheezy" 7.1
23:57 🔗 yipdw hmm, that should work
23:58 🔗 yipdw I have run RsyncUpload tasks on Debian 7 before
23:58 🔗 yipdw maybe it's a different cause
23:58 🔗 pft well the rsync is failing
23:58 🔗 yipdw cat___: hold on to it for now, ping SketchCow later
23:58 🔗 pft rsync version 3.0.9 protocol version 30
23:59 🔗 yipdw cat___: or you can throw it straight into the Internet Archive
23:59 🔗 pft omf_: any idea why the rsync upload is failing?
23:59 🔗 yipdw odd
23:59 🔗 yipdw pft: that should be good enough
23:59 🔗 pft also is there anything i can do with these ~900mb files in data
23:59 🔗 yipdw hmm, I'm not sure what's up with that then

irclogger-viewer