#archiveteam 2013-10-10,Thu

↑back Search

Time	Nickname	Message
01:18 ^🔗	SketchCow	Foursquare contacted Internet Archive and wants that Foursquare dataset GONE
01:18 ^🔗	SketchCow	So if you want it from there, grab it
01:19 ^🔗	SketchCow	https://archive.org/details/201309_foursquare_dataset_umn
01:33 ^🔗	arkhive	Why would they want it taken down?
01:35 ^🔗	arkhive	and if they were extracted from the public API what's the big deal?
01:38 ^🔗	godane	SketchCow: i'm doing a mirror of dailymail
01:38 ^🔗	godane	just for the fun of it
01:38 ^🔗	godane	turns out just grabbing dailymail.co.uk/news/article-#id will give me the article
01:39 ^🔗	godane	even if its not in news
03:22 ^🔗	Cameron_D	https://thepiratebay.sx/torrent/9027196/UMN_Sarwat_Foursquare_Dataset_%28September_2013%29 oops :)
03:25 ^🔗	DFJustin	the keys are like right next to each other
05:04 ^🔗	ivan`	https://github.com/robots.txt
08:25 ^🔗	cat____	I have no idea of your command structure, or who decides what to archive and what not to, but I will just put this out here. http://forum.gaijinpot.com/forum/ is going to bite the dust on the 15th.
08:26 ^🔗	cat____	There's a lot of information for the English-speaking/foreign community in Japan on that board. A lot of it is valuable advice from long-time expats.
08:26 ^🔗	cat____	I hope that this message will reach the right people and this information can be saved before being lost forever.
08:27 ^🔗	omf_	cat____, It has already reached the right people. We save everything and let the historians later figure out the value of the data
08:29 ^🔗	cat____	That gives me some relief.
08:29 ^🔗	omf_	It is a forum with a sane url scheme and no javascript is needed to page through results
08:29 ^🔗	omf_	did they just announce the closing?
08:29 ^🔗	cat____	Yes. Click on any post. There will be a box on the top.
08:29 ^🔗	omf_	I saw that, no emails or a forum post just on that topic? We like to save those too
08:30 ^🔗	cat____	I've sent the admin an email. No response so far. The members have made a post at http://forum.gaijinpot.com/forum/community-center/general-discussion/1527872-forum-closing
08:31 ^🔗	cat____	What I get from it, they consider forums Web 1.0 and Facebook/Twitter as the future. :(
08:32 ^🔗	omf_	Forums are social media
08:32 ^🔗	omf_	duh
08:32 ^🔗	cat____	The backstory goes something like this: the board ran vBulletin 3.x, got hacked. They upgraded to 5.x which happens to be a maintenance nightmare.
08:33 ^🔗	cat____	So rather than using a sane forum software, they just shut it down. sigh
08:33 ^🔗	SketchCow	We'll certainly run a scan against it.
08:33 ^🔗	cat____	thank you <3
08:36 ^🔗	omf_	863,537 is how many posts that site has
08:59 ^🔗	SketchCow	omf_: Let's wait for archivebot to be moved to my infrastructure (storage) and we'll do it.
08:59 ^🔗	omf_	SketchCow, we can warrior gaikinpot. I am already collecting the urls to do it
09:12 ^🔗	yipdw	omf_: feel free to continue collecting, but keep an eye on http://archivebot.at.ninjawedding.org:4567/ too
09:12 ^🔗	yipdw	archivebot's working on it right now
09:14 ^🔗	yipdw	I don't think throwing dozens of machines at it would help, but it's possible that they have some per-connection limits
09:14 ^🔗	yipdw	"they" being gaijinpot
09:44 ^🔗	omf_	yipdw, I don't think archivebot is going to get it all. It closes in 5 days which is 432,000 seconds. There are 863,537 posts meaning it would have to get 2 pages a second minimum to get the whole site with no interruptions. It doesn't look to be going that fast according to the dashboard hence my suggestion of the warrior or maybe just a few clown instances grabbing at different points
09:46 ^🔗	ivan`	yipdw is ignoring individual post pages
09:46 ^🔗	ivan`	divide by some number to get the # of topics?
09:48 ^🔗	cat____	62669 topics according to forum index
09:51 ^🔗	omf_	according to the dashboard it is grabbing everything
09:59 ^🔗	omf_	another fun fact I learned about this site
09:59 ^🔗	omf_	it only lets you page up to 20 pages but there are more beyond that
10:04 ^🔗	cat____	vBulletin 5 is very broken in that regard. The bottom paginator is broken. The textbox at the top works.
10:07 ^🔗	cat____	also if you have web storage disabled in your browser, the entire ajax construct behind the forum just dies.
10:08 ^🔗	omf_	yeah I test with javascript off
12:02 ^🔗	cat___	Hello again. I'm the same "cat" that posted about GaijinPot earlier. I just found out they have an API that gets all forum posts for a thread. description at: http://pastebin.com/E19xTpGC Will try archiving through that.
12:12 ^🔗	omf_	well it looks like I might have all the thread ids since I crawled all the category list pages
12:12 ^🔗	omf_	will have to give this a try
12:55 ^🔗	cat___	http://pastebin.com/E0piGg16 < C# to get all topic html; turns out its similar to the API to get all threads. rush job coding, sorry.
13:50 ^🔗	Nemo_bis	SketchCow: how do I rederive (re-OCR) 5000 items?
15:31 ^🔗	yipdw	omf_: I don't think it's going to go any faster; the dominating delay is on gaijinpot's end
15:32 ^🔗	yipdw	also now it's throwing 500s
15:32 ^🔗	yipdw	is someone slamming the forum right now?
15:32 ^🔗	yipdw	this is way slower than it was last night
15:34 ^🔗	yipdw	well, that' great
15:34 ^🔗	yipdw	gaijinpot just ate it
15:39 ^🔗	ersi	RIP :(
15:39 ^🔗	yipdw	seriously, what the hell
15:39 ^🔗	yipdw	it was working fine last night
15:40 ^🔗	yipdw	I really hope this isn't due to someone killing the server with page requests
15:40 ^🔗	yipdw	oh, there it is again
18:10 ^🔗	SketchCow	Nemo_bis: You ask me
18:27 ^🔗	Nemo_bis	SketchCow: is a list of items on CSV good enough?
18:27 ^🔗	Nemo_bis	actually it's two sets of 5000, one lower priority
18:29 ^🔗	Nemo_bis	They're all books being manually transcribed on Wikisource, but half of them are inactive so I don't need them immediately.
18:40 ^🔗	SketchCow	Are these items all uploaded by you, in the same collection, or what.
19:23 ^🔗	Nemo_bis	SketchCow: no, random items
19:23 ^🔗	Nemo_bis	Needing an OCR update for Wikisource users using them
23:03 ^🔗	omf_	I need someone comfortable running seesaw from the command line to try out the blip.tv grab pipeline
23:04 ^🔗	omf_	https://github.com/ArchiveTeam/blip.tv-grab-video-only
23:04 ^🔗	omf_	I have a few items in the tracker already
23:05 ^🔗	pft	get-wget-lua.sh-ing
23:06 ^🔗	omf_	run-pipeline pipeline.py nickname
23:06 ^🔗	pft	rgr
23:07 ^🔗	pft	chug chug chug
23:09 ^🔗	pft	The --contimeout option may only be used when connecting to an rsync daemon.
23:09 ^🔗	pft	rsync error: syntax or usage error (code 1) at main.c(1275) [sender=3.0.9]
23:09 ^🔗	pft	Process RsyncUpload returned exit code 1 for Item http://blip.tv/file/get/00c6-00c6Episode1IslandC001713.mpeg
23:09 ^🔗	pft	Failed RsyncUpload for Item http://blip.tv/file/get/00c6-00c6Episode1IslandC001713.mpeg
23:12 ^🔗	pft	so i'm guessing that fos.textfiles.com::blooper isn't ready to take the bits?
23:13 ^🔗	omf_	I just changed the upload target on the tracker
23:13 ^🔗	omf_	restart the script and see if it works this time
23:14 ^🔗	pft	ok chugchuging again
23:14 ^🔗	omf_	The rsync code in the pipeline looks good
23:15 ^🔗	pft	Starting RsyncUpload for Item http://blip.tv/file/get/00c6-00c6Episode4Bowlders946.mpeg
23:15 ^🔗	pft	@ERROR: Unknown module 'warrior'
23:16 ^🔗	pft	Uploading with Rsync to rsync://fos.textfiles.com/warrior/blooper/Levon/
23:17 ^🔗	omf_	I tried the syntax both ways
23:17 ^🔗	omf_	hmm
23:17 ^🔗	pft	rsyncd is so gross
23:26 ^🔗	pft	do you know the commandline rsync uses to push the files up?
23:26 ^🔗	pft	i could try manually running it
23:41 ^🔗	cat___	About GaijinPot: I managed to save a list of all threads in all boards last night. I'll probably have to re-run the scraper for that again today because new threads appeared.
23:41 ^🔗	cat___	What isn't working well is downloading all threads using the API. While speed is good (ETA 2 days), it's impossible to go from 0 to 999999 because unless the ID is an actual thread, it seems the API will just return a single post.
23:41 ^🔗	cat___	Next step: Re-run thread list scraper, extract thread IDs, download those. That might cut download time to just one day.
23:41 ^🔗	cat___	Out of curiosity, where would I upload any data that I manage to save? Is there some sort of central repository?
23:56 ^🔗	yipdw	pft: the --contimeout error you're seeing is because the rsync version you've got is too old
23:56 ^🔗	yipdw	pft: seeing that error is pretty common on OS X Mountain Lion, FWIW
23:56 ^🔗	omf_	thanks yipdw
23:57 ^🔗	pft	i'm on debian somethingorother
23:57 ^🔗	pft	whatever stable is
23:57 ^🔗	pft	so i'd estimate my rsync is about 37 years old
23:57 ^🔗	omf_	you can get the debian version by running
23:57 ^🔗	omf_	lsb_release -a
23:57 ^🔗	yipdw	just a little younger than Andrew Tridgell
23:57 ^🔗	pft	"wheezy" 7.1
23:57 ^🔗	yipdw	hmm, that should work
23:58 ^🔗	yipdw	I have run RsyncUpload tasks on Debian 7 before
23:58 ^🔗	yipdw	maybe it's a different cause
23:58 ^🔗	pft	well the rsync is failing
23:58 ^🔗	yipdw	cat___: hold on to it for now, ping SketchCow later
23:58 ^🔗	pft	rsync version 3.0.9 protocol version 30
23:59 ^🔗	yipdw	cat___: or you can throw it straight into the Internet Archive
23:59 ^🔗	pft	omf_: any idea why the rsync upload is failing?
23:59 ^🔗	yipdw	odd
23:59 ^🔗	yipdw	pft: that should be good enough
23:59 ^🔗	pft	also is there anything i can do with these ~900mb files in data
23:59 ^🔗	yipdw	hmm, I'm not sure what's up with that then

irclogger-viewer