#archiveteam 2014-05-08,Thu

↑back Search

Time	Nickname	Message
00:31 ^🔗	ryonaloli	anyone have advice on scraping a site from archive.org with wget?
00:32 ^🔗	APerti	Can't you just get the WARCs for the site?
00:33 ^🔗	ryonaloli	i have no idea what those are :/
00:33 ^🔗	ryonaloli	i'm kinda new to this
01:53 ^🔗	SketchCow	WHat are you trying to get?
02:53 ^🔗	zenguy_pc	hi
02:54 ^🔗	zenguy_pc	i was thinkign about arching fsrn.org , i saw that it's creative commons license.. is that being archived already?
02:57 ^🔗	giganticp	wat is sekrit word
02:57 ^🔗	zenguy_pc	?
02:57 ^🔗	balrog	The secret word is "yahoosucks"
02:58 ^🔗	zenguy_pc	https://clbin.com/mgntE
02:58 ^🔗	zenguy_pc	if anyone is interested
02:58 ^🔗	zenguy_pc	didn't grab it yet
04:55 ^🔗	Jonimus	trying to emulate old games which had even basic copy protections sucks.
04:57 ^🔗	APerti	Which games/copy protection?
05:04 ^🔗	Jonimus	nvm, it turns out it was a "have you read the manual" check
05:04 ^🔗	Jonimus	and of course Archive.org has a scan of the manual ;D
05:08 ^🔗	APerti	Nice.
06:25 ^🔗	ryonaloli	<@SketchCow> WHat are you trying to get?
06:25 ^🔗	ryonaloli	an imageboard called gurochan which died weeks ago. i'm part of the team that created a new one but we need the original's archives
06:28 ^🔗	ryonaloli	https://web.archive.org/web/20140106164316/http://gurochan.net/ (link is sfw, following links will be nsfw)
07:09 ^🔗	SketchCow	OK, so you want to pull from the internet archive WARCs
07:10 ^🔗	SketchCow	https://archive.org/details/gurochan_archive_2006-2010
07:10 ^🔗	SketchCow	(Obviously not perfect, I just happened to notice this)
07:12 ^🔗	ryonaloli	SketchCow: that's only the images with unix timestamps, and we already have those. what we need is the original thread structure
07:12 ^🔗	SketchCow	Anyway, what I'm seeing here is that archive.org has semi-irregular grabs.
07:12 ^🔗	ryonaloli	how do i use WARCs? i looked it up but could only get descriptions of the format, not how to creat it
07:13 ^🔗	SketchCow	But it's probably what you're looking for.
07:13 ^🔗	SketchCow	You can probably yank from the wayback.
07:14 ^🔗	ryonaloli	i'm not sure hwo to use them to take from wayback
07:16 ^🔗	SketchCow	http://waybackdownloader.com/
07:16 ^🔗	SketchCow	Maybe
07:16 ^🔗	SketchCow	I'm looking for utilities.
07:17 ^🔗	ryonaloli	>pricing and order form
07:18 ^🔗	SketchCow	Yes.
07:19 ^🔗	SketchCow	$15, might not be bad.
07:19 ^🔗	ryonaloli	we're already on a tight budget to run the current site. we can't afford to spending $15 when there's probably a way to do even with a firefox macro
07:19 ^🔗	SketchCow	Otherwise, write a script and scrape like crazy.
07:20 ^🔗	SketchCow	Sounds like you have it all under control. Good luck.
07:20 ^🔗	ryonaloli	it requires javascript to view pages though, right?
07:20 ^🔗	SketchCow	No idea.
07:21 ^🔗	ryonaloli	hm
07:22 ^🔗	ryonaloli	it seems the web archive tries it's hardest to make scraping impossible
07:33 ^🔗	ryonaloli	heh, that site's faq all link to 404
07:48 ^🔗	yipdw	you can check out https://github.com/alard/warc-proxy
07:48 ^🔗	yipdw	it's a tool which reads WARCs and reconstructs HTTP responses from those WARCs
07:49 ^🔗	ryonaloli	but how do i create a warc?
07:49 ^🔗	midas	but remember kids, just because it's an archive file doesnt make it a backup.
07:49 ^🔗	midas	ryonaloli: wget has a special flag for that
07:50 ^🔗	yipdw	you can also use wpull --warc-file
07:51 ^🔗	yipdw	if there's a bunch of WARCs in a tarball, you can use https://github.com/ArchiveTeam/megawarc
07:52 ^🔗	yipdw	I'm not sure why you need to create a WARC to retrieve thread structure from (some hypothetical) WARC, though
07:52 ^🔗	ryonaloli	i'm still not sure how to turn a wayback link into a warc
07:52 ^🔗	yipdw	oh, the Wayback Machine's WARCs aren't publicly accessible
07:52 ^🔗	yipdw	well, most of them aren't, but that's not an important detail
07:52 ^🔗	midas	neither are we ryonaloli, making a warc from wayback would only recreate the wayback http response
07:53 ^🔗	ryonaloli	heh
07:53 ^🔗	midas	besides that, grabbing all of the wayback machine might fill your drive up pritty fast
07:53 ^🔗	ryonaloli	then what would be the best way to scrape a site without paying $15?
07:53 ^🔗	ryonaloli	all? nah, just a website with <10 gigs
07:53 ^🔗	midas	pay 15 bucks.
07:53 ^🔗	midas	just pay the 15 bucks.
07:53 ^🔗	yipdw	you could write your own scraper
07:55 ^🔗	ryonaloli	i'm not sure how i'd write it if archive.org tries it's best to block those. as for the $15, this is for a site with a very low budget
07:55 ^🔗	yipdw	looking at gurochan.net captures it doesn't seem like it'd be all that difficult
07:55 ^🔗	yipdw	eh?
07:55 ^🔗	yipdw	I've never been blocked from downloading on any archive.org subdomain
07:55 ^🔗	yipdw	what gave you the impression that you'd be blocked?
07:55 ^🔗	yipdw	I mean, okay, maybe if you consume a ridiculous proportion of their bandwidth
07:55 ^🔗	yipdw	but you don't need to do that
07:56 ^🔗	ryonaloli	oh, i looked it up and most answers said it requires javascript to ge tinternal links
07:56 ^🔗	yipdw	what does, Wayback?
07:56 ^🔗	ryonaloli	i think so
07:58 ^🔗	yipdw	I don't know what that means
07:58 ^🔗	yipdw	I can access any archived URL on gurochan with curl
07:59 ^🔗	yipdw	e.g. $ curl -vvv 'http://web.archive.org/web/20100611210558/http://gurochan.net/dis/res/1109.html' works
07:59 ^🔗	ryonaloli	hm, i'll probably have to try again then
07:59 ^🔗	yipdw	I don't know where you read that accessing Wayback either (a) results in bans or (b) requires Javascript
08:00 ^🔗	yipdw	wherever you read that is wrong
08:14 ^🔗	ryonaloli	when i try "wget -np -e robots=off --mirror --domains=staticweb.archive.org,web.archive.org 'https://web.archive.org/web/20140106164316/http://gurochan.net/'", only the main page is downloaded, it doesn't go into any other links that have don't have '20140106164316'
08:14 ^🔗	ryonaloli	how do i let it go recursively into the rest without it trying to archive all of archive.org?
09:22 ^🔗	midas	I still dont understand what you're trying to do. but i'd start with getting this warc file: https://archive.org/details/gurochan_archive_2006-2010
09:23 ^🔗	midas	grab the warc-proxy and start working from there.
09:23 ^🔗	midas	FYI, warc proxy has a readme.
09:23 ^🔗	ryonaloli	i already have that file. midas: what i'm trying to do is retrieve the threads from the archive. the wget command doesn't seem to recursively follow links
09:24 ^🔗	midas	thats because you're trying to use the wayback machine, it's not made for doing that
09:24 ^🔗	midas	warc proxy + that warc file, should be enough to get you going
09:24 ^🔗	ryonaloli	that's the only thing i can use though. that 2006-2010 archive is just a bunch of images, not the threads or the original filenames
09:41 ^🔗	midas	ryonaloli: wget -e robots=off --mirror --domains=staticweb.archive.org,web.archive.org https://web.archive.org/web/20140207233054/http://gurochan.net/
09:42 ^🔗	midas	grabs it all, good luck getting it into something usefull, cant help you with that
09:44 ^🔗	ryonaloli	midas: does that also grab every previous version?
09:45 ^🔗	midas	everything is everything ryonaloli
09:46 ^🔗	ryonaloli	damn, that's gotta be hundreds of gb
09:46 ^🔗	midas	..
09:47 ^🔗	nico	probably not
09:52 ^🔗	ryonaloli	but there are over a hundred snapshots, and the whole site is 7gb iirc
09:52 ^🔗	midas	well use the warc proxy.
09:52 ^🔗	midas	now you're getting all of the snapshots, all of them
09:52 ^🔗	midas	that's what you wanted.
10:02 ^🔗	ryonaloli	how will the warc proxy be different?
10:02 ^🔗	ryonaloli	and, i didn't want all of the snapshots. just all of the most recent one for each page
11:51 ^🔗	fexx	any plans to grab 800notes.com / other phone number indexing sites?
11:53 ^🔗	ersi	None what I know of, but anyone can do what they please. If it's interesting, feel free to take 'em on
12:02 ^🔗	schbirid	https://pay.reddit.com/r/opendirectories/comments/25002s/meta_a_tool_for_tree_mapping_remote_directories/
12:04 ^🔗	schbirid	not very useful output, http://dirmap.krakissi.net/?path=https%3A%2F%2Fwww.quaddicted.com%2Ffiles%2Fmaps%2F
12:05 ^🔗	midas	so it has a open dir and loops it to find all files
12:06 ^🔗	schbirid	wget --spider -nv and some regexping is more suitable for people like us
12:07 ^🔗	midas	with the strange twich of downloading everything
12:08 ^🔗	schbirid	it does not download everything
12:11 ^🔗	midas	spider doesnt, but people like us do
12:11 ^🔗	midas	;-)
12:11 ^🔗	schbirid	>:)
14:15 ^🔗	DFJustin	wow you guys fail at reading comprehension
14:15 ^🔗	DFJustin	what he needs is https://code.google.com/p/warrick/ but he's gone now naturally
14:38 ^🔗	SketchCow	There you go.
14:38 ^🔗	SketchCow	The $15 thing didn't inspire me to keep going on.
14:53 ^🔗	midas	SketchCow: next time: http://archiveteam.org/index.php?title=Restoring (will add more data tonight, mostly made by DFJustin now)
14:53 ^🔗	midas	aka, all made by him atm :p
14:56 ^🔗	SketchCow	Yeah, then we won't have to break someone's back suggesting $15
14:58 ^🔗	midas	well, we can just point
15:36 ^🔗	balrog	SketchCow: it doesn't inspire me either
15:37 ^🔗	DFJustin	it's a recurring problem so it's worth documenting
15:41 ^🔗	SketchCow	Agreed, absolutely.
15:43 ^🔗	balrog	I'd put a disclaimer saying we don't endorse that paid service though
15:45 ^🔗	DFJustin	you know what they say about wikis
15:46 ^🔗	SketchCow	Everybody's got one
15:46 ^🔗	DFJustin	that too
15:53 ^🔗	SketchCow	Using the internetarchive python interface.
15:53 ^🔗	SketchCow	Hardcore.
15:53 ^🔗	SketchCow	Running into bugs and limits, so you know I'm being cruel
16:34 ^🔗	midas	badass.py
16:48 ^🔗	SketchCow	https://archive.org/details/gg_Aerial_Assault_Rev_1_1992_Sega
16:48 ^🔗	SketchCow	Title, year and creator added by script. Cover and screenshot also.
23:21 ^🔗	ivan`	http://dealbook.nytimes.com/2014/05/08/delicious-social-site-is-sold-by-youtube-founders/

irclogger-viewer