[00:31] anyone have advice on scraping a site from archive.org with wget? [00:32] Can't you just get the WARCs for the site? [00:33] i have no idea what those are :/ [00:33] i'm kinda new to this [01:53] WHat are you trying to get? [02:53] hi [02:54] i was thinkign about arching fsrn.org , i saw that it's creative commons license.. is that being archived already? [02:57] wat is sekrit word [02:57] ? [02:57] The secret word is "yahoosucks" [02:58] https://clbin.com/mgntE [02:58] if anyone is interested [02:58] didn't grab it yet [04:55] trying to emulate old games which had even basic copy protections sucks. [04:57] Which games/copy protection? [05:04] nvm, it turns out it was a "have you read the manual" check [05:04] and of course Archive.org has a scan of the manual ;D [05:08] Nice. [06:25] <@SketchCow> WHat are you trying to get? [06:25] an imageboard called gurochan which died weeks ago. i'm part of the team that created a new one but we need the original's archives [06:28] https://web.archive.org/web/20140106164316/http://gurochan.net/ (link is sfw, following links will be nsfw) [07:09] OK, so you want to pull from the internet archive WARCs [07:10] https://archive.org/details/gurochan_archive_2006-2010 [07:10] (Obviously not perfect, I just happened to notice this) [07:12] SketchCow: that's only the images with unix timestamps, and we already have those. what we need is the original thread structure [07:12] Anyway, what I'm seeing here is that archive.org has semi-irregular grabs. [07:12] how do i use WARCs? i looked it up but could only get descriptions of the format, not how to creat it [07:13] But it's probably what you're looking for. [07:13] You can probably yank from the wayback. [07:14] i'm not sure hwo to use them to take from wayback [07:16] http://waybackdownloader.com/ [07:16] Maybe [07:16] I'm looking for utilities. [07:17] >pricing and order form [07:18] Yes. [07:19] $15, might not be bad. [07:19] we're already on a tight budget to run the current site. we can't afford to spending $15 when there's probably a way to do even with a firefox macro [07:19] Otherwise, write a script and scrape like crazy. [07:20] Sounds like you have it all under control. Good luck. [07:20] it requires javascript to view pages though, right? [07:20] No idea. [07:21] hm [07:22] it seems the web archive tries it's hardest to make scraping impossible [07:33] heh, that site's faq all link to 404 [07:48] you can check out https://github.com/alard/warc-proxy [07:48] it's a tool which reads WARCs and reconstructs HTTP responses from those WARCs [07:49] but how do i create a warc? [07:49] but remember kids, just because it's an archive file doesnt make it a backup. [07:49] ryonaloli: wget has a special flag for that [07:50] you can also use wpull --warc-file [07:51] if there's a bunch of WARCs in a tarball, you can use https://github.com/ArchiveTeam/megawarc [07:52] I'm not sure why you need to create a WARC to retrieve thread structure from (some hypothetical) WARC, though [07:52] i'm still not sure how to turn a wayback link into a warc [07:52] oh, the Wayback Machine's WARCs aren't publicly accessible [07:52] well, most of them aren't, but that's not an important detail [07:52] neither are we ryonaloli, making a warc from wayback would only recreate the wayback http response [07:53] heh [07:53] besides that, grabbing all of the wayback machine might fill your drive up pritty fast [07:53] then what would be the best way to scrape a site without paying $15? [07:53] all? nah, just a website with <10 gigs [07:53] pay 15 bucks. [07:53] just pay the 15 bucks. [07:53] you could write your own scraper [07:55] i'm not sure how i'd write it if archive.org tries it's best to block those. as for the $15, this is for a site with a very low budget [07:55] looking at gurochan.net captures it doesn't seem like it'd be all that difficult [07:55] eh? [07:55] I've never been blocked from downloading on any archive.org subdomain [07:55] what gave you the impression that you'd be blocked? [07:55] I mean, okay, maybe if you consume a ridiculous proportion of their bandwidth [07:55] but you don't need to do that [07:56] oh, i looked it up and most answers said it requires javascript to ge tinternal links [07:56] what does, Wayback? [07:56] i think so [07:58] I don't know what that means [07:58] I can access any archived URL on gurochan with curl [07:59] e.g. $ curl -vvv 'http://web.archive.org/web/20100611210558/http://gurochan.net/dis/res/1109.html' works [07:59] hm, i'll probably have to try again then [07:59] I don't know where you read that accessing Wayback either (a) results in bans or (b) requires Javascript [08:00] wherever you read that is wrong [08:14] when i try "wget -np -e robots=off --mirror --domains=staticweb.archive.org,web.archive.org 'https://web.archive.org/web/20140106164316/http://gurochan.net/'", only the main page is downloaded, it doesn't go into any other links that have don't have '20140106164316' [08:14] how do i let it go recursively into the rest without it trying to archive all of archive.org? [09:22] I still dont understand what you're trying to do. but i'd start with getting this warc file: https://archive.org/details/gurochan_archive_2006-2010 [09:23] grab the warc-proxy and start working from there. [09:23] FYI, warc proxy has a readme. [09:23] i already have that file. midas: what i'm trying to do is retrieve the threads from the archive. the wget command doesn't seem to recursively follow links [09:24] thats because you're trying to use the wayback machine, it's not made for doing that [09:24] warc proxy + that warc file, should be enough to get you going [09:24] that's the only thing i can use though. that 2006-2010 archive is just a bunch of images, not the threads or the original filenames [09:41] ryonaloli: wget -e robots=off --mirror --domains=staticweb.archive.org,web.archive.org https://web.archive.org/web/20140207233054/http://gurochan.net/ [09:42] grabs it all, good luck getting it into something usefull, cant help you with that [09:44] midas: does that also grab every previous version? [09:45] everything is everything ryonaloli [09:46] damn, that's gotta be hundreds of gb [09:46] .. [09:47] probably not [09:52] but there are over a hundred snapshots, and the whole site is 7gb iirc [09:52] well use the warc proxy. [09:52] now you're getting all of the snapshots, all of them [09:52] that's what you wanted. [10:02] how will the warc proxy be different? [10:02] and, i didn't want all of the snapshots. just all of the most recent one for each page [11:51] any plans to grab 800notes.com / other phone number indexing sites? [11:53] None what I know of, but anyone can do what they please. If it's interesting, feel free to take 'em on [12:02] https://pay.reddit.com/r/opendirectories/comments/25002s/meta_a_tool_for_tree_mapping_remote_directories/ [12:04] not very useful output, http://dirmap.krakissi.net/?path=https%3A%2F%2Fwww.quaddicted.com%2Ffiles%2Fmaps%2F [12:05] so it has a open dir and loops it to find all files [12:06] wget --spider -nv and some regexping is more suitable for people like us [12:07] with the strange twich of downloading everything [12:08] it does not download everything [12:11] spider doesnt, but people like us do [12:11] ;-) [12:11] >:) [14:15] wow you guys fail at reading comprehension [14:15] what he needs is https://code.google.com/p/warrick/ but he's gone now naturally [14:38] There you go. [14:38] The $15 thing didn't inspire me to keep going on. [14:53] SketchCow: next time: http://archiveteam.org/index.php?title=Restoring (will add more data tonight, mostly made by DFJustin now) [14:53] aka, all made by him atm :p [14:56] Yeah, then we won't have to break someone's back suggesting $15 [14:58] well, we can just point [15:36] SketchCow: it doesn't inspire me either [15:37] it's a recurring problem so it's worth documenting [15:41] Agreed, absolutely. [15:43] I'd put a disclaimer saying we don't endorse that paid service though [15:45] you know what they say about wikis [15:46] Everybody's got one [15:46] that too [15:53] Using the internetarchive python interface. [15:53] Hardcore. [15:53] Running into bugs and limits, so you know I'm being cruel [16:34] badass.py [16:48] https://archive.org/details/gg_Aerial_Assault_Rev_1_1992_Sega [16:48] Title, year and creator added by script. Cover and screenshot also. [23:21] http://dealbook.nytimes.com/2014/05/08/delicious-social-site-is-sold-by-youtube-founders/