#archiveteam 2014-05-08,Thu

↑back Search

Time Nickname Message
00:31 🔗 ryonaloli anyone have advice on scraping a site from archive.org with wget?
00:32 🔗 APerti Can't you just get the WARCs for the site?
00:33 🔗 ryonaloli i have no idea what those are :/
00:33 🔗 ryonaloli i'm kinda new to this
01:53 🔗 SketchCow WHat are you trying to get?
02:53 🔗 zenguy_pc hi
02:54 🔗 zenguy_pc i was thinkign about arching fsrn.org , i saw that it's creative commons license.. is that being archived already?
02:57 🔗 giganticp wat is sekrit word
02:57 🔗 zenguy_pc ?
02:57 🔗 balrog The secret word is "yahoosucks"
02:58 🔗 zenguy_pc https://clbin.com/mgntE
02:58 🔗 zenguy_pc if anyone is interested
02:58 🔗 zenguy_pc didn't grab it yet
04:55 🔗 Jonimus trying to emulate old games which had even basic copy protections sucks.
04:57 🔗 APerti Which games/copy protection?
05:04 🔗 Jonimus nvm, it turns out it was a "have you read the manual" check
05:04 🔗 Jonimus and of course Archive.org has a scan of the manual ;D
05:08 🔗 APerti Nice.
06:25 🔗 ryonaloli <@SketchCow> WHat are you trying to get?
06:25 🔗 ryonaloli an imageboard called gurochan which died weeks ago. i'm part of the team that created a new one but we need the original's archives
06:28 🔗 ryonaloli https://web.archive.org/web/20140106164316/http://gurochan.net/ (link is sfw, following links will be nsfw)
07:09 🔗 SketchCow OK, so you want to pull from the internet archive WARCs
07:10 🔗 SketchCow https://archive.org/details/gurochan_archive_2006-2010
07:10 🔗 SketchCow (Obviously not perfect, I just happened to notice this)
07:12 🔗 ryonaloli SketchCow: that's only the images with unix timestamps, and we already have those. what we need is the original thread structure
07:12 🔗 SketchCow Anyway, what I'm seeing here is that archive.org has semi-irregular grabs.
07:12 🔗 ryonaloli how do i use WARCs? i looked it up but could only get descriptions of the format, not how to creat it
07:13 🔗 SketchCow But it's probably what you're looking for.
07:13 🔗 SketchCow You can probably yank from the wayback.
07:14 🔗 ryonaloli i'm not sure hwo to use them to take from wayback
07:16 🔗 SketchCow http://waybackdownloader.com/
07:16 🔗 SketchCow Maybe
07:16 🔗 SketchCow I'm looking for utilities.
07:17 🔗 ryonaloli >pricing and order form
07:18 🔗 SketchCow Yes.
07:19 🔗 SketchCow $15, might not be bad.
07:19 🔗 ryonaloli we're already on a tight budget to run the current site. we can't afford to spending $15 when there's probably a way to do even with a firefox macro
07:19 🔗 SketchCow Otherwise, write a script and scrape like crazy.
07:20 🔗 SketchCow Sounds like you have it all under control. Good luck.
07:20 🔗 ryonaloli it requires javascript to view pages though, right?
07:20 🔗 SketchCow No idea.
07:21 🔗 ryonaloli hm
07:22 🔗 ryonaloli it seems the web archive tries it's hardest to make scraping impossible
07:33 🔗 ryonaloli heh, that site's faq all link to 404
07:48 🔗 yipdw you can check out https://github.com/alard/warc-proxy
07:48 🔗 yipdw it's a tool which reads WARCs and reconstructs HTTP responses from those WARCs
07:49 🔗 ryonaloli but how do i create a warc?
07:49 🔗 midas but remember kids, just because it's an archive file doesnt make it a backup.
07:49 🔗 midas ryonaloli: wget has a special flag for that
07:50 🔗 yipdw you can also use wpull --warc-file
07:51 🔗 yipdw if there's a bunch of WARCs in a tarball, you can use https://github.com/ArchiveTeam/megawarc
07:52 🔗 yipdw I'm not sure why you need to create a WARC to retrieve thread structure from (some hypothetical) WARC, though
07:52 🔗 ryonaloli i'm still not sure how to turn a wayback link into a warc
07:52 🔗 yipdw oh, the Wayback Machine's WARCs aren't publicly accessible
07:52 🔗 yipdw well, most of them aren't, but that's not an important detail
07:52 🔗 midas neither are we ryonaloli, making a warc from wayback would only recreate the wayback http response
07:53 🔗 ryonaloli heh
07:53 🔗 midas besides that, grabbing all of the wayback machine might fill your drive up pritty fast
07:53 🔗 ryonaloli then what would be the best way to scrape a site without paying $15?
07:53 🔗 ryonaloli all? nah, just a website with <10 gigs
07:53 🔗 midas pay 15 bucks.
07:53 🔗 midas just pay the 15 bucks.
07:53 🔗 yipdw you could write your own scraper
07:55 🔗 ryonaloli i'm not sure how i'd write it if archive.org tries it's best to block those. as for the $15, this is for a site with a very low budget
07:55 🔗 yipdw looking at gurochan.net captures it doesn't seem like it'd be all that difficult
07:55 🔗 yipdw eh?
07:55 🔗 yipdw I've never been blocked from downloading on any archive.org subdomain
07:55 🔗 yipdw what gave you the impression that you'd be blocked?
07:55 🔗 yipdw I mean, okay, maybe if you consume a ridiculous proportion of their bandwidth
07:55 🔗 yipdw but you don't need to do that
07:56 🔗 ryonaloli oh, i looked it up and most answers said it requires javascript to ge tinternal links
07:56 🔗 yipdw what does, Wayback?
07:56 🔗 ryonaloli i think so
07:58 🔗 yipdw I don't know what that means
07:58 🔗 yipdw I can access any archived URL on gurochan with curl
07:59 🔗 yipdw e.g. $ curl -vvv 'http://web.archive.org/web/20100611210558/http://gurochan.net/dis/res/1109.html' works
07:59 🔗 ryonaloli hm, i'll probably have to try again then
07:59 🔗 yipdw I don't know where you read that accessing Wayback either (a) results in bans or (b) requires Javascript
08:00 🔗 yipdw wherever you read that is wrong
08:14 🔗 ryonaloli when i try "wget -np -e robots=off --mirror --domains=staticweb.archive.org,web.archive.org 'https://web.archive.org/web/20140106164316/http://gurochan.net/'", only the main page is downloaded, it doesn't go into any other links that have don't have '20140106164316'
08:14 🔗 ryonaloli how do i let it go recursively into the rest without it trying to archive all of archive.org?
09:22 🔗 midas I still dont understand what you're trying to do. but i'd start with getting this warc file: https://archive.org/details/gurochan_archive_2006-2010
09:23 🔗 midas grab the warc-proxy and start working from there.
09:23 🔗 midas FYI, warc proxy has a readme.
09:23 🔗 ryonaloli i already have that file. midas: what i'm trying to do is retrieve the threads from the archive. the wget command doesn't seem to recursively follow links
09:24 🔗 midas thats because you're trying to use the wayback machine, it's not made for doing that
09:24 🔗 midas warc proxy + that warc file, should be enough to get you going
09:24 🔗 ryonaloli that's the only thing i can use though. that 2006-2010 archive is just a bunch of images, not the threads or the original filenames
09:41 🔗 midas ryonaloli: wget -e robots=off --mirror --domains=staticweb.archive.org,web.archive.org https://web.archive.org/web/20140207233054/http://gurochan.net/
09:42 🔗 midas grabs it all, good luck getting it into something usefull, cant help you with that
09:44 🔗 ryonaloli midas: does that also grab every previous version?
09:45 🔗 midas everything is everything ryonaloli
09:46 🔗 ryonaloli damn, that's gotta be hundreds of gb
09:46 🔗 midas ..
09:47 🔗 nico probably not
09:52 🔗 ryonaloli but there are over a hundred snapshots, and the whole site is 7gb iirc
09:52 🔗 midas well use the warc proxy.
09:52 🔗 midas now you're getting all of the snapshots, all of them
09:52 🔗 midas that's what you wanted.
10:02 🔗 ryonaloli how will the warc proxy be different?
10:02 🔗 ryonaloli and, i didn't want all of the snapshots. just all of the most recent one for each page
11:51 🔗 fexx any plans to grab 800notes.com / other phone number indexing sites?
11:53 🔗 ersi None what I know of, but anyone can do what they please. If it's interesting, feel free to take 'em on
12:02 🔗 schbirid https://pay.reddit.com/r/opendirectories/comments/25002s/meta_a_tool_for_tree_mapping_remote_directories/
12:04 🔗 schbirid not very useful output, http://dirmap.krakissi.net/?path=https%3A%2F%2Fwww.quaddicted.com%2Ffiles%2Fmaps%2F
12:05 🔗 midas so it has a open dir and loops it to find all files
12:06 🔗 schbirid wget --spider -nv and some regexping is more suitable for people like us
12:07 🔗 midas with the strange twich of downloading everything
12:08 🔗 schbirid it does not download everything
12:11 🔗 midas spider doesnt, but people like us do
12:11 🔗 midas ;-)
12:11 🔗 schbirid >:)
14:15 🔗 DFJustin wow you guys fail at reading comprehension
14:15 🔗 DFJustin what he needs is https://code.google.com/p/warrick/ but he's gone now naturally
14:38 🔗 SketchCow There you go.
14:38 🔗 SketchCow The $15 thing didn't inspire me to keep going on.
14:53 🔗 midas SketchCow: next time: http://archiveteam.org/index.php?title=Restoring (will add more data tonight, mostly made by DFJustin now)
14:53 🔗 midas aka, all made by him atm :p
14:56 🔗 SketchCow Yeah, then we won't have to break someone's back suggesting $15
14:58 🔗 midas well, we can just point
15:36 🔗 balrog SketchCow: it doesn't inspire me either
15:37 🔗 DFJustin it's a recurring problem so it's worth documenting
15:41 🔗 SketchCow Agreed, absolutely.
15:43 🔗 balrog I'd put a disclaimer saying we don't endorse that paid service though
15:45 🔗 DFJustin you know what they say about wikis
15:46 🔗 SketchCow Everybody's got one
15:46 🔗 DFJustin that too
15:53 🔗 SketchCow Using the internetarchive python interface.
15:53 🔗 SketchCow Hardcore.
15:53 🔗 SketchCow Running into bugs and limits, so you know I'm being cruel
16:34 🔗 midas badass.py
16:48 🔗 SketchCow https://archive.org/details/gg_Aerial_Assault_Rev_1_1992_Sega
16:48 🔗 SketchCow Title, year and creator added by script. Cover and screenshot also.
23:21 🔗 ivan` http://dealbook.nytimes.com/2014/05/08/delicious-social-site-is-sold-by-youtube-founders/

irclogger-viewer