| Time | Nickname | Message | 
    
        | 00:31
            
                🔗 | ryonaloli | anyone have advice on scraping a site from archive.org with wget? | 
    
        | 00:32
            
                🔗 | APerti | Can't you just get the WARCs for the site? | 
    
        | 00:33
            
                🔗 | ryonaloli | i have no idea what those are :/ | 
    
        | 00:33
            
                🔗 | ryonaloli | i'm kinda new to this | 
    
        | 01:53
            
                🔗 | SketchCow | WHat are you trying to get? | 
    
        | 02:53
            
                🔗 | zenguy_pc | hi | 
    
        | 02:54
            
                🔗 | zenguy_pc | i was thinkign about arching fsrn.org , i saw that it's creative commons license.. is that being archived already? | 
    
        | 02:57
            
                🔗 | giganticp | wat is sekrit word | 
    
        | 02:57
            
                🔗 | zenguy_pc | ? | 
    
        | 02:57
            
                🔗 | balrog | The secret word is "yahoosucks" | 
    
        | 02:58
            
                🔗 | zenguy_pc | https://clbin.com/mgntE | 
    
        | 02:58
            
                🔗 | zenguy_pc | if anyone is interested | 
    
        | 02:58
            
                🔗 | zenguy_pc | didn't grab it yet | 
    
        | 04:55
            
                🔗 | Jonimus | trying to emulate old games which had even basic copy protections sucks. | 
    
        | 04:57
            
                🔗 | APerti | Which games/copy protection? | 
    
        | 05:04
            
                🔗 | Jonimus | nvm, it turns out it was a "have you read the manual" check | 
    
        | 05:04
            
                🔗 | Jonimus | and of course Archive.org has a scan of the manual ;D | 
    
        | 05:08
            
                🔗 | APerti | Nice. | 
    
        | 06:25
            
                🔗 | ryonaloli | <@SketchCow> WHat are you trying to get? | 
    
        | 06:25
            
                🔗 | ryonaloli | an imageboard called gurochan which died weeks ago. i'm part of the team that created a new one but we need the original's archives | 
    
        | 06:28
            
                🔗 | ryonaloli | https://web.archive.org/web/20140106164316/http://gurochan.net/ (link is sfw, following links will be nsfw) | 
    
        | 07:09
            
                🔗 | SketchCow | OK, so you want to pull from the internet archive WARCs | 
    
        | 07:10
            
                🔗 | SketchCow | https://archive.org/details/gurochan_archive_2006-2010 | 
    
        | 07:10
            
                🔗 | SketchCow | (Obviously not perfect, I just happened to notice this) | 
    
        | 07:12
            
                🔗 | ryonaloli | SketchCow: that's only the images with unix timestamps, and we already have those. what we need is the original thread structure | 
    
        | 07:12
            
                🔗 | SketchCow | Anyway, what I'm seeing here is that archive.org has semi-irregular grabs. | 
    
        | 07:12
            
                🔗 | ryonaloli | how do i use WARCs? i looked it up but could only get descriptions of the format, not how to creat it | 
    
        | 07:13
            
                🔗 | SketchCow | But it's probably what you're looking for. | 
    
        | 07:13
            
                🔗 | SketchCow | You can probably yank from the wayback. | 
    
        | 07:14
            
                🔗 | ryonaloli | i'm not sure hwo to use them to take from wayback | 
    
        | 07:16
            
                🔗 | SketchCow | http://waybackdownloader.com/ | 
    
        | 07:16
            
                🔗 | SketchCow | Maybe | 
    
        | 07:16
            
                🔗 | SketchCow | I'm looking for utilities. | 
    
        | 07:17
            
                🔗 | ryonaloli | >pricing and order form | 
    
        | 07:18
            
                🔗 | SketchCow | Yes. | 
    
        | 07:19
            
                🔗 | SketchCow | $15, might not be bad. | 
    
        | 07:19
            
                🔗 | ryonaloli | we're already on a tight budget to run the current site. we can't afford to spending $15 when there's probably a way to do even with a firefox macro | 
    
        | 07:19
            
                🔗 | SketchCow | Otherwise, write a script and scrape like crazy. | 
    
        | 07:20
            
                🔗 | SketchCow | Sounds like you have it all under control. Good luck. | 
    
        | 07:20
            
                🔗 | ryonaloli | it requires javascript to view pages though, right? | 
    
        | 07:20
            
                🔗 | SketchCow | No idea. | 
    
        | 07:21
            
                🔗 | ryonaloli | hm | 
    
        | 07:22
            
                🔗 | ryonaloli | it seems the web archive tries it's hardest to make scraping impossible | 
    
        | 07:33
            
                🔗 | ryonaloli | heh, that site's faq all link to 404 | 
    
        | 07:48
            
                🔗 | yipdw | you can check out https://github.com/alard/warc-proxy | 
    
        | 07:48
            
                🔗 | yipdw | it's a tool which reads WARCs and reconstructs HTTP responses from those WARCs | 
    
        | 07:49
            
                🔗 | ryonaloli | but how do i create a warc? | 
    
        | 07:49
            
                🔗 | midas | but remember kids, just because it's an archive file doesnt make it a backup. | 
    
        | 07:49
            
                🔗 | midas | ryonaloli: wget has a special flag for that | 
    
        | 07:50
            
                🔗 | yipdw | you can also use wpull --warc-file | 
    
        | 07:51
            
                🔗 | yipdw | if there's a bunch of WARCs in a tarball, you can use https://github.com/ArchiveTeam/megawarc | 
    
        | 07:52
            
                🔗 | yipdw | I'm not sure why you need to create a WARC to retrieve thread structure from (some hypothetical) WARC, though | 
    
        | 07:52
            
                🔗 | ryonaloli | i'm still not sure how to turn a wayback link into a warc | 
    
        | 07:52
            
                🔗 | yipdw | oh, the Wayback Machine's WARCs aren't publicly accessible | 
    
        | 07:52
            
                🔗 | yipdw | well, most of them aren't, but that's not an important detail | 
    
        | 07:52
            
                🔗 | midas | neither are we ryonaloli, making a warc from wayback would only recreate the wayback http response | 
    
        | 07:53
            
                🔗 | ryonaloli | heh | 
    
        | 07:53
            
                🔗 | midas | besides that, grabbing all of the wayback machine might fill your drive up pritty fast | 
    
        | 07:53
            
                🔗 | ryonaloli | then what would be the best way to scrape a site without paying $15? | 
    
        | 07:53
            
                🔗 | ryonaloli | all? nah, just a website with <10 gigs | 
    
        | 07:53
            
                🔗 | midas | pay 15 bucks. | 
    
        | 07:53
            
                🔗 | midas | just pay the 15 bucks. | 
    
        | 07:53
            
                🔗 | yipdw | you could write your own scraper | 
    
        | 07:55
            
                🔗 | ryonaloli | i'm not sure how i'd write it if archive.org tries it's best to block those. as for the $15, this is for a site with a very low budget | 
    
        | 07:55
            
                🔗 | yipdw | looking at gurochan.net captures it doesn't seem like it'd be all that difficult | 
    
        | 07:55
            
                🔗 | yipdw | eh? | 
    
        | 07:55
            
                🔗 | yipdw | I've never been blocked from downloading on any archive.org subdomain | 
    
        | 07:55
            
                🔗 | yipdw | what gave you the impression that you'd be blocked? | 
    
        | 07:55
            
                🔗 | yipdw | I mean, okay, maybe if you consume a ridiculous proportion of their bandwidth | 
    
        | 07:55
            
                🔗 | yipdw | but you don't need to do that | 
    
        | 07:56
            
                🔗 | ryonaloli | oh, i looked it up and most answers said it requires javascript to ge tinternal links | 
    
        | 07:56
            
                🔗 | yipdw | what does, Wayback? | 
    
        | 07:56
            
                🔗 | ryonaloli | i think so | 
    
        | 07:58
            
                🔗 | yipdw | I don't know what that means | 
    
        | 07:58
            
                🔗 | yipdw | I can access any archived URL on gurochan with curl | 
    
        | 07:59
            
                🔗 | yipdw | e.g. $ curl -vvv 'http://web.archive.org/web/20100611210558/http://gurochan.net/dis/res/1109.html' works | 
    
        | 07:59
            
                🔗 | ryonaloli | hm, i'll probably have to try again then | 
    
        | 07:59
            
                🔗 | yipdw | I don't know where you read that accessing Wayback either (a) results in bans or (b) requires Javascript | 
    
        | 08:00
            
                🔗 | yipdw | wherever you read that is wrong | 
    
        | 08:14
            
                🔗 | ryonaloli | when i try "wget -np -e robots=off --mirror --domains=staticweb.archive.org,web.archive.org 'https://web.archive.org/web/20140106164316/http://gurochan.net/'", only the main page is downloaded, it doesn't go into any other links that have don't have '20140106164316' | 
    
        | 08:14
            
                🔗 | ryonaloli | how do i let it go recursively into the rest without it trying to archive all of archive.org? | 
    
        | 09:22
            
                🔗 | midas | I still dont understand what you're trying to do. but i'd start with getting this warc file: https://archive.org/details/gurochan_archive_2006-2010 | 
    
        | 09:23
            
                🔗 | midas | grab the warc-proxy and start working from there. | 
    
        | 09:23
            
                🔗 | midas | FYI, warc proxy has a readme. | 
    
        | 09:23
            
                🔗 | ryonaloli | i already have that file. midas: what i'm trying to do is retrieve the threads from the archive. the wget command doesn't seem to recursively follow links | 
    
        | 09:24
            
                🔗 | midas | thats because you're trying to use the wayback machine, it's not made for doing that | 
    
        | 09:24
            
                🔗 | midas | warc proxy + that warc file, should be enough to get you going | 
    
        | 09:24
            
                🔗 | ryonaloli | that's the only thing i can use though. that 2006-2010 archive is just a bunch of images, not the threads or the original filenames | 
    
        | 09:41
            
                🔗 | midas | ryonaloli:  wget -e robots=off --mirror --domains=staticweb.archive.org,web.archive.org https://web.archive.org/web/20140207233054/http://gurochan.net/ | 
    
        | 09:42
            
                🔗 | midas | grabs it all, good luck getting it into something usefull, cant help you with that | 
    
        | 09:44
            
                🔗 | ryonaloli | midas: does that also grab every previous version? | 
    
        | 09:45
            
                🔗 | midas | everything is everything ryonaloli | 
    
        | 09:46
            
                🔗 | ryonaloli | damn, that's gotta be hundreds of gb | 
    
        | 09:46
            
                🔗 | midas | .. | 
    
        | 09:47
            
                🔗 | nico | probably not | 
    
        | 09:52
            
                🔗 | ryonaloli | but there are over a hundred snapshots, and the whole site is 7gb iirc | 
    
        | 09:52
            
                🔗 | midas | well use the warc proxy. | 
    
        | 09:52
            
                🔗 | midas | now you're getting all of the snapshots, all of them | 
    
        | 09:52
            
                🔗 | midas | that's what you wanted. | 
    
        | 10:02
            
                🔗 | ryonaloli | how will the warc proxy be different? | 
    
        | 10:02
            
                🔗 | ryonaloli | and, i didn't want all of the snapshots. just all of the most recent one for each page | 
    
        | 11:51
            
                🔗 | fexx | any plans to grab 800notes.com / other phone number indexing sites? | 
    
        | 11:53
            
                🔗 | ersi | None what I know of, but anyone can do what they please. If it's interesting, feel free to take 'em on | 
    
        | 12:02
            
                🔗 | schbirid | https://pay.reddit.com/r/opendirectories/comments/25002s/meta_a_tool_for_tree_mapping_remote_directories/ | 
    
        | 12:04
            
                🔗 | schbirid | not very useful output, http://dirmap.krakissi.net/?path=https%3A%2F%2Fwww.quaddicted.com%2Ffiles%2Fmaps%2F | 
    
        | 12:05
            
                🔗 | midas | so it has a open dir and loops it to find all files | 
    
        | 12:06
            
                🔗 | schbirid | wget --spider -nv and some regexping is more suitable for people like us | 
    
        | 12:07
            
                🔗 | midas | with the strange twich of downloading everything | 
    
        | 12:08
            
                🔗 | schbirid | it does not download everything | 
    
        | 12:11
            
                🔗 | midas | spider doesnt, but people like us do | 
    
        | 12:11
            
                🔗 | midas | ;-) | 
    
        | 12:11
            
                🔗 | schbirid | >:) | 
    
        | 14:15
            
                🔗 | DFJustin | wow you guys fail at reading comprehension | 
    
        | 14:15
            
                🔗 | DFJustin | what he needs is https://code.google.com/p/warrick/ but he's gone now naturally | 
    
        | 14:38
            
                🔗 | SketchCow | There you go. | 
    
        | 14:38
            
                🔗 | SketchCow | The $15 thing didn't inspire me to keep going on. | 
    
        | 14:53
            
                🔗 | midas | SketchCow: next time: http://archiveteam.org/index.php?title=Restoring (will add more data tonight, mostly made by DFJustin now) | 
    
        | 14:53
            
                🔗 | midas | aka, all made by him atm :p | 
    
        | 14:56
            
                🔗 | SketchCow | Yeah, then we won't have to break someone's back suggesting $15 | 
    
        | 14:58
            
                🔗 | midas | well, we can just point | 
    
        | 15:36
            
                🔗 | balrog | SketchCow: it doesn't inspire me either | 
    
        | 15:37
            
                🔗 | DFJustin | it's a recurring problem so it's worth documenting | 
    
        | 15:41
            
                🔗 | SketchCow | Agreed, absolutely. | 
    
        | 15:43
            
                🔗 | balrog | I'd put a disclaimer saying we don't endorse that paid service though | 
    
        | 15:45
            
                🔗 | DFJustin | you know what they say about wikis | 
    
        | 15:46
            
                🔗 | SketchCow | Everybody's got one | 
    
        | 15:46
            
                🔗 | DFJustin | that too | 
    
        | 15:53
            
                🔗 | SketchCow | Using the internetarchive python interface. | 
    
        | 15:53
            
                🔗 | SketchCow | Hardcore. | 
    
        | 15:53
            
                🔗 | SketchCow | Running into bugs and limits, so you know I'm being cruel | 
    
        | 16:34
            
                🔗 | midas | badass.py | 
    
        | 16:48
            
                🔗 | SketchCow | https://archive.org/details/gg_Aerial_Assault_Rev_1_1992_Sega | 
    
        | 16:48
            
                🔗 | SketchCow | Title, year and creator added by script. Cover and screenshot also. | 
    
        | 23:21
            
                🔗 | ivan` | http://dealbook.nytimes.com/2014/05/08/delicious-social-site-is-sold-by-youtube-founders/ |