[12:30] Where is the /v/ archive being hosted at again? I am sure this is asked often. I apologize in advance. [18:29] At the moment it seems like the site will cease to exist sometime in early August due to no practical way of funding this fun. The past 7 months the site has been kept up by a few very generous people combined with some good old-fashioned stubbornes. The site is unlikely to come back as is. [18:29] Hello everyone. [18:29] torrentbytes.net says 2013-07-21 03:43:00 [18:29] archiving the torrents is not worth it, they prune after 4 weeks iirc [18:30] might just be money making of course [18:36] wow tb [21:46] did someone already get a wget-warc of h-online? http://www.h-online.com/open/news/item/The-H-is-closing-down-1920027.html [21:47] ah, i see this has been linked a couple times, but anyone happen to do a grab? [21:54] Cameron's doing one, apparently [22:09] Sheesh. Is there any reasonable way to generalize Tumblr's structure to jam into a script? [22:10] Seems like there's "inf scroll over /pages/x.html, except when you don't and inf scroll over /archive/?before_time=[epoch], except when it's just flat structure" [22:11] at least their RSS feeds have the same structure ;) [22:12] Is that enough to reach back through an entire tumblr's history? [22:12] no, no way [22:12] Along with styles, themes, etc [22:12] did you get the tumblr URL list from ersi? [22:13] No; is that on AT's github? [22:13] I thought that only worked if there was no JS scrolling [22:14] not at the github, https://ludios.org/tmp/greader-db-com.tumblr.bz2 [22:14] (just domains) [22:14] Ah, thanks [22:15] I suppose it may be possible to script something, but it'll have tons of checks for structure [22:15] based on my use of tumblr's RSS feeds, tumblr does have an HTML page for each post [22:16] Ah; I'm trying to reach back and grab everything from beginning->now on an arbitrary site [22:17] I suppose it's possible to grab each post separately and forget the top-level structure [22:28] ivan`, what exactly is this list? [22:39] shaqfu: tumblr subdomains, don't know if you need them, I assumed you were backing up all of tumblr or something [22:40] if not, that would be a nice project ;) [22:40] Oh, no. I'm trying to put something together to fire at a specific site and guarantee a complete crawl+warc [22:40] But it seems like you'd have to somehow detect the structure and crawl based on that [22:40] The sites I was sampling all had wildly different structures :( [22:41] Sometimes you can just --mirror, other times, crawl /pages/x.html and capture the JS script providing infinite scroll [22:41] Except for other times, when you need to crawl /archive/before_time=x [22:42] And I'm sure there are even crazier site structures out there :) [22:43] (are any of those subdomained crawled, btw?) [22:44] all of the Google Reader feed caches for those were grabbed, when available [22:44] no images, sadly [22:44] so it is not really tumblr, just a hollow shell of tumblr ;) [22:44] Gotcha. Seems like it's easy to get an okay crawl, but really hard to get a 100% one