#archiveteam 2013-07-21,Sun

↑back Search

Time Nickname Message
12:30 🔗 Hitorin Where is the /v/ archive being hosted at again? I am sure this is asked often. I apologize in advance.
18:29 🔗 Schbirid At the moment it seems like the site will cease to exist sometime in early August due to no practical way of funding this fun. The past 7 months the site has been kept up by a few very generous people combined with some good old-fashioned stubbornes. The site is unlikely to come back as is.
18:29 🔗 Schbirid Hello everyone.
18:29 🔗 Schbirid torrentbytes.net says 2013-07-21 03:43:00
18:29 🔗 Schbirid archiving the torrents is not worth it, they prune after 4 weeks iirc
18:30 🔗 Schbirid might just be money making of course
18:36 🔗 Aranje wow tb
21:46 🔗 arrith1 did someone already get a wget-warc of h-online? http://www.h-online.com/open/news/item/The-H-is-closing-down-1920027.html
21:47 🔗 arrith1 ah, i see this has been linked a couple times, but anyone happen to do a grab?
21:54 🔗 antomatic Cameron's doing one, apparently
22:09 🔗 shaqfu Sheesh. Is there any reasonable way to generalize Tumblr's structure to jam into a script?
22:10 🔗 shaqfu Seems like there's "inf scroll over /pages/x.html, except when you don't and inf scroll over /archive/?before_time=[epoch], except when it's just flat structure"
22:11 🔗 ivan` at least their RSS feeds have the same structure ;)
22:12 🔗 shaqfu Is that enough to reach back through an entire tumblr's history?
22:12 🔗 ivan` no, no way
22:12 🔗 shaqfu Along with styles, themes, etc
22:12 🔗 ivan` did you get the tumblr URL list from ersi?
22:13 🔗 shaqfu No; is that on AT's github?
22:13 🔗 shaqfu I thought that only worked if there was no JS scrolling
22:14 🔗 ivan` not at the github, https://ludios.org/tmp/greader-db-com.tumblr.bz2
22:14 🔗 ivan` (just domains)
22:14 🔗 shaqfu Ah, thanks
22:15 🔗 shaqfu I suppose it may be possible to script something, but it'll have tons of checks for structure
22:15 🔗 ivan` based on my use of tumblr's RSS feeds, tumblr does have an HTML page for each post
22:16 🔗 shaqfu Ah; I'm trying to reach back and grab everything from beginning->now on an arbitrary site
22:17 🔗 shaqfu I suppose it's possible to grab each post separately and forget the top-level structure
22:28 🔗 shaqfu ivan`, what exactly is this list?
22:39 🔗 ivan` shaqfu: tumblr subdomains, don't know if you need them, I assumed you were backing up all of tumblr or something
22:40 🔗 ivan` if not, that would be a nice project ;)
22:40 🔗 shaqfu Oh, no. I'm trying to put something together to fire at a specific site and guarantee a complete crawl+warc
22:40 🔗 shaqfu But it seems like you'd have to somehow detect the structure and crawl based on that
22:40 🔗 shaqfu The sites I was sampling all had wildly different structures :(
22:41 🔗 shaqfu Sometimes you can just --mirror, other times, crawl /pages/x.html and capture the JS script providing infinite scroll
22:41 🔗 shaqfu Except for other times, when you need to crawl /archive/before_time=x
22:42 🔗 shaqfu And I'm sure there are even crazier site structures out there :)
22:43 🔗 shaqfu (are any of those subdomained crawled, btw?)
22:44 🔗 ivan` all of the Google Reader feed caches for those were grabbed, when available
22:44 🔗 ivan` no images, sadly
22:44 🔗 ivan` so it is not really tumblr, just a hollow shell of tumblr ;)
22:44 🔗 shaqfu Gotcha. Seems like it's easy to get an okay crawl, but really hard to get a 100% one

irclogger-viewer