#archiveteam 2013-07-21,Sun

↑back Search

Time	Nickname	Message
12:30 ^🔗	Hitorin	Where is the /v/ archive being hosted at again? I am sure this is asked often. I apologize in advance.
18:29 ^🔗	Schbirid	At the moment it seems like the site will cease to exist sometime in early August due to no practical way of funding this fun. The past 7 months the site has been kept up by a few very generous people combined with some good old-fashioned stubbornes. The site is unlikely to come back as is.
18:29 ^🔗	Schbirid	Hello everyone.
18:29 ^🔗	Schbirid	torrentbytes.net says 2013-07-21 03:43:00
18:29 ^🔗	Schbirid	archiving the torrents is not worth it, they prune after 4 weeks iirc
18:30 ^🔗	Schbirid	might just be money making of course
18:36 ^🔗	Aranje	wow tb
21:46 ^🔗	arrith1	did someone already get a wget-warc of h-online? http://www.h-online.com/open/news/item/The-H-is-closing-down-1920027.html
21:47 ^🔗	arrith1	ah, i see this has been linked a couple times, but anyone happen to do a grab?
21:54 ^🔗	antomatic	Cameron's doing one, apparently
22:09 ^🔗	shaqfu	Sheesh. Is there any reasonable way to generalize Tumblr's structure to jam into a script?
22:10 ^🔗	shaqfu	Seems like there's "inf scroll over /pages/x.html, except when you don't and inf scroll over /archive/?before_time=[epoch], except when it's just flat structure"
22:11 ^🔗	ivan`	at least their RSS feeds have the same structure ;)
22:12 ^🔗	shaqfu	Is that enough to reach back through an entire tumblr's history?
22:12 ^🔗	ivan`	no, no way
22:12 ^🔗	shaqfu	Along with styles, themes, etc
22:12 ^🔗	ivan`	did you get the tumblr URL list from ersi?
22:13 ^🔗	shaqfu	No; is that on AT's github?
22:13 ^🔗	shaqfu	I thought that only worked if there was no JS scrolling
22:14 ^🔗	ivan`	not at the github, https://ludios.org/tmp/greader-db-com.tumblr.bz2
22:14 ^🔗	ivan`	(just domains)
22:14 ^🔗	shaqfu	Ah, thanks
22:15 ^🔗	shaqfu	I suppose it may be possible to script something, but it'll have tons of checks for structure
22:15 ^🔗	ivan`	based on my use of tumblr's RSS feeds, tumblr does have an HTML page for each post
22:16 ^🔗	shaqfu	Ah; I'm trying to reach back and grab everything from beginning->now on an arbitrary site
22:17 ^🔗	shaqfu	I suppose it's possible to grab each post separately and forget the top-level structure
22:28 ^🔗	shaqfu	ivan`, what exactly is this list?
22:39 ^🔗	ivan`	shaqfu: tumblr subdomains, don't know if you need them, I assumed you were backing up all of tumblr or something
22:40 ^🔗	ivan`	if not, that would be a nice project ;)
22:40 ^🔗	shaqfu	Oh, no. I'm trying to put something together to fire at a specific site and guarantee a complete crawl+warc
22:40 ^🔗	shaqfu	But it seems like you'd have to somehow detect the structure and crawl based on that
22:40 ^🔗	shaqfu	The sites I was sampling all had wildly different structures :(
22:41 ^🔗	shaqfu	Sometimes you can just --mirror, other times, crawl /pages/x.html and capture the JS script providing infinite scroll
22:41 ^🔗	shaqfu	Except for other times, when you need to crawl /archive/before_time=x
22:42 ^🔗	shaqfu	And I'm sure there are even crazier site structures out there :)
22:43 ^🔗	shaqfu	(are any of those subdomained crawled, btw?)
22:44 ^🔗	ivan`	all of the Google Reader feed caches for those were grabbed, when available
22:44 ^🔗	ivan`	no images, sadly
22:44 ^🔗	ivan`	so it is not really tumblr, just a hollow shell of tumblr ;)
22:44 ^🔗	shaqfu	Gotcha. Seems like it's easy to get an okay crawl, but really hard to get a 100% one

irclogger-viewer