Time |
Nickname |
Message |
12:30
🔗
|
Hitorin |
Where is the /v/ archive being hosted at again? I am sure this is asked often. I apologize in advance. |
18:29
🔗
|
Schbirid |
At the moment it seems like the site will cease to exist sometime in early August due to no practical way of funding this fun. The past 7 months the site has been kept up by a few very generous people combined with some good old-fashioned stubbornes. The site is unlikely to come back as is. |
18:29
🔗
|
Schbirid |
Hello everyone. |
18:29
🔗
|
Schbirid |
torrentbytes.net says 2013-07-21 03:43:00 |
18:29
🔗
|
Schbirid |
archiving the torrents is not worth it, they prune after 4 weeks iirc |
18:30
🔗
|
Schbirid |
might just be money making of course |
18:36
🔗
|
Aranje |
wow tb |
21:46
🔗
|
arrith1 |
did someone already get a wget-warc of h-online? http://www.h-online.com/open/news/item/The-H-is-closing-down-1920027.html |
21:47
🔗
|
arrith1 |
ah, i see this has been linked a couple times, but anyone happen to do a grab? |
21:54
🔗
|
antomatic |
Cameron's doing one, apparently |
22:09
🔗
|
shaqfu |
Sheesh. Is there any reasonable way to generalize Tumblr's structure to jam into a script? |
22:10
🔗
|
shaqfu |
Seems like there's "inf scroll over /pages/x.html, except when you don't and inf scroll over /archive/?before_time=[epoch], except when it's just flat structure" |
22:11
🔗
|
ivan` |
at least their RSS feeds have the same structure ;) |
22:12
🔗
|
shaqfu |
Is that enough to reach back through an entire tumblr's history? |
22:12
🔗
|
ivan` |
no, no way |
22:12
🔗
|
shaqfu |
Along with styles, themes, etc |
22:12
🔗
|
ivan` |
did you get the tumblr URL list from ersi? |
22:13
🔗
|
shaqfu |
No; is that on AT's github? |
22:13
🔗
|
shaqfu |
I thought that only worked if there was no JS scrolling |
22:14
🔗
|
ivan` |
not at the github, https://ludios.org/tmp/greader-db-com.tumblr.bz2 |
22:14
🔗
|
ivan` |
(just domains) |
22:14
🔗
|
shaqfu |
Ah, thanks |
22:15
🔗
|
shaqfu |
I suppose it may be possible to script something, but it'll have tons of checks for structure |
22:15
🔗
|
ivan` |
based on my use of tumblr's RSS feeds, tumblr does have an HTML page for each post |
22:16
🔗
|
shaqfu |
Ah; I'm trying to reach back and grab everything from beginning->now on an arbitrary site |
22:17
🔗
|
shaqfu |
I suppose it's possible to grab each post separately and forget the top-level structure |
22:28
🔗
|
shaqfu |
ivan`, what exactly is this list? |
22:39
🔗
|
ivan` |
shaqfu: tumblr subdomains, don't know if you need them, I assumed you were backing up all of tumblr or something |
22:40
🔗
|
ivan` |
if not, that would be a nice project ;) |
22:40
🔗
|
shaqfu |
Oh, no. I'm trying to put something together to fire at a specific site and guarantee a complete crawl+warc |
22:40
🔗
|
shaqfu |
But it seems like you'd have to somehow detect the structure and crawl based on that |
22:40
🔗
|
shaqfu |
The sites I was sampling all had wildly different structures :( |
22:41
🔗
|
shaqfu |
Sometimes you can just --mirror, other times, crawl /pages/x.html and capture the JS script providing infinite scroll |
22:41
🔗
|
shaqfu |
Except for other times, when you need to crawl /archive/before_time=x |
22:42
🔗
|
shaqfu |
And I'm sure there are even crazier site structures out there :) |
22:43
🔗
|
shaqfu |
(are any of those subdomained crawled, btw?) |
22:44
🔗
|
ivan` |
all of the Google Reader feed caches for those were grabbed, when available |
22:44
🔗
|
ivan` |
no images, sadly |
22:44
🔗
|
ivan` |
so it is not really tumblr, just a hollow shell of tumblr ;) |
22:44
🔗
|
shaqfu |
Gotcha. Seems like it's easy to get an okay crawl, but really hard to get a 100% one |