Time |
Nickname |
Message |
00:03
🔗
|
Coderjoe |
yipdw: big surprise </sarcasm> https://wwws.whitehouse.gov/petitions/!/response/why-we-cant-comment |
00:09
🔗
|
Coderjoe |
underscor: yeah... and it took us over a year to learn of it. gives me the warm fuzzies. how about you? |
00:13
🔗
|
Coderjoe |
also, this AC claims that Ken Silva (cited in that article) DID know: http://it.slashdot.org/comments.pl?sid=2651017&cid=38904163 |
17:20
🔗
|
yipdw |
SketchCow: I'm going to call Proust "done" |
17:20
🔗
|
yipdw |
since it appears to now be stable |
17:20
🔗
|
yipdw |
(for the near future) |
18:18
🔗
|
bsmith094 |
so , anything for the ffnet grab? |
20:32
🔗
|
yipdw |
bsmith094: nothing new from me; I've been busy with other tasks |
20:39
🔗
|
alard |
bsmith094/yipdw: Is ffnet going away soon? |
20:40
🔗
|
alard |
Or is this a no-need-to-hurry long term project that can wait? |
20:43
🔗
|
yipdw |
alard: it is a no-need-to-hurry project |
20:43
🔗
|
yipdw |
(which is why I am not moving urgently on it :P) |
20:45
🔗
|
alard |
Good. |
20:45
🔗
|
yipdw |
I'm currently uploading some last bits of Proust |
20:46
🔗
|
yipdw |
and, after I archive the AMI (just in case SketchCow wants it reinstated or something), will have a machine ready to do more fetches of stuff |
20:46
🔗
|
yipdw |
alard: I gather Tabblo is the next high-priority one |
20:46
🔗
|
yipdw |
? |
20:47
🔗
|
alard |
It could be, but I haven't been able to find a confirmation for that. |
20:48
🔗
|
yipdw |
ok |
20:48
🔗
|
alard |
The only thing I've found is this blog by a former employee (since November 2010) employee, http://nedbatchelder.com/blog/201201/goodbye_tabblo.html |
20:48
🔗
|
yipdw |
uh, I guess I can switch this EC2 instance over to Mobileme |
20:48
🔗
|
yipdw |
haha |
20:49
🔗
|
yipdw |
ANOTHER storytelling site bites the dust |
20:50
🔗
|
DFJustin |
they haven't announced it's going down but it's on life support at this point |
20:50
🔗
|
alard |
Yes. And another 'social network' too, since Tabblo has all these things everyone else has too: friends, comments etc. |
20:50
🔗
|
yipdw |
the tabblo archiver tool doesn't look too reliable |
20:50
🔗
|
alard |
DFJustin: The date of 15 March is floating around, not sure where that comes from. |
20:52
🔗
|
alard |
What's wrong about the lifeboat? (Haven't tried it.) |
20:52
🔗
|
balrog_ph |
"With the latest employee departures, no one at HP even knows how to shut it down, other than to simply pull the plug." |
20:52
🔗
|
balrog_ph |
I'd archive what I can. |
20:52
🔗
|
yipdw |
I was referring to the "it doesn't always get all the images" comment |
20:52
🔗
|
yipdw |
oh wait, HP owns it |
20:52
🔗
|
yipdw |
FUCK |
20:52
🔗
|
yipdw |
yeah that shit is going to crash pretty soon |
20:52
🔗
|
DFJustin |
lol |
20:53
🔗
|
balrog_ph |
yipdw: He supposedly fixed that bug in 2.2 |
20:53
🔗
|
balrog_ph |
"Sometimes, the downloaded tabblo zip file seems OK, but is actually missing some images. Tabblo Lifeboat now checks for this when the zip file is downloaded, and will retry if parts are missing. It will also check all your previously downloaded tabblos in case you had downloaded them with an earlier version. " |
20:54
🔗
|
yipdw |
balrog_ph: yeah, I'm looking at the lifeboat mercurial repository niow |
20:54
🔗
|
yipdw |
just to understand how the lifeboat works |
20:55
🔗
|
balrog_ph |
Ahh, ok |
20:55
🔗
|
yipdw |
the code is...weird |
20:55
🔗
|
yipdw |
and I don't mean structurally; it's fine in that regard |
20:55
🔗
|
yipdw |
it's just full of comments like |
20:55
🔗
|
yipdw |
# Tabblo returns short pages sometimes!? |
20:55
🔗
|
yipdw |
# Why does tabblo.com not just return 302 for redirects?? |
20:55
🔗
|
yipdw |
which, from a developer on the webapp, is NOT what I expect to see |
20:55
🔗
|
yipdw |
it's like he's doing archaeology on some digital monolith |
20:56
🔗
|
alard |
Yes, the zip file download is strange. I've tried that. It's *very* slow, then it just stops half way. The next time you try it, you get more data, then it stops again. Repeat until you have a valid zip file. |
20:58
🔗
|
yipdw |
it looks like Tabblo suffers from a similar problem as Splinder |
20:58
🔗
|
yipdw |
(and every other huge webapp, really) |
20:58
🔗
|
balrog_ph |
Which is what? |
20:58
🔗
|
yipdw |
application server timeout |
20:58
🔗
|
yipdw |
or more precisely app server overload |
20:59
🔗
|
yipdw |
there's code in the lifeboat that retries a download of a page up to ten times |
20:59
🔗
|
alard |
That seems a likely explanation. And they have caching, so the next time you try it things go faster. |
20:59
🔗
|
alard |
And eventually things are cached enough to give you the whole file within the time limit. |
20:59
🔗
|
yipdw |
yeah, assuming your request didn't get knocked out of cache |
21:00
🔗
|
yipdw |
er, response to your request |
21:00
🔗
|
alard |
Should we set organize a rescue mission? |
21:00
🔗
|
yipdw |
hmm |
21:00
🔗
|
yipdw |
I wonder if organizing a rescue mission would make things worse :P |
21:00
🔗
|
alard |
Saving the tabblos seems easy and simple enough. |
21:00
🔗
|
alard |
A rescue mission with limited admission? |
21:00
🔗
|
yipdw |
in the sense that it'd be stressing the site and causing more download failures |
21:00
🔗
|
yipdw |
probably, yeah |
21:01
🔗
|
alard |
And perhaps make such a big problem that they'll just shut it down. |
21:01
🔗
|
yipdw |
right |
21:02
🔗
|
yipdw |
I guess we'd just use the lifeboat code |
21:02
🔗
|
alard |
It isn't warc. |
21:02
🔗
|
yipdw |
true |
21:03
🔗
|
yipdw |
but it does handle a lot of Tabblo corner cases already |
21:03
🔗
|
yipdw |
how hard would it be to add WARC generation? |
21:03
🔗
|
alard |
Well, basically the only thing of real interest is the download_tabblo method. |
21:04
🔗
|
alard |
Discovering tabblo id's is less important, we just start at 1 and continue to 180000+ |
21:04
🔗
|
alard |
The download_tabblo method downloads the zip file, which we can do ourselves. |
21:05
🔗
|
yipdw |
I'm wondering how to handle things like truncated pages and error reponses |
21:05
🔗
|
yipdw |
post-processing the WARC and wget log files? |
21:05
🔗
|
yipdw |
or can we use that wget-lua branch |
21:06
🔗
|
alard |
Maybe it should be a two-step process: 1. we run a wget --page-requisites on the tabblo page, which will give us a complete web page to put in a WARC. |
21:06
🔗
|
alard |
2. we also download the zip file that contains the original images, but don't add that zip file to the warc. |
21:07
🔗
|
alard |
Then we'd have a more or less browsable (as in: WARC) copy of the site, and we'd have a copy of the original photos. The rest is derived from that from the lifeboat, which can be done later, if necessary. |
21:08
🔗
|
yipdw |
hmm, let me see if I follow that |
21:08
🔗
|
alard |
(The lifeboat just downloads that one zip file per tabblo, as far as I can see.) |
21:08
🔗
|
yipdw |
we retrieve the page structure using wget-warc, and use the ZIP from the lifeboat to augment whatever's missing |
21:08
🔗
|
yipdw |
(or alternatively use the ZIP as the source of truth for photos) |
21:09
🔗
|
yipdw |
or do you propose that the ZIP and WARC remain separate? |
21:11
🔗
|
alard |
I'd think they serve different purposes: 1. the warc would give people the pages they link to now (the tabblos can be viewed via the wayback machine, for example). 2. The original content is still in the zip file. It's not directly browsable, but the data is there and can be processed by something like the lifeboat. |
21:12
🔗
|
alard |
(The lifeboat doesn't include comments, by the way, since those aren't in the zip.) |
22:13
🔗
|
winr4r |
hi guys |
22:13
🔗
|
winr4r |
i have word from the inside that rutnet.org.uk is going to close down in a few weeks since it lost its funding |
22:14
🔗
|
winr4r |
nothing's certain right now, but it has a lot of sub-sites for towns and villages in rutland |
22:17
🔗
|
winr4r |
and i've been told its funding will be cut and it's going to close at the end of the financial year 2012, which means early april |
22:18
🔗
|
Nemo_bis |
if you have internal contacts why don't you ask a backup |
22:18
🔗
|
winr4r |
sorry, .co.uk* |
22:19
🔗
|
winr4r |
Nemo_bis: the internal contact is a client who has a sub-site on rutnet and needs to get it off rutnet by april |
22:19
🔗
|
Nemo_bis |
not internal enough then, ok |
22:19
🔗
|
winr4r |
Nemo_bis: yeah, not quite |
22:21
🔗
|
winr4r |
if we get the contract, then we will end up with some contact with the owners of the website now (in order to set up redirects from their old rutnet site to the new one) |
22:22
🔗
|
winr4r |
and we *might* (very slim might) get friendly enough to negotiate a database dump |
22:23
🔗
|
winr4r |
but that's a might on top of a might |
22:26
🔗
|
Coderjoe |
isnt the end of FY12 actualyl april 2013? |
22:27
🔗
|
winr4r |
Coderjoe: you might be right, actually, i do know it's whatever end of the FY that happens this year |
22:29
🔗
|
Nemo_bis |
it depends in the company, doesn't it |
22:29
🔗
|
winr4r |
Nemo_bis: it's going down in april 2012 |
22:29
🔗
|
winr4r |
unless something happens |
22:29
🔗
|
Nemo_bis |
the FY I mean |
22:31
🔗
|
winr4r |
Nemo_bis: i think the FY is the same anywhere here |
22:31
🔗
|
Nemo_bis |
ah |
22:31
🔗
|
winr4r |
in any case, it's funded by the government right now |
22:32
🔗
|
winr4r |
and you know, cutting £50 a month for a dedicated server would zero the national debt overnight |
22:32
🔗
|
winr4r |
meanwhile, the fucknuts who get to make decisions like that actually keep their jobs |