#archiveteam 2012-02-03,Fri

↑back Search

Time Nickname Message
00:03 🔗 Coderjoe yipdw: big surprise </sarcasm> https://wwws.whitehouse.gov/petitions/!/response/why-we-cant-comment
00:09 🔗 Coderjoe underscor: yeah... and it took us over a year to learn of it. gives me the warm fuzzies. how about you?
00:13 🔗 Coderjoe also, this AC claims that Ken Silva (cited in that article) DID know: http://it.slashdot.org/comments.pl?sid=2651017&cid=38904163
17:20 🔗 yipdw SketchCow: I'm going to call Proust "done"
17:20 🔗 yipdw since it appears to now be stable
17:20 🔗 yipdw (for the near future)
18:18 🔗 bsmith094 so , anything for the ffnet grab?
20:32 🔗 yipdw bsmith094: nothing new from me; I've been busy with other tasks
20:39 🔗 alard bsmith094/yipdw: Is ffnet going away soon?
20:40 🔗 alard Or is this a no-need-to-hurry long term project that can wait?
20:43 🔗 yipdw alard: it is a no-need-to-hurry project
20:43 🔗 yipdw (which is why I am not moving urgently on it :P)
20:45 🔗 alard Good.
20:45 🔗 yipdw I'm currently uploading some last bits of Proust
20:46 🔗 yipdw and, after I archive the AMI (just in case SketchCow wants it reinstated or something), will have a machine ready to do more fetches of stuff
20:46 🔗 yipdw alard: I gather Tabblo is the next high-priority one
20:46 🔗 yipdw ?
20:47 🔗 alard It could be, but I haven't been able to find a confirmation for that.
20:48 🔗 yipdw ok
20:48 🔗 alard The only thing I've found is this blog by a former employee (since November 2010) employee, http://nedbatchelder.com/blog/201201/goodbye_tabblo.html
20:48 🔗 yipdw uh, I guess I can switch this EC2 instance over to Mobileme
20:48 🔗 yipdw haha
20:49 🔗 yipdw ANOTHER storytelling site bites the dust
20:50 🔗 DFJustin they haven't announced it's going down but it's on life support at this point
20:50 🔗 alard Yes. And another 'social network' too, since Tabblo has all these things everyone else has too: friends, comments etc.
20:50 🔗 yipdw the tabblo archiver tool doesn't look too reliable
20:50 🔗 alard DFJustin: The date of 15 March is floating around, not sure where that comes from.
20:52 🔗 alard What's wrong about the lifeboat? (Haven't tried it.)
20:52 🔗 balrog_ph "With the latest employee departures, no one at HP even knows how to shut it down, other than to simply pull the plug."
20:52 🔗 balrog_ph I'd archive what I can.
20:52 🔗 yipdw I was referring to the "it doesn't always get all the images" comment
20:52 🔗 yipdw oh wait, HP owns it
20:52 🔗 yipdw FUCK
20:52 🔗 yipdw yeah that shit is going to crash pretty soon
20:52 🔗 DFJustin lol
20:53 🔗 balrog_ph yipdw: He supposedly fixed that bug in 2.2
20:53 🔗 balrog_ph "Sometimes, the downloaded tabblo zip file seems OK, but is actually missing some images. Tabblo Lifeboat now checks for this when the zip file is downloaded, and will retry if parts are missing. It will also check all your previously downloaded tabblos in case you had downloaded them with an earlier version. "
20:54 🔗 yipdw balrog_ph: yeah, I'm looking at the lifeboat mercurial repository niow
20:54 🔗 yipdw just to understand how the lifeboat works
20:55 🔗 balrog_ph Ahh, ok
20:55 🔗 yipdw the code is...weird
20:55 🔗 yipdw and I don't mean structurally; it's fine in that regard
20:55 🔗 yipdw it's just full of comments like
20:55 🔗 yipdw # Tabblo returns short pages sometimes!?
20:55 🔗 yipdw # Why does tabblo.com not just return 302 for redirects??
20:55 🔗 yipdw which, from a developer on the webapp, is NOT what I expect to see
20:55 🔗 yipdw it's like he's doing archaeology on some digital monolith
20:56 🔗 alard Yes, the zip file download is strange. I've tried that. It's *very* slow, then it just stops half way. The next time you try it, you get more data, then it stops again. Repeat until you have a valid zip file.
20:58 🔗 yipdw it looks like Tabblo suffers from a similar problem as Splinder
20:58 🔗 yipdw (and every other huge webapp, really)
20:58 🔗 balrog_ph Which is what?
20:58 🔗 yipdw application server timeout
20:58 🔗 yipdw or more precisely app server overload
20:59 🔗 yipdw there's code in the lifeboat that retries a download of a page up to ten times
20:59 🔗 alard That seems a likely explanation. And they have caching, so the next time you try it things go faster.
20:59 🔗 alard And eventually things are cached enough to give you the whole file within the time limit.
20:59 🔗 yipdw yeah, assuming your request didn't get knocked out of cache
21:00 🔗 yipdw er, response to your request
21:00 🔗 alard Should we set organize a rescue mission?
21:00 🔗 yipdw hmm
21:00 🔗 yipdw I wonder if organizing a rescue mission would make things worse :P
21:00 🔗 alard Saving the tabblos seems easy and simple enough.
21:00 🔗 alard A rescue mission with limited admission?
21:00 🔗 yipdw in the sense that it'd be stressing the site and causing more download failures
21:00 🔗 yipdw probably, yeah
21:01 🔗 alard And perhaps make such a big problem that they'll just shut it down.
21:01 🔗 yipdw right
21:02 🔗 yipdw I guess we'd just use the lifeboat code
21:02 🔗 alard It isn't warc.
21:02 🔗 yipdw true
21:03 🔗 yipdw but it does handle a lot of Tabblo corner cases already
21:03 🔗 yipdw how hard would it be to add WARC generation?
21:03 🔗 alard Well, basically the only thing of real interest is the download_tabblo method.
21:04 🔗 alard Discovering tabblo id's is less important, we just start at 1 and continue to 180000+
21:04 🔗 alard The download_tabblo method downloads the zip file, which we can do ourselves.
21:05 🔗 yipdw I'm wondering how to handle things like truncated pages and error reponses
21:05 🔗 yipdw post-processing the WARC and wget log files?
21:05 🔗 yipdw or can we use that wget-lua branch
21:06 🔗 alard Maybe it should be a two-step process: 1. we run a wget --page-requisites on the tabblo page, which will give us a complete web page to put in a WARC.
21:06 🔗 alard 2. we also download the zip file that contains the original images, but don't add that zip file to the warc.
21:07 🔗 alard Then we'd have a more or less browsable (as in: WARC) copy of the site, and we'd have a copy of the original photos. The rest is derived from that from the lifeboat, which can be done later, if necessary.
21:08 🔗 yipdw hmm, let me see if I follow that
21:08 🔗 alard (The lifeboat just downloads that one zip file per tabblo, as far as I can see.)
21:08 🔗 yipdw we retrieve the page structure using wget-warc, and use the ZIP from the lifeboat to augment whatever's missing
21:08 🔗 yipdw (or alternatively use the ZIP as the source of truth for photos)
21:09 🔗 yipdw or do you propose that the ZIP and WARC remain separate?
21:11 🔗 alard I'd think they serve different purposes: 1. the warc would give people the pages they link to now (the tabblos can be viewed via the wayback machine, for example). 2. The original content is still in the zip file. It's not directly browsable, but the data is there and can be processed by something like the lifeboat.
21:12 🔗 alard (The lifeboat doesn't include comments, by the way, since those aren't in the zip.)
22:13 🔗 winr4r hi guys
22:13 🔗 winr4r i have word from the inside that rutnet.org.uk is going to close down in a few weeks since it lost its funding
22:14 🔗 winr4r nothing's certain right now, but it has a lot of sub-sites for towns and villages in rutland
22:17 🔗 winr4r and i've been told its funding will be cut and it's going to close at the end of the financial year 2012, which means early april
22:18 🔗 Nemo_bis if you have internal contacts why don't you ask a backup
22:18 🔗 winr4r sorry, .co.uk*
22:19 🔗 winr4r Nemo_bis: the internal contact is a client who has a sub-site on rutnet and needs to get it off rutnet by april
22:19 🔗 Nemo_bis not internal enough then, ok
22:19 🔗 winr4r Nemo_bis: yeah, not quite
22:21 🔗 winr4r if we get the contract, then we will end up with some contact with the owners of the website now (in order to set up redirects from their old rutnet site to the new one)
22:22 🔗 winr4r and we *might* (very slim might) get friendly enough to negotiate a database dump
22:23 🔗 winr4r but that's a might on top of a might
22:26 🔗 Coderjoe isnt the end of FY12 actualyl april 2013?
22:27 🔗 winr4r Coderjoe: you might be right, actually, i do know it's whatever end of the FY that happens this year
22:29 🔗 Nemo_bis it depends in the company, doesn't it
22:29 🔗 winr4r Nemo_bis: it's going down in april 2012
22:29 🔗 winr4r unless something happens
22:29 🔗 Nemo_bis the FY I mean
22:31 🔗 winr4r Nemo_bis: i think the FY is the same anywhere here
22:31 🔗 Nemo_bis ah
22:31 🔗 winr4r in any case, it's funded by the government right now
22:32 🔗 winr4r and you know, cutting £50 a month for a dedicated server would zero the national debt overnight
22:32 🔗 winr4r meanwhile, the fucknuts who get to make decisions like that actually keep their jobs

irclogger-viewer