[00:31] Eh. [00:31] I found what I wanted in the manual. [00:31] It's not even that long. [00:31] (But it sure is boring.) [00:43] still interested in a full backup of Google Answers? [00:44] there are ~781k sequential I can try doing [00:44] just got into archiving and it sounds fun. [00:52] RichardG: sure, btw if you're interested in google stuff we need crawling help for Google Reader. channel is #donereading [00:52] oh, sure I could try helping [00:52] specifically crawling various sites to get usernames [00:53] i'm working on a script for livejournal but there are lots of others and time is short [00:53] correction: Answers seems to have around 787k, and the archives do go as far back as the first questions [00:54] RichardG: do they happen to have warcs? not totally necessary but preferable [00:54] I'm doing it in warc [00:55] had to compile the latest wget on cygwin to get watc support [00:55] was afraid the 700k+ files in one dir would nuke my HDD structure but then I remembered I scraped a ton of pastebins for a few months. All the spam [00:57] RichardG: might want to setup a vm at some point [01:08] Do not do it on windows. Windows has many gotchas for filesystem fuckaroundery [01:13] btw anyone with even basic coding/scripting skills, the Google Reader effort can use lists of usernames (just submit .txt to the opml page) from sites listed here: http://www.archiveteam.org/index.php?title=Google_Reader [01:51] Guess what motherfuckers? It's been a journey unlike anything we'd imagined [01:51] http://blog.snapjoy.com/2013/snapjoy-is-closing/ [01:51] Shit is closing July 24th [01:52] They got purchased by dropbox [01:54] uhoh [01:54] How about I journey to the founder home and beat them with a pipe [02:01] heh [02:01] does anyone know how the posterous archives are organized? I need to find a particular site [02:18] WHY HELLO. [02:18] YEAH SO GUESS WHO HAS PAID FOR THREE LXURY NIGHTS IN A HOTEL AND GOTTEN 2 INTERVIEWS [02:18] Not happy. [02:19] Did people cancel on you? [02:25] that's a shame. what are you going to do? [02:41] I'm going to to crush my fucking inbox [02:51] SketchCow: https://www.youtube.com/watch?v=btBf3sJEAgc [03:00] Legit [06:19] What sites did we backup have a username search [06:19] tabblo I believe [06:31] omf_: that's a good idea [06:45] i know webshots did, and i think mobileme did [06:50] one thing is google reader's search feature only returns up to about 1000 results for feeds, but not more. so in terms of ideas for getting more feed URLs, one idea is to get lists of usernames from other sites, but other ideas are welcome [07:12] GLaDOS, and I have been building up an overview and breakdown of the ArchiveTeam process, here is our work in progress - http://pad.archivingyoursh.it/p/atpodcast [07:15] Currently it is 4 major topic areas. An overview of the whole process, Servers, Clients, Saving the site [07:20] morning omf_ [07:25] looks like i'm missing stuff i need to get uploaded right away [07:25] like my g4tv.com/images/ dump [07:25] i did the html index but forgot the images [07:26] i think i grabed all images in there gallary [07:30] PS i'm getting more attack of the show episodes [08:47] omf_: if I give you a list of com.domain things, could you query your cuil URLs? [08:48] sometimes I need the subdomain, sometimes the thing after the first or second slash [09:04] I can write a post-processor for the URLs if needed [12:45] Did someone say, Snapjoy? [12:45] [02:51:09] <@omf_> http://blog.snapjoy.com/2013/snapjoy-is-closing/ [12:47] I proposed #snappedjoy [12:47] s/proposed/propose [12:58] I really dislike the tone of closing announcements like that. 'After two years, the time has come for us to shut down' - as if that's a totally normal and expected thing. [12:59] "With Snapjoy, all of your photos are always organized, safe, and together in one place." [13:00] unless you went away somewhere for 32 days [13:16] >always [13:23] so, for the warrior, what's the current project? xanga? [13:26] yes [13:30] Is that higher priority than formspring? [13:31] Formspring isn't closing anymore, so yes. [13:34] Yes, Formspring is lowest priority. [13:34] Snapjoy's a priority. [13:34] Not sure if we can reach the photos. [13:37] so, since dropbox is buying or taking over snapjoy, shouldn't they just host all of the pictures on their site? they are a storage site after all [13:37] dashcloud: that makes way too much sense. [13:38] The reason for this move is to divest the company of liability related to the old company. [13:38] All of it. [13:38] That's why. [13:38] -------------------------------------- [13:38] GLaDOS: I need an address http://i.imgur.com/oyHHgAD.png [13:38] NEW PROJECT: #snapshut [13:38] -------------------------------------- [13:39] Er, hang on a second, let me set an MX record up [13:41] Cameron_D: would iam@archivingyoursh.it work? [13:41] (You don't actually need to, because it won't end up happening) [13:41] Total wall clock time: 4m 24s, Downloaded: 54 files, 531K in 0,1s (4,29 MB/s) [13:41] but if you want to, sure, that sounds good enough [13:41] Google Answers being that bad, 54 valid in 1000 [14:08] I'm trying to run the xanga script, but when I run it, it says that my version of seesaw is out nof date, even though I just ran sudo pip install --upgrade seesaw [14:09] I already have 17 of ~780 batches from Google Answers [14:09] can give a preview to see if I'm doing this right [14:18] first 5 batches of Google Answers >>> http://dl.dropboxusercontent.com/u/861751/googleanswers-incremental-pvw.tar.bz2 [14:35] got it working [14:36] still not sure why I couldn't get it to upgrade the globally installed version, but i unstalled that ind installed with the --user option [14:36] GAnswers status: 56 batches, I found a really weird corrupted one where it has a normal compressed size but the warc only decompresses to the header [14:36] I'm keeping on with it [15:32] who has been doing the uploads of the Posterous grab? [19:28] hem.passagen.se/user mirroring is quick and should be done in two or three days [19:28] http://archive.org/search.php?query=hem.passagen.se [19:28] warc.gz inside tar [19:35] .... archive.org offline? ._. [19:35] sadface [19:36] we should archive archive.org [19:36] :D [19:36] to my butt [19:36] ok, and we go off topic to -bs :) [20:12] and then to another butt, butts all the way down [21:10] Google Answers status: it's done, but I need to repair some batches that were duped due to an issue in my script [23:36] 7.5 days left for Reader, and the Feed API they're leaving up doesn't actually serve all the historical entries [23:39] I'm guessing heavy Reader users see the value because they had deleted posts and blogs in there, but most don't [23:40] RichardG: you might want to use ias2upload script to get your google answsers straight into IA, unless you're doing something else through dropbox [23:43] I wish I had URLs from IA and cuil for greader-grab a while ago since href extraction would have found even more feeds