#archiveteam 2013-06-23,Sun

↑back Search

Time Nickname Message
00:31 🔗 namespace Eh.
00:31 🔗 namespace I found what I wanted in the manual.
00:31 🔗 namespace It's not even that long.
00:31 🔗 namespace (But it sure is boring.)
00:43 🔗 RichardG still interested in a full backup of Google Answers?
00:44 🔗 RichardG there are ~781k sequential I can try doing
00:44 🔗 RichardG just got into archiving and it sounds fun.
00:52 🔗 arrith1 RichardG: sure, btw if you're interested in google stuff we need crawling help for Google Reader. channel is #donereading
00:52 🔗 RichardG oh, sure I could try helping
00:52 🔗 arrith1 specifically crawling various sites to get usernames
00:53 🔗 arrith1 i'm working on a script for livejournal but there are lots of others and time is short
00:53 🔗 RichardG correction: Answers seems to have around 787k, and the archives do go as far back as the first questions
00:54 🔗 arrith1 RichardG: do they happen to have warcs? not totally necessary but preferable
00:54 🔗 RichardG I'm doing it in warc
00:55 🔗 RichardG had to compile the latest wget on cygwin to get watc support
00:55 🔗 RichardG was afraid the 700k+ files in one dir would nuke my HDD structure but then I remembered I scraped a ton of pastebins for a few months. All the spam
00:57 🔗 arrith1 RichardG: might want to setup a vm at some point
01:08 🔗 omf_ Do not do it on windows. Windows has many gotchas for filesystem fuckaroundery
01:13 🔗 arrith1 btw anyone with even basic coding/scripting skills, the Google Reader effort can use lists of usernames (just submit .txt to the opml page) from sites listed here: http://www.archiveteam.org/index.php?title=Google_Reader
01:51 🔗 omf_ Guess what motherfuckers? It's been a journey unlike anything we'd imagined
01:51 🔗 omf_ http://blog.snapjoy.com/2013/snapjoy-is-closing/
01:51 🔗 omf_ Shit is closing July 24th
01:52 🔗 omf_ They got purchased by dropbox
01:54 🔗 arrith1 uhoh
01:54 🔗 omf_ How about I journey to the founder home and beat them with a pipe
02:01 🔗 db48x heh
02:01 🔗 db48x does anyone know how the posterous archives are organized? I need to find a particular site
02:18 🔗 SketchCow WHY HELLO.
02:18 🔗 SketchCow YEAH SO GUESS WHO HAS PAID FOR THREE LXURY NIGHTS IN A HOTEL AND GOTTEN 2 INTERVIEWS
02:18 🔗 SketchCow Not happy.
02:19 🔗 omf_ Did people cancel on you?
02:25 🔗 db48x that's a shame. what are you going to do?
02:41 🔗 SketchCow I'm going to to crush my fucking inbox
02:51 🔗 xmc SketchCow: https://www.youtube.com/watch?v=btBf3sJEAgc
03:00 🔗 SketchCow Legit
06:19 🔗 omf_ What sites did we backup have a username search
06:19 🔗 omf_ tabblo I believe
06:31 🔗 arrith1 omf_: that's a good idea
06:45 🔗 winr4r i know webshots did, and i think mobileme did
06:50 🔗 arrith1 one thing is google reader's search feature only returns up to about 1000 results for feeds, but not more. so in terms of ideas for getting more feed URLs, one idea is to get lists of usernames from other sites, but other ideas are welcome
07:12 🔗 omf_ GLaDOS, and I have been building up an overview and breakdown of the ArchiveTeam process, here is our work in progress - http://pad.archivingyoursh.it/p/atpodcast
07:15 🔗 omf_ Currently it is 4 major topic areas. An overview of the whole process, Servers, Clients, Saving the site
07:20 🔗 winr4r morning omf_
07:25 🔗 godane looks like i'm missing stuff i need to get uploaded right away
07:25 🔗 godane like my g4tv.com/images/ dump
07:25 🔗 godane i did the html index but forgot the images
07:26 🔗 godane i think i grabed all images in there gallary
07:30 🔗 godane PS i'm getting more attack of the show episodes
08:47 🔗 ivan` omf_: if I give you a list of com.domain things, could you query your cuil URLs?
08:48 🔗 ivan` sometimes I need the subdomain, sometimes the thing after the first or second slash
09:04 🔗 ivan` I can write a post-processor for the URLs if needed
12:45 🔗 GLaDOS Did someone say, Snapjoy?
12:45 🔗 Smiley [02:51:09] <@omf_> http://blog.snapjoy.com/2013/snapjoy-is-closing/
12:47 🔗 GLaDOS I proposed #snappedjoy
12:47 🔗 GLaDOS s/proposed/propose
12:58 🔗 antomatic I really dislike the tone of closing announcements like that. 'After two years, the time has come for us to shut down' - as if that's a totally normal and expected thing.
12:59 🔗 ivan` "With Snapjoy, all of your photos are always organized, safe, and together in one place."
13:00 🔗 ivan` unless you went away somewhere for 32 days
13:16 🔗 winr4r >always
13:23 🔗 dashcloud so, for the warrior, what's the current project? xanga?
13:26 🔗 winr4r yes
13:30 🔗 chazchaz Is that higher priority than formspring?
13:31 🔗 GLaDOS Formspring isn't closing anymore, so yes.
13:34 🔗 SketchCow Yes, Formspring is lowest priority.
13:34 🔗 SketchCow Snapjoy's a priority.
13:34 🔗 SketchCow Not sure if we can reach the photos.
13:37 🔗 dashcloud so, since dropbox is buying or taking over snapjoy, shouldn't they just host all of the pictures on their site? they are a storage site after all
13:37 🔗 GLaDOS dashcloud: that makes way too much sense.
13:38 🔗 SketchCow The reason for this move is to divest the company of liability related to the old company.
13:38 🔗 SketchCow All of it.
13:38 🔗 SketchCow That's why.
13:38 🔗 SketchCow --------------------------------------
13:38 🔗 Cameron_D GLaDOS: I need an address http://i.imgur.com/oyHHgAD.png
13:38 🔗 SketchCow NEW PROJECT: #snapshut
13:38 🔗 SketchCow --------------------------------------
13:39 🔗 GLaDOS Er, hang on a second, let me set an MX record up
13:41 🔗 GLaDOS Cameron_D: would iam@archivingyoursh.it work?
13:41 🔗 Cameron_D (You don't actually need to, because it won't end up happening)
13:41 🔗 RichardG Total wall clock time: 4m 24s, Downloaded: 54 files, 531K in 0,1s (4,29 MB/s)
13:41 🔗 Cameron_D but if you want to, sure, that sounds good enough
13:41 🔗 RichardG Google Answers being that bad, 54 valid in 1000
14:08 🔗 chazchaz I'm trying to run the xanga script, but when I run it, it says that my version of seesaw is out nof date, even though I just ran sudo pip install --upgrade seesaw
14:09 🔗 RichardG I already have 17 of ~780 batches from Google Answers
14:09 🔗 RichardG can give a preview to see if I'm doing this right
14:18 🔗 RichardG first 5 batches of Google Answers >>> http://dl.dropboxusercontent.com/u/861751/googleanswers-incremental-pvw.tar.bz2
14:35 🔗 chazchaz got it working
14:36 🔗 chazchaz still not sure why I couldn't get it to upgrade the globally installed version, but i unstalled that ind installed with the --user option
14:36 🔗 RichardG GAnswers status: 56 batches, I found a really weird corrupted one where it has a normal compressed size but the warc only decompresses to the header
14:36 🔗 RichardG I'm keeping on with it
15:32 🔗 db48x who has been doing the uploads of the Posterous grab?
19:28 🔗 Schbirid hem.passagen.se/user mirroring is quick and should be done in two or three days
19:28 🔗 Schbirid http://archive.org/search.php?query=hem.passagen.se
19:28 🔗 Schbirid warc.gz inside tar
19:35 🔗 joepie91 .... archive.org offline? ._.
19:35 🔗 joepie91 sadface
19:36 🔗 Schbirid we should archive archive.org
19:36 🔗 Smiley :D
19:36 🔗 Schbirid to my butt
19:36 🔗 Smiley ok, and we go off topic to -bs :)
20:12 🔗 omf_ and then to another butt, butts all the way down
21:10 🔗 RichardG Google Answers status: it's done, but I need to repair some batches that were duped due to an issue in my script
23:36 🔗 ivan` 7.5 days left for Reader, and the Feed API they're leaving up doesn't actually serve all the historical entries
23:39 🔗 ivan` I'm guessing heavy Reader users see the value because they had deleted posts and blogs in there, but most don't
23:40 🔗 arrith1 RichardG: you might want to use ias2upload script to get your google answsers straight into IA, unless you're doing something else through dropbox
23:43 🔗 ivan` I wish I had URLs from IA and cuil for greader-grab a while ago since href extraction would have found even more feeds

irclogger-viewer