#archiveteam 2013-06-23,Sun

↑back Search

Time	Nickname	Message
00:31 ^🔗	namespace	Eh.
00:31 ^🔗	namespace	I found what I wanted in the manual.
00:31 ^🔗	namespace	It's not even that long.
00:31 ^🔗	namespace	(But it sure is boring.)
00:43 ^🔗	RichardG	still interested in a full backup of Google Answers?
00:44 ^🔗	RichardG	there are ~781k sequential I can try doing
00:44 ^🔗	RichardG	just got into archiving and it sounds fun.
00:52 ^🔗	arrith1	RichardG: sure, btw if you're interested in google stuff we need crawling help for Google Reader. channel is #donereading
00:52 ^🔗	RichardG	oh, sure I could try helping
00:52 ^🔗	arrith1	specifically crawling various sites to get usernames
00:53 ^🔗	arrith1	i'm working on a script for livejournal but there are lots of others and time is short
00:53 ^🔗	RichardG	correction: Answers seems to have around 787k, and the archives do go as far back as the first questions
00:54 ^🔗	arrith1	RichardG: do they happen to have warcs? not totally necessary but preferable
00:54 ^🔗	RichardG	I'm doing it in warc
00:55 ^🔗	RichardG	had to compile the latest wget on cygwin to get watc support
00:55 ^🔗	RichardG	was afraid the 700k+ files in one dir would nuke my HDD structure but then I remembered I scraped a ton of pastebins for a few months. All the spam
00:57 ^🔗	arrith1	RichardG: might want to setup a vm at some point
01:08 ^🔗	omf_	Do not do it on windows. Windows has many gotchas for filesystem fuckaroundery
01:13 ^🔗	arrith1	btw anyone with even basic coding/scripting skills, the Google Reader effort can use lists of usernames (just submit .txt to the opml page) from sites listed here: http://www.archiveteam.org/index.php?title=Google_Reader
01:51 ^🔗	omf_	Guess what motherfuckers? It's been a journey unlike anything we'd imagined
01:51 ^🔗	omf_	http://blog.snapjoy.com/2013/snapjoy-is-closing/
01:51 ^🔗	omf_	Shit is closing July 24th
01:52 ^🔗	omf_	They got purchased by dropbox
01:54 ^🔗	arrith1	uhoh
01:54 ^🔗	omf_	How about I journey to the founder home and beat them with a pipe
02:01 ^🔗	db48x	heh
02:01 ^🔗	db48x	does anyone know how the posterous archives are organized? I need to find a particular site
02:18 ^🔗	SketchCow	WHY HELLO.
02:18 ^🔗	SketchCow	YEAH SO GUESS WHO HAS PAID FOR THREE LXURY NIGHTS IN A HOTEL AND GOTTEN 2 INTERVIEWS
02:18 ^🔗	SketchCow	Not happy.
02:19 ^🔗	omf_	Did people cancel on you?
02:25 ^🔗	db48x	that's a shame. what are you going to do?
02:41 ^🔗	SketchCow	I'm going to to crush my fucking inbox
02:51 ^🔗	xmc	SketchCow: https://www.youtube.com/watch?v=btBf3sJEAgc
03:00 ^🔗	SketchCow	Legit
06:19 ^🔗	omf_	What sites did we backup have a username search
06:19 ^🔗	omf_	tabblo I believe
06:31 ^🔗	arrith1	omf_: that's a good idea
06:45 ^🔗	winr4r	i know webshots did, and i think mobileme did
06:50 ^🔗	arrith1	one thing is google reader's search feature only returns up to about 1000 results for feeds, but not more. so in terms of ideas for getting more feed URLs, one idea is to get lists of usernames from other sites, but other ideas are welcome
07:12 ^🔗	omf_	GLaDOS, and I have been building up an overview and breakdown of the ArchiveTeam process, here is our work in progress - http://pad.archivingyoursh.it/p/atpodcast
07:15 ^🔗	omf_	Currently it is 4 major topic areas. An overview of the whole process, Servers, Clients, Saving the site
07:20 ^🔗	winr4r	morning omf_
07:25 ^🔗	godane	looks like i'm missing stuff i need to get uploaded right away
07:25 ^🔗	godane	like my g4tv.com/images/ dump
07:25 ^🔗	godane	i did the html index but forgot the images
07:26 ^🔗	godane	i think i grabed all images in there gallary
07:30 ^🔗	godane	PS i'm getting more attack of the show episodes
08:47 ^🔗	ivan`	omf_: if I give you a list of com.domain things, could you query your cuil URLs?
08:48 ^🔗	ivan`	sometimes I need the subdomain, sometimes the thing after the first or second slash
09:04 ^🔗	ivan`	I can write a post-processor for the URLs if needed
12:45 ^🔗	GLaDOS	Did someone say, Snapjoy?
12:45 ^🔗	Smiley	[02:51:09] <@omf_> http://blog.snapjoy.com/2013/snapjoy-is-closing/
12:47 ^🔗	GLaDOS	I proposed #snappedjoy
12:47 ^🔗	GLaDOS	s/proposed/propose
12:58 ^🔗	antomatic	I really dislike the tone of closing announcements like that. 'After two years, the time has come for us to shut down' - as if that's a totally normal and expected thing.
12:59 ^🔗	ivan`	"With Snapjoy, all of your photos are always organized, safe, and together in one place."
13:00 ^🔗	ivan`	unless you went away somewhere for 32 days
13:16 ^🔗	winr4r	>always
13:23 ^🔗	dashcloud	so, for the warrior, what's the current project? xanga?
13:26 ^🔗	winr4r	yes
13:30 ^🔗	chazchaz	Is that higher priority than formspring?
13:31 ^🔗	GLaDOS	Formspring isn't closing anymore, so yes.
13:34 ^🔗	SketchCow	Yes, Formspring is lowest priority.
13:34 ^🔗	SketchCow	Snapjoy's a priority.
13:34 ^🔗	SketchCow	Not sure if we can reach the photos.
13:37 ^🔗	dashcloud	so, since dropbox is buying or taking over snapjoy, shouldn't they just host all of the pictures on their site? they are a storage site after all
13:37 ^🔗	GLaDOS	dashcloud: that makes way too much sense.
13:38 ^🔗	SketchCow	The reason for this move is to divest the company of liability related to the old company.
13:38 ^🔗	SketchCow	All of it.
13:38 ^🔗	SketchCow	That's why.
13:38 ^🔗	SketchCow	--------------------------------------
13:38 ^🔗	Cameron_D	GLaDOS: I need an address http://i.imgur.com/oyHHgAD.png
13:38 ^🔗	SketchCow	NEW PROJECT: #snapshut
13:38 ^🔗	SketchCow	--------------------------------------
13:39 ^🔗	GLaDOS	Er, hang on a second, let me set an MX record up
13:41 ^🔗	GLaDOS	Cameron_D: would iam@archivingyoursh.it work?
13:41 ^🔗	Cameron_D	(You don't actually need to, because it won't end up happening)
13:41 ^🔗	RichardG	Total wall clock time: 4m 24s, Downloaded: 54 files, 531K in 0,1s (4,29 MB/s)
13:41 ^🔗	Cameron_D	but if you want to, sure, that sounds good enough
13:41 ^🔗	RichardG	Google Answers being that bad, 54 valid in 1000
14:08 ^🔗	chazchaz	I'm trying to run the xanga script, but when I run it, it says that my version of seesaw is out nof date, even though I just ran sudo pip install --upgrade seesaw
14:09 ^🔗	RichardG	I already have 17 of ~780 batches from Google Answers
14:09 ^🔗	RichardG	can give a preview to see if I'm doing this right
14:18 ^🔗	RichardG	first 5 batches of Google Answers >>> http://dl.dropboxusercontent.com/u/861751/googleanswers-incremental-pvw.tar.bz2
14:35 ^🔗	chazchaz	got it working
14:36 ^🔗	chazchaz	still not sure why I couldn't get it to upgrade the globally installed version, but i unstalled that ind installed with the --user option
14:36 ^🔗	RichardG	GAnswers status: 56 batches, I found a really weird corrupted one where it has a normal compressed size but the warc only decompresses to the header
14:36 ^🔗	RichardG	I'm keeping on with it
15:32 ^🔗	db48x	who has been doing the uploads of the Posterous grab?
19:28 ^🔗	Schbirid	hem.passagen.se/user mirroring is quick and should be done in two or three days
19:28 ^🔗	Schbirid	http://archive.org/search.php?query=hem.passagen.se
19:28 ^🔗	Schbirid	warc.gz inside tar
19:35 ^🔗	joepie91	.... archive.org offline? ._.
19:35 ^🔗	joepie91	sadface
19:36 ^🔗	Schbirid	we should archive archive.org
19:36 ^🔗	Smiley	:D
19:36 ^🔗	Schbirid	to my butt
19:36 ^🔗	Smiley	ok, and we go off topic to -bs :)
20:12 ^🔗	omf_	and then to another butt, butts all the way down
21:10 ^🔗	RichardG	Google Answers status: it's done, but I need to repair some batches that were duped due to an issue in my script
23:36 ^🔗	ivan`	7.5 days left for Reader, and the Feed API they're leaving up doesn't actually serve all the historical entries
23:39 ^🔗	ivan`	I'm guessing heavy Reader users see the value because they had deleted posts and blogs in there, but most don't
23:40 ^🔗	arrith1	RichardG: you might want to use ias2upload script to get your google answsers straight into IA, unless you're doing something else through dropbox
23:43 ^🔗	ivan`	I wish I had URLs from IA and cuil for greader-grab a while ago since href extraction would have found even more feeds

irclogger-viewer