#archiveteam 2013-06-07,Fri

↑back Search

Time	Nickname	Message
01:28 ^🔗	ivan`	does anyone just happen to have a list of banned subreddits lying around
01:34 ^🔗	wp494	but by no means is it complete
01:34 ^🔗	wp494	apparently the last message didn't send
01:34 ^🔗	wp494	./r/subredditdrama users post drama involving subreddit closures
01:35 ^🔗	ivan`	ah
01:35 ^🔗	tyn	ivan` I found a partial list of subreddits consisting of 23913 subreddits sorted by subscribers if that helps
01:35 ^🔗	ivan`	tyn: commoncrawl? running that too right now
01:35 ^🔗	ivan`	http://stattit.com/subreddits/by_subscribers/ lists 24,485
01:35 ^🔗	ivan`	I'll check if wayback has more copies of that
01:35 ^🔗	tyn	No, http://subreddits.org/alldata.zip
01:35 ^🔗	ivan`	nice, thanks
02:10 ^🔗	SketchCow	http://archive.org/search.php?query=collection%3Amadmagazine&sort=-publicdate
02:10 ^🔗	SketchCow	Watch the fun
02:54 ^🔗	DFJustin	um you can't do that on the internet
02:56 ^🔗	samwilson	you can't watch the fun on the internet?! are you sure?
02:56 ^🔗	samwilson	;)
03:44 ^🔗	ivan`	please tell me your favorite reddit users so they are on my list of feeds to grab
03:47 ^🔗	ivan`	heh http://web.archive.org/web/20050808003120/http://reddit.com/
03:48 ^🔗	omf_	Just so you know I have been backing up parts of reddit for 3 years already
03:49 ^🔗	ivan`	omf_: cool, can I get a user list and a subreddit list?
03:50 ^🔗	omf_	I don't track users only subreddits
03:52 ^🔗	ivan`	I have discovered only 150K subreddits
03:52 ^🔗	ivan`	I'm sure there are more banned ones
03:52 ^🔗	ivan`	actually less, I have lowercase dupes
03:54 ^🔗	omf_	ivan`, are you trying to archive all of reddit?
03:56 ^🔗	ivan`	just the reddit feeds in google's feed cache
03:56 ^🔗	ivan`	if you can grep for /user/ and /r/ that would be appreciated
03:58 ^🔗	omf_	It will take me a while. I got everything as json docs in couchdb
03:58 ^🔗	ivan`	that's fine
03:59 ^🔗	ivan`	tried grepping couchdb's binary blobs? ;)
04:01 ^🔗	ivan`	egrep --no-filename -o '/(r\|user)/([^/"])/' \| uniq
12:55 ^🔗	wp494	I'm just going to sit here and assume that the yahoo messages that appeared on the main tracker site is the google reader grab
13:36 ^🔗	Smiley	o_O
13:36 ^🔗	Smiley	i didn't touch yahoo
13:36 ^🔗	*	Smiley blames alard
13:36 ^🔗	*	BlueMax blames Smiley
13:38 ^🔗	Smiley	there, removed it.
16:05 ^🔗	SketchCow	https://twitter.com/textfiles/status/343035981215174657
18:26 ^🔗	epitron	ivan`: oh man... i wanna read all those old reddit articles :)
18:26 ^🔗	epitron	look at how awesome reddit used to be
18:27 ^🔗	epitron	does google reader have history going back that far?
19:33 ^🔗	ivan`	epitron: it depends on how early someone added a feed
19:37 ^🔗	epitron	were there feeds on reddit in 2005?
19:38 ^🔗	epitron	when it first started, i think it was all frontpage
19:39 ^🔗	epitron	hmm.. reddit probably predates google reader
19:39 ^🔗	epitron	oh, no, they were out at the same time
19:41 ^🔗	epitron	looks like http%3A%2F%2Fwww.reddit.com%2F.rss isn't in your crawled feeds
19:41 ^🔗	epitron	(i added them all to a leveldb store, so i could check)
19:42 ^🔗	ivan`	I think I need to switch from postgres to leveldb or similar
19:43 ^🔗	ivan`	thanks, I'll add that one
19:45 ^🔗	epitron	google scale \o/
19:47 ^🔗	epitron	i just created 2 leveldb databases
19:47 ^🔗	epitron	"crawled" and "frontier"
19:48 ^🔗	epitron	that way i don't have to walk through the entire crawled database to find new ones
19:48 ^🔗	ivan`	epitron: how many items, how big is the leveldb
19:49 ^🔗	epitron	99 million
19:49 ^🔗	epitron	3.2 gigs
19:49 ^🔗	ivan`	very nice
19:49 ^🔗	epitron	i was looking at the storage files with a hex editor.. it looks like it's stored as a prefix tree
19:51 ^🔗	ivan`	my leveldb will probably be [unencoded url] - > [array of item ids] with a single process importing and assigning the item id
19:51 ^🔗	ivan`	my postgres setup has ridiculously bad performance
20:01 ^🔗	Marcelo	is anyone getting error 500 on formspring sometimes?
20:12 ^🔗	ivan`	what library turns https://google.com/page into com.google:https/page?
20:23 ^🔗	jfranusic	so, now that wget has warc support, I've decided to start saving copies of websites using that
20:23 ^🔗	jfranusic	but, I have no idea what flags I should be using
20:23 ^🔗	jfranusic	are there guidelines somewhere for what I should be using?
20:23 ^🔗	jfranusic	(I'm looking at this right now: https://github.com/ArchiveTeam/posterous-grab/blob/master/pipeline.py#L242 )
20:24 ^🔗	ivan`	jfranusic: see http://www.archiveteam.org/index.php?title=User:Djsmiley2k
20:25 ^🔗	Smiley	:D
20:25 ^🔗	DFJustin	that should get added to http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
20:32 ^🔗	godane	looks like i have uploaded over 6.1TB
20:32 ^🔗	jfranusic	yes, that would have been helpful
20:33 ^🔗	The_Vole	look what I found http://www.symphonyofscience.com/sos/

irclogger-viewer