[01:28] does anyone just happen to have a list of banned subreddits lying around [01:34] but by no means is it complete [01:34] apparently the last message didn't send [01:34] ./r/subredditdrama users post drama involving subreddit closures [01:35] ah [01:35] ivan` I found a partial list of subreddits consisting of 23913 subreddits sorted by subscribers if that helps [01:35] tyn: commoncrawl? running that too right now [01:35] http://stattit.com/subreddits/by_subscribers/ lists 24,485 [01:35] I'll check if wayback has more copies of that [01:35] No, http://subreddits.org/alldata.zip [01:35] nice, thanks [02:10] http://archive.org/search.php?query=collection%3Amadmagazine&sort=-publicdate [02:10] Watch the fun [02:54] um you can't do that on the internet [02:56] you can't watch the fun on the internet?! are you sure? [02:56] ;) [03:44] please tell me your favorite reddit users so they are on my list of feeds to grab [03:47] heh http://web.archive.org/web/20050808003120/http://reddit.com/ [03:48] Just so you know I have been backing up parts of reddit for 3 years already [03:49] omf_: cool, can I get a user list and a subreddit list? [03:50] I don't track users only subreddits [03:52] I have discovered only 150K subreddits [03:52] I'm sure there are more banned ones [03:52] actually less, I have lowercase dupes [03:54] ivan`, are you trying to archive all of reddit? [03:56] just the reddit feeds in google's feed cache [03:56] if you can grep for /user/ and /r/ that would be appreciated [03:58] It will take me a while. I got everything as json docs in couchdb [03:58] that's fine [03:59] tried grepping couchdb's binary blobs? ;) [04:01] egrep --no-filename -o '/(r|user)/([^/"]*)/' * | uniq [12:55] I'm just going to sit here and assume that the yahoo messages that appeared on the main tracker site is the google reader grab [13:36] o_O [13:36] i didn't touch yahoo [13:36] * Smiley blames alard [13:36] * BlueMax blames Smiley [13:38] there, removed it. [16:05] https://twitter.com/textfiles/status/343035981215174657 [18:26] ivan`: oh man... i wanna read all those old reddit articles :) [18:26] look at how awesome reddit used to be [18:27] does google reader have history going back that far? [19:33] epitron: it depends on how early someone added a feed [19:37] were there feeds on reddit in 2005? [19:38] when it first started, i think it was all frontpage [19:39] hmm.. reddit probably predates google reader [19:39] oh, no, they were out at the same time [19:41] looks like http%3A%2F%2Fwww.reddit.com%2F.rss isn't in your crawled feeds [19:41] (i added them all to a leveldb store, so i could check) [19:42] I think I need to switch from postgres to leveldb or similar [19:43] thanks, I'll add that one [19:45] google scale \o/ [19:47] i just created 2 leveldb databases [19:47] "crawled" and "frontier" [19:48] that way i don't have to walk through the entire crawled database to find new ones [19:48] epitron: how many items, how big is the leveldb [19:49] 99 million [19:49] 3.2 gigs [19:49] very nice [19:49] i was looking at the storage files with a hex editor.. it looks like it's stored as a prefix tree [19:51] my leveldb will probably be [unencoded url] - > [array of item ids] with a single process importing and assigning the item id [19:51] my postgres setup has ridiculously bad performance [20:01] is anyone getting error 500 on formspring sometimes? [20:12] what library turns https://google.com/page into com.google:https/page? [20:23] so, now that wget has warc support, I've decided to start saving copies of websites using that [20:23] but, I have no idea what flags I should be using [20:23] are there guidelines somewhere for what I should be using? [20:23] (I'm looking at this right now: https://github.com/ArchiveTeam/posterous-grab/blob/master/pipeline.py#L242 ) [20:24] jfranusic: see http://www.archiveteam.org/index.php?title=User:Djsmiley2k [20:25] :D [20:25] that should get added to http://www.archiveteam.org/index.php?title=Wget_with_WARC_output [20:32] looks like i have uploaded over 6.1TB [20:32] yes, that would have been helpful [20:33] look what I found http://www.symphonyofscience.com/sos/