#archiveteam 2013-06-07,Fri

↑back Search

Time Nickname Message
01:28 🔗 ivan` does anyone just happen to have a list of banned subreddits lying around
01:34 🔗 wp494 but by no means is it complete
01:34 🔗 wp494 apparently the last message didn't send
01:34 🔗 wp494 ./r/subredditdrama users post drama involving subreddit closures
01:35 🔗 ivan` ah
01:35 🔗 tyn ivan` I found a partial list of subreddits consisting of 23913 subreddits sorted by subscribers if that helps
01:35 🔗 ivan` tyn: commoncrawl? running that too right now
01:35 🔗 ivan` http://stattit.com/subreddits/by_subscribers/ lists 24,485
01:35 🔗 ivan` I'll check if wayback has more copies of that
01:35 🔗 tyn No, http://subreddits.org/alldata.zip
01:35 🔗 ivan` nice, thanks
02:10 🔗 SketchCow http://archive.org/search.php?query=collection%3Amadmagazine&sort=-publicdate
02:10 🔗 SketchCow Watch the fun
02:54 🔗 DFJustin um you can't do that on the internet
02:56 🔗 samwilson you can't watch the fun on the internet?! are you sure?
02:56 🔗 samwilson ;)
03:44 🔗 ivan` please tell me your favorite reddit users so they are on my list of feeds to grab
03:47 🔗 ivan` heh http://web.archive.org/web/20050808003120/http://reddit.com/
03:48 🔗 omf_ Just so you know I have been backing up parts of reddit for 3 years already
03:49 🔗 ivan` omf_: cool, can I get a user list and a subreddit list?
03:50 🔗 omf_ I don't track users only subreddits
03:52 🔗 ivan` I have discovered only 150K subreddits
03:52 🔗 ivan` I'm sure there are more banned ones
03:52 🔗 ivan` actually less, I have lowercase dupes
03:54 🔗 omf_ ivan`, are you trying to archive all of reddit?
03:56 🔗 ivan` just the reddit feeds in google's feed cache
03:56 🔗 ivan` if you can grep for /user/ and /r/ that would be appreciated
03:58 🔗 omf_ It will take me a while. I got everything as json docs in couchdb
03:58 🔗 ivan` that's fine
03:59 🔗 ivan` tried grepping couchdb's binary blobs? ;)
04:01 🔗 ivan` egrep --no-filename -o '/(r|user)/([^/"]*)/' * | uniq
12:55 🔗 wp494 I'm just going to sit here and assume that the yahoo messages that appeared on the main tracker site is the google reader grab
13:36 🔗 Smiley o_O
13:36 🔗 Smiley i didn't touch yahoo
13:36 🔗 * Smiley blames alard
13:36 🔗 * BlueMax blames Smiley
13:38 🔗 Smiley there, removed it.
16:05 🔗 SketchCow https://twitter.com/textfiles/status/343035981215174657
18:26 🔗 epitron ivan`: oh man... i wanna read all those old reddit articles :)
18:26 🔗 epitron look at how awesome reddit used to be
18:27 🔗 epitron does google reader have history going back that far?
19:33 🔗 ivan` epitron: it depends on how early someone added a feed
19:37 🔗 epitron were there feeds on reddit in 2005?
19:38 🔗 epitron when it first started, i think it was all frontpage
19:39 🔗 epitron hmm.. reddit probably predates google reader
19:39 🔗 epitron oh, no, they were out at the same time
19:41 🔗 epitron looks like http%3A%2F%2Fwww.reddit.com%2F.rss isn't in your crawled feeds
19:41 🔗 epitron (i added them all to a leveldb store, so i could check)
19:42 🔗 ivan` I think I need to switch from postgres to leveldb or similar
19:43 🔗 ivan` thanks, I'll add that one
19:45 🔗 epitron google scale \o/
19:47 🔗 epitron i just created 2 leveldb databases
19:47 🔗 epitron "crawled" and "frontier"
19:48 🔗 epitron that way i don't have to walk through the entire crawled database to find new ones
19:48 🔗 ivan` epitron: how many items, how big is the leveldb
19:49 🔗 epitron 99 million
19:49 🔗 epitron 3.2 gigs
19:49 🔗 ivan` very nice
19:49 🔗 epitron i was looking at the storage files with a hex editor.. it looks like it's stored as a prefix tree
19:51 🔗 ivan` my leveldb will probably be [unencoded url] - > [array of item ids] with a single process importing and assigning the item id
19:51 🔗 ivan` my postgres setup has ridiculously bad performance
20:01 🔗 Marcelo is anyone getting error 500 on formspring sometimes?
20:12 🔗 ivan` what library turns https://google.com/page into com.google:https/page?
20:23 🔗 jfranusic so, now that wget has warc support, I've decided to start saving copies of websites using that
20:23 🔗 jfranusic but, I have no idea what flags I should be using
20:23 🔗 jfranusic are there guidelines somewhere for what I should be using?
20:23 🔗 jfranusic (I'm looking at this right now: https://github.com/ArchiveTeam/posterous-grab/blob/master/pipeline.py#L242 )
20:24 🔗 ivan` jfranusic: see http://www.archiveteam.org/index.php?title=User:Djsmiley2k
20:25 🔗 Smiley :D
20:25 🔗 DFJustin that should get added to http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
20:32 🔗 godane looks like i have uploaded over 6.1TB
20:32 🔗 jfranusic yes, that would have been helpful
20:33 🔗 The_Vole look what I found http://www.symphonyofscience.com/sos/

irclogger-viewer