Time |
Nickname |
Message |
01:28
🔗
|
ivan` |
does anyone just happen to have a list of banned subreddits lying around |
01:34
🔗
|
wp494 |
but by no means is it complete |
01:34
🔗
|
wp494 |
apparently the last message didn't send |
01:34
🔗
|
wp494 |
./r/subredditdrama users post drama involving subreddit closures |
01:35
🔗
|
ivan` |
ah |
01:35
🔗
|
tyn |
ivan` I found a partial list of subreddits consisting of 23913 subreddits sorted by subscribers if that helps |
01:35
🔗
|
ivan` |
tyn: commoncrawl? running that too right now |
01:35
🔗
|
ivan` |
http://stattit.com/subreddits/by_subscribers/ lists 24,485 |
01:35
🔗
|
ivan` |
I'll check if wayback has more copies of that |
01:35
🔗
|
tyn |
No, http://subreddits.org/alldata.zip |
01:35
🔗
|
ivan` |
nice, thanks |
02:10
🔗
|
SketchCow |
http://archive.org/search.php?query=collection%3Amadmagazine&sort=-publicdate |
02:10
🔗
|
SketchCow |
Watch the fun |
02:54
🔗
|
DFJustin |
um you can't do that on the internet |
02:56
🔗
|
samwilson |
you can't watch the fun on the internet?! are you sure? |
02:56
🔗
|
samwilson |
;) |
03:44
🔗
|
ivan` |
please tell me your favorite reddit users so they are on my list of feeds to grab |
03:47
🔗
|
ivan` |
heh http://web.archive.org/web/20050808003120/http://reddit.com/ |
03:48
🔗
|
omf_ |
Just so you know I have been backing up parts of reddit for 3 years already |
03:49
🔗
|
ivan` |
omf_: cool, can I get a user list and a subreddit list? |
03:50
🔗
|
omf_ |
I don't track users only subreddits |
03:52
🔗
|
ivan` |
I have discovered only 150K subreddits |
03:52
🔗
|
ivan` |
I'm sure there are more banned ones |
03:52
🔗
|
ivan` |
actually less, I have lowercase dupes |
03:54
🔗
|
omf_ |
ivan`, are you trying to archive all of reddit? |
03:56
🔗
|
ivan` |
just the reddit feeds in google's feed cache |
03:56
🔗
|
ivan` |
if you can grep for /user/ and /r/ that would be appreciated |
03:58
🔗
|
omf_ |
It will take me a while. I got everything as json docs in couchdb |
03:58
🔗
|
ivan` |
that's fine |
03:59
🔗
|
ivan` |
tried grepping couchdb's binary blobs? ;) |
04:01
🔗
|
ivan` |
egrep --no-filename -o '/(r|user)/([^/"]*)/' * | uniq |
12:55
🔗
|
wp494 |
I'm just going to sit here and assume that the yahoo messages that appeared on the main tracker site is the google reader grab |
13:36
🔗
|
Smiley |
o_O |
13:36
🔗
|
Smiley |
i didn't touch yahoo |
13:36
🔗
|
* |
Smiley blames alard |
13:36
🔗
|
* |
BlueMax blames Smiley |
13:38
🔗
|
Smiley |
there, removed it. |
16:05
🔗
|
SketchCow |
https://twitter.com/textfiles/status/343035981215174657 |
18:26
🔗
|
epitron |
ivan`: oh man... i wanna read all those old reddit articles :) |
18:26
🔗
|
epitron |
look at how awesome reddit used to be |
18:27
🔗
|
epitron |
does google reader have history going back that far? |
19:33
🔗
|
ivan` |
epitron: it depends on how early someone added a feed |
19:37
🔗
|
epitron |
were there feeds on reddit in 2005? |
19:38
🔗
|
epitron |
when it first started, i think it was all frontpage |
19:39
🔗
|
epitron |
hmm.. reddit probably predates google reader |
19:39
🔗
|
epitron |
oh, no, they were out at the same time |
19:41
🔗
|
epitron |
looks like http%3A%2F%2Fwww.reddit.com%2F.rss isn't in your crawled feeds |
19:41
🔗
|
epitron |
(i added them all to a leveldb store, so i could check) |
19:42
🔗
|
ivan` |
I think I need to switch from postgres to leveldb or similar |
19:43
🔗
|
ivan` |
thanks, I'll add that one |
19:45
🔗
|
epitron |
google scale \o/ |
19:47
🔗
|
epitron |
i just created 2 leveldb databases |
19:47
🔗
|
epitron |
"crawled" and "frontier" |
19:48
🔗
|
epitron |
that way i don't have to walk through the entire crawled database to find new ones |
19:48
🔗
|
ivan` |
epitron: how many items, how big is the leveldb |
19:49
🔗
|
epitron |
99 million |
19:49
🔗
|
epitron |
3.2 gigs |
19:49
🔗
|
ivan` |
very nice |
19:49
🔗
|
epitron |
i was looking at the storage files with a hex editor.. it looks like it's stored as a prefix tree |
19:51
🔗
|
ivan` |
my leveldb will probably be [unencoded url] - > [array of item ids] with a single process importing and assigning the item id |
19:51
🔗
|
ivan` |
my postgres setup has ridiculously bad performance |
20:01
🔗
|
Marcelo |
is anyone getting error 500 on formspring sometimes? |
20:12
🔗
|
ivan` |
what library turns https://google.com/page into com.google:https/page? |
20:23
🔗
|
jfranusic |
so, now that wget has warc support, I've decided to start saving copies of websites using that |
20:23
🔗
|
jfranusic |
but, I have no idea what flags I should be using |
20:23
🔗
|
jfranusic |
are there guidelines somewhere for what I should be using? |
20:23
🔗
|
jfranusic |
(I'm looking at this right now: https://github.com/ArchiveTeam/posterous-grab/blob/master/pipeline.py#L242 ) |
20:24
🔗
|
ivan` |
jfranusic: see http://www.archiveteam.org/index.php?title=User:Djsmiley2k |
20:25
🔗
|
Smiley |
:D |
20:25
🔗
|
DFJustin |
that should get added to http://www.archiveteam.org/index.php?title=Wget_with_WARC_output |
20:32
🔗
|
godane |
looks like i have uploaded over 6.1TB |
20:32
🔗
|
jfranusic |
yes, that would have been helpful |
20:33
🔗
|
The_Vole |
look what I found http://www.symphonyofscience.com/sos/ |