Time |
Nickname |
Message |
00:40
π
|
ivan` |
if anyone wants to discover blog URLs/usernames, or even the blog platforms themselves, that would be an enormous help: http://www.archiveteam.org/index.php?title=Google_Reader |
00:41
π
|
ivan` |
especially some foreign ones that we generally ignore |
00:45
π
|
ivan` |
I'm going to fix the pipeline script and set up the database that will generate the work items |
00:50
π
|
ivan` |
also, if anyone is friends with Jeff Barr or Bill Kearney of http://www.syndic8.com/ maybe you can bug them for the data |
00:50
π
|
ivan` |
their site does not respond to requests beyond the homepage |
01:30
π
|
ivan` |
did anyone grab all the opml files from opmlmanager before it went down sometime in 2012? |
01:30
π
|
ivan` |
IA doesn't seem to have much, http://web.archive.org/web/20120210125326/http://www.opmlmanager.com/user_list.php |
02:01
π
|
zenguy_pc |
i did the google takeout thing for reader |
02:01
π
|
zenguy_pc |
extracted the xml |
02:02
π
|
zenguy_pc |
it should be 3-4 years old but i don't see the stored posts |
02:02
π
|
zenguy_pc |
you guys need reddit rss? |
02:02
π
|
zenguy_pc |
i only have gonewild though for about a year |
02:02
π
|
zenguy_pc |
nevermind .. only individual posts |
02:04
π
|
zenguy_pc |
does giving you the google reader subscriptionsxml allow you to get all posts via google api that they have from those urls like 3 years old? |
02:20
π
|
ivan` |
zenguy_pc: yes, Reader serves data even for URLs that don't exist anymore |
02:20
π
|
ivan` |
and as far as I can tell, Reader does keep all posts |
02:20
π
|
zenguy_pc |
ok thats good |
02:20
π
|
ivan` |
maybe there's some really high limit so they don't store a million spam posts |
02:20
π
|
zenguy_pc |
i wish i had more urls.. i have had had 30 tops |
02:20
π
|
zenguy_pc |
i will check my other accounts later |
02:21
π
|
zenguy_pc |
i got my subscriptions . i thought they would have given me all the posts i had.. i see starred posts |
02:21
π
|
ivan` |
right |
02:21
π
|
zenguy_pc |
i hadn't touched it in a year and io have some music blogs i wanted to go through |
02:22
π
|
zenguy_pc |
i can't do that in a month |
02:22
π
|
ivan` |
which music blogs? |
02:25
π
|
ivan` |
Feed API is supposed to stay up after July 1, but the eternal spring cleaning will probably kill it soon too |
02:26
π
|
zenguy_pc |
2dopeboyz |
02:26
π
|
zenguy_pc |
http://pastie.org/7966029 |
02:27
π
|
BlueMax |
http://tracker.archiveteam.org/posterous/ 10,000 items left |
02:27
π
|
BlueMax |
wow |
02:29
π
|
zenguy_pc |
i had the overheard in NY , ATL Office.. sites in rss than they seem to have died in dec 2012 |
02:29
π
|
zenguy_pc |
will you guys retrieve the urls from it ro can you get cached info from google api ? |
02:30
π
|
zenguy_pc |
do the sites need to be live? |
02:34
π
|
ivan` |
for Reader the sites can be dead; just need the feed URL |
02:35
π
|
ivan` |
I have a "Deleted" folder in my Reader that has a lot of stuff |
02:44
π
|
zenguy_pc |
what about random urls? |
02:44
π
|
zenguy_pc |
if you're after google cached stuff , isn't there a way to get random rss urls any see if google has it in their db? |
02:45
π
|
ivan` |
zenguy_pc: you can do searches with https://www.google.com/reader/directory/search?q=keyword-here |
02:45
π
|
ivan` |
you can also use their recommendations feature in Reader |
03:04
π
|
zenguy_pc |
how do you intend to deak with reddit rate limit.. i've been banned several times just update feeds every hours.. 1000+ |
03:05
π
|
ivan` |
we grab the data from Google, reddit probably doesn't rate-limit Google |
03:05
π
|
zenguy_pc |
i wanted to get posts before users deleted them which some gonewild posters were apt to do |
03:06
π
|
zenguy_pc |
can i use the same google reader upload for reddit upload |
03:06
π
|
ivan` |
I don't know what that means |
03:06
π
|
zenguy_pc |
nevermind .. you answered the question |
03:07
π
|
zenguy_pc |
you'll index reddit through google |
03:08
π
|
ivan` |
I'm not even really interested in new data, just old data, but Reader is going to hit the site if it doesn't have the feed in its cache already |
03:09
π
|
zenguy_pc |
ah good idea. |
03:09
π
|
zenguy_pc |
i saw reddit's own backup blog post and it was a lot of data |
03:10
π
|
zenguy_pc |
wish they had proper search.. so much content/gems is lost in all that data without proper search |
03:15
π
|
zenguy_pc |
has anyone ever used this http://buzz.sourceforge.net/ |
03:18
π
|
zenguy_pc |
i could never get it to work |
03:55
π
|
SketchCow |
http://www.flickr.com/photos/textfiles/sets/72157633722203885/ |
03:59
π
|
BlueMax |
"The Lucky Byte" huh |
04:10
π
|
S[h]O[r]T |
thats awesome |
04:13
π
|
S[h]O[r]T |
i didnt realize IA was a christian organization |
04:18
π
|
S[h]O[r]T |
what are the tv tuners recording, anything special 24/7? |
05:44
π
|
ivan` |
what the heck is the password for http://areca.co/8/Feed-Item-Dataset-TUDCS5 |
05:45
π
|
ivan` |
I have tried quite a few |
05:59
π
|
ivan` |
hopefully the author knows and will reply |
08:40
π
|
Smiley |
S[h]O[r]T: It's not afaik, it's in a old "scientific christian" church tho from what that movie said yesterday. |
08:40
π
|
Smiley |
Maaan, yahoo have basically killed flickr already, it's so slow :/ |
08:41
π
|
BlueMax |
it wasn't dead when Yahoo took control of it in the first place? |
08:42
π
|
Smiley |
No. |
08:42
π
|
Smiley |
It was quick at least. |
08:44
π
|
ivan` |
http://www.archiveteam.org/index.php?title=Google_Reader please edit if you know of more blog platforms |
08:46
π
|
ivan` |
hm, I should go through my feed URLs and discover some myself |
08:49
π
|
godane |
so i'm finding more techtv stuff |
08:50
π
|
godane |
i didn't do a full scan of the 21000s yet |
08:50
π
|
godane |
i also only the 21800s video ids |
08:50
π
|
godane |
*only did |
09:14
π
|
godane |
i'm upload 2nd season of secret life of machines |
09:15
π
|
antomatic |
S[h]O[r]T: I think they archive TV News and stuff. |
09:15
π
|
antomatic |
(indexed via closed captioning data - very cool) |
09:17
π
|
Smiley |
"It's reported that Yahoo has formally put in a bid to buy Hulu only a week after adding Tumblr to the family. |
09:17
π
|
Smiley |
Really yahoo? REALLY? |
09:17
π
|
Smiley |
BUY ALL THE THINGS! |
09:24
π
|
BlueMax |
christ they don't have the cash to buy Hulu surely |
09:30
π
|
godane |
so i got 45k videos uploaded to my g4video-web collection |
10:22
π
|
godane |
so is anyone willing to give me money to buy a new hard dirve? |
11:55
π
|
ivan` |
The following text is what triggered our spam filter: http://USERNAME.tumblr.com |
11:55
π
|
ivan` |
The text you wanted to save was blocked by the spam filter. This is probably caused by a link to a blacklisted external site. |
12:53
π
|
edoc |
jux.com is closing. |
12:59
π
|
ivan` |
google finds 30 blogs on jux.com |
12:59
π
|
ivan` |
okay actually a lot more when I click that redundant link |
15:05
π
|
omf_ |
Is anyone else building applications using the Internet Archive's APIs? |
15:05
π
|
omf_ |
I wrote up a bunch of documentation but I could be missing something since the IA lacks developer documentation |
15:36
π
|
Smiley |
ivan`: # of results? |
15:43
π
|
WiK |
afternoon |
15:48
π
|
omf_ |
WiK, will you take a pull request that expands your language coverage? |
15:51
π
|
WiK |
do what? |
15:53
π
|
WiK |
i understood the pull request part |
15:55
π
|
omf_ |
I added a few lines to pullfromdb.sh to track more programming language files |
16:04
π
|
WiK |
nopaste your changes, let me to a look |
16:05
π
|
WiK |
meh, just do a pull request |
16:05
π
|
WiK |
ill take a look at it |
20:48
π
|
soultcer |
ivan`: tumblr is on the spam blacklist, since they are bad at filtering spam and the only way to report spam to them is by first signing up on tumblr - http://www.archiveteam.org/index.php?title=MediaWiki:Spam-blacklist |
21:27
π
|
ivan` |
Smiley: 760,000ish |
21:27
π
|
ivan` |
that was just with a site: |
22:11
π
|
ivan` |
soultcer: thanks, too bad |
22:30
π
|
citruspi |
Hey, could an admin PM me? |
22:51
π
|
balrog |
site:http://www3.telus.net/ stuff needs to be archived... dunno how much longer that will survive |
23:10
π
|
citruspi |
Hey, I'm looking to help with the Archive Team. |
23:10
π
|
citruspi |
I'm primarily a python progammer |
23:10
π
|
citruspi |
Is there anything I could do to help right now? (I'll hang around in the channel in the future) |
23:11
π
|
BlueMax |
citruspi, I'm sure someone can find some use for you :P |
23:12
π
|
citruspi |
Sweet, thanks :) |
23:13
π
|
ivan` |
citruspi: http://www.archiveteam.org/index.php?title=Google_Reader needs your help |
23:13
π
|
ivan` |
you can write crawlers to discover usernames on the services mentioned, or if you want to do some C, add gzip support to wget |
23:14
π
|
ivan` |
also a crawler to hit https://www.google.com/reader/directory/search with every keyword imaginable |
23:14
π
|
citruspi |
Thanks ivan`, I'll join the channel |
23:14
π
|
ivan` |
thanks! |
23:14
π
|
* |
citruspi join #donereading |
23:15
π
|
citruspi |
yeah, didn't mean to add the /meΓ’ΒΒ¦ |