[00:40] if anyone wants to discover blog URLs/usernames, or even the blog platforms themselves, that would be an enormous help: http://www.archiveteam.org/index.php?title=Google_Reader [00:41] especially some foreign ones that we generally ignore [00:45] I'm going to fix the pipeline script and set up the database that will generate the work items [00:50] also, if anyone is friends with Jeff Barr or Bill Kearney of http://www.syndic8.com/ maybe you can bug them for the data [00:50] their site does not respond to requests beyond the homepage [01:30] did anyone grab all the opml files from opmlmanager before it went down sometime in 2012? [01:30] IA doesn't seem to have much, http://web.archive.org/web/20120210125326/http://www.opmlmanager.com/user_list.php [02:01] i did the google takeout thing for reader [02:01] extracted the xml [02:02] it should be 3-4 years old but i don't see the stored posts [02:02] you guys need reddit rss? [02:02] i only have gonewild though for about a year [02:02] nevermind .. only individual posts [02:04] does giving you the google reader subscriptionsxml allow you to get all posts via google api that they have from those urls like 3 years old? [02:20] zenguy_pc: yes, Reader serves data even for URLs that don't exist anymore [02:20] and as far as I can tell, Reader does keep all posts [02:20] ok thats good [02:20] maybe there's some really high limit so they don't store a million spam posts [02:20] i wish i had more urls.. i have had had 30 tops [02:20] i will check my other accounts later [02:21] i got my subscriptions . i thought they would have given me all the posts i had.. i see starred posts [02:21] right [02:21] i hadn't touched it in a year and io have some music blogs i wanted to go through [02:22] i can't do that in a month [02:22] which music blogs? [02:25] Feed API is supposed to stay up after July 1, but the eternal spring cleaning will probably kill it soon too [02:26] 2dopeboyz [02:26] http://pastie.org/7966029 [02:27] http://tracker.archiveteam.org/posterous/ 10,000 items left [02:27] wow [02:29] i had the overheard in NY , ATL Office.. sites in rss than they seem to have died in dec 2012 [02:29] will you guys retrieve the urls from it ro can you get cached info from google api ? [02:30] do the sites need to be live? [02:34] for Reader the sites can be dead; just need the feed URL [02:35] I have a "Deleted" folder in my Reader that has a lot of stuff [02:44] what about random urls? [02:44] if you're after google cached stuff , isn't there a way to get random rss urls any see if google has it in their db? [02:45] zenguy_pc: you can do searches with https://www.google.com/reader/directory/search?q=keyword-here [02:45] you can also use their recommendations feature in Reader [03:04] how do you intend to deak with reddit rate limit.. i've been banned several times just update feeds every hours.. 1000+ [03:05] we grab the data from Google, reddit probably doesn't rate-limit Google [03:05] i wanted to get posts before users deleted them which some gonewild posters were apt to do [03:06] can i use the same google reader upload for reddit upload [03:06] I don't know what that means [03:06] nevermind .. you answered the question [03:07] you'll index reddit through google [03:08] I'm not even really interested in new data, just old data, but Reader is going to hit the site if it doesn't have the feed in its cache already [03:09] ah good idea. [03:09] i saw reddit's own backup blog post and it was a lot of data [03:10] wish they had proper search.. so much content/gems is lost in all that data without proper search [03:15] has anyone ever used this http://buzz.sourceforge.net/ [03:18] i could never get it to work [03:55] http://www.flickr.com/photos/textfiles/sets/72157633722203885/ [03:59] "The Lucky Byte" huh [04:10] thats awesome [04:13] i didnt realize IA was a christian organization [04:18] what are the tv tuners recording, anything special 24/7? [05:44] what the heck is the password for http://areca.co/8/Feed-Item-Dataset-TUDCS5 [05:45] I have tried quite a few [05:59] hopefully the author knows and will reply [08:40] S[h]O[r]T: It's not afaik, it's in a old "scientific christian" church tho from what that movie said yesterday. [08:40] Maaan, yahoo have basically killed flickr already, it's so slow :/ [08:41] it wasn't dead when Yahoo took control of it in the first place? [08:42] No. [08:42] It was quick at least. [08:44] http://www.archiveteam.org/index.php?title=Google_Reader please edit if you know of more blog platforms [08:46] hm, I should go through my feed URLs and discover some myself [08:49] so i'm finding more techtv stuff [08:50] i didn't do a full scan of the 21000s yet [08:50] i also only the 21800s video ids [08:50] *only did [09:14] i'm upload 2nd season of secret life of machines [09:15] S[h]O[r]T: I think they archive TV News and stuff. [09:15] (indexed via closed captioning data - very cool) [09:17] "It's reported that Yahoo has formally put in a bid to buy Hulu only a week after adding Tumblr to the family. [09:17] Really yahoo? REALLY? [09:17] BUY ALL THE THINGS! [09:24] christ they don't have the cash to buy Hulu surely [09:30] so i got 45k videos uploaded to my g4video-web collection [10:22] so is anyone willing to give me money to buy a new hard dirve? [11:55] The following text is what triggered our spam filter: http://USERNAME.tumblr.com [11:55] The text you wanted to save was blocked by the spam filter. This is probably caused by a link to a blacklisted external site. [12:53] jux.com is closing. [12:59] google finds 30 blogs on jux.com [12:59] okay actually a lot more when I click that redundant link [15:05] Is anyone else building applications using the Internet Archive's APIs? [15:05] I wrote up a bunch of documentation but I could be missing something since the IA lacks developer documentation [15:36] ivan`: # of results? [15:43] afternoon [15:48] WiK, will you take a pull request that expands your language coverage? [15:51] do what? [15:53] i understood the pull request part [15:55] I added a few lines to pullfromdb.sh to track more programming language files [16:04] nopaste your changes, let me to a look [16:05] meh, just do a pull request [16:05] ill take a look at it [20:48] ivan`: tumblr is on the spam blacklist, since they are bad at filtering spam and the only way to report spam to them is by first signing up on tumblr - http://www.archiveteam.org/index.php?title=MediaWiki:Spam-blacklist [21:27] Smiley: 760,000ish [21:27] that was just with a site: [22:11] soultcer: thanks, too bad [22:30] Hey, could an admin PM me? [22:51] site:http://www3.telus.net/ stuff needs to be archived... dunno how much longer that will survive [23:10] Hey, I'm looking to help with the Archive Team. [23:10] I'm primarily a python progammer [23:10] Is there anything I could do to help right now? (I'll hang around in the channel in the future) [23:11] citruspi, I'm sure someone can find some use for you :P [23:12] Sweet, thanks :) [23:13] citruspi: http://www.archiveteam.org/index.php?title=Google_Reader needs your help [23:13] you can write crawlers to discover usernames on the services mentioned, or if you want to do some C, add gzip support to wget [23:14] also a crawler to hit https://www.google.com/reader/directory/search with every keyword imaginable [23:14] Thanks ivan`, I'll join the channel [23:14] thanks! [23:14] * citruspi join #donereading [23:15] yeah, didn't mean to add the /meâ¦