#archiveteam 2013-05-27,Mon

↑back Search

Time	Nickname	Message
00:40 ^🔗	ivan`	if anyone wants to discover blog URLs/usernames, or even the blog platforms themselves, that would be an enormous help: http://www.archiveteam.org/index.php?title=Google_Reader
00:41 ^🔗	ivan`	especially some foreign ones that we generally ignore
00:45 ^🔗	ivan`	I'm going to fix the pipeline script and set up the database that will generate the work items
00:50 ^🔗	ivan`	also, if anyone is friends with Jeff Barr or Bill Kearney of http://www.syndic8.com/ maybe you can bug them for the data
00:50 ^🔗	ivan`	their site does not respond to requests beyond the homepage
01:30 ^🔗	ivan`	did anyone grab all the opml files from opmlmanager before it went down sometime in 2012?
01:30 ^🔗	ivan`	IA doesn't seem to have much, http://web.archive.org/web/20120210125326/http://www.opmlmanager.com/user_list.php
02:01 ^🔗	zenguy_pc	i did the google takeout thing for reader
02:01 ^🔗	zenguy_pc	extracted the xml
02:02 ^🔗	zenguy_pc	it should be 3-4 years old but i don't see the stored posts
02:02 ^🔗	zenguy_pc	you guys need reddit rss?
02:02 ^🔗	zenguy_pc	i only have gonewild though for about a year
02:02 ^🔗	zenguy_pc	nevermind .. only individual posts
02:04 ^🔗	zenguy_pc	does giving you the google reader subscriptionsxml allow you to get all posts via google api that they have from those urls like 3 years old?
02:20 ^🔗	ivan`	zenguy_pc: yes, Reader serves data even for URLs that don't exist anymore
02:20 ^🔗	ivan`	and as far as I can tell, Reader does keep all posts
02:20 ^🔗	zenguy_pc	ok thats good
02:20 ^🔗	ivan`	maybe there's some really high limit so they don't store a million spam posts
02:20 ^🔗	zenguy_pc	i wish i had more urls.. i have had had 30 tops
02:20 ^🔗	zenguy_pc	i will check my other accounts later
02:21 ^🔗	zenguy_pc	i got my subscriptions . i thought they would have given me all the posts i had.. i see starred posts
02:21 ^🔗	ivan`	right
02:21 ^🔗	zenguy_pc	i hadn't touched it in a year and io have some music blogs i wanted to go through
02:22 ^🔗	zenguy_pc	i can't do that in a month
02:22 ^🔗	ivan`	which music blogs?
02:25 ^🔗	ivan`	Feed API is supposed to stay up after July 1, but the eternal spring cleaning will probably kill it soon too
02:26 ^🔗	zenguy_pc	2dopeboyz
02:26 ^🔗	zenguy_pc	http://pastie.org/7966029
02:27 ^🔗	BlueMax	http://tracker.archiveteam.org/posterous/ 10,000 items left
02:27 ^🔗	BlueMax	wow
02:29 ^🔗	zenguy_pc	i had the overheard in NY , ATL Office.. sites in rss than they seem to have died in dec 2012
02:29 ^🔗	zenguy_pc	will you guys retrieve the urls from it ro can you get cached info from google api ?
02:30 ^🔗	zenguy_pc	do the sites need to be live?
02:34 ^🔗	ivan`	for Reader the sites can be dead; just need the feed URL
02:35 ^🔗	ivan`	I have a "Deleted" folder in my Reader that has a lot of stuff
02:44 ^🔗	zenguy_pc	what about random urls?
02:44 ^🔗	zenguy_pc	if you're after google cached stuff , isn't there a way to get random rss urls any see if google has it in their db?
02:45 ^🔗	ivan`	zenguy_pc: you can do searches with https://www.google.com/reader/directory/search?q=keyword-here
02:45 ^🔗	ivan`	you can also use their recommendations feature in Reader
03:04 ^🔗	zenguy_pc	how do you intend to deak with reddit rate limit.. i've been banned several times just update feeds every hours.. 1000+
03:05 ^🔗	ivan`	we grab the data from Google, reddit probably doesn't rate-limit Google
03:05 ^🔗	zenguy_pc	i wanted to get posts before users deleted them which some gonewild posters were apt to do
03:06 ^🔗	zenguy_pc	can i use the same google reader upload for reddit upload
03:06 ^🔗	ivan`	I don't know what that means
03:06 ^🔗	zenguy_pc	nevermind .. you answered the question
03:07 ^🔗	zenguy_pc	you'll index reddit through google
03:08 ^🔗	ivan`	I'm not even really interested in new data, just old data, but Reader is going to hit the site if it doesn't have the feed in its cache already
03:09 ^🔗	zenguy_pc	ah good idea.
03:09 ^🔗	zenguy_pc	i saw reddit's own backup blog post and it was a lot of data
03:10 ^🔗	zenguy_pc	wish they had proper search.. so much content/gems is lost in all that data without proper search
03:15 ^🔗	zenguy_pc	has anyone ever used this http://buzz.sourceforge.net/
03:18 ^🔗	zenguy_pc	i could never get it to work
03:55 ^🔗	SketchCow	http://www.flickr.com/photos/textfiles/sets/72157633722203885/
03:59 ^🔗	BlueMax	"The Lucky Byte" huh
04:10 ^🔗	S[h]O[r]T	thats awesome
04:13 ^🔗	S[h]O[r]T	i didnt realize IA was a christian organization
04:18 ^🔗	S[h]O[r]T	what are the tv tuners recording, anything special 24/7?
05:44 ^🔗	ivan`	what the heck is the password for http://areca.co/8/Feed-Item-Dataset-TUDCS5
05:45 ^🔗	ivan`	I have tried quite a few
05:59 ^🔗	ivan`	hopefully the author knows and will reply
08:40 ^🔗	Smiley	S[h]O[r]T: It's not afaik, it's in a old "scientific christian" church tho from what that movie said yesterday.
08:40 ^🔗	Smiley	Maaan, yahoo have basically killed flickr already, it's so slow :/
08:41 ^🔗	BlueMax	it wasn't dead when Yahoo took control of it in the first place?
08:42 ^🔗	Smiley	No.
08:42 ^🔗	Smiley	It was quick at least.
08:44 ^🔗	ivan`	http://www.archiveteam.org/index.php?title=Google_Reader please edit if you know of more blog platforms
08:46 ^🔗	ivan`	hm, I should go through my feed URLs and discover some myself
08:49 ^🔗	godane	so i'm finding more techtv stuff
08:50 ^🔗	godane	i didn't do a full scan of the 21000s yet
08:50 ^🔗	godane	i also only the 21800s video ids
08:50 ^🔗	godane	*only did
09:14 ^🔗	godane	i'm upload 2nd season of secret life of machines
09:15 ^🔗	antomatic	S[h]O[r]T: I think they archive TV News and stuff.
09:15 ^🔗	antomatic	(indexed via closed captioning data - very cool)
09:17 ^🔗	Smiley	"It's reported that Yahoo has formally put in a bid to buy Hulu only a week after adding Tumblr to the family.
09:17 ^🔗	Smiley	Really yahoo? REALLY?
09:17 ^🔗	Smiley	BUY ALL THE THINGS!
09:24 ^🔗	BlueMax	christ they don't have the cash to buy Hulu surely
09:30 ^🔗	godane	so i got 45k videos uploaded to my g4video-web collection
10:22 ^🔗	godane	so is anyone willing to give me money to buy a new hard dirve?
11:55 ^🔗	ivan`	The following text is what triggered our spam filter: http://USERNAME.tumblr.com
11:55 ^🔗	ivan`	The text you wanted to save was blocked by the spam filter. This is probably caused by a link to a blacklisted external site.
12:53 ^🔗	edoc	jux.com is closing.
12:59 ^🔗	ivan`	google finds 30 blogs on jux.com
12:59 ^🔗	ivan`	okay actually a lot more when I click that redundant link
15:05 ^🔗	omf_	Is anyone else building applications using the Internet Archive's APIs?
15:05 ^🔗	omf_	I wrote up a bunch of documentation but I could be missing something since the IA lacks developer documentation
15:36 ^🔗	Smiley	ivan`: # of results?
15:43 ^🔗	WiK	afternoon
15:48 ^🔗	omf_	WiK, will you take a pull request that expands your language coverage?
15:51 ^🔗	WiK	do what?
15:53 ^🔗	WiK	i understood the pull request part
15:55 ^🔗	omf_	I added a few lines to pullfromdb.sh to track more programming language files
16:04 ^🔗	WiK	nopaste your changes, let me to a look
16:05 ^🔗	WiK	meh, just do a pull request
16:05 ^🔗	WiK	ill take a look at it
20:48 ^🔗	soultcer	ivan`: tumblr is on the spam blacklist, since they are bad at filtering spam and the only way to report spam to them is by first signing up on tumblr - http://www.archiveteam.org/index.php?title=MediaWiki:Spam-blacklist
21:27 ^🔗	ivan`	Smiley: 760,000ish
21:27 ^🔗	ivan`	that was just with a site:
22:11 ^🔗	ivan`	soultcer: thanks, too bad
22:30 ^🔗	citruspi	Hey, could an admin PM me?
22:51 ^🔗	balrog	site:http://www3.telus.net/ stuff needs to be archived... dunno how much longer that will survive
23:10 ^🔗	citruspi	Hey, I'm looking to help with the Archive Team.
23:10 ^🔗	citruspi	I'm primarily a python progammer
23:10 ^🔗	citruspi	Is there anything I could do to help right now? (I'll hang around in the channel in the future)
23:11 ^🔗	BlueMax	citruspi, I'm sure someone can find some use for you :P
23:12 ^🔗	citruspi	Sweet, thanks :)
23:13 ^🔗	ivan`	citruspi: http://www.archiveteam.org/index.php?title=Google_Reader needs your help
23:13 ^🔗	ivan`	you can write crawlers to discover usernames on the services mentioned, or if you want to do some C, add gzip support to wget
23:14 ^🔗	ivan`	also a crawler to hit https://www.google.com/reader/directory/search with every keyword imaginable
23:14 ^🔗	citruspi	Thanks ivan`, I'll join the channel
23:14 ^🔗	ivan`	thanks!
23:14 ^🔗	*	citruspi join #donereading
23:15 ^🔗	citruspi	yeah, didn't mean to add the /meâ¦

irclogger-viewer