#archiveteam-bs 2012-12-09,Sun

↑back Search

Time	Nickname	Message
00:02 ^🔗	SketchCow	Just checked - probably 3 hours behind now.
00:02 ^🔗	chronomex	holy moly
00:03 ^🔗	chronomex	I always feel smug when I upload faster than my items derive
00:09 ^🔗	SketchCow	The derive queue broke overnight
00:09 ^🔗	SketchCow	So they're dealing with it now.
00:13 ^🔗	SketchCow	http://archive.org/details/officialdocuments_uk_9780119898545
00:13 ^🔗	SketchCow	This was my thing - I wrote some hairy grep and sed and was able to extract the metadata out of the page.
00:13 ^🔗	SketchCow	I have to do it by the page, but that's still very fast and it's probably less than a thousand documents.
00:22 ^🔗	chronomex	aye
02:30 ^🔗	DFJustin	yay found a bunch more fps addon level cds
03:19 ^🔗	SketchCow	The queue for deriving is STILL backed up - my DNA Lounge upload from roughly 14 hours ago was finally derived, but the uploads from the late afternoon are at 6.5 hours and counting.
03:19 ^🔗	SketchCow	So yeah.
03:19 ^🔗	SketchCow	Turns out that takes a while to kill.
03:19 ^🔗	SketchCow	Also, I ripped an audio from a BBC iPlayer show and I don't care who knows
03:19 ^🔗	SketchCow	turn me in!
03:55 ^🔗	godane	SketchCow: looks like archive.org didn't eat my data: http://archive.org/details/GBTV_REAL_NEWS_02_16_2012
03:55 ^🔗	godane	i just set up
04:10 ^🔗	SketchCow	Right.
04:10 ^🔗	SketchCow	Just let it go - it's taking a long time to catch up.
04:11 ^🔗	SketchCow	A LONG time. They went 16 hours with no deriving activity, and they have a bunch of stuff blowing in they need to deal with.
04:11 ^🔗	SketchCow	TV for example.
04:11 ^🔗	tuankiet	If you can, please run this Yahoo Blog discover https://github.com/tuankiet65/yahoo-blog-archive/wiki/How-to-run
04:19 ^🔗	DFJustin	yeah the tv backlog looks insane
04:26 ^🔗	SketchCow	It really is.
04:26 ^🔗	SketchCow	TV is a hell of a check that IA has to cash now
04:28 ^🔗	underscor	http://archive.org/~hank/derive-wait.php ouch D:
05:23 ^🔗	tuankiet	@all: I have a web application and I want to upload to archive.org. What should I do?
05:29 ^🔗	GLaDOS	tuankiet: the web application being?
05:30 ^🔗	tuankiet	It's dead.
05:30 ^🔗	tuankiet	you can search for eyeOS
05:31 ^🔗	GLaDOS	You mean http://www.eyeos.com/?
05:41 ^🔗	tuankiet	yes, at first they open source but after that the delete open source files and replace by closed source
06:07 ^🔗	GLaDOS	damnit quassel
06:08 ^🔗	GLaDOS	tuankiet: Upload it in a zipfile. I believe archive.org supports zipfiles for indexing.
06:08 ^🔗	GLaDOS	Don't take my word for it, though. I'm a moron when it comes to it.
07:05 ^🔗	chronomex	yes, IA can deal with zipfiles
07:37 ^🔗	godane	SketchCow: this should be in shareware collection: http://archive.org/details/Capcom_E3_2002_Press_CD
07:38 ^🔗	godane	i just thought i remind you cause this is in the shareware collection: http://archive.org/details/Capcom_E3_2001_Press_CD
08:27 ^🔗	SketchCow	Fixed
08:37 ^🔗	godane	uploaded: http://archive.org/details/G4.Comic-Con.2011.Live.HDTV.XviD-MOMENTUM
10:09 ^🔗	godane	the best part of vimeo is the original video that was uploaded can be downloaded
10:55 ^🔗	DFJustin	each full CD of text information can save as many as 15 mature trees https://archive.org/download/cdrom-aztech-hec4/hec4back.png
10:59 ^🔗	DFJustin	by that standard sketchcow is officially the lorax
11:53 ^🔗	godane	uploaded: http://archive.org/details/floss_weekly_2009
11:57 ^🔗	chronomex	but how many trees does it take to print out a video game?
11:57 ^🔗	chronomex	more to the point, how does one print a video game?
12:00 ^🔗	GLaDOS	3D printing, make the levels, create robotics for NPCs and environment changes, and somehow create a respawn system if for some reason you did a shooting game/game involving 04murder
12:01 ^🔗	chronomex	can kinkos do it yet?
12:01 ^🔗	GLaDOS	No idea.
12:01 ^🔗	GLaDOS	Should be easy to add support, though.
12:02 ^🔗	chronomex	or is that kind of a 3Q 2013 sort of problem
12:02 ^🔗	GLaDOS	Respawning, possibly.
12:04 ^🔗	godane	i may have screwed up a name of item
12:05 ^🔗	godane	i put as www.engadget-articles-2004-mirror when it should be www.engadget.com-articles-2004-mirror
12:08 ^🔗	godane	please fixed name: http://archive.org/details/www.engadget-articles-2004-mirror
12:49 ^🔗	ersi	Who was it that was running a scraper for robots.txt on sites?
13:21 ^🔗	Schbirid	me
13:21 ^🔗	Schbirid	ersi
13:50 ^🔗	godane	i'm grabing blackhat.com
13:51 ^🔗	godane	the mp4 files are not being grab so it doesn't take me forever to download
13:51 ^🔗	Schbirid	SketchCow is taking care of those (recorded talks) iirc
13:51 ^🔗	godane	ok then
13:52 ^🔗	godane	but this way we at least have the site missed the videos
13:54 ^🔗	Schbirid	nothing says competence like a CDN provider serving some zip file as http://bitgravity.com/robots.txt and that file not being a standard zip file (at least i fail to uncompress it)
13:54 ^🔗	Schbirid	inside seems to be a text document
13:57 ^🔗	ersi	Schbirid: Ah, cool cool.
13:57 ^🔗	SmileyG	patrick moore :/
13:57 ^🔗	ersi	Schbirid: What sites were you crawling?
13:58 ^🔗	Schbirid	top 10000 from the alexa toplist
13:58 ^🔗	Schbirid	https://github.com/ArchiveTeam/robots-relapse
14:00 ^🔗	ersi	Cool, thanks - I'll take a loot at it :)
14:01 ^🔗	ersi	Whoa, mostly just bash
14:01 ^🔗	godane	i maybe able to do io9 warc.gz at some point: http://io9.com/search/?display=all&sorting=date&q=Search&page=50
14:01 ^🔗	Schbirid	i still havent uploaded them anywhere, if you want them, just shout. ~1G i think
14:01 ^🔗	Schbirid	ersi: you better get some beer now to make your brain not implode at the hackiness
14:02 ^🔗	ersi	I was just curious, mostly what URL's you were crawling :)
14:02 ^🔗	Schbirid	actually, i am not storing them in sqlite anymore
14:02 ^🔗	Schbirid	that changes daily ;)
14:02 ^🔗	ersi	The URL's?
14:03 ^🔗	Schbirid	i fetch the toplist before each run and use the top 10k from it
14:03 ^🔗	Schbirid	so pages might appear and vanish
14:03 ^🔗	ersi	Ah, well yeah
14:03 ^🔗	Schbirid	now that i think about it, this is terribly stupid
14:03 ^🔗	ersi	"start at Alexa top 10k" is sufficiently exact for me
14:04 ^🔗	ersi	I'm thinking/I've have started collecting URLs in general
14:04 ^🔗	*	Schbirid writes an infinite URL generator
14:06 ^🔗	ersi	Well, I'm only interested in URLs that leads to content
14:12 ^🔗	Schbirid	nice http://66dofan.com/robots.txt
14:12 ^🔗	ersi	lol, there's a lot of.. interesting URLs in alexa top 1m
14:12 ^🔗	ersi	999105,seehorsepenis.com
14:12 ^🔗	ersi	for example.. wtf
14:13 ^🔗	Schbirid	adobe.com recently added Disallow: /*.sql$ to theirs, hmmmmmmm
14:13 ^🔗	Schbirid	lol
14:13 ^🔗	ersi	haha, great
14:13 ^🔗	Schbirid	i really want to make some nice site showing recent changes but things like that scare me
14:14 ^🔗	ersi	recent changes in the sites you've crawled?
14:14 ^🔗	Schbirid	https://encrypted.google.com/search?hl=en&q=site:adobe.com+filetype:sql
14:14 ^🔗	Schbirid	in their robots.txt files
14:14 ^🔗	ersi	ah, well yeah
14:14 ^🔗	ersi	"Your search - site:adobe.com+filetype:sql - did not match any documents." :(
14:15 ^🔗	Schbirid	i got one but on closer look it is generic for some software setup
14:15 ^🔗	soultcer	ersi: Writing a web crawler, are we?
14:15 ^🔗	ersi	How often does Alexa release their top lists? :o
14:16 ^🔗	ersi	soultcer: Hehe, been wanting to for ages.. I started on a very basic one
14:16 ^🔗	Schbirid	daily
14:17 ^🔗	soultcer	Cool. What does it do? Feeding a search engine? Archiving? ...
14:17 ^🔗	Schbirid	i save those too if you want history ;)
14:18 ^🔗	ersi	nothing yet :p Prints out all anchor hrefs
14:20 ^🔗	ersi	But when doing it, I started thinking about where one would get seeds.. and I thought of a few things; Start crawling my RSS feeds and follow links, watch IRC channels for links, unshorten urls (Urlteam, fuck yeah!), hook into yacy (p2p search engine), go through browser history occationally
14:21 ^🔗	soultcer	Wikipedia releases a dump of it's link table every couple of months
14:21 ^🔗	soultcer	Also pretty useful: Use a wordlist from password cracking and simply append .com or .net
14:21 ^🔗	ersi	Yeah
14:22 ^🔗	ersi	Also good sources :)
15:33 ^🔗	ersi	9540,196.1.211.6
15:33 ^🔗	ersi	lol, Alexa top-1m is pretty funny
15:33 ^🔗	ersi	that's a good top site
15:59 ^🔗	Schbirid	looks like a syrian firewall http://196.1.211.6:8080/alert/
16:00 ^🔗	Schbirid	but doesnt it nicely show how stupid alexa is? (sorry brewster)
16:00 ^🔗	Schbirid	err, sudan, not syrian
16:30 ^🔗	ersi	I dunno why Alexa has been such a big deal
16:32 ^🔗	Schbirid	they were early and gave people stats/toplist
17:23 ^🔗	Schbirid	anyone here using vnstat? any idea how i can make it output data from more than a year ago?
17:42 ^🔗	SketchCow	Hi, hello,I am the lorax.
21:02 ^🔗	DFJustin	http://economistsview.typepad.com/economistsview/2012/12/gop-fires-author-of-copyright-reform-paper.html

irclogger-viewer