[00:02] Just checked - probably 3 hours behind now. [00:02] holy moly [00:03] I always feel smug when I upload faster than my items derive [00:09] The derive queue broke overnight [00:09] So they're dealing with it now. [00:13] http://archive.org/details/officialdocuments_uk_9780119898545 [00:13] This was my thing - I wrote some hairy grep and sed and was able to extract the metadata out of the page. [00:13] I have to do it by the page, but that's still very fast and it's probably less than a thousand documents. [00:22] aye [02:30] yay found a bunch more fps addon level cds [03:19] The queue for deriving is STILL backed up - my DNA Lounge upload from roughly 14 hours ago was finally derived, but the uploads from the late afternoon are at 6.5 hours and counting. [03:19] So yeah. [03:19] Turns out that takes a while to kill. [03:19] Also, I ripped an audio from a BBC iPlayer show and I don't care who knows [03:19] turn me in! [03:55] SketchCow: looks like archive.org didn't eat my data: http://archive.org/details/GBTV_REAL_NEWS_02_16_2012 [03:55] i just set up [04:10] Right. [04:10] Just let it go - it's taking a long time to catch up. [04:11] A LONG time. They went 16 hours with no deriving activity, and they have a bunch of stuff blowing in they need to deal with. [04:11] TV for example. [04:11] If you can, please run this Yahoo Blog discover https://github.com/tuankiet65/yahoo-blog-archive/wiki/How-to-run [04:19] yeah the tv backlog looks insane [04:26] It really is. [04:26] TV is a hell of a check that IA has to cash now [04:28] http://archive.org/~hank/derive-wait.php ouch D: [05:23] @all: I have a web application and I want to upload to archive.org. What should I do? [05:29] tuankiet: the web application being? [05:30] It's dead. [05:30] you can search for eyeOS [05:31] You mean http://www.eyeos.com/? [05:41] yes, at first they open source but after that the delete open source files and replace by closed source [06:07] damnit quassel [06:08] tuankiet: Upload it in a zipfile. I believe archive.org supports zipfiles for indexing. [06:08] Don't take my word for it, though. I'm a moron when it comes to it. [07:05] yes, IA can deal with zipfiles [07:37] SketchCow: this should be in shareware collection: http://archive.org/details/Capcom_E3_2002_Press_CD [07:38] i just thought i remind you cause this is in the shareware collection: http://archive.org/details/Capcom_E3_2001_Press_CD [08:27] Fixed [08:37] uploaded: http://archive.org/details/G4.Comic-Con.2011.Live.HDTV.XviD-MOMENTUM [10:09] the best part of vimeo is the original video that was uploaded can be downloaded [10:55] each full CD of text information can save as many as 15 mature trees https://archive.org/download/cdrom-aztech-hec4/hec4back.png [10:59] by that standard sketchcow is officially the lorax [11:53] uploaded: http://archive.org/details/floss_weekly_2009 [11:57] but how many trees does it take to print out a video game? [11:57] more to the point, how does one print a video game? [12:00] 3D printing, make the levels, create robotics for NPCs and environment changes, and somehow create a respawn system if for some reason you did a shooting game/game involving 04murder [12:01] can kinkos do it yet? [12:01] No idea. [12:01] Should be easy to add support, though. [12:02] or is that kind of a 3Q 2013 sort of problem [12:02] Respawning, possibly. [12:04] i may have screwed up a name of item [12:05] i put as www.engadget-articles-2004-mirror when it should be www.engadget.com-articles-2004-mirror [12:08] please fixed name: http://archive.org/details/www.engadget-articles-2004-mirror [12:49] Who was it that was running a scraper for robots.txt on sites? [13:21] me [13:21] ersi [13:50] i'm grabing blackhat.com [13:51] the mp4 files are not being grab so it doesn't take me forever to download [13:51] SketchCow is taking care of those (recorded talks) iirc [13:51] ok then [13:52] but this way we at least have the site missed the videos [13:54] nothing says competence like a CDN provider serving some zip file as http://bitgravity.com/robots.txt and that file not being a standard zip file (at least i fail to uncompress it) [13:54] inside seems to be a text document [13:57] Schbirid: Ah, cool cool. [13:57] patrick moore :/ [13:57] Schbirid: What sites were you crawling? [13:58] top 10000 from the alexa toplist [13:58] https://github.com/ArchiveTeam/robots-relapse [14:00] Cool, thanks - I'll take a loot at it :) [14:01] Whoa, mostly just bash [14:01] i maybe able to do io9 warc.gz at some point: http://io9.com/search/?display=all&sorting=date&q=Search&page=50 [14:01] i still havent uploaded them anywhere, if you want them, just shout. ~1G i think [14:01] ersi: you better get some beer now to make your brain not implode at the hackiness [14:02] I was just curious, mostly what URL's you were crawling :) [14:02] actually, i am not storing them in sqlite anymore [14:02] that changes daily ;) [14:02] The URL's? [14:03] i fetch the toplist before each run and use the top 10k from it [14:03] so pages might appear and vanish [14:03] Ah, well yeah [14:03] now that i think about it, this is terribly stupid [14:03] "start at Alexa top 10k" is sufficiently exact for me [14:04] I'm thinking/I've have started collecting URLs in general [14:04] * Schbirid writes an infinite URL generator [14:06] Well, I'm only interested in URLs that leads to content [14:12] nice http://66dofan.com/robots.txt [14:12] lol, there's a lot of.. interesting URLs in alexa top 1m [14:12] 999105,seehorsepenis.com [14:12] for example.. wtf [14:13] adobe.com recently added Disallow: /*.sql$ to theirs, hmmmmmmm [14:13] lol [14:13] haha, great [14:13] i really want to make some nice site showing recent changes but things like that scare me [14:14] recent changes in the sites you've crawled? [14:14] https://encrypted.google.com/search?hl=en&q=site:adobe.com+filetype:sql [14:14] in their robots.txt files [14:14] ah, well yeah [14:14] "Your search - site:adobe.com+filetype:sql - did not match any documents." :( [14:15] i got one but on closer look it is generic for some software setup [14:15] ersi: Writing a web crawler, are we? [14:15] How often does Alexa release their top lists? :o [14:16] soultcer: Hehe, been wanting to for ages.. I started on a very basic one [14:16] daily [14:17] Cool. What does it do? Feeding a search engine? Archiving? ... [14:17] i save those too if you want history ;) [14:18] nothing yet :p Prints out all anchor hrefs [14:20] But when doing it, I started thinking about where one would get seeds.. and I thought of a few things; Start crawling my RSS feeds and follow links, watch IRC channels for links, unshorten urls (Urlteam, fuck yeah!), hook into yacy (p2p search engine), go through browser history occationally [14:21] Wikipedia releases a dump of it's link table every couple of months [14:21] Also pretty useful: Use a wordlist from password cracking and simply append .com or .net [14:21] Yeah [14:22] Also good sources :) [15:33] 9540,196.1.211.6 [15:33] lol, Alexa top-1m is pretty funny [15:33] that's a good top site [15:59] looks like a syrian firewall http://196.1.211.6:8080/alert/ [16:00] but doesnt it nicely show how stupid alexa is? (sorry brewster) [16:00] err, sudan, not syrian [16:30] I dunno why Alexa has been such a big deal [16:32] they were early and gave people stats/toplist [17:23] anyone here using vnstat? any idea how i can make it output data from more than a year ago? [17:42] Hi, hello,I am the lorax. [21:02] http://economistsview.typepad.com/economistsview/2012/12/gop-fires-author-of-copyright-reform-paper.html