#archiveteam-bs 2012-12-09,Sun

↑back Search

Time Nickname Message
00:02 🔗 SketchCow Just checked - probably 3 hours behind now.
00:02 🔗 chronomex holy moly
00:03 🔗 chronomex I always feel smug when I upload faster than my items derive
00:09 🔗 SketchCow The derive queue broke overnight
00:09 🔗 SketchCow So they're dealing with it now.
00:13 🔗 SketchCow http://archive.org/details/officialdocuments_uk_9780119898545
00:13 🔗 SketchCow This was my thing - I wrote some hairy grep and sed and was able to extract the metadata out of the page.
00:13 🔗 SketchCow I have to do it by the page, but that's still very fast and it's probably less than a thousand documents.
00:22 🔗 chronomex aye
02:30 🔗 DFJustin yay found a bunch more fps addon level cds
03:19 🔗 SketchCow The queue for deriving is STILL backed up - my DNA Lounge upload from roughly 14 hours ago was finally derived, but the uploads from the late afternoon are at 6.5 hours and counting.
03:19 🔗 SketchCow So yeah.
03:19 🔗 SketchCow Turns out that takes a while to kill.
03:19 🔗 SketchCow Also, I ripped an audio from a BBC iPlayer show and I don't care who knows
03:19 🔗 SketchCow turn me in!
03:55 🔗 godane SketchCow: looks like archive.org didn't eat my data: http://archive.org/details/GBTV_REAL_NEWS_02_16_2012
03:55 🔗 godane i just set up
04:10 🔗 SketchCow Right.
04:10 🔗 SketchCow Just let it go - it's taking a long time to catch up.
04:11 🔗 SketchCow A LONG time. They went 16 hours with no deriving activity, and they have a bunch of stuff blowing in they need to deal with.
04:11 🔗 SketchCow TV for example.
04:11 🔗 tuankiet If you can, please run this Yahoo Blog discover https://github.com/tuankiet65/yahoo-blog-archive/wiki/How-to-run
04:19 🔗 DFJustin yeah the tv backlog looks insane
04:26 🔗 SketchCow It really is.
04:26 🔗 SketchCow TV is a hell of a check that IA has to cash now
04:28 🔗 underscor http://archive.org/~hank/derive-wait.php ouch D:
05:23 🔗 tuankiet @all: I have a web application and I want to upload to archive.org. What should I do?
05:29 🔗 GLaDOS tuankiet: the web application being?
05:30 🔗 tuankiet It's dead.
05:30 🔗 tuankiet you can search for eyeOS
05:31 🔗 GLaDOS You mean http://www.eyeos.com/?
05:41 🔗 tuankiet yes, at first they open source but after that the delete open source files and replace by closed source
06:07 🔗 GLaDOS damnit quassel
06:08 🔗 GLaDOS tuankiet: Upload it in a zipfile. I believe archive.org supports zipfiles for indexing.
06:08 🔗 GLaDOS Don't take my word for it, though. I'm a moron when it comes to it.
07:05 🔗 chronomex yes, IA can deal with zipfiles
07:37 🔗 godane SketchCow: this should be in shareware collection: http://archive.org/details/Capcom_E3_2002_Press_CD
07:38 🔗 godane i just thought i remind you cause this is in the shareware collection: http://archive.org/details/Capcom_E3_2001_Press_CD
08:27 🔗 SketchCow Fixed
08:37 🔗 godane uploaded: http://archive.org/details/G4.Comic-Con.2011.Live.HDTV.XviD-MOMENTUM
10:09 🔗 godane the best part of vimeo is the original video that was uploaded can be downloaded
10:55 🔗 DFJustin each full CD of text information can save as many as 15 mature trees https://archive.org/download/cdrom-aztech-hec4/hec4back.png
10:59 🔗 DFJustin by that standard sketchcow is officially the lorax
11:53 🔗 godane uploaded: http://archive.org/details/floss_weekly_2009
11:57 🔗 chronomex but how many trees does it take to print out a video game?
11:57 🔗 chronomex more to the point, how does one print a video game?
12:00 🔗 GLaDOS 3D printing, make the levels, create robotics for NPCs and environment changes, and somehow create a respawn system if for some reason you did a shooting game/game involving 04murder
12:01 🔗 chronomex can kinkos do it yet?
12:01 🔗 GLaDOS No idea.
12:01 🔗 GLaDOS Should be easy to add support, though.
12:02 🔗 chronomex or is that kind of a 3Q 2013 sort of problem
12:02 🔗 GLaDOS Respawning, possibly.
12:04 🔗 godane i may have screwed up a name of item
12:05 🔗 godane i put as www.engadget-articles-2004-mirror when it should be www.engadget.com-articles-2004-mirror
12:08 🔗 godane please fixed name: http://archive.org/details/www.engadget-articles-2004-mirror
12:49 🔗 ersi Who was it that was running a scraper for robots.txt on sites?
13:21 🔗 Schbirid me
13:21 🔗 Schbirid ersi
13:50 🔗 godane i'm grabing blackhat.com
13:51 🔗 godane the mp4 files are not being grab so it doesn't take me forever to download
13:51 🔗 Schbirid SketchCow is taking care of those (recorded talks) iirc
13:51 🔗 godane ok then
13:52 🔗 godane but this way we at least have the site missed the videos
13:54 🔗 Schbirid nothing says competence like a CDN provider serving some zip file as http://bitgravity.com/robots.txt and that file not being a standard zip file (at least i fail to uncompress it)
13:54 🔗 Schbirid inside seems to be a text document
13:57 🔗 ersi Schbirid: Ah, cool cool.
13:57 🔗 SmileyG patrick moore :/
13:57 🔗 ersi Schbirid: What sites were you crawling?
13:58 🔗 Schbirid top 10000 from the alexa toplist
13:58 🔗 Schbirid https://github.com/ArchiveTeam/robots-relapse
14:00 🔗 ersi Cool, thanks - I'll take a loot at it :)
14:01 🔗 ersi Whoa, mostly just bash
14:01 🔗 godane i maybe able to do io9 warc.gz at some point: http://io9.com/search/?display=all&sorting=date&q=Search&page=50
14:01 🔗 Schbirid i still havent uploaded them anywhere, if you want them, just shout. ~1G i think
14:01 🔗 Schbirid ersi: you better get some beer now to make your brain not implode at the hackiness
14:02 🔗 ersi I was just curious, mostly what URL's you were crawling :)
14:02 🔗 Schbirid actually, i am not storing them in sqlite anymore
14:02 🔗 Schbirid that changes daily ;)
14:02 🔗 ersi The URL's?
14:03 🔗 Schbirid i fetch the toplist before each run and use the top 10k from it
14:03 🔗 Schbirid so pages might appear and vanish
14:03 🔗 ersi Ah, well yeah
14:03 🔗 Schbirid now that i think about it, this is terribly stupid
14:03 🔗 ersi "start at Alexa top 10k" is sufficiently exact for me
14:04 🔗 ersi I'm thinking/I've have started collecting URLs in general
14:04 🔗 * Schbirid writes an infinite URL generator
14:06 🔗 ersi Well, I'm only interested in URLs that leads to content
14:12 🔗 Schbirid nice http://66dofan.com/robots.txt
14:12 🔗 ersi lol, there's a lot of.. interesting URLs in alexa top 1m
14:12 🔗 ersi 999105,seehorsepenis.com
14:12 🔗 ersi for example.. wtf
14:13 🔗 Schbirid adobe.com recently added Disallow: /*.sql$ to theirs, hmmmmmmm
14:13 🔗 Schbirid lol
14:13 🔗 ersi haha, great
14:13 🔗 Schbirid i really want to make some nice site showing recent changes but things like that scare me
14:14 🔗 ersi recent changes in the sites you've crawled?
14:14 🔗 Schbirid https://encrypted.google.com/search?hl=en&q=site:adobe.com+filetype:sql
14:14 🔗 Schbirid in their robots.txt files
14:14 🔗 ersi ah, well yeah
14:14 🔗 ersi "Your search - site:adobe.com+filetype:sql - did not match any documents." :(
14:15 🔗 Schbirid i got one but on closer look it is generic for some software setup
14:15 🔗 soultcer ersi: Writing a web crawler, are we?
14:15 🔗 ersi How often does Alexa release their top lists? :o
14:16 🔗 ersi soultcer: Hehe, been wanting to for ages.. I started on a very basic one
14:16 🔗 Schbirid daily
14:17 🔗 soultcer Cool. What does it do? Feeding a search engine? Archiving? ...
14:17 🔗 Schbirid i save those too if you want history ;)
14:18 🔗 ersi nothing yet :p Prints out all anchor hrefs
14:20 🔗 ersi But when doing it, I started thinking about where one would get seeds.. and I thought of a few things; Start crawling my RSS feeds and follow links, watch IRC channels for links, unshorten urls (Urlteam, fuck yeah!), hook into yacy (p2p search engine), go through browser history occationally
14:21 🔗 soultcer Wikipedia releases a dump of it's link table every couple of months
14:21 🔗 soultcer Also pretty useful: Use a wordlist from password cracking and simply append .com or .net
14:21 🔗 ersi Yeah
14:22 🔗 ersi Also good sources :)
15:33 🔗 ersi 9540,196.1.211.6
15:33 🔗 ersi lol, Alexa top-1m is pretty funny
15:33 🔗 ersi that's a good top site
15:59 🔗 Schbirid looks like a syrian firewall http://196.1.211.6:8080/alert/
16:00 🔗 Schbirid but doesnt it nicely show how stupid alexa is? (sorry brewster)
16:00 🔗 Schbirid err, sudan, not syrian
16:30 🔗 ersi I dunno why Alexa has been such a big deal
16:32 🔗 Schbirid they were early and gave people stats/toplist
17:23 🔗 Schbirid anyone here using vnstat? any idea how i can make it output data from more than a year ago?
17:42 🔗 SketchCow Hi, hello,I am the lorax.
21:02 🔗 DFJustin http://economistsview.typepad.com/economistsview/2012/12/gop-fires-author-of-copyright-reform-paper.html

irclogger-viewer