Time |
Nickname |
Message |
00:02
🔗
|
SketchCow |
Just checked - probably 3 hours behind now. |
00:02
🔗
|
chronomex |
holy moly |
00:03
🔗
|
chronomex |
I always feel smug when I upload faster than my items derive |
00:09
🔗
|
SketchCow |
The derive queue broke overnight |
00:09
🔗
|
SketchCow |
So they're dealing with it now. |
00:13
🔗
|
SketchCow |
http://archive.org/details/officialdocuments_uk_9780119898545 |
00:13
🔗
|
SketchCow |
This was my thing - I wrote some hairy grep and sed and was able to extract the metadata out of the page. |
00:13
🔗
|
SketchCow |
I have to do it by the page, but that's still very fast and it's probably less than a thousand documents. |
00:22
🔗
|
chronomex |
aye |
02:30
🔗
|
DFJustin |
yay found a bunch more fps addon level cds |
03:19
🔗
|
SketchCow |
The queue for deriving is STILL backed up - my DNA Lounge upload from roughly 14 hours ago was finally derived, but the uploads from the late afternoon are at 6.5 hours and counting. |
03:19
🔗
|
SketchCow |
So yeah. |
03:19
🔗
|
SketchCow |
Turns out that takes a while to kill. |
03:19
🔗
|
SketchCow |
Also, I ripped an audio from a BBC iPlayer show and I don't care who knows |
03:19
🔗
|
SketchCow |
turn me in! |
03:55
🔗
|
godane |
SketchCow: looks like archive.org didn't eat my data: http://archive.org/details/GBTV_REAL_NEWS_02_16_2012 |
03:55
🔗
|
godane |
i just set up |
04:10
🔗
|
SketchCow |
Right. |
04:10
🔗
|
SketchCow |
Just let it go - it's taking a long time to catch up. |
04:11
🔗
|
SketchCow |
A LONG time. They went 16 hours with no deriving activity, and they have a bunch of stuff blowing in they need to deal with. |
04:11
🔗
|
SketchCow |
TV for example. |
04:11
🔗
|
tuankiet |
If you can, please run this Yahoo Blog discover https://github.com/tuankiet65/yahoo-blog-archive/wiki/How-to-run |
04:19
🔗
|
DFJustin |
yeah the tv backlog looks insane |
04:26
🔗
|
SketchCow |
It really is. |
04:26
🔗
|
SketchCow |
TV is a hell of a check that IA has to cash now |
04:28
🔗
|
underscor |
http://archive.org/~hank/derive-wait.php ouch D: |
05:23
🔗
|
tuankiet |
@all: I have a web application and I want to upload to archive.org. What should I do? |
05:29
🔗
|
GLaDOS |
tuankiet: the web application being? |
05:30
🔗
|
tuankiet |
It's dead. |
05:30
🔗
|
tuankiet |
you can search for eyeOS |
05:31
🔗
|
GLaDOS |
You mean http://www.eyeos.com/? |
05:41
🔗
|
tuankiet |
yes, at first they open source but after that the delete open source files and replace by closed source |
06:07
🔗
|
GLaDOS |
damnit quassel |
06:08
🔗
|
GLaDOS |
tuankiet: Upload it in a zipfile. I believe archive.org supports zipfiles for indexing. |
06:08
🔗
|
GLaDOS |
Don't take my word for it, though. I'm a moron when it comes to it. |
07:05
🔗
|
chronomex |
yes, IA can deal with zipfiles |
07:37
🔗
|
godane |
SketchCow: this should be in shareware collection: http://archive.org/details/Capcom_E3_2002_Press_CD |
07:38
🔗
|
godane |
i just thought i remind you cause this is in the shareware collection: http://archive.org/details/Capcom_E3_2001_Press_CD |
08:27
🔗
|
SketchCow |
Fixed |
08:37
🔗
|
godane |
uploaded: http://archive.org/details/G4.Comic-Con.2011.Live.HDTV.XviD-MOMENTUM |
10:09
🔗
|
godane |
the best part of vimeo is the original video that was uploaded can be downloaded |
10:55
🔗
|
DFJustin |
each full CD of text information can save as many as 15 mature trees https://archive.org/download/cdrom-aztech-hec4/hec4back.png |
10:59
🔗
|
DFJustin |
by that standard sketchcow is officially the lorax |
11:53
🔗
|
godane |
uploaded: http://archive.org/details/floss_weekly_2009 |
11:57
🔗
|
chronomex |
but how many trees does it take to print out a video game? |
11:57
🔗
|
chronomex |
more to the point, how does one print a video game? |
12:00
🔗
|
GLaDOS |
3D printing, make the levels, create robotics for NPCs and environment changes, and somehow create a respawn system if for some reason you did a shooting game/game involving 04murder |
12:01
🔗
|
chronomex |
can kinkos do it yet? |
12:01
🔗
|
GLaDOS |
No idea. |
12:01
🔗
|
GLaDOS |
Should be easy to add support, though. |
12:02
🔗
|
chronomex |
or is that kind of a 3Q 2013 sort of problem |
12:02
🔗
|
GLaDOS |
Respawning, possibly. |
12:04
🔗
|
godane |
i may have screwed up a name of item |
12:05
🔗
|
godane |
i put as www.engadget-articles-2004-mirror when it should be www.engadget.com-articles-2004-mirror |
12:08
🔗
|
godane |
please fixed name: http://archive.org/details/www.engadget-articles-2004-mirror |
12:49
🔗
|
ersi |
Who was it that was running a scraper for robots.txt on sites? |
13:21
🔗
|
Schbirid |
me |
13:21
🔗
|
Schbirid |
ersi |
13:50
🔗
|
godane |
i'm grabing blackhat.com |
13:51
🔗
|
godane |
the mp4 files are not being grab so it doesn't take me forever to download |
13:51
🔗
|
Schbirid |
SketchCow is taking care of those (recorded talks) iirc |
13:51
🔗
|
godane |
ok then |
13:52
🔗
|
godane |
but this way we at least have the site missed the videos |
13:54
🔗
|
Schbirid |
nothing says competence like a CDN provider serving some zip file as http://bitgravity.com/robots.txt and that file not being a standard zip file (at least i fail to uncompress it) |
13:54
🔗
|
Schbirid |
inside seems to be a text document |
13:57
🔗
|
ersi |
Schbirid: Ah, cool cool. |
13:57
🔗
|
SmileyG |
patrick moore :/ |
13:57
🔗
|
ersi |
Schbirid: What sites were you crawling? |
13:58
🔗
|
Schbirid |
top 10000 from the alexa toplist |
13:58
🔗
|
Schbirid |
https://github.com/ArchiveTeam/robots-relapse |
14:00
🔗
|
ersi |
Cool, thanks - I'll take a loot at it :) |
14:01
🔗
|
ersi |
Whoa, mostly just bash |
14:01
🔗
|
godane |
i maybe able to do io9 warc.gz at some point: http://io9.com/search/?display=all&sorting=date&q=Search&page=50 |
14:01
🔗
|
Schbirid |
i still havent uploaded them anywhere, if you want them, just shout. ~1G i think |
14:01
🔗
|
Schbirid |
ersi: you better get some beer now to make your brain not implode at the hackiness |
14:02
🔗
|
ersi |
I was just curious, mostly what URL's you were crawling :) |
14:02
🔗
|
Schbirid |
actually, i am not storing them in sqlite anymore |
14:02
🔗
|
Schbirid |
that changes daily ;) |
14:02
🔗
|
ersi |
The URL's? |
14:03
🔗
|
Schbirid |
i fetch the toplist before each run and use the top 10k from it |
14:03
🔗
|
Schbirid |
so pages might appear and vanish |
14:03
🔗
|
ersi |
Ah, well yeah |
14:03
🔗
|
Schbirid |
now that i think about it, this is terribly stupid |
14:03
🔗
|
ersi |
"start at Alexa top 10k" is sufficiently exact for me |
14:04
🔗
|
ersi |
I'm thinking/I've have started collecting URLs in general |
14:04
🔗
|
* |
Schbirid writes an infinite URL generator |
14:06
🔗
|
ersi |
Well, I'm only interested in URLs that leads to content |
14:12
🔗
|
Schbirid |
nice http://66dofan.com/robots.txt |
14:12
🔗
|
ersi |
lol, there's a lot of.. interesting URLs in alexa top 1m |
14:12
🔗
|
ersi |
999105,seehorsepenis.com |
14:12
🔗
|
ersi |
for example.. wtf |
14:13
🔗
|
Schbirid |
adobe.com recently added Disallow: /*.sql$ to theirs, hmmmmmmm |
14:13
🔗
|
Schbirid |
lol |
14:13
🔗
|
ersi |
haha, great |
14:13
🔗
|
Schbirid |
i really want to make some nice site showing recent changes but things like that scare me |
14:14
🔗
|
ersi |
recent changes in the sites you've crawled? |
14:14
🔗
|
Schbirid |
https://encrypted.google.com/search?hl=en&q=site:adobe.com+filetype:sql |
14:14
🔗
|
Schbirid |
in their robots.txt files |
14:14
🔗
|
ersi |
ah, well yeah |
14:14
🔗
|
ersi |
"Your search - site:adobe.com+filetype:sql - did not match any documents." :( |
14:15
🔗
|
Schbirid |
i got one but on closer look it is generic for some software setup |
14:15
🔗
|
soultcer |
ersi: Writing a web crawler, are we? |
14:15
🔗
|
ersi |
How often does Alexa release their top lists? :o |
14:16
🔗
|
ersi |
soultcer: Hehe, been wanting to for ages.. I started on a very basic one |
14:16
🔗
|
Schbirid |
daily |
14:17
🔗
|
soultcer |
Cool. What does it do? Feeding a search engine? Archiving? ... |
14:17
🔗
|
Schbirid |
i save those too if you want history ;) |
14:18
🔗
|
ersi |
nothing yet :p Prints out all anchor hrefs |
14:20
🔗
|
ersi |
But when doing it, I started thinking about where one would get seeds.. and I thought of a few things; Start crawling my RSS feeds and follow links, watch IRC channels for links, unshorten urls (Urlteam, fuck yeah!), hook into yacy (p2p search engine), go through browser history occationally |
14:21
🔗
|
soultcer |
Wikipedia releases a dump of it's link table every couple of months |
14:21
🔗
|
soultcer |
Also pretty useful: Use a wordlist from password cracking and simply append .com or .net |
14:21
🔗
|
ersi |
Yeah |
14:22
🔗
|
ersi |
Also good sources :) |
15:33
🔗
|
ersi |
9540,196.1.211.6 |
15:33
🔗
|
ersi |
lol, Alexa top-1m is pretty funny |
15:33
🔗
|
ersi |
that's a good top site |
15:59
🔗
|
Schbirid |
looks like a syrian firewall http://196.1.211.6:8080/alert/ |
16:00
🔗
|
Schbirid |
but doesnt it nicely show how stupid alexa is? (sorry brewster) |
16:00
🔗
|
Schbirid |
err, sudan, not syrian |
16:30
🔗
|
ersi |
I dunno why Alexa has been such a big deal |
16:32
🔗
|
Schbirid |
they were early and gave people stats/toplist |
17:23
🔗
|
Schbirid |
anyone here using vnstat? any idea how i can make it output data from more than a year ago? |
17:42
🔗
|
SketchCow |
Hi, hello,I am the lorax. |
21:02
🔗
|
DFJustin |
http://economistsview.typepad.com/economistsview/2012/12/gop-fires-author-of-copyright-reform-paper.html |