Time |
Nickname |
Message |
00:01
🔗
|
joepie91 |
will get to work in a bit |
00:01
🔗
|
joepie91 |
:P |
00:21
🔗
|
SketchCow |
Yes |
01:34
🔗
|
joepie91 |
winr4r: starting on the scraper now |
01:34
🔗
|
joepie91 |
let's see how long it takes to write it :P |
01:36
🔗
|
winr4r |
:D |
01:40
🔗
|
SketchCow |
About to pump magazines into http://archive.org/details/byte-magazine |
01:43
🔗
|
winr4r |
woohoo! |
01:53
🔗
|
SketchCow |
root@teamarchive-1:/2/MAGS/BYTE magazine full-res scans PDF JC1.0 20120622# du -sh . |
01:53
🔗
|
SketchCow |
33G . |
01:53
🔗
|
SketchCow |
root@teamarchive-1:/2/MAGS/BYTE magazine full-res scans PDF JC1.0 20120622# ls | wc -l |
01:53
🔗
|
SketchCow |
128 |
01:56
🔗
|
winr4r |
slurp |
01:58
🔗
|
SketchCow |
Here's what I plan to do. |
01:58
🔗
|
SketchCow |
OK, then, 1986_03_BYTE_11-03_Homebound_Computing.pdf gets the love. |
01:58
🔗
|
SketchCow |
I will add an item called byte-magazine-1986-03. |
01:58
🔗
|
SketchCow |
I will say this dates to 1986-03. |
01:58
🔗
|
SketchCow |
In the collection named byte-magazine... |
01:58
🔗
|
SketchCow |
I will give it the title of Byte Magazine Volume 11 Number 03 - Homebound Computing. |
01:58
🔗
|
SketchCow |
And here we go. |
01:59
🔗
|
SketchCow |
> do |
01:59
🔗
|
SketchCow |
> done |
01:59
🔗
|
SketchCow |
> sh ingestor "$each" |
01:59
🔗
|
SketchCow |
for each in *.pdf |
01:59
🔗
|
SketchCow |
And ingestor does ALL the work. |
02:00
🔗
|
SketchCow |
It's finished uploaded 8 already |
02:06
🔗
|
SketchCow |
30 uploaded. |
02:06
🔗
|
SketchCow |
So not so bad. |
02:06
🔗
|
SketchCow |
They'll start slowing down - some issues are 280-300mb |
02:16
🔗
|
joepie91 |
[...] |
02:16
🔗
|
joepie91 |
Archived 'Another night like this...', posted at 2005-02-06T15:42:00 by Devil's Kitchen |
02:16
🔗
|
joepie91 |
Archived 'Joe Gordon', posted at 2005-01-13T21:45:00 by Devil's Kitchen |
02:16
🔗
|
joepie91 |
Archived 'Toll Free...', posted at 2005-02-22T21:16:00 by Devil's Kitchen |
02:16
🔗
|
joepie91 |
Archived 'Well, hello...', posted at 2005-01-13T21:26:00 by Devil's Kitchen |
02:16
🔗
|
joepie91 |
Scraping http://www.devilskitchen.me.uk/2005_02_01_archive.html... |
02:16
🔗
|
joepie91 |
that seems to go pretty well |
02:16
🔗
|
joepie91 |
now to actually save it |
02:20
🔗
|
winr4r |
woohoo! |
02:34
🔗
|
SketchCow |
http://archive.org/details/byte-magazine-1985-01 |
02:34
🔗
|
SketchCow |
319mb!! |
02:34
🔗
|
joepie91 |
winr4r: scraping now |
02:34
🔗
|
joepie91 |
at most |
02:34
🔗
|
joepie91 |
shouldn't take much more than a minute or 2 |
02:34
🔗
|
winr4r |
huzzah |
02:36
🔗
|
SketchCow |
Are you using something that blows it into .warc as well? |
02:36
🔗
|
joepie91 |
lol, I was 403'd |
02:37
🔗
|
joepie91 |
SketchCow: no, I'm actually parsing the archives pages |
02:37
🔗
|
joepie91 |
archive * |
02:41
🔗
|
joepie91 |
okay, let's try it again from another IP with a bit more delay inbetween >.> |
02:42
🔗
|
joepie91 |
this will take a while :P |
02:42
🔗
|
joepie91 |
SketchCow: output is JSON with post title, author name, posting date, and body |
02:42
🔗
|
joepie91 |
body being the HTML of the particular post |
02:43
🔗
|
joepie91 |
root@aarnist:~/devilskitchen# find -type f | wc -l |
02:43
🔗
|
joepie91 |
so far |
02:43
🔗
|
joepie91 |
140 |
02:45
🔗
|
joepie91 |
365... |
02:45
🔗
|
joepie91 |
388... |
02:46
🔗
|
joepie91 |
I've arrived at 2006 by now :P |
02:51
🔗
|
joepie91 |
if anyone cares, scraper source: http://git.cryto.net/cgit/joepie91/tree/tools/scrapers/devilskitchen.py |
02:51
🔗
|
joepie91 |
cc winr4r |
02:51
🔗
|
joepie91 |
786 posts archived so far, around 2007-10 now |
02:52
🔗
|
chronomex |
this is archivey enough for #archiveteam |
02:52
🔗
|
chronomex |
imo |
02:52
🔗
|
joepie91 |
mm... fair enough |
02:52
🔗
|
joepie91 |
will move the convo there then :) |
02:52
🔗
|
winr4r |
k |
02:52
🔗
|
winr4r |
k |
02:52
🔗
|
winr4r |
k |
02:52
🔗
|
winr4r |
WHOAH |
02:52
🔗
|
winr4r |
sorry, trying to write on my netbook in the dark :/ |
02:52
🔗
|
joepie91 |
winr4r: you're not in #archiveteam |
02:53
🔗
|
joepie91 |
and lol |
02:53
🔗
|
winr4r |
joepie91: i'm not |
02:54
🔗
|
winr4r |
also, going to sleep with a cat tucked in behind my knees |
02:54
🔗
|
joepie91 |
haha |
02:55
🔗
|
joepie91 |
winr4r: you don't want to see the result then? :P |
02:55
🔗
|
winr4r |
there is literally nothing in the world that feels better than this |
02:55
🔗
|
joepie91 |
only 4 more years worth of posts to go |
02:55
🔗
|
joepie91 |
heh |
02:55
🔗
|
chronomex |
winr4r: sex is nice too. |
02:55
🔗
|
winr4r |
joepie91: i really do, but as for me, and right now, there is me and my neighbour's cat |
02:55
🔗
|
winr4r |
we're going to both sleep very well |
02:55
🔗
|
joepie91 |
:P |
02:55
🔗
|
winr4r |
gnight folks |
02:55
🔗
|
joepie91 |
night |
02:56
🔗
|
joepie91 |
goddamnit. |
02:56
🔗
|
joepie91 |
403'd again. |
02:57
🔗
|
joepie91 |
annoying. |
03:19
🔗
|
joepie91 |
SketchCow: suggestions for places to upload the resulting scrape? |
03:19
🔗
|
joepie91 |
fit for archive.org, for example? |
03:24
🔗
|
joepie91 |
for reference, here is the full scrape (minus the pages that 403ed for some reason): http://aarnist.cryto.net:81/devilskitchen.tar.gz cc winr4r |
03:25
🔗
|
chronomex |
joepie91: where's the .warc? |
03:25
🔗
|
joepie91 |
chronomex: there is none |
03:25
🔗
|
chronomex |
why not? |
03:25
🔗
|
joepie91 |
because I scraped the actual blog posts, and not the site as a whole |
03:26
🔗
|
chronomex |
just the content, not even the html? |
03:26
🔗
|
joepie91 |
chronomex: as mentioned earlier, it has the title, author, date, and body of every blog post |
03:26
🔗
|
joepie91 |
:P |
03:26
🔗
|
joepie91 |
if you really want a .warc, feel free to run wget-warc, because I don't have it here |
03:26
🔗
|
chronomex |
ah, ok |
03:26
🔗
|
joepie91 |
it's a pretty small site anyway |
03:27
🔗
|
chronomex |
have a list of urls I can work from? |
03:27
🔗
|
joepie91 |
saving the archive pages suffices, because it doesn't shorten the articles |
03:27
🔗
|
joepie91 |
sure, 1 sec |
03:27
🔗
|
chronomex |
archive pages don't get comments :) |
03:28
🔗
|
joepie91 |
http://pastie.org/4778385 |
03:28
🔗
|
joepie91 |
there you go |
03:28
🔗
|
joepie91 |
correct |
03:28
🔗
|
joepie91 |
but considering it's google, doing anything more is a bit tricky |
03:28
🔗
|
joepie91 |
:/ |
03:28
🔗
|
joepie91 |
google is incredibly hostile towards scrapers and bots in my experience |
03:28
🔗
|
chronomex |
:( |
03:28
🔗
|
joepie91 |
it 403d my home IP for a short while (entirely, not just for a few pages) |
03:28
🔗
|
joepie91 |
after I scraped with a 5 second interval |
03:29
🔗
|
chronomex |
Disallow: /search |
03:29
🔗
|
chronomex |
User-agent: * |
03:29
🔗
|
chronomex |
Allow: / |
03:29
🔗
|
chronomex |
LIES |
03:29
🔗
|
joepie91 |
hmm? :P |
03:29
🔗
|
chronomex |
in /robots.txt |
03:30
🔗
|
joepie91 |
that doesn't make it not hostile towards bots/scrapers :) |
03:30
🔗
|
chronomex |
not relevant: http://www.reddit.com/r/obots |
03:30
🔗
|
chronomex |
hahahaha http://www.reddit.com/robots.txt |
03:31
🔗
|
chronomex |
User-Agent: bender |
03:31
🔗
|
chronomex |
Disallow: /my_shiny_metal_ass |
03:31
🔗
|
chronomex |
Disallow: /earth |
03:31
🔗
|
chronomex |
User-Agent: Gort |
03:31
🔗
|
joepie91 |
lol |
03:33
🔗
|
SketchCow |
joepie91: Get it all together and it has a home in the archiveteam collection at archive.org. |
03:34
🔗
|
joepie91 |
SketchCow: right, I have a JSON dump of all the articles packed up here: http://aarnist.cryto.net:81/devilskitchen.tar.gz |
03:34
🔗
|
joepie91 |
is that sufficient? |
03:34
🔗
|
joepie91 |
title, author, date, body |
03:35
🔗
|
SketchCow |
How many articles |
03:35
🔗
|
joepie91 |
1114 |
03:52
🔗
|
SketchCow |
OK, so. |
03:52
🔗
|
SketchCow |
you have acopy |
03:52
🔗
|
SketchCow |
you really want a warc copy as well. |
03:52
🔗
|
SketchCow |
You want a couple good copies, so we have something to work with in the future |
03:52
🔗
|
SketchCow |
WARC is what archive.org wants, although it's clunky in contemporary space for now |
04:02
🔗
|
joepie91 |
SketchCow: |
04:02
🔗
|
joepie91 |
cat: css.c: No such file or directory |
04:02
🔗
|
joepie91 |
make[3]: *** [css_.c] Error 1 |
04:02
🔗
|
joepie91 |
make[3]: Leaving directory `/root/wget-warc/trunk/src' |
04:02
🔗
|
joepie91 |
when compiling wget-warc |
04:02
🔗
|
joepie91 |
any suggestions? |
04:04
🔗
|
joepie91 |
debian 6 btw |
04:07
🔗
|
joepie91 |
ah, problem solved it seems |
04:07
🔗
|
joepie91 |
apt-get install flex && ./configure && make |
04:10
🔗
|
joepie91 |
help ._. |
04:10
🔗
|
joepie91 |
make[2]: *** No rule to make target `Makevars', needed by `Makefile'. Stop. |
04:13
🔗
|
joepie91 |
right, I think it works now |
04:22
🔗
|
joepie91 |
finally found a command that does the job |
04:22
🔗
|
joepie91 |
lol |
04:24
🔗
|
joepie91 |
SketchCow: okay, wget-warc'ing the blog now, let's see if I get through without google banning me |
04:24
🔗
|
joepie91 |
it ran against a no-index, so I had to ignore it |
04:24
🔗
|
joepie91 |
er |
04:24
🔗
|
joepie91 |
no-follow * |
05:51
🔗
|
DFJustin |
<joepie91> SketchCow: going to a non-archived URL via wayback machine adds it to archive queue? <-- technically it doesn't add it to a queue, it just does a grab of the page right then |
05:51
🔗
|
DFJustin |
+ any prerequisites that your browser fetches |
07:33
🔗
|
godane |
i may do a better pull of hackaday.com |
07:33
🔗
|
godane |
mostly cause the images are not in warc.gz format |
09:13
🔗
|
alard |
joepie91: The most recent Wget release (1.14) has warc support built-in. It looks like you've compiled an older version (one with a "trunk" directory), so it might be useful to upgrade if you plan to use it again. |
10:14
🔗
|
winr4r |
joepie91: you're wonderful |
10:14
🔗
|
winr4r |
good job |
13:30
🔗
|
SketchCow |
Uploading a few hundred Laptop manuals |
13:33
🔗
|
winr4r |
good morning jason! |
13:33
🔗
|
winr4r |
and hello mistym |
13:33
🔗
|
mistym |
Morning! |
13:34
🔗
|
mistym |
Ugggh, why did it have to get so cold so fast? I mean it is Winnipeg, but... :/ |
13:35
🔗
|
winr4r |
it got much colder in the last couple of days here, too |
13:40
🔗
|
joepie91 |
SketchCow, winr4r, tar.gz with both a warc and a json dump of the blog in it: http://aarnist.cryto.net:81/devilskitchen_final.tar.gz |
13:40
🔗
|
joepie91 |
warc seems to have completed successfully |
13:41
🔗
|
joepie91 |
(surprisingly) |
13:45
🔗
|
winr4r |
joepie91: good job :) |
13:53
🔗
|
SketchCow |
http://archive.org/details/archiveteam-devilskitchen-panic |
14:03
🔗
|
winr4r |
yay! |
14:08
🔗
|
joepie91 |
\o/ |
16:43
🔗
|
godane |
SketchCow: just for you to know i'm getting ~40000 exterinal images form my underground-gamer.com dump |
16:43
🔗
|
godane |
also i think there is enough stuff in this dump just to do a talk on pirates again |
19:24
🔗
|
joepie91 |
would you look at that, WHOIS data in JSON format :) |
19:24
🔗
|
joepie91 |
http://whois.cryto.net/ :D |
20:46
🔗
|
DFJustin |
on the subject of manual uploads, might as well toot my own horn http://archive.org/search.php?query=subject%3A%22computer%20history%22%20AND%20uploader%3A%22dopefishjustin%40gmail.com%22%20AND%20collection%3Aopensource&sort=-publicdate |
20:57
🔗
|
dashcloud |
looks nice |