Time |
Nickname |
Message |
00:15
🔗
|
yipdw| |
heh |
00:15
🔗
|
yipdw| |
running ffgrab.rb on AT&T's 3G network is an exercise in omg |
00:19
🔗
|
Coderjoe |
eek |
01:21
🔗
|
bsmith093 |
/home/ben/.rvm/gems/ruby-1.9.3-p0/gems/girl_friday-0.9.7/lib/girl_friday/actor.rb:212:in `block in initialize' |
01:21
🔗
|
bsmith093 |
/home/ben/.rvm/gems/ruby-1.9.3-p0/gems/girl_friday-0.9.7/lib/girl_friday/actor.rb:212:in `initialize' |
01:21
🔗
|
bsmith093 |
/home/ben/.rvm/gems/ruby-1.9.3-p0/gems/girl_friday-0.9.7/lib/girl_friday/actor.rb:377:in `watchdog' |
01:21
🔗
|
bsmith093 |
/home/ben/.rvm/gems/ruby-1.9.3-p0/gems/girl_friday-0.9.7/lib/girl_friday/actor.rb:66:in `new' |
01:21
🔗
|
bsmith093 |
yipdw im getting a whole mess of these/home/ben/.rvm/gems/ruby-1.9.3-p0/gems/girl_friday-0.9.7/lib/girl_friday/actor.rb:69:in `block (2 levels) in spawn' |
01:21
🔗
|
bsmith093 |
/home/ben/.rvm/gems/ruby-1.9.3-p0/gems/girl_friday-0.9.7/lib/girl_friday/actor.rb:66:in `block in spawn' |
01:21
🔗
|
bsmith093 |
E, [2011-12-09T20:21:18.680575 #18455] ERROR -- : Exception Errno::ECONNREFUSED (Connection refused - Unable to connect to Redis on 127.0.0.1:6379) raised while scraping /cartoon/; requeuing. |
01:26
🔗
|
yipdw| |
yeah |
01:26
🔗
|
yipdw| |
I said you need a Redis instance |
01:31
🔗
|
bsmith093 |
i figured out how to run reedis, now where is it saving these links, to, or isnt it? |
01:32
🔗
|
yipdw| |
all data is being saved in the Redis database |
01:32
🔗
|
yipdw| |
the list of story IDs is present in the "stories" key |
01:32
🔗
|
yipdw| |
all other keys are discovery state |
01:32
🔗
|
bsmith093 |
which is where? |
01:33
🔗
|
yipdw| |
the stories key is in the Redis database |
01:33
🔗
|
yipdw| |
redis-cli is the Redis CLI interface |
01:33
🔗
|
yipdw| |
to query, launch it and run e.g. scard stories, smembers stories |
01:33
🔗
|
yipdw| |
more info available at http://redis.io |
01:34
🔗
|
yipdw| |
also, you should pull again, because I fixed a problem with usage of the If-Modified-Since header |
01:34
🔗
|
bsmith093 |
wow, 5k stories in 45sec thats fast |
01:34
🔗
|
yipdw| |
it's not actually pulling story data |
01:35
🔗
|
yipdw| |
but, yeah, it's decently quick |
01:35
🔗
|
yipdw| |
it can be faster without the random waits, but |
01:35
🔗
|
bsmith093 |
i know its just saving valid ids right |
01:35
🔗
|
yipdw| |
I feel bad about taking that out |
01:35
🔗
|
yipdw| |
if you run it on JRuby 1.6+, it is possible to scale it to a very high number of threads |
01:36
🔗
|
bsmith093 |
do i need to run anything else before running ffgrab agian after apull |
01:36
🔗
|
yipdw| |
but again, I don't feel good about doing that, because I don't know fanfiction.net's capabilities |
01:36
🔗
|
yipdw| |
no, terminate it |
01:36
🔗
|
bsmith093 |
i did |
01:36
🔗
|
yipdw| |
ok |
01:36
🔗
|
yipdw| |
then just re-run it |
01:36
🔗
|
bsmith093 |
and i git pull, and now im running again |
01:36
🔗
|
yipdw| |
the scraper will pick up from where it left off |
01:37
🔗
|
yipdw| |
anyway, brb |
01:37
🔗
|
yipdw| |
well bbl more like, heh |
01:37
🔗
|
bsmith093 |
thanks bye |
06:55
🔗
|
Coderjoe |
http://retropc.net/ |
06:57
🔗
|
Coderjoe |
the japanese counterpart to SketchCow? |
07:01
🔗
|
Coderjoe |
man... why didn't I buy another HR-S9911U or two before they disappeared? |
07:38
🔗
|
DFJustin |
the japanese guys are like negative sketchcow, they collect shitloads of stuff and then never digitize any of it ever |
07:40
🔗
|
DFJustin |
there are a few exceptions |
08:47
🔗
|
Coderjoe |
not sure if there was a link from the main page of the site, so: http://retropc.net/alice/ |
09:49
🔗
|
yipdw |
well this is awesome |
09:49
🔗
|
yipdw |
9) "/Wrestling_and_CSI_Miami_Crossovers/230/1686/_cache_control" |
09:50
🔗
|
chronomex |
woooo |
09:50
🔗
|
chronomex |
the best |
09:50
🔗
|
Coderjoe |
O_O |
09:51
🔗
|
Coderjoe |
fan fiction gets weeeeeird |
09:51
🔗
|
yipdw |
47) "/Frasier_and_Megami_Tensei_Crossovers/381/1074/_cache_control" |
09:51
🔗
|
Coderjoe |
i wonder how many ff.net stories have self-insertion |
09:51
🔗
|
dnova |
fan fiction starts weird, remains weird, and ends weird. |
09:51
🔗
|
chronomex |
is _cache_control some sort of strange fanfic thing? |
09:52
🔗
|
yipdw |
no, that's my Cache-Control observance mechanism |
09:52
🔗
|
chronomex |
:P |
09:52
🔗
|
yipdw |
it's kind of hacky |
09:52
🔗
|
yipdw |
but it works |
09:54
🔗
|
Schbirid |
argh, what is the linux tool to compare two textfiles where you can specify eg "only show entries that appear in a but not b". not diff, something simpler |
09:55
🔗
|
chronomex |
grep -v -f b a |
09:56
🔗
|
chronomex |
works best when B is small and A is large |
09:56
🔗
|
Schbirid |
it was something with the flags -1 , -2, -3 |
09:56
🔗
|
Schbirid |
but that sounds good too :) |
09:56
🔗
|
chronomex |
it's not quite there, but that's where I would start. |
09:57
🔗
|
chronomex |
you also want to tell grep to match whole lines only (dunno the option), and interpret the lines as fixed strings rather than patterns (-F, I think) |
10:00
🔗
|
Schbirid |
chronomex: comm it is :) |
10:00
🔗
|
chronomex |
ah |
10:08
🔗
|
DFJustin |
ah this alice soft archive is nice |
13:36
🔗
|
Schbirid |
http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance-win/ |
21:05
🔗
|
chronomex |
Schbirid: I bet it would have really flown with fgrep. |
21:58
🔗
|
bsmith093 |
gsc-game.com wget-warc appears to be done, 1.5gb total, including cdx and warc fle |
21:58
🔗
|
bsmith093 |
44.185 items, at 1.8gb |
23:14
🔗
|
SketchCow |
Yes, once again, archiveteam's site has spam on it. |
23:18
🔗
|
SketchCow |
You know, for being all twitchy about the fact I've been porting their digitized magazines to archive.org, this site has done a spectacularly shitty job getting consistent spams. |
23:18
🔗
|
SketchCow |
scans. |
23:19
🔗
|
SketchCow |
Some of these are literally mish-mashes where issues 1 2 3 8 and 9 are collections of JPG files, and then 4 5 6 are pdfs and then 7 is a set of jpg files in two file directories. |
23:47
🔗
|
underscor |
SketchCow: hah |
23:58
🔗
|
chronomex |
scanning is boring, let's get high first |