Time |
Nickname |
Message |
00:00
🔗
|
db48x2 |
let's see. error 7 is failure to connect to the host, and 56 is failure in receiving network data |
00:01
🔗
|
bsmith093 |
xargs runs curl as many times as insatnce is set for, but the instant any one of them hit an error they all are left to finish up then the sript quits |
00:01
🔗
|
db48x2 |
right |
00:01
🔗
|
db48x2 |
that's how xargs works |
00:02
🔗
|
bsmith093 |
oh ok, so is therea way to print an error and keep xarge going |
00:03
🔗
|
bsmith093 |
http://tracker.archive.org/ff.net/numbers this file has the full list of ids to check, thats that the curl func does, xargs check multiple at once, or at least its supposed to |
00:04
🔗
|
yipdw |
if you're getting error 7 or 56 "seemingly at random", it may be fanfiction.net |
00:04
🔗
|
yipdw |
also, that script you posted is kind of nuts |
00:05
🔗
|
bsmith093 |
i agree, not my scriot but the best arrith could do on short notice |
00:05
🔗
|
yipdw |
you're creating a subshell and are evaluating a function and then executing it for every job |
00:05
🔗
|
bsmith093 |
yipdw: but why is it nuts |
00:05
🔗
|
yipdw |
there's a simpler way to do it :P |
00:05
🔗
|
yipdw |
one moment |
00:05
🔗
|
bsmith093 |
yay, so whats the better way |
00:06
🔗
|
db48x2 |
well, you could put the contents of the function into another sh file, and call it that way :) |
00:06
🔗
|
bsmith093 |
the numbers, cause they are |
00:06
🔗
|
yipdw |
I'm still not quite sure why there even needs to be another function |
00:06
🔗
|
db48x2 |
but that's a minor thing |
00:07
🔗
|
bsmith093 |
this is why i gave the link, so you all could tweak and offer suggestions :P |
00:13
🔗
|
yipdw |
oh, I see why |
00:13
🔗
|
yipdw |
I guess xargs won't execute a shell function |
00:13
🔗
|
db48x2 |
no |
00:14
🔗
|
yipdw |
bsmith093: out of curiosity, do you get the curl error if you run with instance_count=1 |
00:14
🔗
|
bsmith093 |
not sure hold on |
00:15
🔗
|
yipdw |
also, is it always on the same IDs |
00:16
🔗
|
yipdw |
finally, I'm wondering if it would be more efficient to do this via spidering the fanfiction.net story indices |
00:16
🔗
|
yipdw |
76.2 megabytes of IDs is a lot |
00:16
🔗
|
yipdw |
? |
00:16
🔗
|
yipdw |
how many of those IDs actually reference stories |
00:16
🔗
|
bsmith093 |
probably several million |
00:16
🔗
|
bsmith093 |
and thers only 10 mil max so why not be methodical |
00:18
🔗
|
bsmith093 |
appears not |
00:18
🔗
|
yipdw |
finally, I don't think that script actually archives stories in full |
00:19
🔗
|
bsmith093 |
inst count =1 no problem, but its not in paralell, so it kinda defeats the puroose |
00:19
🔗
|
bsmith093 |
purpose |
00:19
🔗
|
yipdw |
how does it behave on stories with multiple chapters, e.g. http://www.fanfiction.net/s/5909536/? |
00:19
🔗
|
bsmith093 |
wait just crashed yep same error |
00:20
🔗
|
bsmith093 |
it doesn't actually grab them just checks if the id is valid |
00:20
🔗
|
bsmith093 |
im building a linklist |
00:21
🔗
|
yipdw |
hm |
00:21
🔗
|
yipdw |
I maintain it would be more efficient (for you and for them) to start at the roots on http://www.fanfiction.net/ |
00:21
🔗
|
yipdw |
and trace from there |
00:21
🔗
|
yipdw |
by using WWW::Mechanize/Mechanize/etc. |
00:21
🔗
|
yipdw |
I've got to run, though, so I can't provide an example |
00:21
🔗
|
yipdw |
maybe later |
00:22
🔗
|
yipdw |
usage of those tools does mean leaving bash and using Perl, Python, Ruby, or whatnot, but IMO those are better languages for this sort of stuff anyway |
00:22
🔗
|
yipdw |
bbl for real |
00:26
🔗
|
bsmith093 |
connection dropped out, what'd i miss |
00:41
🔗
|
underscor |
Nothing |
00:42
🔗
|
bsmith093 |
how do i get wget --spider to give up a linklist fort he ehole site |
00:42
🔗
|
underscor |
Dunno off the top of my head |
00:43
🔗
|
underscor |
On a side note, I'm almost to 1,500,000 |
00:43
🔗
|
underscor |
Simply using this |
00:43
🔗
|
underscor |
for i in `cat numbers_[a-e][n-z] `;do var=`curl -A "ArchiveTeam/1.0 - Email archiveteam@k-srv.info for misbehavior or complaints" -I http://www.fanfiction.net/s/$i|grep Last`;echo -n "$i - ";if [ -z $var ]; then echo "Not a story";else echo "Story";echo $i>>stories_aa;fi;done |
01:01
🔗
|
chronomex |
k-srv.info, who is that? |
01:07
🔗
|
arrith |
yipdw: took me like an hour to work out that xargs subshell thing. seriously. |
01:08
🔗
|
arrith |
yipdw: they want you to put stuff into another script then have xargs in your original script run *that* |
01:08
🔗
|
arrith |
yipdw: and i am quite proud of my (crazy) workaround :D |
01:12
🔗
|
arrith |
yipdw: btw underscor is doing his own script that's more thorough, what i'm doing is just a dirty/fast grab for the stories as really just a proof of concept. |
01:20
🔗
|
bsmith093 |
hey im running underscor's thing with his files from ffnet tracker, and its picking up where he left off |
01:21
🔗
|
bsmith093 |
still 81 days though |
01:22
🔗
|
underscor |
bsmith093: What do you mean where I left off? |
01:22
🔗
|
bsmith093 |
the file stories aa is growning |
01:22
🔗
|
bsmith093 |
growing |
01:24
🔗
|
underscor |
I know, I'm saying I didn't leave off anywhere |
01:25
🔗
|
arrith |
bsmith093: afaik that's a snapshot of his work, he's probably farther along than that |
01:25
🔗
|
bsmith093 |
well ok then |
01:25
🔗
|
underscor |
oh, yeah |
01:25
🔗
|
underscor |
sorry |
01:25
🔗
|
underscor |
my bad |
01:26
🔗
|
underscor |
http://tracker.archive.org/ff.net/stories_0-1299999 |
01:26
🔗
|
underscor |
That might be of interest though |
01:26
🔗
|
underscor |
Those are all the valid ones |
02:16
🔗
|
bsmith094 |
underscor: im running your script now, since yoru so much further ahead than me, and it keeps failing Running storyinator on id 0000004 Let's get some metadata. Frontpage Gotten Title is Little Helper Writen by Sheryl Nantus, whose userid is 3284 Placed in tv>>X-Files Tags are Rated: K+, English, F. Mulder & D. Scully, P:3-16-99 Published 3-16-99, updated Story has 38 reviews, which is 3 pages chapters in this story Making dir |
02:17
🔗
|
underscor |
That all looks correct |
02:17
🔗
|
underscor |
Do you have php |
02:17
🔗
|
underscor |
and do you have the php file xmlr.php? |
02:17
🔗
|
bsmith094 |
yeah about, that could not open imput file xmlr.php |
02:18
🔗
|
underscor |
Did you download it? |
02:18
🔗
|
underscor |
:P |
02:18
🔗
|
bsmith094 |
yes |
02:19
🔗
|
bsmith094 |
man php says yes i do have php, but maybe not the right version or something |
02:20
🔗
|
arrith |
bsmith094: what's wrong with my script :( |
02:21
🔗
|
bsmith094 |
still running arrith |
02:21
🔗
|
arrith |
it gets the job done of finding IDs. and in parallel! |
02:21
🔗
|
arrith |
oh |
02:21
🔗
|
arrith |
bsmith094: looks like it's working? |
02:21
🔗
|
bsmith094 |
apparently |
02:21
🔗
|
bsmith094 |
using underscors numbers because ho's got so many |
02:24
🔗
|
bsmith094 |
now as for the actual downloading of the stories, well thats more complicated, according to whatever black arts this this is using http://pastebin.com/e5e4tvK5 |
02:39
🔗
|
arrith |
bsmith094: Unknown Paste ID! |
04:18
🔗
|
yipdw |
arrith: I see |
04:19
🔗
|
yipdw |
arrith: at that point, in my opinion, it probably is clearer to switch to a different programming language and use e.g. a thread pool |
04:22
🔗
|
arrith |
psssshh |
04:23
🔗
|
arrith |
yeah. i have a very small and light and elegant reimplementation of that in python written by a friend of mine using a threadpool even but eh, i don't know python yet |
04:23
🔗
|
yipdw |
or, more appropriately |
04:23
🔗
|
yipdw |
a queue |
04:23
🔗
|
yipdw |
thread pool being an implementation detail, obviously :P |
04:24
🔗
|
arrith |
http://paste.pocoo.org/show/516501/ |
04:25
🔗
|
arrith |
ah, i don't think i know the difference between a pool and a queue |
04:25
🔗
|
yipdw |
yeah, pretty much |
04:25
🔗
|
yipdw |
they're different structures, not directly related |
04:26
🔗
|
yipdw |
the idea being that you throw all of your tasks (IDs, in this case) into a queue, and then there exist multiple executors that dequeue a task, work on it, and then check it in |
04:26
🔗
|
yipdw |
the thread pool is a way to limit the number of concurrent executors |
04:26
🔗
|
arrith |
ah |
04:26
🔗
|
arrith |
does python's multiprocess use a queue? |
04:27
🔗
|
yipdw |
the map() function probably does |
04:28
🔗
|
arrith |
ah |
04:36
🔗
|
bsmith094 |
im back, is the python code just an example? |
04:37
🔗
|
bsmith094 |
anyway, now im trying to get underscor's storyinator.sh to work |
04:49
🔗
|
godane |
i have 193 episodes of crankygeeks now |
04:50
🔗
|
godane |
i also have all crankygeeks episodes posts |
04:50
🔗
|
bsmith094 |
publicly availible yet |
04:50
🔗
|
godane |
i have not uploaded anything yet |
04:51
🔗
|
godane |
i have backed them up on dvd |
04:51
🔗
|
godane |
i have md5sum file for making sure data is right |
04:51
🔗
|
bsmith094 |
how many dvds |
04:51
🔗
|
godane |
3 so far |
04:52
🔗
|
godane |
it will be at least 6 dvds when fully done |
04:52
🔗
|
dnova |
nice. that's hardly anything |
04:52
🔗
|
bsmith094 |
wow |
04:52
🔗
|
godane |
*6 single layer dvds |
04:53
🔗
|
godane |
this is only one format |
04:53
🔗
|
dnova |
there's no reason to get the other formats |
04:53
🔗
|
dnova |
if you're getting the best one |
04:53
🔗
|
godane |
i'm getting ipod one |
04:53
🔗
|
dnova |
the smaller ones can be recreated from those if necessary. |
04:53
🔗
|
dnova |
wait... really? |
04:53
🔗
|
bsmith094 |
so mp3 |
04:53
🔗
|
dnova |
oh they're just podcasts? |
04:53
🔗
|
dnova |
I thought they were videos |
04:54
🔗
|
godane |
there just podcasts |
04:54
🔗
|
dnova |
that should be fine then. |
04:54
🔗
|
godane |
but its video podcasts |
04:54
🔗
|
dnova |
... uh |
04:54
🔗
|
dnova |
ok... well, mp3 has no video |
04:54
🔗
|
godane |
video with audio podcasts |
04:55
🔗
|
dnova |
you should be getting the highest quality version of them |
04:55
🔗
|
godane |
i did for the first 70 |
04:55
🔗
|
godane |
mpeg4 would have had to change to quicktime |
04:56
🔗
|
godane |
cause mpeg4 became the ipod format |
04:57
🔗
|
bsmith094 |
i thought mpeg4 was basically quicktime |
04:58
🔗
|
godane |
there is .mp4 files then there is .mov files |
05:00
🔗
|
dnova |
mp4 is independent of quicktime |
05:01
🔗
|
godane |
anyways the videos are not that big |
05:01
🔗
|
godane |
backing up something is better then nothing |
05:01
🔗
|
Coderjoe |
the isomedia mp4 container format is largely based on the quicktime container format. (note, quicktime is like AVI in this regard: both can contain a number of different codecs. the mp4 container is a bit more limited) |
05:02
🔗
|
godane |
you don't need to backup the 1.6gb 720p videos of podcasts |
05:02
🔗
|
dnova |
godane: says who? |
05:02
🔗
|
dnova |
I guess if it's not that important to you, that's fine |
05:03
🔗
|
dnova |
I don't know anything about that podcast and I am personally not too concerned about it |
05:03
🔗
|
dnova |
but if it's worth doing, it's worth doing right, isn't it? |
05:03
🔗
|
godane |
it takes along time to download and upload 1.6gb file |
05:03
🔗
|
dnova |
what's the deadline? |
05:04
🔗
|
godane |
also crankygeeks doesn't have HD |
05:04
🔗
|
godane |
the biggest file is like 140ishmb |
05:04
🔗
|
dnova |
so where did 1.6gb come from? |
05:04
🔗
|
dnova |
get the 140mb or whatever the best quality files are |
05:05
🔗
|
godane |
i'm just getting the ipod one |
05:05
🔗
|
godane |
sorry |
05:05
🔗
|
dnova |
I don't give a shit, but it sounds like you do |
05:05
🔗
|
dnova |
but only enough to half-ass it |
05:05
🔗
|
godane |
i watch the videos |
05:05
🔗
|
godane |
the is not big differents between the too |
05:05
🔗
|
dnova |
maybe they'll upload them to youtube for near-term preservation and availability |
05:05
🔗
|
godane |
*twno |
05:06
🔗
|
bsmith094 |
dnova: hey, i don't particularly care for most of the stuff on ffnet, either, but im still saving them preemptively, cause it would really suck if that much creativity went into the bitbucket |
05:07
🔗
|
dnova |
right |
05:07
🔗
|
dnova |
and I don't care about splinder on a personal level either |
05:07
🔗
|
bsmith094 |
nor me |
05:07
🔗
|
dnova |
but I've spent lots of time and a decent amount of money to grab as much as I can |
05:07
🔗
|
dnova |
and I'm still grabbing. |
05:07
🔗
|
bsmith094 |
at all, but as long as it was that easy to help out, i did' |
05:08
🔗
|
dnova |
hell yeah man! |
05:08
🔗
|
dnova |
they did a bang-up job with that. |
05:08
🔗
|
bsmith094 |
now, ffnet, for being a fully automated site, is a pita to grab, all of it any way, pinging the ids just to see which urls are valid will take about a month |
05:09
🔗
|
dnova |
I will help with that if possible. just let me know. |
05:09
🔗
|
yipdw |
I'm still not sure why you're going through all the IDs |
05:09
🔗
|
dnova |
it's not a huge time crunch with that project so don't stress too much about it |
05:09
🔗
|
yipdw |
have you identified some problem with using the fanfiction.net indices? |
05:09
🔗
|
yipdw |
e.g. have they blocked some stories from showing up? |
05:09
🔗
|
bsmith094 |
we dont really have a script yet, weve, * and i mean underscor and arrith , have some tentative efforts going |
05:10
🔗
|
dnova |
yeah.. when it's a little more fleshed out I'll throw some hardware at it |
05:10
🔗
|
bsmith094 |
yipdw: uhh no, i just want to save them bc thats a lot of work to dissappear |
05:10
🔗
|
yipdw |
that's not the question I asked |
05:10
🔗
|
dnova |
he means why are you brute forcing the ID list |
05:10
🔗
|
yipdw |
yes |
05:10
🔗
|
yipdw |
fanfiction.net has, from what I can see, a perfectly usable story index |
05:10
🔗
|
bsmith094 |
oh well , umm, thats the easiest way ive found |
05:10
🔗
|
* |
yipdw sighs |
05:10
🔗
|
bsmith094 |
where? |
05:11
🔗
|
yipdw |
their web page |
05:11
🔗
|
yipdw |
one moment |
05:11
🔗
|
yipdw |
I have some time now, let me whip up a demo |
05:11
🔗
|
bsmith094 |
we could scrape the feed, butthat goes forward not back |
05:11
🔗
|
yipdw |
no no |
05:11
🔗
|
yipdw |
I mean the page itself |
05:11
🔗
|
yipdw |
e.g. http://www.fanfiction.net/play/ |
05:11
🔗
|
bsmith094 |
ok then whip away, cause u lost me |
05:11
🔗
|
yipdw |
every story is linked from these lists |
05:11
🔗
|
yipdw |
(as far as I can tell) |
05:12
🔗
|
yipdw |
if you have some counterexamples, I would like to hear them |
05:12
🔗
|
bsmith094 |
ummm, but its easier just to grab the story ids directly |
05:13
🔗
|
bsmith094 |
all u need then is to find out how many chapters each on is, and thats on the first page of each one |
05:13
🔗
|
yipdw |
the initial implementation is easier, but: |
05:13
🔗
|
yipdw |
(1) it requires a pre-filtering step that (you say) will take a month |
05:13
🔗
|
yipdw |
(2) it's really inconsiderate |
05:13
🔗
|
yipdw |
(fanfiction.net isn't dying) |
05:13
🔗
|
dnova |
heh |
05:13
🔗
|
yipdw |
and the point of archiveteam, as far as I know, is to archive, not be assholes |
05:14
🔗
|
yipdw |
the latter sometimes happens but not as an objective |
05:14
🔗
|
dnova |
to be fair, he's not trying to be an asshole at all |
05:14
🔗
|
yipdw |
bear in mind that every GET for a story ID likely requires a database lookup, unless ff.net has done some caching along those lines |
05:14
🔗
|
yipdw |
I know he's not |
05:14
🔗
|
yipdw |
I'm just saying that brute-forcing is a pretty inconsiderate way to do thngs |
05:14
🔗
|
yipdw |
which is also why I'm writing up an alternative |
05:14
🔗
|
bsmith094 |
im just using curl to scrape the head of the urls |
05:16
🔗
|
bsmith094 |
yipdw: so what's your alternative |
05:16
🔗
|
yipdw |
you have a set of roots, right |
05:16
🔗
|
bsmith094 |
yeah |
05:16
🔗
|
yipdw |
namely, the fanfiction categories on the main page |
05:16
🔗
|
yipdw |
ok |
05:16
🔗
|
bsmith094 |
with u so far |
05:16
🔗
|
yipdw |
each root contains a set of categories |
05:16
🔗
|
yipdw |
each category contains a set of stories |
05:16
🔗
|
yipdw |
therefore, there is no need to test each ID |
05:17
🔗
|
yipdw |
and you can begin archiving stories immediately |
05:17
🔗
|
bsmith094 |
uhuh, a so what wget --spider -m? |
05:17
🔗
|
yipdw |
I don't know what tool you'll use; I'm writing a tool in Ruby at the moment |
05:30
🔗
|
yipdw |
whoa my rubinius install is out of date |
05:30
🔗
|
yipdw |
time to update |
05:34
🔗
|
dnova |
update it good |
05:41
🔗
|
bsmith094 |
so anyway, wget -m witha ua changed to firefox seems to be saving the same links tructure as well, so no resorting of ids back into categories |
05:47
🔗
|
Coderjoe |
did you apply a wait time (possibly with the random wait options as well?) |
05:52
🔗
|
yipdw |
ok |
05:53
🔗
|
yipdw |
https://gist.github.com/1432483 |
05:53
🔗
|
yipdw |
that doesn't actually save anything yet, but it can be extended to do so |
05:53
🔗
|
yipdw |
the idea is to demonstrate a more targeted approach |
05:54
🔗
|
yipdw |
if you run that (use Ruby 1.9.3, JRuby in 1.9 mode, or Rubinius 2.0.0 in 1.9 mode) you'll see how it works |
05:54
🔗
|
yipdw |
attaching an example run log now |
05:54
🔗
|
yipdw |
attached |
05:54
🔗
|
yipdw |
note, too, that paginated categories are treated as just more categories |
05:55
🔗
|
yipdw |
there's some deduplication work to be done there, but |
05:55
🔗
|
yipdw |
eh |
05:56
🔗
|
yipdw |
one possibility for saving with the script I linked is to save each story as its own WARC, reviews and all; that'd eliminate the need for a separate review queue |
05:56
🔗
|
yipdw |
that assumes that the unit of work you want to save is the story |
05:56
🔗
|
yipdw |
which I think is true. |
05:57
🔗
|
underscor |
yipdw: That's pretty spiffy! |
05:58
🔗
|
yipdw |
I think it's probably buggy |
05:58
🔗
|
yipdw |
there are some duplicate names showing up; the link selection logic probably needs to be refined |
05:58
🔗
|
yipdw |
but that's the idea |
05:58
🔗
|
yipdw |
as a bonus, the number of instances you run can be carefully controlled by simply changing the size of the connection pool |
06:01
🔗
|
arrith |
aha |
06:02
🔗
|
arrith |
underscor: i was going to ping you to make sure you saw this discussion, yeah some interesting stuff |
06:02
🔗
|
yipdw |
oh oops |
06:02
🔗
|
yipdw |
my category-detection scheme fails on crossovers |
06:08
🔗
|
yipdw |
heh, that's annoying |
06:08
🔗
|
yipdw |
http://www.fanfiction.net/crossovers/movie/ has broken HTML |
06:09
🔗
|
bsmith094 |
yipdw: well, its official, your ruby kicks my wget's ass |
06:09
🔗
|
bsmith094 |
probably more efficient, too |
06:10
🔗
|
yipdw |
keep in mind that this code doesn't actually save anything yet |
06:10
🔗
|
yipdw |
I'm not sure how you want to do that |
06:10
🔗
|
bsmith094 |
im fine with category/show/userid/story |
06:11
🔗
|
bsmith094 |
and you have a repo, which makes updating SO much easier |
06:11
🔗
|
yipdw |
also, I'm not sure how hard it would be to get wget-warc to do this |
06:11
🔗
|
yipdw |
(haven't tried) |
06:11
🔗
|
yipdw |
there are advantages to using that, such as making it easier to replicate fanfiction.net's structure |
06:11
🔗
|
bsmith094 |
i still don't get why warc is important? |
06:11
🔗
|
dnova |
why did we have to compile wget-warc for splinder? |
06:12
🔗
|
yipdw |
dnova: there's no official release of wget + WARC capabilities |
06:12
🔗
|
Coderjoe |
because the warc features are not in most distro's package repos yet |
06:12
🔗
|
bsmith094 |
this code would work for fictionpress as well, since they're identical |
06:12
🔗
|
yipdw |
bsmith094: I think it's important to capture not only the story data but also the circumstances under which the capture was done |
06:12
🔗
|
Coderjoe |
the warc features have been accepted into wget's mainline, however |
06:12
🔗
|
yipdw |
WARC provides that |
06:12
🔗
|
dnova |
ah, interesting. |
06:12
🔗
|
bsmith094 |
huh, well ok then |
06:13
🔗
|
yipdw |
also, IA is set up to ingest WARCs, I think |
06:13
🔗
|
Coderjoe |
yes, the wayback is set up to ingest warc pretty much automatically (once someone feeds the warc to it) |
06:14
🔗
|
bsmith094 |
so something like Books/Harry Potter/1234567/2345678/blah.html |
06:14
🔗
|
yipdw |
so do you just want to archive the text of the stories? |
06:14
🔗
|
yipdw |
or are you after more than that? |
06:14
🔗
|
bsmith094 |
ok its late or early, so gnight yall |
06:14
🔗
|
yipdw |
because if it's just text, fanfiction.net's mobile site is actually better suited for this |
06:14
🔗
|
yipdw |
(it's simpler) |
06:15
🔗
|
Coderjoe |
yipdw: he's after just the text. I'd prefer a full warc set |
06:15
🔗
|
yipdw |
Coderjoe: full WARC set of all stories, one story per WARC? |
06:15
🔗
|
yipdw |
or a WARC archive of the whole site |
06:15
🔗
|
Coderjoe |
well, IIRC, he wants the text, author comments, and reviews |
06:15
🔗
|
yipdw |
ok |
06:15
🔗
|
dnova |
a warc for the entire site would require LOTS of ram, I think |
06:15
🔗
|
yipdw |
dnova: yeah |
06:16
🔗
|
yipdw |
I guess what I should be asking is |
06:16
🔗
|
bsmith094 |
actually that would be a great bonus but ill take jus the stories if that all i can grab |
06:16
🔗
|
yipdw |
what's the objective here |
06:17
🔗
|
yipdw |
is the idea to take e.g. http://www.fanfiction.net/s/6635497/1/Plotting_The_Unknown_Future and wrap it into a WARC, comments, reviews and all? |
06:17
🔗
|
yipdw |
for ingestion into IA? |
06:18
🔗
|
yipdw |
anyway, I'll clean up that ff Ruby code and dump it into an AT repo on github |
06:18
🔗
|
yipdw |
that loop { sleep 5 } bullshit needs to go |
06:19
🔗
|
yipdw |
PSA: if anyone is doing sleeps like that in threads and you're not waiting on a periodic source, you have sinned |
06:20
🔗
|
dnova |
I'll take your word for it |
06:20
🔗
|
yipdw |
arguably sleeping on periodic sources is a bad idea anyway |
06:21
🔗
|
yipdw |
er, as a wait for |
06:21
🔗
|
bsmith094 |
not a huge thing, and i feel like a jerk since i cant code wotrh a damn, but it would be just fantastic, if you could put the author profile page in there somewhere, as well as the reviews for each story as html, with the story |
06:21
🔗
|
arrith |
yipdw: underscor is kinda leading the design on that |
06:22
🔗
|
yipdw |
arrith: cool |
06:22
🔗
|
yipdw |
again, this Ruby stuff is just a PoC |
06:22
🔗
|
arrith |
i'm not sure what he's including but i'm hoping as much as possible |
06:22
🔗
|
yipdw |
feel free to use or not use as needed |
06:22
🔗
|
arrith |
once he's done with his bash+php+perl thing i want to look over it and try to convert it to python as much as possible, make sure it's getting everything comprehensively enough, then integrate it with the universal tracker for periodic scrapes |
06:23
🔗
|
bsmith094 |
true, that why i feel like a jerk, im throwing out ideas, that equal more work for the rest of you guys, and i cant really contribute anything, but bandwidth to run whatever scripts you finally come up with |
06:23
🔗
|
arrith |
alright. i don't know ruby but it looks pretty neat. i'll try to make sense of it |
06:23
🔗
|
Wyatt|Wor |
Before I forget yet again, SketchCow, can I have an rsync slot? I've got some berlios and a wayward chunk of Google Groups. |
06:23
🔗
|
arrith |
bsmith094: you could glance over a python tutorial :P |
06:23
🔗
|
Wyatt|Wor |
arrith: Ruby is perl in a dress. |
06:23
🔗
|
arrith |
bsmith094: http://learnpythonthehardway.org/ |
06:23
🔗
|
dnova |
arrith: do you know a good one? |
06:23
🔗
|
dnova |
beat me. |
06:23
🔗
|
arrith |
dnova: http://learnpythonthehardway.org/ |
06:23
🔗
|
dnova |
LOL |
06:23
🔗
|
yipdw |
arrith: it starts at the roots -- the major subdivisions of the site |
06:23
🔗
|
SketchCow |
OK, one moment. |
06:23
🔗
|
dnova |
thanks :P |
06:24
🔗
|
yipdw |
arrith: each root is thrown into the discovery queue, which generates more categories or story URLs |
06:24
🔗
|
arrith |
dnova: that one and How To Think Like A Computer Scientist |
06:24
🔗
|
yipdw |
arrith: from there, categories are sent to the discovery queue, story URLs are sent to the grab queue |
06:24
🔗
|
bsmith094 |
that, ust now, was more activity in 2 min, than this feed hashad in a week |
06:24
🔗
|
yipdw |
arrith: there's four executors for each queue, and four HTTP connections shared amongst all queues |
06:25
🔗
|
yipdw |
it's similar in structure to what one might do with the multiprocessing package in python |
06:25
🔗
|
yipdw |
just different names. |
06:25
🔗
|
dnova |
this looks great, I'm going to check it out, thanks arrith. |
06:25
🔗
|
arrith |
dnova: good :) |
06:25
🔗
|
arrith |
yipdw: hmm yeah i'm hoping that's not too difficult to translate into python |
06:25
🔗
|
yipdw |
arrith: it shouldn't be, Python has much of the same tools |
06:26
🔗
|
yipdw |
one second |
06:26
🔗
|
yipdw |
updating support.rb with smarter logic |
06:26
🔗
|
arrith |
Wyatt|Wor: sounds about right |
06:31
🔗
|
bsmith094 |
while were all here, has anyone checked out storyinator.sh from here, www.tracker.archive.org/ffnet |
06:32
🔗
|
yipdw |
alrighty |
06:32
🔗
|
yipdw |
https://gist.github.com/1432483/cdbfa4c8e9779e009838235da543fc0a08754862 |
06:32
🔗
|
Wyatt|Wor |
Oh? Hey now |
06:32
🔗
|
Wyatt|Wor |
bsmith094: I'm getting 404 |
06:33
🔗
|
Wyatt|Wor |
Or not even 404 |
06:33
🔗
|
bsmith094 |
http://tracker.archive.org/ff.net |
06:33
🔗
|
bsmith094 |
wrong link |
06:33
🔗
|
Wyatt|Wor |
AH |
06:34
🔗
|
arrith |
Wyatt|Wor: that's a bit of underscor's work so far |
06:34
🔗
|
zetathust |
yeah |
06:34
🔗
|
bsmith094 |
yeah i know |
06:34
🔗
|
arrith |
mk |
06:35
🔗
|
arrith |
just he's further along and it's a non portable proof of concept atm |
06:37
🔗
|
arrith |
yipdw: do you generally prefer ruby to python for quick projects? |
06:37
🔗
|
Wyatt|Wor |
non....portable? But it runs on anything with a bash interpreter... |
06:37
🔗
|
Wyatt|Wor |
;) |
06:38
🔗
|
yipdw |
arrith: I've used Ruby more recently |
06:38
🔗
|
yipdw |
so I find it easier to express programs in it |
06:38
🔗
|
yipdw |
I have nothing against Python, though; I usually use it to script Blender |
06:38
🔗
|
yipdw |
no complaints about Python there |
06:38
🔗
|
arrith |
Wyatt|Wor: heh well, requires php. a novice user getting php up and running for a small script isn't the easiest |
06:39
🔗
|
bsmith094 |
i have a layout idea for what to grab for the stories http://pastebin.com/W6tUR1VE |
06:39
🔗
|
arrith |
yipdw: ah, i was wondering if you had experience with both or just knew ruby more |
06:39
🔗
|
yipdw |
arrith: both :P |
06:40
🔗
|
arrith |
yipdw: that's exactly why i'm writing things in bash and not python ;) |
06:40
🔗
|
yipdw |
eh? |
06:40
🔗
|
yipdw |
well |
06:40
🔗
|
yipdw |
here's my problem with bash |
06:40
🔗
|
yipdw |
the language is arcane as hell, it's not really THAT portable due to lots of differences between shell versions |
06:40
🔗
|
yipdw |
and even if you have the same version, the installed utilities can differ |
06:40
🔗
|
yipdw |
GNU du does not accept the same options as e.g. BSD du, for instance |
06:40
🔗
|
yipdw |
so you end up coding abstractions for stuff like that |
06:41
🔗
|
arrith |
oh yeah i have no defense for any of that |
06:41
🔗
|
yipdw |
in the end I've found Python, Perl, Ruby to be more portable than bash :P |
06:41
🔗
|
arrith |
yeah definitely |
06:41
🔗
|
arrith |
i blame it on being 'raised wrong'. it's all i know! |
06:41
🔗
|
arrith |
for now at least |
06:41
🔗
|
Wyatt|Wor |
I love bash for the beauty that comes from some of the ugliest code on the planet. |
06:42
🔗
|
bsmith094 |
ditto |
06:42
🔗
|
bsmith094 |
i can actually follow most of it |
06:42
🔗
|
Wyatt|Wor |
But I'm not going to pretend it's more than glue. |
06:42
🔗
|
Wyatt|Wor |
Moreso than perl, even. |
06:43
🔗
|
arrith |
i've had an unfortunate feedback loop of mainly knowing bash, so i start a project in it then google to fill in areas that i lack, and not just starting over doing the hard very beginning stuff with a new lang |
06:43
🔗
|
yipdw |
for my next project, I'll get ArchiveTeam using Factor |
06:43
🔗
|
no2p |
Why use bash when you can use ksh? ;) |
06:43
🔗
|
yipdw |
http://factorcode.org/ |
06:44
🔗
|
Wyatt|Wor |
no2p: Actually people ask me this seriously here. I even developed an answer for it: Because Bash is everywhere. |
06:44
🔗
|
no2p |
Oh, no doubt. I was joking in terms of 'looks'. |
06:44
🔗
|
yipdw |
so is Java, but that hasn't helped much :P |
06:44
🔗
|
yipdw |
well, that's unfair |
06:45
🔗
|
yipdw |
in a server context it's fine |
06:45
🔗
|
Wyatt|Wor |
yipdw: But Java is a boilerplate language not a glue language |
06:45
🔗
|
yipdw |
re: portability |
06:45
🔗
|
yipdw |
Wyatt|Wor: I don't understand the distinction |
06:45
🔗
|
bsmith094 |
hey, i love java for its ubiquity |
06:45
🔗
|
Coderjoe |
geh |
06:46
🔗
|
Wyatt|Wor |
yipdw: Java you spend most of your time writing long strings of boilerplate code. Bash, you spend a lot of time gluing other things together until it does what you want. |
06:46
🔗
|
bsmith094 |
not saying its good, or fast, but a jar will run on anything with a jvm |
06:46
🔗
|
Coderjoe |
stop with all the esoteric languages. for the distributed downloading stuff, it should use a well-featured and widly-installed language (like python or perl) |
06:46
🔗
|
yipdw |
I wasn't serious about Factor |
06:46
🔗
|
bsmith094 |
nor me with java |
06:46
🔗
|
arrith |
yipdw: AT needs easier code not harder code :P |
06:47
🔗
|
bsmith094 |
**shudder** |
06:47
🔗
|
arrith |
s'why i'm evangelizing python |
06:47
🔗
|
yipdw |
python's fine |
06:47
🔗
|
bsmith094 |
ive heard great things about it |
06:47
🔗
|
yipdw |
Wyatt|Wor: I guess, although a lot of that applies to Java programming too; it's just that you write more to glue bits from libraries together |
06:47
🔗
|
Coderjoe |
Wyatt|Wor: except bash relies on other userland tools (like gnu userland or bsd userland), which are not always completely compatible. Plus there were issues with your centos being out of date, buggy versions of grep, etc |
06:47
🔗
|
arrith |
at least in terms of getting beginners up to speed and helping out with it |
06:47
🔗
|
arrith |
i suppose if a person already knows java then other jvm-ish things might be easier for them |
06:47
🔗
|
bsmith094 |
check out the archive box channel |
06:48
🔗
|
Wyatt|Wor |
I prefer perl's flavour of sugar to python's, but that's personal preference. |
06:48
🔗
|
bsmith094 |
.join #archivebox |
06:48
🔗
|
yipdw |
arrith: I dunno, how many Java programmers do you know who have picked up Clojure :P |
06:48
🔗
|
arrith |
yipdw: none that weren't told to, which i guess the AT would be doing heh |
06:48
🔗
|
Wyatt|Wor |
Coderjoe: Right, gluing things together. I'm not advocating for doing all AT stuff in Bash, don't misunderstand |
06:49
🔗
|
Wyatt|Wor |
(Or any of it, really) |
06:50
🔗
|
Coderjoe |
(and I'm not saying python is free of problems either. I had to write some hacks recently to work around problems with python's win32 file api interaction...) |
06:50
🔗
|
arrith |
Coderjoe: out of curiosity, what kind of problems? |
06:50
🔗
|
Coderjoe |
for paths longer than 256 characters |
06:50
🔗
|
arrith |
ah interesting |
06:51
🔗
|
Coderjoe |
os.walk and such have the needed hacks in the main python code, but stat and open do not |
06:52
🔗
|
Coderjoe |
https://gist.github.com/1432614 |
06:53
🔗
|
Coderjoe |
but it is partly windows' fault for being stupid with the paths |
06:54
🔗
|
Wyatt|Wor |
That looks really bizarre |
06:54
🔗
|
yipdw |
goddamnit, I just spent five minutes looking for my phone's TV-out cable and it was right next to me |
06:55
🔗
|
Wyatt|Wor |
It's going to take a while to get used to this idea that mobile phones can output 1080p video over HDMI. |
06:56
🔗
|
Coderjoe |
Wyatt|Wor: the \\?\ thing has to do with some win32 api hacks. see under "lpFileName" on http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858%28v=vs.85%29.aspx |
06:57
🔗
|
Coderjoe |
I would have thought the unicode version would be free of this MAX_PATH stupidity, but apparently not |
06:59
🔗
|
yipdw |
Wyatt|Wor: I'm syncing contacts between two phones; one phone's screen is shattered, so need TV-out to enable Bluetooth |
06:59
🔗
|
yipdw |
I'm impressed that the sync actually seems to be working |
06:59
🔗
|
yipdw |
(granted, they're both Nokia products, but even so) |
06:59
🔗
|
Wyatt|Wor |
Ooh, bummer. What handset? |
07:00
🔗
|
yipdw |
N900 and N9 |
07:00
🔗
|
Wyatt|Wor |
Ah, those are nice. A pal of mine really digs his. |
07:00
🔗
|
yipdw |
their biggest problem is that they've both been left for dead :P |
07:01
🔗
|
Wyatt|Wor |
I know; that's really sad. |
07:02
🔗
|
Wyatt|Wor |
MeeGo, from what I've seen, is really nice, too |
07:03
🔗
|
yipdw |
I like the UI paradigm; the infrastructure has some rough spots |
07:03
🔗
|
yipdw |
like the capabilities framework |
07:03
🔗
|
yipdw |
I think a lot of that is because it was never finished |
07:03
🔗
|
yipdw |
but chronomex is probably gonna ring the off-topic bell on me so I'll shut up now :P |
07:06
🔗
|
arrith |
yipdw: well, i don't want to be ot but i commend your bravery for going for the N9 after what happened with the N900 and especially all that's happened around it. i was eying an N900 for a long time but at this point i'm waiting for cyanogenmod to get more debianish or to see what tizen turns out to be |
07:07
🔗
|
yipdw |
arrith: heh, not so much bravery as "ooh, shiny" |
07:07
🔗
|
yipdw |
MeeGo (or more precisely Nokia's Harmattan layer) irks me in that I'm trying to fix some of its problems (like no generic Jabber support) but so much of it is closed-source |
07:08
🔗
|
yipdw |
so there's a lot more "huh, I guess I'll just have to poke at it" than IMO is necessary |
07:09
🔗
|
arrith |
ah dang |
07:09
🔗
|
bsmith094 |
i realize this is kind of random, but does I have an official IRC channel? |
07:09
🔗
|
bsmith094 |
IA |
07:10
🔗
|
arrith |
maemo was the closest i've seen to 'debian on a phone' but it's gotten pretty weird since that point |
07:10
🔗
|
bsmith094 |
feel free to bite my hea off, but.... android |
07:10
🔗
|
Coderjoe |
mmm |
07:10
🔗
|
yipdw |
bsmith094: I don't think it does; poke underscor or SketchCow |
07:10
🔗
|
Coderjoe |
4k-16bit pngs of sintel are large |
07:10
🔗
|
arrith |
bsmith094: #archive on freenode was mentioned back in 2005 |
07:11
🔗
|
bsmith094 |
underscor SketchCow |
07:11
🔗
|
arrith |
Coderjoe: they still need to do a 4k render. this max of 720p is insulting. |
07:11
🔗
|
Coderjoe |
they have a 4k and a 4k-16bit render |
07:11
🔗
|
Coderjoe |
and a 1080p render |
07:11
🔗
|
bsmith094 |
well its not here |
07:11
🔗
|
arrith |
bsmith094: http://www.google.com/search?q=archive.org+irc+channel |
07:12
🔗
|
arrith |
Coderjoe: eh? those weren't on the download page last i saw. they must be hidden |
07:12
🔗
|
Coderjoe |
http://media.xiph.org/sintel/ |
07:12
🔗
|
arrith |
mm nice |
07:13
🔗
|
bsmith094 |
ita empty, and automatedly dead |
07:15
🔗
|
arrith |
bsmith094: you can bide your time with that python tutorial |
07:17
🔗
|
bsmith094 |
k then, hey, speaking of sintel, whatever happened to that other foss movie, elephants dream, the dvd iso torrents are deader than luna, and i would really like them, yes i did check ia no they dont have it |
07:18
🔗
|
Coderjoe |
i don't know. I have the 1080p pngs and flac audio for it, though |
07:18
🔗
|
Coderjoe |
and I think I have the dvd somewhere too |
07:20
🔗
|
bsmith094 |
well i found a torrent its running and i am so uploading this to IA when its finished in 2 days |
07:21
🔗
|
dnova |
bsmith094: flesh out the FanFiction.Net wiki page please. |
07:21
🔗
|
bsmith094 |
do i have edit rights? |
07:22
🔗
|
dnova |
if you have any account yes |
07:23
🔗
|
dnova |
I added it to http://archiveteam.org/index.php?title=Projects#Other_Projects |
07:23
🔗
|
dnova |
but it needs some info |
07:23
🔗
|
dnova |
even if it's very preliminary |
07:25
🔗
|
bsmith094 |
<titleblacklist-forbidden-new-account> |
07:26
🔗
|
bsmith094 |
so do i have an acoun tor not |
07:26
🔗
|
dnova |
... are you logged in? |
07:26
🔗
|
dnova |
did you make an account? |
07:26
🔗
|
dnova |
I'm not sure what to say |
07:26
🔗
|
bsmith094 |
im trying toc reate one and i keep getting that error |
07:27
🔗
|
Coderjoe |
I see nothing recent for the user creation log |
07:27
🔗
|
Coderjoe |
http://www.archiveteam.org/index.php?title=Special:Log/newusers |
07:28
🔗
|
Coderjoe |
what account are you trying to make? |
07:28
🔗
|
bsmith094 |
bsmith093 |
07:28
🔗
|
dnova |
there is no user "bsmith*" |
07:28
🔗
|
Coderjoe |
i wonder if something is filtering it thinking it looks too much like a spambot username? |
07:28
🔗
|
bsmith094 |
k then ill try something else |
07:29
🔗
|
Coderjoe |
thoygh we appear to have other spambots in the roost |
07:29
🔗
|
Coderjoe |
http://www.archiveteam.org/index.php?title=Special:Contributions/Fdhbgj |
07:30
🔗
|
bsmith094 |
EntropyWins tried that same error only thing i can think of is i screwed up the captcha |
07:30
🔗
|
bsmith094 |
but not 6 times in a row |
07:31
🔗
|
Coderjoe |
i don't know then. |
07:31
🔗
|
Wyatt|Wor |
New signups may be turned off for the moment because SketchCow was hunting another SEO spammer |
07:31
🔗
|
Coderjoe |
and I should probably hop in the time machine and go to bed 2 hours ago |
07:32
🔗
|
Wyatt|Wor |
That's my hypothesis, at least. |
07:33
🔗
|
Coderjoe |
arrith: btw, if you go to media.xiph.org, the page there lists the sizes of the different versions |
07:33
🔗
|
arrith |
ah alright |
07:35
🔗
|
arrith |
bsmith094: try a different nick type |
07:35
🔗
|
arrith |
i think spammers put numbers on their usernames at some point |
07:38
🔗
|
bsmith094 |
ok imin as NonCoderBen, now what do i say |
07:38
🔗
|
arrith |
bsmith094: what you must |
07:40
🔗
|
bsmith094 |
check it now |
07:43
🔗
|
SketchCow |
No, new signings should be fine. |
07:49
🔗
|
Wyatt|Wor |
Huh. |
07:49
🔗
|
Wyatt|Wor |
Okay then, weird. |
07:49
🔗
|
Wyatt|Wor |
Oh yeah, rsync for me? |
07:51
🔗
|
bsmith094 |
arrith: that curl script is running 400 at once |
08:09
🔗
|
SketchCow |
Ah yes, slot |
08:12
🔗
|
bsmith094 |
gnoght/gmorning all |
08:14
🔗
|
* |
kennethre yawns |
08:26
🔗
|
SketchCow |
Sorry, got hung up on stupid thing. |
08:26
🔗
|
SketchCow |
See, I moved bbsdocumentary.com to the new server, but it still has php infestation. |
08:28
🔗
|
Wyatt|Wor |
No, no, it's cool. Those are nasty. |
08:28
🔗
|
Wyatt|Wor |
What sort of infestation? |
08:28
🔗
|
SketchCow |
php additions. |
08:29
🔗
|
Wyatt|Wor |
Sorry to hear that. :/ Can you diff against a backup? |
08:30
🔗
|
SketchCow |
Well, I can find the culripts, and I can shut off PHP on the new server. |
08:30
🔗
|
SketchCow |
New server uses no PHP. |
08:30
🔗
|
SketchCow |
PHP is garbage. |
08:30
🔗
|
SketchCow |
People who like it like leaving "just one" door unlocked for convenience, but it's OK because all the other doors are locked. |
08:31
🔗
|
SketchCow |
i.e. retards |
08:31
🔗
|
chronomex |
PHP is the wrong tool for any job |
08:32
🔗
|
chronomex |
kind of like a tin vise grip |
08:32
🔗
|
Wyatt|Wor |
I've never been a fan and certainly haven't grown fonder. Wordpress has killed any good will it might have had from me. |
08:32
🔗
|
chronomex |
heh |
08:32
🔗
|
chronomex |
anyway. |
08:35
🔗
|
SketchCow |
What the fuck is nef format |
08:35
🔗
|
chronomex |
it's a raw file from a camera |
08:36
🔗
|
chronomex |
I don't know what it stands for. |
08:36
🔗
|
Wyatt|Wor |
Google sayeth Nikon Electronic Format. |
08:36
🔗
|
chronomex |
https://www.google.com/search?q=nef+format |
08:37
🔗
|
chronomex |
"Nikon exclusive NEF format" |
08:37
🔗
|
SketchCow |
Well, I am excited to see what happens when I dump NEF format into archive.org |
08:37
🔗
|
SketchCow |
Theory: Nothing |
08:37
🔗
|
chronomex |
unlike products, guys, an "exclusive" designation on a file format is NOT a bonus. |
08:37
🔗
|
Wyatt|Wor |
I still don't understand why there are so many different formats for raw image data. |
08:37
🔗
|
db48x2 |
lol |
08:38
🔗
|
* |
Wyatt|Wor never put any dots in photography |
08:38
🔗
|
ersi |
SketchCow: Whoa man, that was a nice test. |
08:39
🔗
|
SketchCow |
It'd better be, for $13k of new equipment! |
08:39
🔗
|
ersi |
Wyatt|Wor: Because there's several Large Photo Corps and they all have different sensors, which most often just dumps out the raw sensor data |
08:40
🔗
|
ersi |
SketchCow: Heh, might be that you grabbed Chris for the test as well ^_^ |
08:40
🔗
|
chronomex |
Wyatt|Wor: http://en.wikipedia.org/wiki/Raw_image_format#Rationale |
08:40
🔗
|
SketchCow |
Right now, I'm cleaning up french magazines so this dead end with raw formats won't be miserable. |
08:40
🔗
|
SketchCow |
Do you know Chris? |
08:41
🔗
|
ersi |
SketchCow: No, but it feels like I do, now. |
08:41
🔗
|
SketchCow |
I can fix a french magazine item in 5 seconds now. |
08:41
🔗
|
Wyatt|Wor |
SketchCow: Is that using the new lights? If so, I totally agree with your decision to halogen. |
08:41
🔗
|
SketchCow |
Gotta type fast, but I can do it. |
08:41
🔗
|
SketchCow |
Well, the new lights are just new copies of the old lights. |
08:41
🔗
|
SketchCow |
Same light as GET LAMP |
08:41
🔗
|
Wyatt|Wor |
And it looked good there. |
08:42
🔗
|
SketchCow |
Here's the command I'm doing: |
08:42
🔗
|
SketchCow |
mv */* .;rmdir *;mv *.txt txt.txt;exit |
08:42
🔗
|
* |
ersi shrugs |
08:42
🔗
|
balrog |
SketchCow: does that free package support NEF? |
08:43
🔗
|
SketchCow |
No idea |
08:43
🔗
|
balrog |
dcraw |
08:43
🔗
|
balrog |
http://www.cybercom.net/~dcoffin/dcraw/ |
08:43
🔗
|
SketchCow |
Let me look. |
08:43
🔗
|
balrog |
yeah but cameras only |
08:43
🔗
|
ersi |
"There are dozens of raw photo formats: CRW, CR2, MRW, NEF, RAF, etc. "RAW Format" does not exist; it is an illusion created by dcraw's ability to read all raw formats. " |
08:43
🔗
|
balrog |
he provides this code for scanners: http://www.cybercom.net/~dcoffin/dcraw/scan.c |
08:43
🔗
|
SketchCow |
Wait, wait |
08:43
🔗
|
balrog |
the NEFs you have |
08:43
🔗
|
SketchCow |
I THINK the donator donated .TIFFs as well |
08:43
🔗
|
balrog |
are they from cameras or scanners |
08:43
🔗
|
SketchCow |
In that case, who gives a shit, I'll include all three. |
08:43
🔗
|
balrog |
well then probably use the tiffs |
08:44
🔗
|
ersi |
Yeah, that's the best, really. |
08:44
🔗
|
balrog |
the benefit of raw images are that you can make adjustments later |
08:44
🔗
|
SketchCow |
.tif, .nef, and the thing one |
08:44
🔗
|
balrog |
if you have the software to process them, that is. |
08:44
🔗
|
ersi |
balrog: Or you just chuck them all in, nothing is lost that way |
08:44
🔗
|
balrog |
but a .NEF from a scanner is no more useful than a .TIFF |
08:44
🔗
|
balrog |
ersi: true |
08:44
🔗
|
balrog |
:/ |
08:44
🔗
|
balrog |
scanner .NEFs don't have additional data, like camera ones do |
08:44
🔗
|
chronomex |
balrog: tifs are really damn useful. |
08:44
🔗
|
balrog |
idk why nikon even did that |
08:44
🔗
|
SketchCow |
These are 190x newspapers the guy took photos of. |
08:45
🔗
|
balrog |
SketchCow: photos with a camera? |
08:45
🔗
|
SketchCow |
They're not 100% perfect but it's a nice collection to add. |
08:45
🔗
|
SketchCow |
Yeah. |
08:45
🔗
|
ersi |
Whoa, that be many. |
08:45
🔗
|
balrog |
keep the .NEFs |
08:45
🔗
|
chronomex |
balrog: you just don't like tif because it's a pain to view on windows, but tif is perfect for actually working with images. |
08:45
🔗
|
balrog |
if someone needs to do white balance correction or such … will need them. |
08:45
🔗
|
SketchCow |
190x is the date, not the number |
08:45
🔗
|
balrog |
chronomex: I didn't say I don't like .tif |
08:45
🔗
|
balrog |
I actually do |
08:45
🔗
|
balrog |
but raw camera images contain more data |
08:45
🔗
|
chronomex |
00:45:02 < balrog> but a .NEF from a scanner is no more useful than a .TIFF |
08:45
🔗
|
chronomex |
looks like you said "tiff and nef are not useful" |
08:45
🔗
|
balrog |
chronomex: I was saying that a scanner .NEF is junk |
08:45
🔗
|
SketchCow |
/newspapers/Jimmy Swinnerton/On And Off The Ark - 1902/26b.tif' saved [2083747] |
08:45
🔗
|
SketchCow |
2mb TIF |
08:45
🔗
|
balrog |
since it has nothing that the .tiff doesn't have |
08:46
🔗
|
balrog |
yeah I have played with nikon scanners that generate .nefs |
08:46
🔗
|
chronomex |
balrog: NEF is a special case of TIFF. |
08:46
🔗
|
ersi |
SketchCow: Oh. Heh. |
08:46
🔗
|
ersi |
SketchCow: That be plenty old then, sweet find |
08:47
🔗
|
SketchCow |
So, just to explain what's going on. |
08:47
🔗
|
SketchCow |
So archive.org chokes on some characters sets. |
08:47
🔗
|
balrog |
:[ |
08:47
🔗
|
ersi |
chronomex: Well, he did say that NEFs from a Nikon SCANNER is bullshit. Since it does not provide any more information than a TIFF would. Nothing was said about NEFs vs. TIFF or anything. |
08:47
🔗
|
SketchCow |
These French computer magazines? The filenames have some of those. |
08:47
🔗
|
ersi |
Now I'll stop caring |
08:47
🔗
|
balrog |
they're not utf-8? |
08:47
🔗
|
SketchCow |
So I have this script. |
08:47
🔗
|
chronomex |
ersi: good plan. |
08:48
🔗
|
SketchCow |
It takes the .zip, unpacks it, drops me into a shell so I "fix" them, then when I exist the shell, it re-packs, and re-uploads to archive.org. |
08:48
🔗
|
ersi |
I would assume it's ISO-8859-*, because they're French |
08:49
🔗
|
SketchCow |
This french magazine collection is unfortunately an embarassment of riches, because they have a LOT of issues, and some of the filenames and other things have failures. |
08:49
🔗
|
chronomex |
SketchCow: hm, that's a nice design pattern. I should remember that. |
08:49
🔗
|
SketchCow |
What, the script?> |
08:49
🔗
|
balrog |
ok night all |
08:49
🔗
|
chronomex |
SketchCow: yeah. |
08:51
🔗
|
SketchCow |
It gets worse. |
08:51
🔗
|
SketchCow |
Now I'm running my two step process THROUGH A LOOP |
08:52
🔗
|
SketchCow |
So I am looping a two step process to make it less than 5 seconds because I'm no longer typing in the up arrow to make the slight number change. |
08:52
🔗
|
SketchCow |
This is the ONLY way I can get so much done, as people seem to think I'm capable of superhuman productivity |
08:53
🔗
|
chronomex |
ogod |
08:54
🔗
|
SketchCow |
I just fixed 8 of them. |
08:55
🔗
|
SketchCow |
This is going to add a brutal amount of material up, like a few thousand issues. |
08:55
🔗
|
SketchCow |
All french, but still very good. |
08:55
🔗
|
SketchCow |
Occasional sub-par scanning, wouldn't mind seeing some redone. |
08:55
🔗
|
SketchCow |
Missing issues here and there, etc. |
08:56
🔗
|
SketchCow |
Newspapers still downloading from the drop point - now that I see he made three versions of each page, it makes more sense. |
08:57
🔗
|
SketchCow |
mv */* .;rmdir *;mv *.txt txt.txt;exit |
08:57
🔗
|
SketchCow |
I mean |
08:57
🔗
|
SketchCow |
for each in 133 132 131 130 129 128 127 126 125 124 123 122 121 120;do ./cleanorator.sh generation4_numero_${each}_images.zip;done |
08:58
🔗
|
SketchCow |
See, do "cleanorator" to the .zip. Then next one |
08:58
🔗
|
SketchCow |
In cleanorator, I then do this simple operation they all share. |
08:59
🔗
|
SketchCow |
That mv. Which is "move them out of the weirdly named subdirectory, make the stupidly named .txt description file into a txt.txt file. |
08:59
🔗
|
SketchCow |
" |
08:59
🔗
|
SketchCow |
Simple, but tedious |
08:59
🔗
|
SketchCow |
But each one adds a 150-200pp magazine to the archive. |
08:59
🔗
|
SketchCow |
So I'll do it. |
09:00
🔗
|
SketchCow |
Yeah, see, up in the 12x range, the individual issues are 230pages |
09:00
🔗
|
SketchCow |
Which is crazy |
09:00
🔗
|
SketchCow |
"Generation 4" magazine |
09:00
🔗
|
SketchCow |
Circa 1999-2000 |
09:03
🔗
|
SketchCow |
Mostly sharing this to give people insight into how I get so much stuff done. |
09:03
🔗
|
zetathust |
the insight |
09:03
🔗
|
SketchCow |
... |
09:03
🔗
|
SketchCow |
blah blah insight blah |
09:03
🔗
|
SketchCow |
Weird. |
09:04
🔗
|
kennethre |
Wyatt|Wor: they're 100% raw dumps from the sensors, so every camera has it's own format |
09:05
🔗
|
kennethre |
Wyatt|Wor: Adobe's DNG is the only open standard for archival of "raw" images |
09:05
🔗
|
kennethre |
Wyatt|Wor : http://en.wikipedia.org/wiki/Digital_Negative |
09:06
🔗
|
ersi |
Nice at joining ages later |
09:06
🔗
|
kennethre |
nodded off ;) |
09:07
🔗
|
ersi |
We're past RAW Formats since atleast 20 min ago |
09:07
🔗
|
kennethre |
better late than never |
09:07
🔗
|
SketchCow |
kennethre: Mahdi came onto my Google Hangout. We chatted. |
09:08
🔗
|
kennethre |
SketchCow: ah, nice. We're good pals |
09:08
🔗
|
kennethre |
SketchCow: or he's just a crazy stalker and has me fooled |
09:10
🔗
|
SketchCow |
Still slaming through issues of Generation 4 magazine. |
09:12
🔗
|
SketchCow |
Good lord, some issues were 280p |
09:14
🔗
|
SketchCow |
I see, just browsing an issue, that some games would get 4 page spreads. |
09:14
🔗
|
SketchCow |
That'll do it. |
09:15
🔗
|
SketchCow |
Wing Commander III article - 8 pages |
09:19
🔗
|
ersi |
Awesome |
09:25
🔗
|
SketchCow |
It's fascinating what a mess some of these archives are. |
09:29
🔗
|
chronomex |
curation is a fuckload of work. |
09:30
🔗
|
SketchCow |
Yeah, my big "innovation" is doing layered qualities of curation. |
09:31
🔗
|
chronomex |
good curation is an order of magnitude harder than adequate curation |
09:32
🔗
|
chronomex |
wonderful curation is an order of magnitude still |
09:32
🔗
|
chronomex |
granted, "good" curation is maybe 5-10 minutes per item |
09:33
🔗
|
SketchCow |
Yeah, for me, it's mostly concentrating on "was heading straight for oblivion" to "stable" |
09:33
🔗
|
chronomex |
right |
09:33
🔗
|
chronomex |
you're going from "fucked" to somewhere between "adequate" and "good" |
09:34
🔗
|
SketchCow |
VERY occasionally I get people pushing back, and I say "motherfucker, this shit was going into a fire" |
09:34
🔗
|
chronomex |
"you want it better, here go make it better" |
09:34
🔗
|
SketchCow |
Hence metadata warriors |
09:34
🔗
|
* |
chronomex nod |
09:36
🔗
|
SketchCow |
http://www.archive.org/details/generation4-magazine |
09:36
🔗
|
SketchCow |
And there we go! |
09:36
🔗
|
SketchCow |
Now they're being rendered, added, etc. |
09:37
🔗
|
SketchCow |
But they're not redrows anymore. |
09:40
🔗
|
SketchCow |
Oo, oo... I can now do the magazines and clean them BEFORE going up! |
09:40
🔗
|
SketchCow |
93 issues of some magazine (Player One) in French... total size: 7.5gb of JPGs |
09:40
🔗
|
SketchCow |
So heavy again. |
09:43
🔗
|
SketchCow |
http://video.constantvzw.org/VJ13/ |
09:43
🔗
|
SketchCow |
There's my talk at the bottom (Jason) |
09:52
🔗
|
SketchCow |
Yeah, bless you french archiving team - and your wild, WILD inconsistency from zip file to zip file. |
09:55
🔗
|
chronomex |
<3 |
09:55
🔗
|
arrith |
SketchCow: if the issue is just how the filenames are you could convert the unicode to its compatible equivalent |
09:55
🔗
|
arrith |
like the unicode snowman is xn--n3h |
09:55
🔗
|
SketchCow |
I'm doing something similar. |
09:56
🔗
|
SketchCow |
Sadly, there's little consistency to the inconsistencies. |
09:56
🔗
|
SketchCow |
Obviously this was a weird labor of love dumped from all directions. |
09:56
🔗
|
SketchCow |
I'm making them somewhat more negotiable. |
09:57
🔗
|
SketchCow |
Sometimes it went into subdirectories, sometimes not. |
09:57
🔗
|
SketchCow |
Sometimes two pages a scan, sometimes one. |
09:57
🔗
|
SketchCow |
Sometimes it was with no weird characters. Sometimes so. |
09:58
🔗
|
SketchCow |
Like, just now, someone included the included booklet as a subdirectory. |
09:58
🔗
|
SketchCow |
Now I'm making it its own item. |
10:00
🔗
|
SketchCow |
Bonus Thumbs.db! |
10:01
🔗
|
arrith |
ah. well i wonder if there's enough in common with the majority that you could script those. then manually do the leftovers |
10:01
🔗
|
arrith |
after a quite google i'm actually not quite sure how url unicode encoding is done, but it's done somehow |
10:02
🔗
|
chronomex |
%-encoding of "unicode" values is a two-step process. |
10:02
🔗
|
SketchCow |
Also, there's a bigger issue at hand. |
10:02
🔗
|
chronomex |
first, the characters are turned into bytes somehow |
10:02
🔗
|
chronomex |
the most common way is utf-8 |
10:02
🔗
|
chronomex |
then, the bytes are coded. |
10:02
🔗
|
SketchCow |
It's only SOMETIMES that it's a unicode issue. Sometimes it's a directory structure issue, a filename issue. |
10:03
🔗
|
SketchCow |
These really are quite a mess. |
10:03
🔗
|
SketchCow |
I now have a thing where I can fix it and make it consistent in less than 20 seconds per issue. |
10:03
🔗
|
db48x2 |
sounds like it's worse than the poetry archive |
10:03
🔗
|
SketchCow |
Google Groups is the nightmare |
10:03
🔗
|
arrith |
ah |
10:04
🔗
|
arrith |
well i found, but don't really understand, this on how to: http://stackoverflow.com/questions/804336/best-way-to-convert-a-unicode-url-to-ascii-utf-8-percent-escaped-in-python |
10:04
🔗
|
arrith |
if anyone is curious |
10:09
🔗
|
SketchCow |
Oool, bonus for naming HALF the files in an archive .jpg and the other half .jpeg |
10:09
🔗
|
chronomex |
SketchCow: I fucking hate that. |
10:09
🔗
|
db48x2 |
heh |
10:10
🔗
|
chronomex |
.jpg: because 8.3 is enough for anyone. |
10:10
🔗
|
arrith |
i like it when there's JPG and JPEG and i forgot to handle case |
10:10
🔗
|
chronomex |
.JPG: because your software REALLY misses 1972 |
10:10
🔗
|
arrith |
haha |
10:10
🔗
|
SketchCow |
Gets better - some of these, they photograph pages 1-96, then 99-140 |
10:10
🔗
|
SketchCow |
WHY |
10:10
🔗
|
chronomex |
shivvvv |
10:14
🔗
|
SketchCow |
Now I'm blasting This American Life while slamming through these 96 issues. |
10:16
🔗
|
SketchCow |
Up to 43. |
10:17
🔗
|
Wyatt|Wor |
Oh, my rsyncs finished. Cool |
10:23
🔗
|
db48x2 |
yea, my splinder upload finished as well |
10:23
🔗
|
db48x2 |
18 gigs |
10:23
🔗
|
SketchCow |
Damn, it is STILL downloading those newspaper issues. |
10:23
🔗
|
SketchCow |
At 4mb a second. |
10:23
🔗
|
db48x2 |
heh |
10:24
🔗
|
Wyatt|Wor |
Oh yeah, I need to massage my Splinder stuff and consolidate it all in one place |
10:25
🔗
|
db48x2 |
interesting |
10:25
🔗
|
db48x2 |
I'm uploading mobileme at 2 MB/s |
10:25
🔗
|
Wyatt|Wor |
Respectable |
10:26
🔗
|
db48x2 |
especially since I only pay for 1 MB/s |
10:26
🔗
|
Wyatt|Wor |
Haha |
10:26
🔗
|
SketchCow |
22G . |
10:26
🔗
|
SketchCow |
root@teamarchive-0:/2/thenews# du -sh . |
10:27
🔗
|
SketchCow |
And growing. |
10:27
🔗
|
SketchCow |
1.3G Jimmy Swinnerton |
10:27
🔗
|
SketchCow |
1.4G Frederick Opper |
10:27
🔗
|
SketchCow |
11G The World |
10:27
🔗
|
SketchCow |
9.2G Mutt n Jeff |
10:27
🔗
|
Wyatt|Wor |
The world fits nicely on a spinning magnetic platter. |
10:28
🔗
|
db48x2 |
heh |
10:28
🔗
|
db48x2 |
actually, I'm suprised I can upload at all |
10:28
🔗
|
db48x2 |
I expected comcast to cut me off already |
10:29
🔗
|
db48x2 |
they've called me up to threaten me every month since I signed up |
10:29
🔗
|
SketchCow |
http://www.archive.org/details/playerone-magazine-001 |
10:30
🔗
|
arrith |
db48x2: if you can afford it the business plans have no caps |
10:30
🔗
|
arrith |
i don't know in particular how much it costs more |
10:31
🔗
|
db48x2 |
gobs more |
10:31
🔗
|
arrith |
ah ;/ |
10:31
🔗
|
db48x2 |
$200-300 more per month |
10:31
🔗
|
db48x2 |
I've signed up with a dsl provider though |
10:31
🔗
|
db48x2 |
half the cost for similar bandwidth |
10:31
🔗
|
db48x2 |
and no caps |
10:31
🔗
|
db48x2 |
wish I'd known about them before |
10:32
🔗
|
Wyatt|Wor |
"I am inquiring about our website, awholeservices.com..." at which point I break down laughing. |
10:32
🔗
|
db48x2 |
lol |
10:32
🔗
|
db48x2 |
I need to finish up the poetry archive |
10:32
🔗
|
db48x2 |
I still have 362 files that are duplicated, where one of the duplicates isn't a poem |
10:32
🔗
|
db48x2 |
haven't figured out how to distinguish them reliably |
10:34
🔗
|
Wyatt|Wor |
duplicated...in name? |
10:34
🔗
|
arrith |
fuzzy duplicate finding is a tricky business |
10:34
🔗
|
db48x2 |
Wyatt|Wor: sorta |
10:34
🔗
|
Wyatt|Wor |
Ah, I think I see |
10:34
🔗
|
db48x2 |
the poetry was originally downloaded by many people |
10:34
🔗
|
db48x2 |
some of them downloaded the same thing |
10:34
🔗
|
db48x2 |
so when I combined them into a single unified directory structure, I checked for duplicates and gave them sequential names |
10:34
🔗
|
db48x2 |
[db48x@celebdil poems]$ ll ./000/901/103/ |
10:34
🔗
|
db48x2 |
drwxrwxr-x. 2 db48x db48x 4.0K Nov 25 04:42 . |
10:34
🔗
|
db48x2 |
drwxrwxr-x. 1002 db48x db48x 20K Nov 25 04:42 .. |
10:34
🔗
|
db48x2 |
total 48K |
10:35
🔗
|
db48x2 |
-rw-r--r--. 1 db48x db48x 8 May 2 2011 000901103a.html |
10:35
🔗
|
db48x2 |
-rw-r--r--. 1 db48x db48x 17K Nov 23 12:32 000901103.html |
10:35
🔗
|
db48x2 |
is an example |
10:35
🔗
|
db48x2 |
here the bad one is only 8 bytes of junk |
10:37
🔗
|
Wyatt|Wor |
How have you been approaching it? |
10:37
🔗
|
db48x2 |
I haven't; I've been putting it off |
10:38
🔗
|
Wyatt|Wor |
Nonsense! You're just planning on how to Do It Right. :P |
10:39
🔗
|
db48x2 |
lol |
10:40
🔗
|
db48x2 |
actually, a quick check shows that all of these files are 8 bytes long |
10:40
🔗
|
db48x2 |
all of the corrupt ones |
10:40
🔗
|
db48x2 |
so I can just delete them all in one go |
10:41
🔗
|
db48x2 |
that just leaves going through and renaming the ones that are left over |
10:41
🔗
|
arrith |
hopefully not many of those |
10:42
🔗
|
db48x2 |
arrith: there were 35349 files that were just 8 bytes of garbage |
10:42
🔗
|
db48x2 |
there are 362 left |
10:42
🔗
|
db48x2 |
all of them have an alternate file that at least has html in it |
10:43
🔗
|
db48x2 |
and now there are none |
11:28
🔗
|
Wyatt|Wor |
All right, job done. Cheers, all! |
11:42
🔗
|
emijrp |
Do you think that Archive Team is a bit English-centric? |
11:46
🔗
|
SketchCow |
Somewhat |
11:46
🔗
|
SketchCow |
But that will change |
11:49
🔗
|
ersi |
It is, what you make of it |
11:55
🔗
|
SketchCow |
http://www.archive.org/stream/l-atarien-magazine-01/l-atarien-01#page/n0/mode/2up |
11:55
🔗
|
SketchCow |
The Magazine of Club Atari (French) |
12:15
🔗
|
arrith |
hard enough to write the wiki let alone translate it |
12:15
🔗
|
arrith |
although if people are up for translating, i think there are mediawiki plugins for that |
12:16
🔗
|
db48x2 |
we just archived a big italian website |
12:17
🔗
|
emijrp |
Sure, but there are 200+ countries and more than 6000+ languages in the world. |
12:17
🔗
|
emijrp |
Only talking about it, not a complaint |
12:19
🔗
|
arrith |
if someone wants to look into that i think they could. there's various pieces of software out there to ease translation |
12:20
🔗
|
underscor |
SketchCow: Are you still awake from yeaterday, or did you get up really early? |
12:21
🔗
|
emijrp |
arrith: i spoke about archiving websites in other languages, not translating our wiki |
12:21
🔗
|
SketchCow |
TRADE SECRET |
12:23
🔗
|
db48x2 |
7700+ |
12:26
🔗
|
underscor |
SketchCow: :( |
12:26
🔗
|
emijrp |
what is your case underscor ? you are on the Us too right? |
12:26
🔗
|
underscor |
I'm up for school |
12:26
🔗
|
underscor |
Although I'm not going because I'm awfully sick :( |
12:26
🔗
|
emijrp |
ha |
12:28
🔗
|
arrith |
oh |
12:28
🔗
|
emijrp |
http://code.google.com/p/wikiteam/downloads/detail?name=archiveteamorg-20111203-history.xml.7z |
12:31
🔗
|
SketchCow |
http://www.archive.org/details/cyberstratege-magazine&reCache=1 |
12:33
🔗
|
emijrp |
a google waves bots wiki http://code.google.com/p/wikiteam/downloads/detail?name=googlewavebotsinfo_wiki-20111201-current.xml.7z |
12:36
🔗
|
emijrp |
musicmen in black |
12:37
🔗
|
emijrp |
fucking window focus, searching for men in black OST on youtube |
12:40
🔗
|
underscor |
lol |
12:45
🔗
|
ersi |
emijrp: "< db48x2> we just archived a big italian website" <- how is that Not working on Non-english stuff? |
12:46
🔗
|
emijrp |
man, what is your problem with me? |
12:47
🔗
|
ersi |
That you apparently can't read :( |
12:47
🔗
|
ersi |
And that you often post stuff without any context |
12:47
🔗
|
ersi |
that's about it. |
12:48
🔗
|
SketchCow |
Boys, boys |
12:48
🔗
|
emijrp |
you just have to reply all my messages in bad mood, stop it |
12:48
🔗
|
ersi |
But I wasn't being a cranky asshole this time, I just asked; How is *that* not working on non-english |
12:49
🔗
|
ersi |
You asked, I replied. I've ignored you mostly |
12:49
🔗
|
ersi |
Maybe we should take this in a PM |
12:52
🔗
|
emijrp |
i have nothing to talk with you, /ignore ersi and end of story |
12:55
🔗
|
ersi |
Truth hurts. |
12:56
🔗
|
emijrp |
french open data http://www.data.gouv.fr/ |
12:57
🔗
|
emijrp |
(it is a new website) |
14:21
🔗
|
emijrp |
I'm making a list of PDF linked from English Wikipedia. |
14:22
🔗
|
emijrp |
An experiment with Spanish Wikipedia (800,000 articles) shows 70,000 different PDF linked. |
14:23
🔗
|
emijrp |
English version is probably 5-10x bigger. |
14:23
🔗
|
emijrp |
But about 50% of links will be 404 errors. |
14:23
🔗
|
emijrp |
Anyone interested on this idea? |
14:26
🔗
|
emijrp |
Around 500,000 random PDFs. Lol. |
15:20
🔗
|
rude___ |
SketchCow re: NEF format- it's the least destructive format to manipulate if people are going to use those files to stitch together entire spreads or comic strips. If IA won't take NEF, converting them to TIFF 16-bit is the way to go, and Bibble is probably the best way to handle that batch conversion. |
15:49
🔗
|
underscor |
Readability is pretty much the best thing ever |
15:58
🔗
|
Paradoks |
Re: Archive Team being English-centric. While true, it seems odd to hear that when we've been spending most our resources archiving an Italian website. |
16:00
🔗
|
Paradoks |
Personally, I occasionally try to find Spanish-language sites that I enjoy reading, but I'm just not immersed enough that I find out about things like I do with English things. So it also makes sense that I wouldn't hear about sites closing in Spain or latin America. |
16:01
🔗
|
Paradoks |
And it seems unlikely that that problem would entirely go away until we have lots of Archive-Team members who are immersed in lots of other languages. |
17:27
🔗
|
SketchCow |
Also, english language is superior |
17:40
🔗
|
SketchCow |
http://www.archive.org/details/computermagazinesfrench coming along. |
17:41
🔗
|
SketchCow |
rude___: Agreed. It's just annoying, until I found out the guy had put up multiple versions regardless. |
17:45
🔗
|
rude___ |
he did? I mean, I did? |
17:45
🔗
|
SketchCow |
You did, there's TIFFs ahoy |
17:46
🔗
|
SketchCow |
Pardon my complaining, we lash out to pass the time down here in the boiler room |
17:46
🔗
|
SketchCow |
Look at this amazing utility I wrote |
17:46
🔗
|
SketchCow |
Who is numero uno? I think we all know. |
17:46
🔗
|
SketchCow |
oot@teamarchive-0:/3/MAGS/FRENCH/magazines/PC Assemblage# ../numero.sh |
17:46
🔗
|
SketchCow |
root@teamarchive-0:/3/MAGS/FRENCH/magazines/PC Assemblage# |
17:47
🔗
|
yipdw |
well, it isn't root, because root is numero cero |
17:47
🔗
|
SketchCow |
Fine, fine, it actually has use and converts filenames like pcassemblage_numero06.zip to pcassemblage_numero_06_images.zip |
17:47
🔗
|
yipdw |
on some systems numero uno is daemon |
17:47
🔗
|
SketchCow |
So my OTHER script can see that 06 and do the right thing, and _images.zip will make the archive.org machines turn it into all those previews. |
17:48
🔗
|
yipdw |
that sounds like an import process I wrote for work -- it's a series of 27 Ruby scripts that all feed transformations into each other |
17:48
🔗
|
Schbirid |
yay, found another quake ad http://www.quaddicted.com/_media/quake/quake_is_good_for_you_2pages.jpg |
17:48
🔗
|
yipdw |
not exactly the fastest, but at least there's diagnostic output out the ass |
17:48
🔗
|
yipdw |
correctness over speed, etc. |
17:48
🔗
|
SketchCow |
Hurrah, http://www.archive.org/details/computermagazinesspanish is now populating. |
17:49
🔗
|
rude___ |
no problem, some of the items were scanned hence going straight to TIFF. The newspapers were photographed so alls you get is NEF and lower res jpg proofs. Exporting TIFFs for everything would've turned my 20 gig upload into a 160 gig upload |
17:51
🔗
|
yipdw |
on that note, I recently learned just how crazy good modern DSLRs are compared to readily-available flatbed scanners, assuming you have some knowledge of perspective, the right optics, and lighting |
17:52
🔗
|
yipdw |
a friend wanted to archive a massive painting she's donating to Child's Play |
17:52
🔗
|
yipdw |
we first tried a flatbed, which sucked |
17:52
🔗
|
yipdw |
next try was a 5D Mark II |
17:52
🔗
|
yipdw |
the sensor on that thing blows my mind every time I see things from it imported into Lightroom. |
17:53
🔗
|
rude___ |
digital backs are good for that kind of stuff |
17:53
🔗
|
yipdw |
we were using a pretty rudimentary lighting setup, too; just bounce flash |
17:53
🔗
|
SketchCow |
That's what archive.org uses. |
17:53
🔗
|
SketchCow |
For the mongo things, like rude's newspapers, they have an oversize from-above scanner. |
17:53
🔗
|
yipdw |
seems like a good choice |
17:54
🔗
|
SketchCow |
The last time I was in the scanning room, they were digitizing 1930s geological surveys. |
17:54
🔗
|
yipdw |
I don't suppose IA does tours, do they :P |
17:55
🔗
|
rude___ |
we attempted commissioning a scanner for the newspaper folios |
17:55
🔗
|
rude___ |
the thing is, the pages were literally disintegrating |
17:56
🔗
|
rude___ |
putting a plate on it didn't work so well |
17:58
🔗
|
rude___ |
diy book scanning has really taken off since then so who knows what would be possible today |
17:59
🔗
|
rude___ |
yipdw: what lens did you use for the painting? |
17:59
🔗
|
yipdw |
rude___: 24-70 f/2.8L at 24mm, f/4, 1/80s, ISO 50 |
17:59
🔗
|
yipdw |
i would have preferred to use a tilt-shift to get a more rectangular projection, but |
17:59
🔗
|
yipdw |
cost, etc. |
17:59
🔗
|
yipdw |
Adobe's lens corrections seem to do a good enough job |
18:00
🔗
|
yipdw |
er, 30mm |
18:00
🔗
|
yipdw |
http://ashleyriot.com/childsplayre.jpg |
18:01
🔗
|
yipdw |
that upload is a bit dark; I guess she took it into Photoshop |
18:08
🔗
|
rude___ |
awesome |
18:10
🔗
|
rude___ |
this is what the D1X yielded, http://bryanvaccaro.org/archive/Img4291.jpg |
18:11
🔗
|
rude___ |
beautiful details in the burber carpet |
18:11
🔗
|
yipdw |
eh |
18:11
🔗
|
yipdw |
heh |
18:11
🔗
|
yipdw |
how much have you noticed diffraction artifacts affecting that sort of work |
18:11
🔗
|
yipdw |
? |
18:12
🔗
|
yipdw |
(EXIF tags on that image say f/16, which I usually never work at for photographic or archival purposes) |
18:14
🔗
|
yipdw |
not so much because I hate small apertures, just that I usually hover around f/2.8 - f/5.6 |
18:14
🔗
|
yipdw |
and I've heard, but not tested, that diffraction begins to impact sharpness around f/11 |
18:16
🔗
|
SketchCow |
So I guess in photographic history, at the beginning, they were trying to set the lenses and focus and stops to be painterly. |
18:16
🔗
|
SketchCow |
because everyone assumed they were like painting |
18:16
🔗
|
SketchCow |
and some group called itself some sort of lens setting |
18:17
🔗
|
SketchCow |
And they basically shot it up so high to such a level of detail to go "fuck you, lenses are superior" |
18:18
🔗
|
SketchCow |
What the.... motherfucker, this set of issues of this magazine swaps between THREE DIFFERENT FILE STRUCTURES |
18:18
🔗
|
rude___ |
yipdw: I don't think we put much thought into it at the time, but I recall that some of the lower f stops didn't look as sharp as f/16 for whatever reason |
18:19
🔗
|
yipdw |
hmm interesting |
18:19
🔗
|
yipdw |
I usually don't worry too much about it due to other factors generally being way more important to image quality :P |
18:19
🔗
|
yipdw |
(e.g. composition, lighting, whether or not your subject is a ponce) |
18:20
🔗
|
yipdw |
but for archiving it seems like a fun thing to test |
18:20
🔗
|
rude___ |
it had something to do with the size of the content, the lens, and lighting situation |
18:20
🔗
|
rude___ |
smaller items were shot at f/3.2, f/8 |
18:21
🔗
|
rude___ |
the simplest answer though is that I didn't know what I was doing |
18:21
🔗
|
yipdw |
:P |
18:21
🔗
|
SketchCow |
Whoop, here we go, structure #4 |
18:21
🔗
|
yipdw |
still works, I can make out the newspaper content |
19:22
🔗
|
SketchCow |
How'd we do with Gamepro? They close in just over an hour. |
20:41
🔗
|
SketchCow |
Jason and friends! |
20:41
🔗
|
SketchCow |
You've been duly warned! |
20:41
🔗
|
SketchCow |
http://cmdrtaco.net/2011/12/everything2-com-seeks-new-ownership/ |
20:47
🔗
|
SketchCow |
I'm jamming it up into archive.org's collection. |
20:51
🔗
|
soultcer |
Weren't they the ones who complained when someone from archiveteam but a torrent of their posts online, because they can make backups without our help? |
20:51
🔗
|
soultcer |
*put |
20:51
🔗
|
dan_ |
Heads up: Everything2.com is up for sale. http://cmdrtaco.net/2011/12/everything2-com-seeks-new-ownership/ |
20:52
🔗
|
soultcer |
dan_: SketchCow posted this seconds before you, but thanks for the warning anyway ;-) |
20:52
🔗
|
SketchCow |
http://www.archive.org/details/archiveteam-everything2 |
20:53
🔗
|
dan_ |
I shot off an e-mail, just thought i'd post in IRC just in case |
20:53
🔗
|
SketchCow |
Your commitment is charming. |
20:53
🔗
|
bsmith094 |
TO THE DOWNLOAD MANAGERS!!! Away! |
20:54
🔗
|
bsmith094 |
honest question, though, whatever happened to the simple websites where a simple, easy wget -m would grab everything in nice, neat folders? |
20:54
🔗
|
dan_ |
Funny, Rob Malda is also seeking employment. Pissing off archiveteam I don't think scores any points. :) |
20:54
🔗
|
SketchCow |
Malda's kind of an idiot. |
20:55
🔗
|
SketchCow |
You know that, right. |
20:55
🔗
|
bsmith094 |
rob malda, is he an actor? |
20:55
🔗
|
SketchCow |
He's the first Slashdot founder. |
20:55
🔗
|
soultcer |
CmdrTaco |
20:55
🔗
|
SketchCow |
I've met him. |
20:55
🔗
|
SketchCow |
He's a fucking zero. |
20:55
🔗
|
bsmith094 |
whoops, thinking of allen alda |
20:56
🔗
|
SketchCow |
Matt Haughey is worth 4,000 Rob Maldas. |
20:56
🔗
|
dan_ |
I live in his hometown (jason, many notacons ago we hung out in the lobby with tyger/froggy the night after the con ended) |
20:56
🔗
|
SketchCow |
Yes indeed we did |
20:57
🔗
|
bsmith094 |
the meta filter guy? |
20:57
🔗
|
SketchCow |
Yes |
20:58
🔗
|
bsmith094 |
(04:44:06 AM) SketchCow: There's my talk at the bottom (Jason) |
20:58
🔗
|
bsmith094 |
SketchCow: whats this talk (04:43:59 AM) SketchCow: http://video.constantvzw.org/VJ13/ |
20:59
🔗
|
SketchCow |
Yes |
20:59
🔗
|
SketchCow |
The one I gave in Belgium on Sunday |
20:59
🔗
|
bsmith094 |
ah, really nice audio for a telepresence |
20:59
🔗
|
SketchCow |
With bonus shutouts, kicks, and the rest. |
21:02
🔗
|
bsmith094 |
hey, here's a site worth saving, localroger.com, wget -m that thro in ia, maybe 20mb if you squint, authors page , has most of his work on it |
21:19
🔗
|
* |
underscor emails malda with an offer of $50 |
21:19
🔗
|
underscor |
hehe |
21:30
🔗
|
bsmith094 |
im trying to edit tha archives page tp add my own scrapes of some websites ive had lying around, can someone check my syntax? |
22:15
🔗
|
underscor |
SketchCow: Did we have anyone archiving GP? |
22:15
🔗
|
underscor |
Pulling it at 80Mbps like a boss |
22:31
🔗
|
bsmith094 |
underscor: gp? |
22:32
🔗
|
underscor |
gamepro |
22:32
🔗
|
SketchCow |
Gamepro was sort of being archived, but another shot is always welcome. |
22:32
🔗
|
bsmith094 |
is it dead yet, and is there a script for that? |
22:34
🔗
|
underscor |
no, and no |
22:35
🔗
|
bsmith094 |
any particular folder u need archived |
22:35
🔗
|
zetathust |
html archived |
22:36
🔗
|
arrith |
bsmith094: that change to the wiki isn't appearing in the table |
22:36
🔗
|
arrith |
bsmith094: try to edit it and hitting 'preview' to try to get it to show |
22:36
🔗
|
bsmith094 |
yes i know can you fix that please? |
22:36
🔗
|
Paradoks |
bsmith: I re-arranged your entry on the archives page. It shows up, now, and the links work. I also made the assumption that "passage" had the standard two 's's, rather than three. |
22:36
🔗
|
SketchCow |
Turns out I know someone who is going to be raiding the closets of GamePro |
22:36
🔗
|
SketchCow |
Now going to talk about arranging for a set of people with a truck |
22:37
🔗
|
SketchCow |
My job, why does it never end |
22:37
🔗
|
Paradoks |
Yay! |
22:37
🔗
|
instence |
gamepro is gone |
22:37
🔗
|
instence |
just got switched over within the hour |
22:37
🔗
|
SketchCow |
GAMEPRO is gone |
22:37
🔗
|
SketchCow |
GAMEPRO the OFFICE is still there |
22:38
🔗
|
bsmith094 |
wait the Systems closets ??! holy crap you lucked out |
22:38
🔗
|
instence |
...i was just saying the site gamepro.com got switched over, chill |
22:39
🔗
|
SketchCow |
Ha ha |
22:39
🔗
|
SketchCow |
Come to #archiveteam and tell people to chill |
22:39
🔗
|
SketchCow |
Next go to #football and ask people to not be so opinionated |
22:39
🔗
|
SketchCow |
#politics could use a telling off to "use less ad hominem attacks" |
22:40
🔗
|
bsmith094 |
lol |
22:40
🔗
|
instence |
? |
22:40
🔗
|
underscor |
hahaha |
22:42
🔗
|
yipdw |
I dunno, watching Redis' MONITOR is a good way to max and relax |
22:43
🔗
|
underscor |
^ |
22:43
🔗
|
bsmith094 |
how do i run a python script from inside a shell script |
22:43
🔗
|
bsmith094 |
using vars read in from a file |
22:43
🔗
|
yipdw |
python [script name] and pass the variables as arguments |
22:43
🔗
|
arrith |
bsmith094: some of the scripts i gave you earlier did that |
22:43
🔗
|
yipdw |
or set them in the environment |
22:44
🔗
|
bsmith094 |
yeah i know, and im trying to send linklist to downloader.py |
22:44
🔗
|
arrith |
bsmith094: any script where you set your downloader.py location |
22:44
🔗
|
bsmith094 |
they're in the same directory |
22:44
🔗
|
bsmith094 |
while read num; do echo exec python downloader.py -f html $num; done < linklist.txt |
22:45
🔗
|
bsmith094 |
what am i missing, because that just does the echo part? |
22:45
🔗
|
zetathust |
echo effect is ten so tiny polecat |
22:46
🔗
|
bsmith094 |
zetathust: ummm, what? |
22:46
🔗
|
arrith |
i'd agree with that |
23:04
🔗
|
SketchCow |
Well, OK, then. |
23:05
🔗
|
SketchCow |
it appears that a set of friends of mine are posed to literally take everything out of the gamepro offices not nailed down |
23:05
🔗
|
SketchCow |
Anyone in the SF area available at 11am thursday? E-mail me, jason@textfiles.com, I'll put you in touch |
23:08
🔗
|
SketchCow |
First person sputtering at the mirror of everything2 |
23:09
🔗
|
PatC |
Yay! I got a "new" (old) computer for a storage box :) |
23:14
🔗
|
pberry |
hola |