Time |
Nickname |
Message |
00:22
🔗
|
underscor |
SketchCow: You should email cogent and carpathia and just ask them if we can have a copy |
00:31
🔗
|
don |
underscor: my god, man |
00:31
🔗
|
don |
how big is it? |
00:31
🔗
|
underscor |
I think it was estimated at 20PB |
00:31
🔗
|
underscor |
But I could be mistaken |
00:33
🔗
|
dashcloud |
SketchCow: you were proved right about Yahoo & Flickr- http://nolancaudill.com/2012/01/30/the-front-line/ |
01:30
🔗
|
Zwangzug |
Hey, was wondering if anyone had advice/recommendations for archiving some forum topics? |
01:31
🔗
|
don |
do you own the forum? |
01:31
🔗
|
Zwangzug |
no, and I don't think I can get the ears of the people who do. |
01:32
🔗
|
Zwangzug |
It's a phpbb3 forum, and there are several dozen topics (many with dozens of pages) I'd like to back up if possible--I've seen other fora where there's an archive mode so there are a lot fewer pages, but there's no obvious way to replicate that here. |
01:32
🔗
|
don |
Then I am not sure of the best way to go BUT if you stick around I'm sure some of the more intelligent people here will be able to help. |
01:32
🔗
|
Zwangzug |
fair enough, thanks |
01:33
🔗
|
don |
and you never know when the admins will prune threads for whatever the fuck reason |
01:33
🔗
|
don |
I own a small regional-interest forum that I took over from a former regime who deleted old threads willy-nilly |
01:33
🔗
|
don |
infuriating |
01:33
🔗
|
don |
I vow to never do that. |
01:33
🔗
|
Zwangzug |
Fortunately, that hasn't been an issue, but it's better to cover all my bases. |
01:33
🔗
|
don |
yes, it is. |
01:33
🔗
|
don |
always. |
01:34
🔗
|
Zwangzug |
It's a very large forum, with only one (still very large) subforum I'm mainly interested in. So something to grab entire websites might be too large-scale. |
01:37
🔗
|
don |
This might make a good topic for me to write up in the wiki |
01:37
🔗
|
don |
you're definitely not the only one with interest in archiving forums |
01:37
🔗
|
yipdw |
Zwangzug: honestly, I think your best bet will be something like wget |
01:38
🔗
|
yipdw |
especially if the forum application has no "archive mode" |
01:38
🔗
|
yipdw |
it is a lot of requests, but that's what happens -- and you can instruct wget to better simulate a browser via its --random-wait option |
01:38
🔗
|
yipdw |
and changing its user-agent, etc. |
01:39
🔗
|
Zwangzug |
Would there be a good way to restrict it to just one subforum or a given set of threads? |
01:39
🔗
|
yipdw |
if you're dealing with a particularly crawler-hostile proprietor, though, it's pretty easy to detect wget |
01:39
🔗
|
yipdw |
yes, just pass the URLs of the subforum or threads in |
01:39
🔗
|
yipdw |
use recursive fetch with --no-parent and --page-requisites |
01:39
🔗
|
yipdw |
that should (I think) do what youw ant |
01:39
🔗
|
yipdw |
though I obviously haven't tried :P |
01:40
🔗
|
Zwangzug |
I'll give it a go, might need some technical support though. Fingers crossed! |
01:40
🔗
|
yipdw |
yeah sure |
01:41
🔗
|
yipdw |
if you're comfortable with compiling software, try this: https://github.com/downloads/ArchiveTeam/mobileme-grab/wget-1.13.4-2581.tar.bz2 |
01:41
🔗
|
yipdw |
it's a build of wget that contains a few useful features and fixes for large crawls, namely WARC output and fixes for memory leaks |
01:42
🔗
|
Zwangzug |
if you're comfortable with compiling software <- no such luck :p |
01:45
🔗
|
Zwangzug |
Er, sorry, this is going to have to be a very tedious walkthrough |
01:45
🔗
|
Zwangzug |
at the level of "I double-clicked on the program and it opened and then disappeared" |
01:45
🔗
|
yipdw |
what OS? |
01:45
🔗
|
Zwangzug |
Windows. |
01:45
🔗
|
yipdw |
oh |
01:46
🔗
|
yipdw |
that makes things more difficult |
01:46
🔗
|
yipdw |
wget's a command-line program, so you'll need to run it from Command Prompt |
01:46
🔗
|
yipdw |
if you can, I highly recomemnd getting an Ubuntu installation (or something) |
01:46
🔗
|
Coderjoe |
hmm |
01:46
🔗
|
Coderjoe |
http://hardware.slashdot.org/comments.pl?sid=2646891&cid=38880617 |
01:46
🔗
|
yipdw |
a lot of the tools that we recommend here are very geared towards UNIX and its relatives |
01:47
🔗
|
zill1 |
There's wget ports for windows, a quick google search should give you something that you can use even if it's based on an older version |
01:47
🔗
|
Coderjoe |
I wonder what the log file for the linked file at textfiles.com looks like |
01:47
🔗
|
yipdw |
zill1: there are, but (1) they're still CLI and (2) they're probably not going to be as robust |
01:48
🔗
|
yipdw |
(3), they don't do WARCs, which IMO is a big deficiency for archival purposes |
01:48
🔗
|
Coderjoe |
(4) still have the annoying memory leaks |
01:48
🔗
|
Zwangzug |
this is, nominally, the "for windows" version |
01:49
🔗
|
yipdw |
well, it should still have the options I was talking about; invoke wget --help at a command prompt to see them |
01:50
🔗
|
zill1 |
Wget generally isn't a built in command for windows |
01:50
🔗
|
yipdw |
zill1: under the assumption that Zwangzug has a copy of wget, of course |
01:50
🔗
|
yipdw |
Zwangzug: you should see something like this -> https://gist.github.com/37a42d17696ba172d47f |
01:51
🔗
|
yipdw |
sans WARC options, maybe other groups |
01:51
🔗
|
* |
Zwangzug just tried to download and install it |
01:51
🔗
|
yipdw |
brb, grocery shopping and stuff |
01:55
🔗
|
Zwangzug |
okay, in cmd.exe mode now--how to open wget from inside there? |
01:57
🔗
|
zill1 |
If you have a windows port of Wget you're going to want to put the .exe in your Windows directory |
01:58
🔗
|
zill1 |
Then you should be able to call it from the command line |
01:59
🔗
|
zill1 |
Starting with wget --help should get you started on what it can do in general |
01:59
🔗
|
Zwangzug |
zill1 If you have a windows port of Wget you're going to want to put the .exe in your Windows directory <- and all the dlls also? |
01:59
🔗
|
Coderjoe |
grr |
01:59
🔗
|
Zwangzug |
ok, this is looking promising |
01:59
🔗
|
Coderjoe |
add it to the path, not the windows dir |
02:01
🔗
|
Zwangzug |
I got wget --help to function so it's working well enough |
02:02
🔗
|
Zwangzug |
should I be able to paste URLs directly into the program? |
02:02
🔗
|
zill1 |
Yeah a call of wget URL should pull down a given page for most things |
02:04
🔗
|
Zwangzug |
huh, ok. got one page. let's see what else I can do... |
02:05
🔗
|
DFJustin |
I don't think --no-parent will be good enough for forums, "subforums" are generally served by the same cgi script in the same directory so you will get the entire forum |
02:07
🔗
|
DFJustin |
a more user-friendly utility on windows is http://www.httrack.com/ but it has the disadvantage of not supporting warc (afaik) |
02:12
🔗
|
Coderjoe |
heretrix? |
02:21
🔗
|
Zwangzug |
that seems rather slow. maybe right click-save as is the best after all, heh |
03:54
🔗
|
underscor |
Ning is removing networks on feb 10 that don't upgrade to a paid plan |
03:54
🔗
|
underscor |
SketchCow asked me to notify the channel |
03:54
🔗
|
underscor |
and see if we want to move on it or what |
03:56
🔗
|
underscor |
Also, abit.com.tw is closing, and has a full robots.txt disallow |
03:56
🔗
|
yipdw |
sigh |
03:56
🔗
|
underscor |
Thinking of doing a full wget mirror on it |
03:56
🔗
|
yipdw |
http://www.ninjawedding.org/whatbullshit.png |
03:57
🔗
|
underscor |
That's fucking gross |
03:57
🔗
|
underscor |
:( |
03:57
🔗
|
underscor |
Does the latest wget fix the recursive memory leak issue? |
03:57
🔗
|
yipdw |
yes |
03:57
🔗
|
yipdw |
>= r2581 in particular |
03:58
🔗
|
underscor |
And warc writing is builtin now, right? |
03:58
🔗
|
underscor |
(so I can just build HEAD) |
03:58
🔗
|
yipdw |
yes |
03:58
🔗
|
underscor |
schweet |
04:01
🔗
|
underscor |
Is there a list of good wget parameters for a full mirror anywhere? |
04:01
🔗
|
underscor |
(or what do you guys use?) |
04:02
🔗
|
dashcloud |
underscor: I've got most of abit |
04:02
🔗
|
yipdw |
depends on the job, but for a full mirror starting at / I usually go for recursive retrieval, infinite depth, span hosts, and allowing only related domains |
04:02
🔗
|
yipdw |
otherwise you will end up spidering the whole Web |
04:03
🔗
|
yipdw |
the last bit does require understanding site structure and watching what wget is doing |
04:03
🔗
|
underscor |
dashcloud: Oh really? Awesome! |
04:03
🔗
|
underscor |
yipdw: haha, yeah. That's never fun. |
04:03
🔗
|
yipdw |
underscor: yeah, especially nowadays when everyone includes shit from other domains |
04:04
🔗
|
underscor |
yep |
04:04
🔗
|
yipdw |
"USE YOUR OWN COPY. IT IS EXTREMELY UNWISE TO LOAD CODE FROM SERVERS YOU DO NOT CONTROL." -- Douglas Crockford |
04:04
🔗
|
underscor |
^ |
04:04
🔗
|
yipdw |
see where that got us |
04:06
🔗
|
dashcloud |
underscor: got somewhere I can push the stuff to? you can do a second check on what I got then |
04:07
🔗
|
underscor |
I can make an rsync module, that work? |
04:08
🔗
|
dashcloud |
sure |
04:08
🔗
|
Coderjoe |
http://i.imgur.com/dCjr6.jpg |
04:09
🔗
|
yipdw |
Splinder's motto |
04:09
🔗
|
underscor |
haha |
04:09
🔗
|
Coderjoe |
i had a mirror of the abit ftp site back when they were supposed to be going down before |
04:09
🔗
|
Coderjoe |
still have it somewhere |
04:09
🔗
|
yipdw |
wtf |
04:09
🔗
|
yipdw |
Proust is STILL alive |
04:10
🔗
|
yipdw |
I'm reminded of that Onion headline: "MARCEL PROUST FINALLY DIES" |
04:11
🔗
|
yipdw |
I wonder if they just forgot to shut it down |
04:11
🔗
|
underscor |
lol |
04:11
🔗
|
Coderjoe |
I don't know if everyone had seen this already: http://i.imgur.com/rR592.png |
04:11
🔗
|
PatC_ |
lol |
04:12
🔗
|
dashcloud |
is there no situation XKCD doesn't have a strip for? |
04:12
🔗
|
Coderjoe |
that was hidden in the black censored area of the SOPA xkcd comic |
04:13
🔗
|
Coderjoe |
what a deal! http://i.imgur.com/RFtnt.png |
04:13
🔗
|
yipdw |
wait, hidden how |
04:13
🔗
|
yipdw |
was it RGB (1,1,1) or something |
04:14
🔗
|
Coderjoe |
I forget which was which, but the black was #000000 and the drawing was #010101 (or vice versa) |
04:14
🔗
|
yipdw |
heh |
04:14
🔗
|
Coderjoe |
I just did a "select color" on it and removed the bar |
04:15
🔗
|
Coderjoe |
after catching a bit of it when looking at my monitor off-axis |
04:15
🔗
|
yipdw |
wait |
04:15
🔗
|
yipdw |
so you're saying that if I had a *worse* monitor |
04:15
🔗
|
yipdw |
I would have seen it |
04:15
🔗
|
underscor |
Yep |
04:15
🔗
|
yipdw |
FUCK YOU, S-IPS |
04:15
🔗
|
underscor |
Need a TN display |
04:15
🔗
|
underscor |
hahahahhaha |
04:17
🔗
|
yipdw |
oh |
04:17
🔗
|
chronomex |
passive matrix |
04:17
🔗
|
yipdw |
I can kinda see it |
04:17
🔗
|
yipdw |
if I zoom the image to 8x |
04:17
🔗
|
yipdw |
THANK YOU, S-IPS |
04:17
🔗
|
yipdw |
or H-IPS or A-TW-IPS or whatever the hell this monitor uses |
04:18
🔗
|
chronomex |
FAP-FAP-IPS |
04:19
🔗
|
yipdw |
cum and experience the next generation of display technology |
04:19
🔗
|
underscor |
lololol |
04:22
🔗
|
chronomex |
not 10 meters from me is a 120hz LCD with shutter glasses... |
04:29
🔗
|
yipdw |
ooh |
04:29
🔗
|
yipdw |
use it |
04:29
🔗
|
yipdw |
TO SEE IN 3D |
04:29
🔗
|
Coderjoe |
TO SEE FOREVER |
04:37
🔗
|
chronomex |
I played Portal in 3D the other day ... |
04:37
🔗
|
yipdw |
it was just stereoscopy, the 3D is a lie |
04:38
🔗
|
yipdw |
I used to play Skyrim in 3D. Then I took an arrow to the eye. |
04:38
🔗
|
Coderjoe |
*groan* |
04:38
🔗
|
Coderjoe |
tired of that meme |
04:38
🔗
|
chronomex |
1) what |
04:38
🔗
|
chronomex |
2) |
04:38
🔗
|
chronomex |
woop woop woop off-topic siren |
04:39
🔗
|
yipdw |
I've actually never played Skyrim |
04:39
🔗
|
yipdw |
but ok |
04:39
🔗
|
yipdw |
ON TOPIC, I guess I should rework the ffnet grabber so it's not a bunch of crazy Ruby |
04:42
🔗
|
chronomex |
perhaps |
04:42
🔗
|
chronomex |
I kind of like crazy ruby |
04:42
🔗
|
yipdw |
yeah, but I'm getting tired of fielding questions about it |
04:43
🔗
|
yipdw |
plus it doesn't scale |
04:43
🔗
|
yipdw |
(really) |
04:43
🔗
|
chronomex |
hm. |
04:50
🔗
|
yipdw |
oh my |
04:50
🔗
|
yipdw |
http://www.youtube.com/watch?v=pHAcJl4d4Lg |
05:01
🔗
|
Coderjoe |
UGH |
05:04
🔗
|
Coderjoe |
.... |
05:04
🔗
|
Coderjoe |
http://www.youtube.com/watch?v=LJRBmJJHWx0 |
05:04
🔗
|
yipdw |
Coderjoe: I found something better |
05:10
🔗
|
chronomex |
ahahaha |
05:10
🔗
|
chronomex |
I did the radiocomm for that convention |
05:11
🔗
|
Coderjoe |
man... I haven't seen Tiffany Grant in awhile |
05:13
🔗
|
Coderjoe |
huh. behind the scenes: http://www.youtube.com/watch?v=6IQpJkiDR8g |
05:15
🔗
|
yipdw |
wow |
05:15
🔗
|
yipdw |
that commercial was better than what Vic wanted, haha |
05:15
🔗
|
chronomex |
vic? |
05:16
🔗
|
yipdw |
Vic Mignogna, the guy directing in that behind the scenes video |
05:16
🔗
|
chronomex |
hrm. |
05:17
🔗
|
chronomex |
you are involved with that crew? |
05:22
🔗
|
yipdw |
not the crew that produced it, but I do have extensive experience with the animes |
05:25
🔗
|
chronomex |
"that crew" == sakuracon |
05:25
🔗
|
yipdw |
oh, no |
08:17
🔗
|
Zebranky_ |
SketchCow: I'd like your thoughts on http://www.kickstarter.com/projects/599092525/the-order-of-the-stick-reprint-drive as a Kickstarter expert, so to speak |
08:27
🔗
|
SketchCow |
in bed |
08:27
🔗
|
SketchCow |
e-mail this. not for this channel. |
08:28
🔗
|
chronomex |
wow, in bed before 4am?!? |
08:32
🔗
|
ersi |
unpossible |
08:33
🔗
|
chronomex |
not particularly relevant to this channel either, but interesting: some experiments with scanning slides using a DSLR and a light table - http://www.flickr.com/photos/afiler/sets/72157629017235485/ |
08:33
🔗
|
chronomex |
next step is to modify a carousel slide projector to accomodate a lower-intensity light source and a camera mount, to scan a whole carousel in one go |
09:30
🔗
|
yipdw |
SketchCow: http://allthingsd.com/20120131/proust-will-live-on-separate-from-iac/ |
15:44
🔗
|
Nemo_bis |
lol, in the TV news: today's anti-Putin activists have *not* been arrested |
17:57
🔗
|
don |
so, tabblo then? |
18:06
🔗
|
Nemo_bis |
sigh |
18:06
🔗
|
Nemo_bis |
2922260983 100% 82.93kB/s 9:33:32 (xfer#846, to-check=1004/2360) |
18:06
🔗
|
Nemo_bis |
d/de/der/derDoc/web.me.com/web.me.com-derDoc.warc.gz |
18:42
🔗
|
tef |
yipdw: ping me about qtwebkit hacking :-) I know how to intercept stuff without breaking. |
18:42
🔗
|
tef |
yipdw: I was going to add a http proxy to warctools that replays content from warcs |
18:45
🔗
|
tef |
i'd recommend it over qtwebkit |
18:45
🔗
|
tef |
hackery because you can't intercept flash/plugin content |
18:50
🔗
|
tef |
oh and sometimes trying to change the request body crashes qtwebkit because a thread is doing something with it elsewhere :/ |
18:50
🔗
|
tef |
the url is about the only thing you can mangle & headers, although doing it on ajax requests often breaks things too |
20:11
🔗
|
yipdw |
tef: oh, cool, that's good to know -- for some reason I thought that QtWebkit's network manager handled all requests, which in the context of plugins doesn't make sense |
20:12
🔗
|
yipdw |
and if you're going to add an HTTP proxy for WARCs, then the WARC viewer problem really reduces to one of packaging tools :P |