[00:22] SketchCow: You should email cogent and carpathia and just ask them if we can have a copy [00:31] underscor: my god, man [00:31] how big is it? [00:31] I think it was estimated at 20PB [00:31] But I could be mistaken [00:33] SketchCow: you were proved right about Yahoo & Flickr- http://nolancaudill.com/2012/01/30/the-front-line/ [01:30] Hey, was wondering if anyone had advice/recommendations for archiving some forum topics? [01:31] do you own the forum? [01:31] no, and I don't think I can get the ears of the people who do. [01:32] It's a phpbb3 forum, and there are several dozen topics (many with dozens of pages) I'd like to back up if possible--I've seen other fora where there's an archive mode so there are a lot fewer pages, but there's no obvious way to replicate that here. [01:32] Then I am not sure of the best way to go BUT if you stick around I'm sure some of the more intelligent people here will be able to help. [01:32] fair enough, thanks [01:33] and you never know when the admins will prune threads for whatever the fuck reason [01:33] I own a small regional-interest forum that I took over from a former regime who deleted old threads willy-nilly [01:33] infuriating [01:33] I vow to never do that. [01:33] Fortunately, that hasn't been an issue, but it's better to cover all my bases. [01:33] yes, it is. [01:33] always. [01:34] It's a very large forum, with only one (still very large) subforum I'm mainly interested in. So something to grab entire websites might be too large-scale. [01:37] This might make a good topic for me to write up in the wiki [01:37] you're definitely not the only one with interest in archiving forums [01:37] Zwangzug: honestly, I think your best bet will be something like wget [01:38] especially if the forum application has no "archive mode" [01:38] it is a lot of requests, but that's what happens -- and you can instruct wget to better simulate a browser via its --random-wait option [01:38] and changing its user-agent, etc. [01:39] Would there be a good way to restrict it to just one subforum or a given set of threads? [01:39] if you're dealing with a particularly crawler-hostile proprietor, though, it's pretty easy to detect wget [01:39] yes, just pass the URLs of the subforum or threads in [01:39] use recursive fetch with --no-parent and --page-requisites [01:39] that should (I think) do what youw ant [01:39] though I obviously haven't tried :P [01:40] I'll give it a go, might need some technical support though. Fingers crossed! [01:40] yeah sure [01:41] if you're comfortable with compiling software, try this: https://github.com/downloads/ArchiveTeam/mobileme-grab/wget-1.13.4-2581.tar.bz2 [01:41] it's a build of wget that contains a few useful features and fixes for large crawls, namely WARC output and fixes for memory leaks [01:42] if you're comfortable with compiling software <- no such luck :p [01:45] Er, sorry, this is going to have to be a very tedious walkthrough [01:45] at the level of "I double-clicked on the program and it opened and then disappeared" [01:45] what OS? [01:45] Windows. [01:45] oh [01:46] that makes things more difficult [01:46] wget's a command-line program, so you'll need to run it from Command Prompt [01:46] if you can, I highly recomemnd getting an Ubuntu installation (or something) [01:46] hmm [01:46] http://hardware.slashdot.org/comments.pl?sid=2646891&cid=38880617 [01:46] a lot of the tools that we recommend here are very geared towards UNIX and its relatives [01:47] There's wget ports for windows, a quick google search should give you something that you can use even if it's based on an older version [01:47] I wonder what the log file for the linked file at textfiles.com looks like [01:47] zill1: there are, but (1) they're still CLI and (2) they're probably not going to be as robust [01:48] (3), they don't do WARCs, which IMO is a big deficiency for archival purposes [01:48] (4) still have the annoying memory leaks [01:48] this is, nominally, the "for windows" version [01:49] well, it should still have the options I was talking about; invoke wget --help at a command prompt to see them [01:50] Wget generally isn't a built in command for windows [01:50] zill1: under the assumption that Zwangzug has a copy of wget, of course [01:50] Zwangzug: you should see something like this -> https://gist.github.com/37a42d17696ba172d47f [01:51] sans WARC options, maybe other groups [01:51] * Zwangzug just tried to download and install it [01:51] brb, grocery shopping and stuff [01:55] okay, in cmd.exe mode now--how to open wget from inside there? [01:57] If you have a windows port of Wget you're going to want to put the .exe in your Windows directory [01:58] Then you should be able to call it from the command line [01:59] Starting with wget --help should get you started on what it can do in general [01:59] zill1 If you have a windows port of Wget you're going to want to put the .exe in your Windows directory <- and all the dlls also? [01:59] grr [01:59] ok, this is looking promising [01:59] add it to the path, not the windows dir [02:01] I got wget --help to function so it's working well enough [02:02] should I be able to paste URLs directly into the program? [02:02] Yeah a call of wget URL should pull down a given page for most things [02:04] huh, ok. got one page. let's see what else I can do... [02:05] I don't think --no-parent will be good enough for forums, "subforums" are generally served by the same cgi script in the same directory so you will get the entire forum [02:07] a more user-friendly utility on windows is http://www.httrack.com/ but it has the disadvantage of not supporting warc (afaik) [02:12] heretrix? [02:21] that seems rather slow. maybe right click-save as is the best after all, heh [03:54] Ning is removing networks on feb 10 that don't upgrade to a paid plan [03:54] SketchCow asked me to notify the channel [03:54] and see if we want to move on it or what [03:56] Also, abit.com.tw is closing, and has a full robots.txt disallow [03:56] sigh [03:56] Thinking of doing a full wget mirror on it [03:56] http://www.ninjawedding.org/whatbullshit.png [03:57] That's fucking gross [03:57] :( [03:57] Does the latest wget fix the recursive memory leak issue? [03:57] yes [03:57] >= r2581 in particular [03:58] And warc writing is builtin now, right? [03:58] (so I can just build HEAD) [03:58] yes [03:58] schweet [04:01] Is there a list of good wget parameters for a full mirror anywhere? [04:01] (or what do you guys use?) [04:02] underscor: I've got most of abit [04:02] depends on the job, but for a full mirror starting at / I usually go for recursive retrieval, infinite depth, span hosts, and allowing only related domains [04:02] otherwise you will end up spidering the whole Web [04:03] the last bit does require understanding site structure and watching what wget is doing [04:03] dashcloud: Oh really? Awesome! [04:03] yipdw: haha, yeah. That's never fun. [04:03] underscor: yeah, especially nowadays when everyone includes shit from other domains [04:04] yep [04:04] "USE YOUR OWN COPY. IT IS EXTREMELY UNWISE TO LOAD CODE FROM SERVERS YOU DO NOT CONTROL." -- Douglas Crockford [04:04] ^ [04:04] see where that got us [04:06] underscor: got somewhere I can push the stuff to? you can do a second check on what I got then [04:07] I can make an rsync module, that work? [04:08] sure [04:08] http://i.imgur.com/dCjr6.jpg [04:09] Splinder's motto [04:09] haha [04:09] i had a mirror of the abit ftp site back when they were supposed to be going down before [04:09] still have it somewhere [04:09] wtf [04:09] Proust is STILL alive [04:10] I'm reminded of that Onion headline: "MARCEL PROUST FINALLY DIES" [04:11] I wonder if they just forgot to shut it down [04:11] lol [04:11] I don't know if everyone had seen this already: http://i.imgur.com/rR592.png [04:11] lol [04:12] is there no situation XKCD doesn't have a strip for? [04:12] that was hidden in the black censored area of the SOPA xkcd comic [04:13] what a deal! http://i.imgur.com/RFtnt.png [04:13] wait, hidden how [04:13] was it RGB (1,1,1) or something [04:14] I forget which was which, but the black was #000000 and the drawing was #010101 (or vice versa) [04:14] heh [04:14] I just did a "select color" on it and removed the bar [04:15] after catching a bit of it when looking at my monitor off-axis [04:15] wait [04:15] so you're saying that if I had a *worse* monitor [04:15] I would have seen it [04:15] Yep [04:15] FUCK YOU, S-IPS [04:15] Need a TN display [04:15] hahahahhaha [04:17] oh [04:17] passive matrix [04:17] I can kinda see it [04:17] if I zoom the image to 8x [04:17] THANK YOU, S-IPS [04:17] or H-IPS or A-TW-IPS or whatever the hell this monitor uses [04:18] FAP-FAP-IPS [04:19] cum and experience the next generation of display technology [04:19] lololol [04:22] not 10 meters from me is a 120hz LCD with shutter glasses... [04:29] ooh [04:29] use it [04:29] TO SEE IN 3D [04:29] TO SEE FOREVER [04:37] I played Portal in 3D the other day ... [04:37] it was just stereoscopy, the 3D is a lie [04:38] I used to play Skyrim in 3D. Then I took an arrow to the eye. [04:38] *groan* [04:38] tired of that meme [04:38] 1) what [04:38] 2) [04:38] woop woop woop off-topic siren [04:39] I've actually never played Skyrim [04:39] but ok [04:39] ON TOPIC, I guess I should rework the ffnet grabber so it's not a bunch of crazy Ruby [04:42] perhaps [04:42] I kind of like crazy ruby [04:42] yeah, but I'm getting tired of fielding questions about it [04:43] plus it doesn't scale [04:43] (really) [04:43] hm. [04:50] oh my [04:50] http://www.youtube.com/watch?v=pHAcJl4d4Lg [05:01] UGH [05:04] .... [05:04] http://www.youtube.com/watch?v=LJRBmJJHWx0 [05:04] Coderjoe: I found something better [05:10] ahahaha [05:10] I did the radiocomm for that convention [05:11] man... I haven't seen Tiffany Grant in awhile [05:13] huh. behind the scenes: http://www.youtube.com/watch?v=6IQpJkiDR8g [05:15] wow [05:15] that commercial was better than what Vic wanted, haha [05:15] vic? [05:16] Vic Mignogna, the guy directing in that behind the scenes video [05:16] hrm. [05:17] you are involved with that crew? [05:22] not the crew that produced it, but I do have extensive experience with the animes [05:25] "that crew" == sakuracon [05:25] oh, no [08:17] SketchCow: I'd like your thoughts on http://www.kickstarter.com/projects/599092525/the-order-of-the-stick-reprint-drive as a Kickstarter expert, so to speak [08:27] in bed [08:27] e-mail this. not for this channel. [08:28] wow, in bed before 4am?!? [08:32] unpossible [08:33] not particularly relevant to this channel either, but interesting: some experiments with scanning slides using a DSLR and a light table - http://www.flickr.com/photos/afiler/sets/72157629017235485/ [08:33] next step is to modify a carousel slide projector to accomodate a lower-intensity light source and a camera mount, to scan a whole carousel in one go [09:30] SketchCow: http://allthingsd.com/20120131/proust-will-live-on-separate-from-iac/ [15:44] lol, in the TV news: today's anti-Putin activists have *not* been arrested [17:57] so, tabblo then? [18:06] sigh [18:06] 2922260983 100% 82.93kB/s 9:33:32 (xfer#846, to-check=1004/2360) [18:06] d/de/der/derDoc/web.me.com/web.me.com-derDoc.warc.gz [18:42] yipdw: ping me about qtwebkit hacking :-) I know how to intercept stuff without breaking. [18:42] yipdw: I was going to add a http proxy to warctools that replays content from warcs [18:45] i'd recommend it over qtwebkit [18:45] hackery because you can't intercept flash/plugin content [18:50] oh and sometimes trying to change the request body crashes qtwebkit because a thread is doing something with it elsewhere :/ [18:50] the url is about the only thing you can mangle & headers, although doing it on ajax requests often breaks things too [20:11] tef: oh, cool, that's good to know -- for some reason I thought that QtWebkit's network manager handled all requests, which in the context of plugins doesn't make sense [20:12] and if you're going to add an HTTP proxy for WARCs, then the WARC viewer problem really reduces to one of packaging tools :P