#archiveteam 2012-02-01,Wed

↑back Search

Time Nickname Message
00:22 🔗 underscor SketchCow: You should email cogent and carpathia and just ask them if we can have a copy
00:31 🔗 don underscor: my god, man
00:31 🔗 don how big is it?
00:31 🔗 underscor I think it was estimated at 20PB
00:31 🔗 underscor But I could be mistaken
00:33 🔗 dashcloud SketchCow: you were proved right about Yahoo & Flickr- http://nolancaudill.com/2012/01/30/the-front-line/
01:30 🔗 Zwangzug Hey, was wondering if anyone had advice/recommendations for archiving some forum topics?
01:31 🔗 don do you own the forum?
01:31 🔗 Zwangzug no, and I don't think I can get the ears of the people who do.
01:32 🔗 Zwangzug It's a phpbb3 forum, and there are several dozen topics (many with dozens of pages) I'd like to back up if possible--I've seen other fora where there's an archive mode so there are a lot fewer pages, but there's no obvious way to replicate that here.
01:32 🔗 don Then I am not sure of the best way to go BUT if you stick around I'm sure some of the more intelligent people here will be able to help.
01:32 🔗 Zwangzug fair enough, thanks
01:33 🔗 don and you never know when the admins will prune threads for whatever the fuck reason
01:33 🔗 don I own a small regional-interest forum that I took over from a former regime who deleted old threads willy-nilly
01:33 🔗 don infuriating
01:33 🔗 don I vow to never do that.
01:33 🔗 Zwangzug Fortunately, that hasn't been an issue, but it's better to cover all my bases.
01:33 🔗 don yes, it is.
01:33 🔗 don always.
01:34 🔗 Zwangzug It's a very large forum, with only one (still very large) subforum I'm mainly interested in. So something to grab entire websites might be too large-scale.
01:37 🔗 don This might make a good topic for me to write up in the wiki
01:37 🔗 don you're definitely not the only one with interest in archiving forums
01:37 🔗 yipdw Zwangzug: honestly, I think your best bet will be something like wget
01:38 🔗 yipdw especially if the forum application has no "archive mode"
01:38 🔗 yipdw it is a lot of requests, but that's what happens -- and you can instruct wget to better simulate a browser via its --random-wait option
01:38 🔗 yipdw and changing its user-agent, etc.
01:39 🔗 Zwangzug Would there be a good way to restrict it to just one subforum or a given set of threads?
01:39 🔗 yipdw if you're dealing with a particularly crawler-hostile proprietor, though, it's pretty easy to detect wget
01:39 🔗 yipdw yes, just pass the URLs of the subforum or threads in
01:39 🔗 yipdw use recursive fetch with --no-parent and --page-requisites
01:39 🔗 yipdw that should (I think) do what youw ant
01:39 🔗 yipdw though I obviously haven't tried :P
01:40 🔗 Zwangzug I'll give it a go, might need some technical support though. Fingers crossed!
01:40 🔗 yipdw yeah sure
01:41 🔗 yipdw if you're comfortable with compiling software, try this: https://github.com/downloads/ArchiveTeam/mobileme-grab/wget-1.13.4-2581.tar.bz2
01:41 🔗 yipdw it's a build of wget that contains a few useful features and fixes for large crawls, namely WARC output and fixes for memory leaks
01:42 🔗 Zwangzug if you're comfortable with compiling software <- no such luck :p
01:45 🔗 Zwangzug Er, sorry, this is going to have to be a very tedious walkthrough
01:45 🔗 Zwangzug at the level of "I double-clicked on the program and it opened and then disappeared"
01:45 🔗 yipdw what OS?
01:45 🔗 Zwangzug Windows.
01:45 🔗 yipdw oh
01:46 🔗 yipdw that makes things more difficult
01:46 🔗 yipdw wget's a command-line program, so you'll need to run it from Command Prompt
01:46 🔗 yipdw if you can, I highly recomemnd getting an Ubuntu installation (or something)
01:46 🔗 Coderjoe hmm
01:46 🔗 Coderjoe http://hardware.slashdot.org/comments.pl?sid=2646891&cid=38880617
01:46 🔗 yipdw a lot of the tools that we recommend here are very geared towards UNIX and its relatives
01:47 🔗 zill1 There's wget ports for windows, a quick google search should give you something that you can use even if it's based on an older version
01:47 🔗 Coderjoe I wonder what the log file for the linked file at textfiles.com looks like
01:47 🔗 yipdw zill1: there are, but (1) they're still CLI and (2) they're probably not going to be as robust
01:48 🔗 yipdw (3), they don't do WARCs, which IMO is a big deficiency for archival purposes
01:48 🔗 Coderjoe (4) still have the annoying memory leaks
01:48 🔗 Zwangzug this is, nominally, the "for windows" version
01:49 🔗 yipdw well, it should still have the options I was talking about; invoke wget --help at a command prompt to see them
01:50 🔗 zill1 Wget generally isn't a built in command for windows
01:50 🔗 yipdw zill1: under the assumption that Zwangzug has a copy of wget, of course
01:50 🔗 yipdw Zwangzug: you should see something like this -> https://gist.github.com/37a42d17696ba172d47f
01:51 🔗 yipdw sans WARC options, maybe other groups
01:51 🔗 * Zwangzug just tried to download and install it
01:51 🔗 yipdw brb, grocery shopping and stuff
01:55 🔗 Zwangzug okay, in cmd.exe mode now--how to open wget from inside there?
01:57 🔗 zill1 If you have a windows port of Wget you're going to want to put the .exe in your Windows directory
01:58 🔗 zill1 Then you should be able to call it from the command line
01:59 🔗 zill1 Starting with wget --help should get you started on what it can do in general
01:59 🔗 Zwangzug zill1 If you have a windows port of Wget you're going to want to put the .exe in your Windows directory <- and all the dlls also?
01:59 🔗 Coderjoe grr
01:59 🔗 Zwangzug ok, this is looking promising
01:59 🔗 Coderjoe add it to the path, not the windows dir
02:01 🔗 Zwangzug I got wget --help to function so it's working well enough
02:02 🔗 Zwangzug should I be able to paste URLs directly into the program?
02:02 🔗 zill1 Yeah a call of wget URL should pull down a given page for most things
02:04 🔗 Zwangzug huh, ok. got one page. let's see what else I can do...
02:05 🔗 DFJustin I don't think --no-parent will be good enough for forums, "subforums" are generally served by the same cgi script in the same directory so you will get the entire forum
02:07 🔗 DFJustin a more user-friendly utility on windows is http://www.httrack.com/ but it has the disadvantage of not supporting warc (afaik)
02:12 🔗 Coderjoe heretrix?
02:21 🔗 Zwangzug that seems rather slow. maybe right click-save as is the best after all, heh
03:54 🔗 underscor Ning is removing networks on feb 10 that don't upgrade to a paid plan
03:54 🔗 underscor SketchCow asked me to notify the channel
03:54 🔗 underscor and see if we want to move on it or what
03:56 🔗 underscor Also, abit.com.tw is closing, and has a full robots.txt disallow
03:56 🔗 yipdw sigh
03:56 🔗 underscor Thinking of doing a full wget mirror on it
03:56 🔗 yipdw http://www.ninjawedding.org/whatbullshit.png
03:57 🔗 underscor That's fucking gross
03:57 🔗 underscor :(
03:57 🔗 underscor Does the latest wget fix the recursive memory leak issue?
03:57 🔗 yipdw yes
03:57 🔗 yipdw >= r2581 in particular
03:58 🔗 underscor And warc writing is builtin now, right?
03:58 🔗 underscor (so I can just build HEAD)
03:58 🔗 yipdw yes
03:58 🔗 underscor schweet
04:01 🔗 underscor Is there a list of good wget parameters for a full mirror anywhere?
04:01 🔗 underscor (or what do you guys use?)
04:02 🔗 dashcloud underscor: I've got most of abit
04:02 🔗 yipdw depends on the job, but for a full mirror starting at / I usually go for recursive retrieval, infinite depth, span hosts, and allowing only related domains
04:02 🔗 yipdw otherwise you will end up spidering the whole Web
04:03 🔗 yipdw the last bit does require understanding site structure and watching what wget is doing
04:03 🔗 underscor dashcloud: Oh really? Awesome!
04:03 🔗 underscor yipdw: haha, yeah. That's never fun.
04:03 🔗 yipdw underscor: yeah, especially nowadays when everyone includes shit from other domains
04:04 🔗 underscor yep
04:04 🔗 yipdw "USE YOUR OWN COPY. IT IS EXTREMELY UNWISE TO LOAD CODE FROM SERVERS YOU DO NOT CONTROL." -- Douglas Crockford
04:04 🔗 underscor ^
04:04 🔗 yipdw see where that got us
04:06 🔗 dashcloud underscor: got somewhere I can push the stuff to? you can do a second check on what I got then
04:07 🔗 underscor I can make an rsync module, that work?
04:08 🔗 dashcloud sure
04:08 🔗 Coderjoe http://i.imgur.com/dCjr6.jpg
04:09 🔗 yipdw Splinder's motto
04:09 🔗 underscor haha
04:09 🔗 Coderjoe i had a mirror of the abit ftp site back when they were supposed to be going down before
04:09 🔗 Coderjoe still have it somewhere
04:09 🔗 yipdw wtf
04:09 🔗 yipdw Proust is STILL alive
04:10 🔗 yipdw I'm reminded of that Onion headline: "MARCEL PROUST FINALLY DIES"
04:11 🔗 yipdw I wonder if they just forgot to shut it down
04:11 🔗 underscor lol
04:11 🔗 Coderjoe I don't know if everyone had seen this already: http://i.imgur.com/rR592.png
04:11 🔗 PatC_ lol
04:12 🔗 dashcloud is there no situation XKCD doesn't have a strip for?
04:12 🔗 Coderjoe that was hidden in the black censored area of the SOPA xkcd comic
04:13 🔗 Coderjoe what a deal! http://i.imgur.com/RFtnt.png
04:13 🔗 yipdw wait, hidden how
04:13 🔗 yipdw was it RGB (1,1,1) or something
04:14 🔗 Coderjoe I forget which was which, but the black was #000000 and the drawing was #010101 (or vice versa)
04:14 🔗 yipdw heh
04:14 🔗 Coderjoe I just did a "select color" on it and removed the bar
04:15 🔗 Coderjoe after catching a bit of it when looking at my monitor off-axis
04:15 🔗 yipdw wait
04:15 🔗 yipdw so you're saying that if I had a *worse* monitor
04:15 🔗 yipdw I would have seen it
04:15 🔗 underscor Yep
04:15 🔗 yipdw FUCK YOU, S-IPS
04:15 🔗 underscor Need a TN display
04:15 🔗 underscor hahahahhaha
04:17 🔗 yipdw oh
04:17 🔗 chronomex passive matrix
04:17 🔗 yipdw I can kinda see it
04:17 🔗 yipdw if I zoom the image to 8x
04:17 🔗 yipdw THANK YOU, S-IPS
04:17 🔗 yipdw or H-IPS or A-TW-IPS or whatever the hell this monitor uses
04:18 🔗 chronomex FAP-FAP-IPS
04:19 🔗 yipdw cum and experience the next generation of display technology
04:19 🔗 underscor lololol
04:22 🔗 chronomex not 10 meters from me is a 120hz LCD with shutter glasses...
04:29 🔗 yipdw ooh
04:29 🔗 yipdw use it
04:29 🔗 yipdw TO SEE IN 3D
04:29 🔗 Coderjoe TO SEE FOREVER
04:37 🔗 chronomex I played Portal in 3D the other day ...
04:37 🔗 yipdw it was just stereoscopy, the 3D is a lie
04:38 🔗 yipdw I used to play Skyrim in 3D. Then I took an arrow to the eye.
04:38 🔗 Coderjoe *groan*
04:38 🔗 Coderjoe tired of that meme
04:38 🔗 chronomex 1) what
04:38 🔗 chronomex 2)
04:38 🔗 chronomex woop woop woop off-topic siren
04:39 🔗 yipdw I've actually never played Skyrim
04:39 🔗 yipdw but ok
04:39 🔗 yipdw ON TOPIC, I guess I should rework the ffnet grabber so it's not a bunch of crazy Ruby
04:42 🔗 chronomex perhaps
04:42 🔗 chronomex I kind of like crazy ruby
04:42 🔗 yipdw yeah, but I'm getting tired of fielding questions about it
04:43 🔗 yipdw plus it doesn't scale
04:43 🔗 yipdw (really)
04:43 🔗 chronomex hm.
04:50 🔗 yipdw oh my
04:50 🔗 yipdw http://www.youtube.com/watch?v=pHAcJl4d4Lg
05:01 🔗 Coderjoe UGH
05:04 🔗 Coderjoe ....
05:04 🔗 Coderjoe http://www.youtube.com/watch?v=LJRBmJJHWx0
05:04 🔗 yipdw Coderjoe: I found something better
05:10 🔗 chronomex ahahaha
05:10 🔗 chronomex I did the radiocomm for that convention
05:11 🔗 Coderjoe man... I haven't seen Tiffany Grant in awhile
05:13 🔗 Coderjoe huh. behind the scenes: http://www.youtube.com/watch?v=6IQpJkiDR8g
05:15 🔗 yipdw wow
05:15 🔗 yipdw that commercial was better than what Vic wanted, haha
05:15 🔗 chronomex vic?
05:16 🔗 yipdw Vic Mignogna, the guy directing in that behind the scenes video
05:16 🔗 chronomex hrm.
05:17 🔗 chronomex you are involved with that crew?
05:22 🔗 yipdw not the crew that produced it, but I do have extensive experience with the animes
05:25 🔗 chronomex "that crew" == sakuracon
05:25 🔗 yipdw oh, no
08:17 🔗 Zebranky_ SketchCow: I'd like your thoughts on http://www.kickstarter.com/projects/599092525/the-order-of-the-stick-reprint-drive as a Kickstarter expert, so to speak
08:27 🔗 SketchCow in bed
08:27 🔗 SketchCow e-mail this. not for this channel.
08:28 🔗 chronomex wow, in bed before 4am?!?
08:32 🔗 ersi unpossible
08:33 🔗 chronomex not particularly relevant to this channel either, but interesting: some experiments with scanning slides using a DSLR and a light table - http://www.flickr.com/photos/afiler/sets/72157629017235485/
08:33 🔗 chronomex next step is to modify a carousel slide projector to accomodate a lower-intensity light source and a camera mount, to scan a whole carousel in one go
09:30 🔗 yipdw SketchCow: http://allthingsd.com/20120131/proust-will-live-on-separate-from-iac/
15:44 🔗 Nemo_bis lol, in the TV news: today's anti-Putin activists have *not* been arrested
17:57 🔗 don so, tabblo then?
18:06 🔗 Nemo_bis sigh
18:06 🔗 Nemo_bis 2922260983 100% 82.93kB/s 9:33:32 (xfer#846, to-check=1004/2360)
18:06 🔗 Nemo_bis d/de/der/derDoc/web.me.com/web.me.com-derDoc.warc.gz
18:42 🔗 tef yipdw: ping me about qtwebkit hacking :-) I know how to intercept stuff without breaking.
18:42 🔗 tef yipdw: I was going to add a http proxy to warctools that replays content from warcs
18:45 🔗 tef i'd recommend it over qtwebkit
18:45 🔗 tef hackery because you can't intercept flash/plugin content
18:50 🔗 tef oh and sometimes trying to change the request body crashes qtwebkit because a thread is doing something with it elsewhere :/
18:50 🔗 tef the url is about the only thing you can mangle & headers, although doing it on ajax requests often breaks things too
20:11 🔗 yipdw tef: oh, cool, that's good to know -- for some reason I thought that QtWebkit's network manager handled all requests, which in the context of plugins doesn't make sense
20:12 🔗 yipdw and if you're going to add an HTTP proxy for WARCs, then the WARC viewer problem really reduces to one of packaging tools :P

irclogger-viewer