[00:02] so [00:02] I'm gonna start archiving all news stories re: Wikipedia blackout [00:06] http://www.cbsnews.com/8301-205_162-57360291/google-plans-to-use-home-page-to-protest-sopa/ [01:16] some guy from luxembourg is uploading old glenn beck episodes at full speed [01:16] :-D [01:16] i want to hug the guy [01:17] LOL [01:17] why, you like glenn beck, or you like the archival value of them? [01:17] xD [01:19] just what to archive thme [01:19] *them [01:19] Glenn Beck is my favorite comedian [01:19] there is like 6 months of it [01:20] if you're going to allow spam videos to get uploaded, it's kind of hard to discriminate against any particular variety [01:21] this luxembourg guy is now seeding 4 episodes [01:22] this is just crasy [01:22] it use to be like dialup speeds before [01:23] also look up george soros [02:41] yipdw: fwiw, warc2warc doesn't truncate the records now at least. going to put the wget workaround in now as an option [03:05] yipdw: python warc2warc.py -D -Z --wget-chunk-fix foo.warc.gz > bar.warc.gz [03:05] should decompress (if needed), decode each http message (removing content-encoding/transfer-encoding), deal with the corrupt wget output, and recompress it record by record [03:13] hoooray! [03:14] tef: are you the tef that's on somethingawful? [03:14] yes [03:14] oh! [03:14] *wave* [03:14] I was going to ask if you were the Ymgve on sa [03:14] I like your av :3 if it's the cos(...) one iirc [03:15] I'm basically the only Ymgve on the internet, actually picked the name mainly for uniqueness [03:15] what about my new mandelbrot avatar? [03:16] surprised it didn't exceed the size limits [03:16] ah yes that owns [03:16] I have a thing for fractals [03:17] I wrote a js1k which uses a hilbert curve to draw a mandelbrot [03:17] http://secretvolcanobase.org/~tef/js1k.html [03:17] this makes me so happy [03:18] hah, didn't know you could do that, cool! [03:28] well I use it for enqueing progressive rendering of the fractal at higher resolutions [03:29] so I draw larger and larger hilbert curves and render the points as blocks until it's ~ 1px small [04:30] tef: awesome, thanks [04:30] ^ÂÊÎÔÛâêîôûĈĉĜĝĤĥĴĵŜŝŴŵŶŷˆ̭̂᷍ḒḓḘḙḼḽṊṋṰṱṶṷẐẑẤấẦầẨẩẪẫẬậẾếỀềỂểỄễỆệỐốỒồỔổỖỗỘộ⨣⨶⩯ꞈ^󠁞 [04:32] hmm [04:32] about 1200 SOPA/Wikipedia stories coming in [04:42] yipdw: are you going to be warcing any sites like wikipedia? Already has a pretty historic banner up [04:43] I read some google person encouraging sites to make their blackout be a 403, or some such code, so it's *not* archived. [04:43] closure: already did [04:43] well [04:43] I got their announcement at least [04:44] I guess I could grab en.wikipedia.org/wiki/Main_Page or something [04:44] they've depolyed their face of jimbo banner technology for good, at last :P [04:44] heh [04:45] ok, I got Main Page [04:45] https://plus.google.com/115984868678744352358/posts/Gas8vjZ5fmB [04:45] I guess I can grab the blackout too [04:46] so he suggested a 503 code [04:49] heh [04:49] seems reasonable [04:49] I just went looking for other status codes...on Wikipedia [04:49] CAN'T DO THAT IN TEN MINUTES [04:49] aside from not being archived [04:49] http://en.wikipedia.org/wiki/List_of_HTTP_status_codes [04:50] remember, in 10 minutes, es.wikipedia.org will still be there :) [04:50] http://es.wikipedia.org/wiki/Anexo:C%C3%B3digos_de_estado_HTTP [04:50] sweet [04:55] hee [04:55] just memorize the http status cats [04:57] http://www.flickr.com/photos/girliemac/6509400997/ [04:59] http://tricities.craigslist.org/ is already black [04:59] hmm, they leave a link to the real site [05:00] woot, wikipedia is black [05:00] really nice design too [05:01] Oh, it's just an overlay trick [05:01] heh, their Learn More link is buggy.. it links to a wiki page that's blacked out :) [05:01] wikipedia just went dark check it out [05:02] closure: hahaha [05:02] yeah, overlay is a good way to do it, although you do see the real page flash [05:03] I think they need to tweak their overlay JS (or whatever) to not show on that page [05:03] yep [05:03] or they could link to the eff's page [05:03] lol the learn more link [05:03] classy [05:03] http://www.fsf.org/ black [05:04] hmmm [05:04] I can make warcs with the crawler from work. [05:04] fixed [05:04] and has a title that's binary for some reason, heh [05:04] 2012-01-18 01:04:52 ERROR 503: Service Temporarily Unavailable. [05:04] that's wget.. [05:05] wonder what's the switch to archive despite failing status code [05:05] heh, wikipedia fixed their link :) [05:08] woot [05:08] I blocked the blackout JS. I can now use the english wikipedia just fine [05:09] (since I already know about sopa and have written my critters about it, I don't think I need to see the blackout overlay) [05:09] already lead story on CNN [05:10] well I have a warc of wikipedia.org's front page now [05:10] should have all the css/js too [05:10] as a plus, I probably can modify this to block the jimboface banners, too [05:11] tef: hey me too! :) [05:11] (I added this to my adblock filter list, with "at start" anchoring: http://meta.wikimedia.org/w/index.php*title=Special:BannerLoader*banner=blackout ) [05:11] I do wish that they really just turned off the wiki software, or set up a redirect [05:12] but maybe they just couldn't do that across all their app servers [05:12] you can also just browse it in w3m :P [05:13] I am hardcore and browse it via gunzip -c *.warc.gz [05:13] oooh, nice on google [05:13] haha [05:13] enormous censorship bar of dooom [05:13] best google logo evar [05:13] time to grab that too [05:14] yipdw: I caught http://pastebin.com/NWiV4yEV [05:15] I wonder if they are blocking the english site on the secure site [05:16] tef: https://gist.github.com/7d787f4487a80b354d8f [05:16] come to think of it [05:16] I'm not sure I got what I actually went for [05:16] one moment [05:17] oh [05:17] I forgot to turn off robots.txt obey stuff [05:18] weird [05:18] you didn't get http://upload.wikimedia.org/wikipedia/commons/9/98/WP_SOPA_Splash_Full.jpg [05:18] yeah [05:18] hooray javascript? [05:18] http://thepiratebay.org/ heh, I didn't see that blackout coming [05:18] I don't think wget's seeing it [05:19] but whatever, you got a copy [05:19] grabbed [05:19] guess I'll grab www.google.com and the landing page [05:19] might have better luck there [05:21] i'm sorta glad the thing I made for work is capturing more :3 [05:22] yay archiving [05:23] tef: haha [05:24] http://thedailywtf.com/Articles/Support-The-Daily-WTF-in-Supporting-the-Support-SOPA-Movement.aspx [05:24] wins IMHO by dragging BBSes into this [05:26] tef: can your software include embedded videos? [05:26] sorta [05:26] there's a few on https://www.google.com/landing/takeaction/sopa-pipa/ that I'm trying to get wget to somehow get [05:26] but without knowledge of YouTube's system I don't think it can really do it [05:27] http://www.metafilter.com/ black [05:27] maybe we should make a list [05:28] pls! [05:28] yipdw: piratepad or the wiki? [05:28] http://boingboing.net/ black [05:29] recommend a pad over wiki while it's still being updated [05:29] http://archiveteam.org/index.php?title=SOPA_blackout_pages [05:29] works here [05:30] if you like edit conflicts and captchas [05:30] FINE [05:31] wow, check view source on boing boing [05:31] http://piratepad.net/I42hnyc0zk [05:33] blah, I just got disconnected [05:33] constantly [05:33] i think there are other sites that run the etherpad sw [05:33] just i don't know of any of them off the top of my head [05:34] http://beta.etherpad.org/p/KDVzcBCKTj [05:35] 4chan is sensored too [05:35] go congress [05:36] no it's not [05:36] go to any section [05:36] ./g/ is [05:36] all the text is black until you mouse over it [05:36] oh, i see.. I actually have to eeek [05:36] censoring the text on 4chan is like.. like.. words fail me [05:39] http://wordpress.com/ [05:40] can we drop the * thing [05:40] it is hard to copy paste the urls out [05:40] and dump them into a script :v [05:40] lol [05:40] we can [05:41] awesome [05:42] oh sadface [05:42] wget doesn't retrieve page requisites if retrieval of a URL returns 503 [05:42] I wonder if I can force it to follow [05:42] http://sopastrike.com/on-strike/ [05:42] don't really want to write custom tools to do this if tef already has them :P [05:44] the sopastrike page is crashing chromium [05:44] ok.. that's a list [05:44] it's not all though [05:44] yeah but my tools are well, flakey [05:44] indeed not [05:45] haha [05:45] " [05:45] "Chromium's connection attempt to www.twitter.com was rejected. The website may be down, or your network may not be properly configured. [05:45] FUCK BLACKOUT, WE'RE GOING OFFLINE [05:46] (probably just my link though) [05:47] oh wait there it goes [05:47] yeah twitter's ceo guy was being a dbag about the sopa stuff [05:47] calling it idiotic or something [05:48] nah, he was just referring to Twitter specifically [05:48] twitter isn't going down for anything but mismanagement [05:48] well, twitter not going down for a day? wikipedia is effectively [05:48] I meant his tweet referred to Twitter going offline as foolish, not Wikipedia or anyone else [05:49] ah yeah [05:50] [05:50] Why Wikipedia went down at midnight - CNN.com [05:51] yipdw: Imgur Tor Project Miro iSchool at Syracuse University Oreilly.com Wikipedia Reddit Mozilla WordPress.org icanhazCheezburger Network MoveOn.org Good Old Games TwitPic Minecraft Free Press Mojang XDA Developers Destructoid Good.is [05:52] lol [05:52] we'll report on it.. but it didn't happen [05:52] at least twitpic is i gues [05:56] arrith: ? [05:56] yipdw: those sites were said to be doing a blackout thing [05:56] oh [05:56] add them to the list [06:00] ahahah I had a stupid bug [06:02] twitpic only doing logo by looks of it [06:02] aww sonuvabitch [06:02] http://savannah.gnu.org/bugs/index.php?20417 [06:02] i'm only doing logo on my site [06:02] that bug deals *directly* with wget's 503 behavior [06:02] and I can't access it [06:02] ... [06:02] fuck. [06:04] omg omg i hope you guys archived wikipedia because they took it down *ducks* [06:04] nice [06:04] ps: ?banner=0 [06:04] will take off the banner [06:07] hey guys! [06:07] you... are... awesome! [06:09] NovaKing: please provide a full url that works for that bug page? [06:10] yeah, I can't get to it [06:10] eh? sorry, that was in relation to wikipedia [06:11] as for 503 [06:11] oh [06:11] that is due to google crawlers [06:11] to not index the blackout page [06:11] you might have to get wget source, and remove the 503 section [06:12] there isn't one [06:12] it's probably the generic failure code handling code [06:13] there isn't one? [06:13] ...? [06:13] I cannot find a failure handler that deals with HTTP 503, no [06:14] there may be a generic one [06:14] I'm trying to find it [06:15] in fact, here's something weird in wget trunk: [06:15] $ grep -nR HTTP_STATUS_UNAVAILABLE * [06:15] src/http.c:131:#define HTTP_STATUS_UNAVAILABLE 503 [06:15] that's it [06:15] it's not used anywhere else [06:15] well, at least not straight off [06:16] could be some preprocessor tricks [06:17] I think it's only there for completeness [06:17] I was in the same place, didn't get any further [06:21] anyone having trouble connecting to twitter? [06:21] I thought they weren't participating [06:21] they participate at random thruought the year [06:21] hahah [06:21] in total it comes to a day anyway [06:22] https://twitter.com/herpderpedia lolz [06:24] worksforme [06:24] but they might be buckling under the "OMG SHIT IS DOWN" posts [06:25] twitter can't go offline [06:25] my mom told me they were a cloud and clouds never go offline [06:25] hahahah [06:26] yeah [06:26] it's the load of #wikipedia [06:26] damn [06:27] well [06:27] Those comments give me no hope for humanity [06:27] it's probably a good thing that Twitter dumps its tweets in the LoC [06:27] because we have no hope of actually archiving any of that, short of a direct feed to Twitter's message brokers [06:27] haha [06:28] mostly because the Javascript on New Twitter makes it such a fucking bear to wrestle [06:28] 19 new tweets in the last minute [06:28] 36 in 2m [06:29] kinda surprised by all the spanish tweets [06:29] google://time in spain => 7:29am; consider the impact of waking up with your coffee in your hand, and seeing half the internetz blacked out [06:31] ok [06:31] so [06:31] they're mentioning the #wikipedia hash tag, though [06:31] how to actually archive these [06:31] short of "let tef do it" [06:31] and only en was supposed to go black [06:35] https://twitter.com/#!/_ItsONLY1_KEN/status/159510766695878656 [06:39] hmm now wondering how to fix the 503 nonsense [06:39] well it gets captured sort of [06:42] I don't think http://sopastrike.com/on-strike/ is accurate at all [06:42] I think it's full of crap and few sites on strike, and a few that will be later [06:43] 4chan's got an odd blackout [06:43] the site still works, just the text is black on black [06:44] oh fuck [06:44] etherpad.org died [06:44] did someone archive it :P [06:46] ah ok it's back [06:48] Sopastrike seems to have a lot of cybersquatting websites [06:48] and facebook profilez [06:48] basically they let anyone put their link up there I guess [06:50] added http://www.qwantz.com/index.php [06:53] oh, cool [06:53] http://blog.nearlyfreespeech.net/2012/01/18/sopa-blackout-option/ [07:01] uuughghu [07:05] right and I have patched the 503 errors [07:05] in wget or another tool? [07:05] in another tools [07:05] I think I found where it's falling through [07:05] oh ok [07:05] work stuffs :/ [07:06] that's fine [07:06] so long as SOMEONE has a tool to grab this [07:06] well i'm running it now [07:06] seems to be surviving [07:12] ah ha [07:12] patched wget [07:14] https://gist.github.com/1631756 [07:14] that can be applied against wget bzr 2574 [07:15] I think it works [07:16] hurrah [07:16] well this crawl should end soonish I think [07:22] heh [07:22] it's picking up links under the blackout js and the blackout js [07:24] yeah [07:25] I'm not sure if the patched wget is handling wikipedia right [07:25] but as you've got a grab of that [07:25] its' fine [07:28] what the [07:28] http://twitter.com/JOIN__US/status/159537070338093056 [07:29] got about 200 resources now [07:31] http://twitter.com/Eastern_Star_/status/159537964865699840 [07:31] stupid [07:31] 2000 even [07:32] gonna try archiving the wtf wikipedia tweets [07:32] good luck with that [07:33] twitter is awful [07:33] I'm probably just going to have to bang their REST API [07:33] have fun [07:33] 20-40 tweets per minute on the two searches I have open [07:36] about the same on #stopsopa #sopa and #wikipedia [07:38] about the same on #wikipediablackout [07:38] #wikistrike hasn't moved in awhile [07:38] ugh [07:38] actually fuck that [07:39] I'll just continue running my news crawlers [07:40] people are discovering that the mobile site still works [07:52] http://twitter.com/HWGVictor/status/159543226984955905 [07:54] another blackout: http://cinematictitanic.com/sopa.html [07:55] well, the front page redirects to the sopa page [07:55] http://flowingdata.com/2012/01/17/watching-wtf-wikipedia-as-sopapipa-blackout-begins/ [07:55] heh [07:55] google isn't giving me the blackout [07:55] cos i'm on aws [07:57] https://github.com/zachstronaut/stop-sopa [07:57] oh, huh [08:05] heh [08:05] http://online.wsj.com/article/SB10001424052970203471004577142893718069820.html [08:06] "Rather, ..." [08:06] darn preview [08:10] lots of non-retweet repetition on #wikipediablackout: I support #wikipediablackout! Show your support here (tinyurl link) [08:11] new tending: #NoALaPincheSOPA [08:11] http://reedmorse.com/tmp/sopa-adwords.png can anyone who doesn't block ads confirm google is making anti-sopa adwords? [08:12] #NOALAPINCHESOPA hey hey hey you never say no to soup okaay? [08:12] closure: yes, there are ads for www.google.com/takeaction [08:12] if you search for "sopa" anyway [08:12] er, on Google [08:13] I tuned my adblock off and went to techcrunch and got the same google sopa banner [08:13] only there or other sites tho [08:13] they show up on Google search results too [08:13] techcrunch could have changed it.. if it's everywhere, that'd be huge [08:14] well, when I turn adblock back on, it goes away [08:14] give me another site with adwords [08:15] no idea, that's why I asked :) [08:16] it's also on ytmnd [08:16] (in an ad slot, not a site) [08:17] both ad slots, actually [08:18] geh [08:18] #stopsopa has 339 new tweets in the last 27 minutes or so [08:18] 335 new #sopa in 25 minutes [08:19] 176 new #wikipedia in 14 minutes [08:19] 159 new #wikipediablackout in 10 minutes [08:22] must be lots of school papers [08:28] well, I am really pissing off firefox [08:29] in addition to all those search pages I had open, I just opened those two trackers. told revisit to grab 1000 tweets [08:30] oops [08:30] closure: what was the etherpad link? [08:30] I just closed it and can't find it in Chromium's history [08:31] oh wait, I can undo [08:31] spot says 373 "wtf wikipedia" tweets per hour [08:31] yipdw: http://beta.etherpad.org/p/KDVzcBCKTj [08:31] thanks [08:32] a number of the ones spot is highlighting are "do this twitter search and laugh at dumb people freaking out" [08:32] http://theoatmeal.com lol [08:34] yipdw: where does a site go if its just changed its banner? but isn't really down [08:34] "raging vagina tractors" [08:34] woot to whoever noticed xmonad.org [08:35] * closure strokes his 300 line .xmonadrc [08:35] that is one long gif [08:36] courtesy of spot: http://twitter.com/WEIQINGZ/status/159551680587898880 [08:37] who keeps claiming I have WARCs of all of those :P [08:39] ok now I kinda do [09:09] new stats at about 50 minutes: #wikipedia 650, #wikipediablackout 655, #sopa 673, #stopsopa 676 [09:10] oh man [09:10] this looks to be silly [09:11] #FactsWithoutWikipedia [09:11] Weaves are made from abandoned foetuses. And you wondered why they rubbed you the wrong way, huh? #FactswithoutWikipedia [09:12] During the selection process for a new pope in the event of a tie, it is settled with a game of conkers #factswithoutwikipedia [09:15] Tiger woods owns 4 brothels #FactsWithoutWikipedia [09:15] The Earth is not spherical, it is actually a rectangular prism. #FactsWithoutWikipedia [09:24] i can imagine the mess when the people in the states wake up in a few hours [09:28] http://theoatmeal.com/sopa [09:28] that is excellent [09:31] "P.S. Please pirate the shit out of this animated GIF. " [09:34] yeah... #FactsWithoutWikipedia is going fast... 20 tweets in the last minute [09:36] Dubstep is the cure to diabetes #factswithoutwikipedia [09:36] well shit [09:38] 1 of them is from a collueage [09:39] oh [09:39] heh [09:39] my buds followed through [09:39] http://chicagoparkour.com/ [09:39] how weird [09:39] 78% of pregnancies occur due to the high incidence of couples sharing toothbrushes and bath towels. #FactsWithoutWikipedia [09:41] I've also been watching my news scraper and there is a suspicious dearth of pro-SOPA/pro-PIPA articles [09:41] but I'm just using Google News feeds [10:14] http://xkcd.com/ [10:54] hahah [10:56] bahahaha [10:58] did you catch the hidden message? [10:58] or hidden comic [10:58] ya [15:23] I've added a few dozen more sopa blackout pages to http://beta.etherpad.org/p/KDVzcBCKTj [15:26] oddly, archive.org is still up [16:32] closure: not for me [16:36] mornin [18:08] right, restarting the crawl with the current list in the pirate pad [18:09] i've been capturing pages all day from about 5am, then again at 8,9 am and then at 1pm, and again this afternoon. (gmt) [18:09] i'll clean up the warcs tomorrow and find out where to shove them [18:12] heheheh [18:12] http://support.godaddy.com/godaddy/go-daddy-many-other-internet-leaders-oppose-sopa-pipa/?ci=56582 [18:13] heh [18:13] they are so full of shit [18:13] yipdw: did the warc options play nicely with the wget warc? and has the bug been filed upstream? [18:13] tef: I haven't tried cleaning up my existing WARCs, I'll get to that sometime tonight [18:13] cool [18:14] honest I am archiving xvideos.com for work [18:14] I haven't yet been able to file bugs because GNU took down savannah.gnu.org [18:14] heee [18:14] worst day to file a bug [18:15] no kidding [18:16] my boss is happy i'm running sopa crawls cos I keep finding bugs [18:16] hooray running stuff for archive team now counts as testing. heh heh heh [18:16] hah [18:16] these are some really weird edge cases [18:17] well not the bugs I found in my stuff (unrelated to warcs) [18:17] found a page hang i nthe crawler on tags [18:18] ahh [18:18] because I am a terrible coder [18:18] and i've moved from cpickle to pickle for some ipc because of weird newline issues [18:18] thanks python [18:27] and flash is crashing my crawler again. i hate flash [18:33] might do a +1 crawl on news.ycombinator [18:56] so, evidently, if you get this level of ruckus going on, politicians back off [18:56] I just saw six news articles scroll by that said something to the effect of "cosponsors back off" [18:56] for a little while, yes [18:56] yeah [18:56] I bet they'll fuck it all up in a matter of time [18:56] I'm sure of it [18:57] hmm newzbin is also blacked out [19:01] what the hell [19:01] http://www.theweedblog.com/what-would-the-marijuana-movement-be-without-the-internet/ [19:01] that showed up in the news feeds for "SOPA" [19:01] oh, I see why [19:33] sigh, uploading to IA at 50 kB/s [19:33] and a 30.5 GiB is reported as 28.3 GiB bu curl [19:39] Sure both units are meant to be GiB? [19:43] nitro2k01, yes, why not? [19:44] 30.5/28.3 is close enough to 1024^3 [19:44] Or 1024^3/1000^3 rather but you get my point [19:45] yes, but curl uses GiB [19:45] and nautilus too I hope [19:45] It's the bigger one that would be using GB [19:45] although it seems you're right, because the file is 30486180948 B [19:46] bah, stupid nautilus [19:47] zombo.com is down too [19:47] SHIT [19:47] WHAT [19:47] Nemo_bis: btw I pushed my crap to SketchCow [19:47] thanks for reminding me :# [19:47] yipdw: I KNOW? NOTHING IS POSSIBLE! [19:48] YOU ARE NOT WELCOME [19:48] TO ZOMBOCOM [19:48] tef, good :) [19:50] but now who's Angra [19:50] i've got news.yc to crawl so I am capturing the front page and all the articles [19:50] hopefully [19:50] or crawl333-9 [19:51] got 141 seeds in the sopa crawl i've just restarted [19:51] I probably have about 7-8 copies of some pages by now but eh i'll dedupe when i'm dead [20:20] just had some teeth pulled [20:20] what is everyone else up to? [20:29] it's snowing out. [20:30] paaaanic [20:30] :) [20:36] yeah got a pretty good dump here on vancouver island [20:36] didnt know you were closeby [20:41] http://blumenauer.house.gov/ lol [20:42] who is archiving blackout messages? [20:42] https://twitter.com/#!/herpderpedia [20:43] nitro2k01: heh, awesome [20:45] db48x: me [20:45] db48x: got a few thousand SOPA-related news stories, some blackout pages [20:46] I also just got a Zeiss CP.2 85mm T/2.1 for a rental [20:46] and am kind of freaking out because the lens is fucking awesome [20:46] heh [20:47] hrm [20:47] anesthesia is wearing off already [20:49] How many teeth? [20:50] 4 [20:50] Wisdom? [20:50] yea [20:50] Ah. Have fun! [20:50] three impacted, one that is useless without the one underneath it [20:51] db48x: me too. [20:52] good [20:52] i've got ~200 M or so but mostly dupes :/ I'm doing a fresh crawl now and a +1 from hacker news [20:53] from the ones in progress I have 78M and 87M [20:53] of stuff [20:53] approx [20:54] not bad [20:54] hmm that's including the db uncompressed [20:54] so far, 4.1 gigs of SOPANews [20:54] I'm going to guess that like tef there's a shitload of dupes in there [20:55] I have dupes across crawls but not within them I think [20:55] db48x: yipdw is using wget and i'm using work resources. I'm capturing some of the js wget is missing, etc. [20:57] I could probably fire up more machines next time but that's a sort of a work in progress [20:57] heh, next time [20:57] we need a catchy acronym for the next bil [20:57] l [20:57] I think my boss is looking to expose api access to make this sort of thing easier [20:57] INNOVATE [20:57] International, uh [20:58] if I had s3 write support for archive.org it would make things a lot easier [20:58] I could just make my crawler upload to them instead of amazon [20:59] heh [20:59] https://twitter.com/#!/jimmy_wales/status/159737306419433472 [21:00] oh, and http://uncyclopedia.wikia.com/wiki/Main_Page [21:02] s3 write for archive.org is not too hard [21:03] yeah I just signed up for an api key [21:04] hmmm [21:05] this anesthesia didn't last as long as I was told [21:05] I could make an irc bot that does a page capture and uploads the warc to archive.org :V [21:05] that'd actually be pretty useful [21:05] since that's the main way these are being found [21:05] I like the one at thedailywtf.com [21:06] huh.. that would be a neat service for here [21:06] well irc does make a great command and control structure [21:07] can always make an archiveteam-bots [21:07] channel with more than one bot to service requests [21:12] heh [21:12] I did seriously consider using irc as a messagebus at one point [21:16] I think i'll need the right credentials first / bucket details but I could likely get an irc bot up later this weekend [21:21] odd [21:21] my private xmpp chat room is bouncing up and down [21:29] http://www.flickr.com/photos/jcn/6721179703/sizes/l/in/photostream/ [21:30] :) [21:45] does anyone here know how the openlibrary.org source code is organized? [22:01] http://hq.deviantart.com/journal/Join-a-SOPA-and-PIPA-debate-280023798 [22:01] er [22:01] http://sakimichan.deviantart.com/art/STOP-SOPA-Bill-276510440 [22:38] hrm [22:38] I'm having trouble concentrating [22:40] Been there [22:41] yipdw: I'd love those pages when you're done. [22:41] How are you doing it? [22:41] I'm periodically scraping Google News for links [22:41] Are you using WGET or heretrix? [22:41] and then spawning an assload of wget-warc processes [22:42] Oh, excellent. [22:42] Everyone here is happy. [22:42] I can substitute Heritrix [22:42] No no! Use wget-warc [22:42] heh, ok [22:42] I haven't yet verified the integrity of all the WARCs; it's entirely possible that wget-warc is tripping up on some of them [22:42] I have spot-checked a few and those do look okay though [22:43] 2.2656 TiB RAM [22:43] 54.0039 TiB Disk [22:43] 826 Virtual CPUs [22:43] [5:42:22 PM EDT] Andy Bezella: fyi worker farm is: [22:43] Interesting statistics on the IA workers [22:44] actually, one problem I have periodically hit is wget-warc just...freezing up [22:44] there's no indication in the log that anything's wrong [22:45] oh wait, I just noticed that I have a shitload of TCP connections open, I bet that's it [22:45] weird [23:07] Regardless, you're a hero, yipdw [23:15] https://static.thepiratebay.org/legal/sopa.txt [23:58] http://www.wired.com/threatlevel/2012/01/scotus-re-copyright-decision/