[00:10] bsmith094: mostly north america and europe and australia. [00:10] 1well that explains a lot [00:11] austrailia is 16hrs ahead of me in ny [00:11] eu is atleast 6hrs ahead [00:17] seattle is 23 hours ahead [00:18] er, not quite, 21 hours ahead [01:20] maybe for u, for me , im in new york [01:29] I think the time offset between seattle and new york is more than simply a matter of perspective ... [01:32] whoops seattle wa is on the other coast, man i need to look at a map once in a while [01:33] hah [01:36] from several hrs ago (06:22:18 PM) bsmith094: repeat from earlier yipdw: seriously, how do i run this now, what do i use, hoard, ffgrab, what, is hoard expecting a file with id numbers, because i have that id really like an answer, this method seems to be MUCH faster than the other thing i was using [01:46] what im walkign about https://github.com/ArchiveTeam/ffnet-grab [01:46] *talking [01:54] btw how IS the mac.com project going? [01:54] or mobileme [05:11] bsmith094: as I said, it's not complete [05:11] bsmith094: however, the idea is that you run hoard and stalk simultaneously [05:11] ohh ok [05:11] hoard retrieves stories; stalk retrieves profiles [05:11] or will retrieve profiles [05:11] but hoard has no args? [05:11] it's supposed to pull story IDs from a Redis instance [05:11] the code to do that doesn't yet exist [05:12] long story short, that stuff doesn't work yet [05:12] I'm pretty sure it could be simpler but whatever [05:12] ah ok so ill stop tyring to run it then [05:12] (I expect that nobody will want to distribute this) [05:12] distribute the workload that is [05:12] btw i synced the redis db with coderjoes instance, so thats fine right [05:12] yes [05:12] his is old; I've been rerunning ffgrab and have so far come up with about 3,000 more stories [05:12] but it's close enough [05:13] no such file to load -- connection_pool (LoadError) from stalk.rb:1 [05:13] it requires the same load convention as hoard.rb [05:13] is there any way i could make this stop happening, ive install all these things at least 5 times [05:13] yeah, make sure you're using the right Ruby installation [05:13] I guess I should modify ffgrab to insert a "date of latest fetch" [05:13] so that one can version fetches [05:14] apparently it remembers if i just use rvm use 1.9.2 so that works great then [05:14] and yes great idea [05:15] might as well update the db when we start scraping the site, as well [05:16] whats your local time? [05:16] I would version it in UTC [05:17] or perhaps use TAI64 just to be an asshole [05:19] just asking because youre usually one here either very late or early my time gmt-5 [05:22] thats why i couldnt find any definition for sid? it uses the redis instance? huh, neat [05:36] and BTW the Internet Archive just broke down [05:36] Yeah, basically, the entire ia7* datacenter went offline [05:37] The machine that manages IP address assignment for that DC died, afaik [05:37] And that fucked everything [06:05] ha, that's annoying [06:05] http://www.fanfiction.net/s/7128202 [06:05] "story not found" returned with HTTP 200 [06:53] hey all [06:55] y0 [06:56] what's up? [06:56] oh not sure if any of you saw -- http://minnie.tuhs.org/pipermail/tuhs/2011-December/002538.html [06:56] basically, deleted blocks are still worth preserving. [06:56] :p [06:57] fascinating [07:04] which makes me wonder; does anyone here have equipment to read old DEC media? [07:04] disk packs and tapes [07:28] balrog: wow, that is cool [07:28] balrog: what makes it even cooler (for me) is that it's going through the archives of someone who's no longer here [07:28] makes you realize how long computing has been around [07:30] yipdw: yupâ¦ [07:31] I have access to stacks of archives [07:31] but most, no way to read [07:31] much is original DEC software too :/ [07:31] neat [07:31] unfortunately I have no devices that can read those, nor do I think my employer has any [07:32] they're rare [07:32] :/ [07:32] I have some devices, but probably not all the interfacing stuff [07:41] I'm going to my local retrocomputing society meeting soon [07:41] I'll ask around [07:48] ciik [07:48] cool* [07:48] mostly have here RL01 and RK05 packs, and tape [07:55] http://www.youtube.com/watch?v=p0t7g38sd7Y [08:21] great to see that Archive.org has resolved its issues [08:46] The weirdness about the silence of this channel... [09:05] yeah? [09:26] What's so weird about it? [09:27] It's mostly silence in here, unless there's a big project going on - or someone is discussing an project [09:36] I see the archives quite long, but when I am here its mostly silent [09:39] * ersi rolls eyes and goes back to work [09:48] Hydriz: while you're waiting for irc chan activity you could work on the wiki :D [09:56] Hydriz: it's also early morning in the US, and a lot of people here are in those time zones [09:56] like me, and I gotta to bed :P [09:56] +get [09:57] ditto [10:11] I would love to work on the wiki, but I don't have the sufficient right [10:11] *rights [10:11] well its close to evening here... [14:21] SketchCow: I need an rsync slot thingy for splinder; Nemo_bis has been nagging me to get my butt into gear and upload splinder stuffs [18:41] from Thin's example suite: [18:41] it 'should not fuck up on stupid fucked IE6 headers' do [18:42] BDD in action, ladies and gentlemen [19:04] just did a git pull on ffnetgrab, ive got a redis instance running, what do i run, in what orderZ? [19:07] yipdw: [19:07] the scripts can be run in any order [19:07] but they are not ready for use [19:07] because they don't retrieve CSS and JS [19:08] due to b.fanfiction.net's always-gzip-even-if-not-requested behavior [19:08] also, minor thing hoard is pulling nothing from the redis queue [19:08] you probably have nothing in Redis [19:08] i synced the db yessterday] [19:09] redis-cli scard stories [19:09] what does that return [19:09] tef: msg me, and I'll do it [19:09] 0? thats odd [19:10] if you're not starting up redis-server in the same directory as the dump file, or if your redis.conf doesn't point to one, then redis-server will not load the dump [19:10] the fanfiction.net grab database is not insubstantial: on my amd64 machine, Redis consumes 500 MB of memory for it [19:10] yipdw: gzip regardless is technically http/1.1 compliant [19:11] tef: not if the client doesn't send Accept-Encoding: gzip [19:11] wheres the dump file by defualt [19:12] yipdw: what are you sending in the accept-encoding field ? [19:12] tef: whatever wget is sending [19:12] because 'identity' was removed in the errata [19:12] (also, I don't think wget is an HTTP 1.1 client) [19:13] oh it doesn't send an accept-encoding header so [19:14] If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding. [19:14] yeah, I just read that [19:14] and yeah identity was removed [19:14] in the errata - I know the httpbis draft better than 2616 - I might be confused [19:14] although iirc you can also send chunked back if the client doesn't ask for it [19:16] well, all that said, even setting e.g. Accept-Encoding to chunked will still give you gzipped data [19:17] b.fanfiction.net really just sends that out, and wget can't handle it [19:17] where "handle it", for my purposes, means "decompresses the data and writes the decompressed data into a WARC" [19:18] it's important that wget do that because some WARC tools, like Wayback, prepend uncompressed data to archived assets [19:18] I could always hack warc2warc in warctools to have a 'write uncompressed http' option [19:18] :P [19:18] and if you do that to compressed data then really bad things happen [19:18] hmm ? [19:18] well, "really bad" meaning "it's unreadable" [19:18] no I mean post-processing the warcs to find http responses and decompress them if necessary, before recompressing the warc record [19:19] oh yaeh [19:19] that'd work [19:19] bsmith094: I assume the machine being down meant you couldn't finish uploading nickel.7z [19:19] yes [19:19] It's back. [19:19] tef: alternatively, I should be able to retrieve the compressed assets separately and append them to the WARC, right? [19:19] How big is it and how did you grab it all [19:19] thanks, rsync up [19:19] tef: the WARCs generated by downloading bits of fanfiction.net aren't too huge, so either approach will be fine [19:20] actually about 1.4gb, and with down them all [19:21] wikipedia is going to blackout the site on wednesday [19:21] anyone interested on creating a mirror ? [19:22] English Wikipedia [19:22] of the blacked-out site or of all of Wikipedia? [19:22] emijrp: Calm down, it's just for a few hours/a day [19:22] only enwp [19:22] I don't think I have enough bandwidth to mirror the English Wikipedia in two days [19:22] let alone storage space [19:23] use ru. ad google translate [19:23] and yeah, this isn't a permanent thing [19:23] it will be 100% same [19:24] ok, no badnwidth [19:25] a firefo addon to broke the CSS/JS trick to hide the content [19:25] how do you know it's just hiding content? [19:25] also I can't see this as a really huge problem, sorry :P [19:26] i have privilege info [19:26] ; ) [19:27] well if it really is just that, just use Firebug and disable the offending styles [19:27] (or whatever) [19:28] yes, but i want an easy method for the million visitors that will need to read wikipedia on wednesday [19:28] write one then [19:29] Who don't you just roll your thumbs for a day and let them have their campaign? [19:29] It's not like they're deleting everything [19:29] my opinion is that circumventing a blackout defeats its point [19:29] Same here. [19:29] and I don't really care for SOPA/PIPA so [19:30] if Wikipedia can leverage their Alexa ranking to further the anti-SOPA/PIPA agenda, I'm all for letting them do that [19:30] ok ive got the dump file in the ffnet dir, what should hoard be diong because it spolulated the todo list and its just hanging there, also yes, let them have their blackout, make an anti sopa video or something [19:31] or link to the eff's video [19:31] bsmith094: paste console output, I have no idea what "hanging" means [19:31] I, [2012-01-16T14:28:06.883313 #10492] INFO -- : Populating todo queue. [19:31] ben@ben-laptop:~/ffnet-grab$ ruby hoard.rb [19:31] I, [2012-01-16T14:28:11.107122 #10492] INFO -- : Todo queue populated with 3658953 story IDs. [19:31] not here [19:31] oh well [19:31] damn line breaks [19:32] oh [19:32] I can tell you why [19:32] because there is no code to fetch anything [19:32] https://github.com/ArchiveTeam/ffnet-grab/blob/master/hoard.rb#L107-109 [19:32] and that was intentional. [19:32] as I said, the scripts are not ready [19:33] they won't be ready until I can work out a method to retrieve usable CSS and JS, either via tef or some other way [19:33] ohhh, i see you commented out the cmd, sneaky, one line out, and it does nothing [19:33] I commented out nothing [19:33] there is simply no code there [19:36] #{...} in Ruby strings doesn't mean "comment", it means "interpolate" [19:36] if you're referring to https://github.com/ArchiveTeam/ffnet-grab/blob/master/hoard.rb#L72 [19:52] yipdw: fwiw there is now a -D option in warctools: warc2warc.py [19:53] so python warc2warc.py -D -Z in.warc > out.warc.gz [19:53] should work [19:53] it's been pushed to code.hanzoarchives.com [19:54] tef: awesome, thanks [19:54] i've tested it by hand on the warcs we create [19:54] haven't tested warc-wget [19:54] will that recompress the WARC record-by-record? [19:54] yes [19:54] cool [19:55] I think that should work; I'll let you know [19:55] if I am not around on irc, file a bug or email me at my work address: thomas.figg@hanzoarchives.com [19:56] tef: actually, am I missing something? commit history for warc-tools on code.hanzoarchives.com doesn't show any commits later than 2011-12-07 -> http://code.hanzoarchives.com/warc-tools/changesets [19:56] oh i'm a muppet [19:56] heh [19:56] *actually pushed now* [19:56] ah ha, there we go [19:56] cool [19:56] memo to self: work repo != public repo [21:59] Serious talk underway to filter or banner archive.org for SOPA [22:00] what the heck is this SOPA I keep hearing about? I suppose it's a US law? [22:00] or is it an organization? [22:01] US Law [22:01] Being discussed [22:01] It adds henious restrictions to internet activity [22:01] That's why people don't like it. [22:02] The act of streaming copyrighted movie is a felony [22:02] There's lots of things [22:04] I think the internet as we know it is worth fighting over for a number of reasons. One, it's the first time humanity can mass communicate ideas, words, thoughts etc in an unrestricted way. And yes it has some downfalls but looking to the big picture it's a really awesome invention. We can learn about stuff without it being filtered thru the regular media for example. [22:07] I am reading about SOPA on wikipedia and man, it doesn't sound like anything I'd like to see the future internet become. [22:07] I think some in big content that crafted and are pushing for SOPA can see the "unintended" consequences, such as sites with user content like youtube thinking it is too much trouble and closing down. If such sites closed, they would again have most of the control on distribution to consumers. [22:09] SOPA is actually Spanish for soup, I dunno what all this internet talk is [22:09] It's better in Swedish [22:09] where it means trash [22:09] heh [22:09] http://theswash.com/liberty/10-technologies-that-congress-tried-to-kill [22:10] http://en.wikipedia.org/wiki/Red_Flag_Act [22:10] I don't know in what world some politicians live in. SOPA reminds me of some EU politician who wanted email to have a postage fee much like regular mail *doh* [22:11] If that could help stop spam, maybe... [22:11] As in, the receiver has the right to charge a small fee [22:12] or something like hashcash [22:12] Yeah, that might work too [22:12] well I'd rather let my spam filters do their work than risk step by step being brought into something where sending email might require postage, a paypal account, a valid VISA card etc [22:13] perhaps it's time to go back to BBSes [22:13] And pay via your phone bill... [22:13] not sure how that solves the spam issue [22:14] my referral to BBSes was to get away from SOPA. [22:14] oh [23:46] yipdw: any luck with warctools ? [23:46] tef: haven't tried; I'll give them a shot when I get hom [23:46] e [23:46] (hopefully in a half-hour or so when this test suite goes green) [23:50] cool