[03:31] does anyone else here have that annoying feeling that you're the only person that can 'just get shit done' without cutting corners and without unnecessarily overcomplicating shit? [05:08] i have that feeling but not in that context at all [05:08] sometimes i feel like im the only one who can get shit done without everyone else thinking it cant be done or we dont have the resources or they are too retarded to bother [05:09] I get that feeling about my entire profession [05:12] shaqfu probably thought that about me the other day trying to manipulate a text file into excel and doing all kinds of crazy find and replaces. and he does like 2 things in bash script :P [05:14] Nah :P [05:14] 90% of that was schbirid anyway [05:15] But 99% of moderately difficult problems in the archives world is "pay someone else" or "too hard, don't do it" [05:19] like the LOC MARC21 catalog situation... they pay someone else to handle it, and if you want access, you also have to pay up. [05:19] WORKS 4 ME [05:19] Report went up for web archiving; 2/3 of people that do it, use Archive-It [05:19] 2% use wget :( [05:19] command lines are hard. let's go shopping! [05:20] shopping for bookshelves or some shit [05:20] Pay IA to use the same goddamn tools we can use free! [05:20] It's not like they publically release their spiders or something wacky like that [05:21] hey, at least IA get some income from it. It unfortunately costs a lot to keep that amount of storage online with that much bandwidth [05:21] Someone's gotta pay for underscor's liquor [05:22] chronomex: Those are usually bought second-hand [05:23] Although I honestly don't have a clue who gets rid of mobile shelving; it's not like it goes obsolete [05:24] The only library I know of that can conspiculously consume is Harvard; maybe they supply everyone else... :P [05:34] I'd like to have some of those movable shelves that run on rails [05:35] you essentially have one isle that you move around to where you need it [05:37] With Archive.it you're paying to have a GUI that anyone can use [05:37] and also storage [05:37] and support [05:37] It's sorta like you can run centos or RHEL w/ support contract [05:37] Also, heretrix IS open source [05:37] The same exact version they use. [05:38] Coderjoe: Pity it's insanely expensive just for the hardware, let alone installing it [05:38] indeed [05:38] Coderjoe: Yeah, those things are so cool [05:38] Odds are good you need a reinforced floor [05:38] Museum of country music or whatever it's called in Nashville had them in their archives [05:38] fun to watch [05:38] you mean concrete isn't good enough? [05:39] Coderjoe: It probably is [05:39] I always felt like I was going into a secret vault when using them [05:39] Turning some big crank and opening a wlal [05:39] * kmm has changed topic for #303 to: "i just met you, and this is crazy, but here's my boner, tug it maybe" [05:40] hahahahaha [05:40] i certainly wouldn't try putting them on a standard 2x6 joist + 2 plywood layer home floor [05:40] underscor: The storage/support is probably worth it, but a lot of these places are big unis that should have the ability/hardware on hand [05:40] shaqfu: It's probably cheaper over all though [05:40] shaqfu: why reinvent the wheel? archive.it has everything set up nice and easy [05:40] ^ [05:41] rather than have some staffer figure things out and then try and pass that on while handling the inevitiable support queries [05:41] Coderjoe: If you have your own repository for stuff, and you have someone that can run Heritrix, why not have them do it vs. paying someone else [05:41] cost [05:41] shaqfu: because archive-it does a lot more than just run heretrix [05:41] the employee's time likely costs more than what archive-it charges [05:42] and the employee(s) can then be doing other stuff [05:42] Coderjoe: What's the charge? I figured it was low, but not lower than a few hours per week of someone's time [05:42] I should see if I can get a demo account to show you guys [05:42] I have no idea [05:42] shaqfu: Heretrix gives you a pile of warcs. Then what? [05:42] How do you make that accessible to your constituents? [05:42] What about focused crawls where only one department should have access? [05:43] underscor: Devise a sensible versioning system, store it long-term, copy over the copied data and toss it on a server somewhere [05:43] What if you want z department only to crawl a whitelist of domains, x department to be able to crawl anything, and y to have all requests approved by x [05:44] shaqfu: I can see I'm not going to win here [05:44] underscor: I'm listening, really [05:44] Why do people pay for google apps then? [05:44] When you can just run your own mail server [05:45] or EC2, when you can run your own hardware [05:45] EC2 is for huge-scale work, and a mailserver is much harder than pulling down sites [05:45] archive-it is not only about pulling down sites. [05:45] it's about storage, retreival, and display [05:45] cost again. it costs money to have someone manage the server(s) [05:45] We even have partners that upload their own WARCs to archive it [05:46] Just for the storage/display/user controls/permissions stuff [05:46] It's a lot of work to set up an environment that can just spin up an arbitrary set of WARCs [05:46] from a nice gui [05:46] The storage I get if you're either too small to afford good infrastructure or don't have it ready, but places that do, I don't see the incentive [05:47] Although yeah, display and retrieval are awesome [05:48] Admittedly, I could just be bitter and lumping Archive-It with any number of unnecessary paid services places use, which is insanely unfair [05:48] You should try it :) [05:48] I'll talk to the people tomorrow [05:48] see if I can get a demo session token set up for you guys to fiddle with for a week [05:49] But if you're a big uni and paying for web + email + av + research, at some point you have to just say "this is enough" [05:49] I think they got to the point from the other side [05:49] They were running their own web + email +av +research, and it was too much overhead [05:50] underscor: Could be; it just seems cheaper to centralize and hire one or two people vs. outsourcing many services [05:50] I suppose it's not, though, because they would if it was :P [05:50] And I don't think every outsourced solution is as kind as IA :) [05:50] Oh, definitely. [05:50] Most uni's don't pay for email, though [05:50] underscor: I had a uni archivist tell me once that email was "just too hard" [05:50] gapps for universities is free [05:51] I take it you've been quite insulated from financial accounting at your places of employ [05:51] Also, having full text search WITHIN archived content is pretty nice [05:51] So I think a lot of places are just letting it go fallow, and haven't gotten to the point of dealing with everything at once [05:51] and is definitely something that plain heretrix won't give you [05:51] Coderjoe: Yes, I have (outside of the usual "we're poor again") [05:52] underscor: couldn't you just grep across everything not in tags? [05:52] Grep across 4,281,459,653 warcs? [05:53] Point taken [05:53] archiveteam: purveyor of Reasonably Big Data [05:54] :D [05:54] to run their own service, they would have to pay for: development time, hardware costs, running-hardware costs, maitainer time, support time... [05:54] pager duty time [05:54] or they could pay $x for all you can eat [05:54] It's actually not like that, it's mildly tiered [05:54] But it's still very cheap, comparitively [05:55] Coderjoe: Again, these are places that would start seeing economy-of-scale benefits by taking on the task themselves [05:55] and all that people time includes not just wage/salary, but also taxes, FICA, possibly health insurance [05:55] shaqfu: but not compared to IA's scale [05:55] underscor: Oh, absolutely not [05:55] and not get any actual income on it to cover those costs [05:56] IA's margins on archive-it are pretty thin, so I think that even with the added margins, a-i is still cheaper than any uni would ever need [05:56] scalewise [05:56] One thing that really bugs me is the inconsistent theming [05:56] http://www.archive-it.org/organizations/369 vs http://wayback.archive-it.org/2344/*/http://action.aclu.org/ [05:56] underscor: For that one task, yes. I'm thinking of what happens when more and more gets added, if it's still reasonable to pay for it [05:57] it is a discussion I have had a few times with my boss (very small company) : any time spent on IT stuff is not being paid for by an outside source. however, that time needs to be taken in order for the paying work to get done. [05:58] and if the turnkey solution is less expensive than the expected roll-your-own costs, almost any accountant will push the turnkey [05:58] (depending on if it does what is needed, doesn't have security/privacy problems, etc) [05:58] Hm, I'm curious if there are places that handle NSF research data for a fee, given that every uni in the US has to deal with it [06:00] I honestly hope I'm wrong, and that archives outsourcing services/storage is a legitimate solution, and not a symptom of "too hard" thinking [06:01] wow, "scalewise" highlights the window for containing my name [06:02] How does scalwise contain winr4r? [06:02] But I'm concerned when large research unis, that have the infrastructure in place to add more long-term storage, do it [06:02] If they're doing it because that's honestly and truly the best solution, or if it's due to short-term thinking [06:03] and archives generally have less funding to spend on development costs than corporations do [06:03] underscor: lewis [06:03] good morning, folks [06:03] Coderjoe: Where I was, anything involving computers was handled by a different department, which got whatever it wanted [06:04] and I thought we were talking about more than just storage, but all of the software that goes into archive-it [06:04] winr4r: aha. my client only highlights on nick [06:04] Coderjoe: Certainly, but again, there's a tipping point [06:06] If you're just doing web stuff, then it makes sense to use A-It. But if you do that for *everything* involving computers, that's a problem [06:10] Dunno; maybe it'll shake out once places stop using /dev/null to archive email... [06:23] underscor: the latter link http://wayback.archive-it.org/2344/*/http://action.aclu.org/ totally matches the styling of every library website ever [06:23] lol [06:23] compare: http://www.spl.org/ [06:26] roundrects! [08:59] library do seem uninspired [08:59] sites [08:59] I just checked my local one and it looks the same [08:59] it is kinda like wikis in that way [09:00] almost all wiki sites look the same [09:00] and frankly I think that is boring [09:06] why can't websites be simple? why must they be a vomit of flash and javascript and busy colors? [09:24] nobody gives money to blue rectangles [09:25] like seriously, look at these guys: http://www.berkshirehathaway.com/ [09:27] indeed. nobody gives money to them! [10:02] in the corporate environment minimal is always best [10:02] did any of you see that netflix aws panel that got open sourced [10:02] way better design than the amazon interface [10:07] eye bleed warning: http://emporiumchicago.com/ [11:20] uploading episode 127 of dl.tv [11:20] starting the next batch of uploads [11:40] uploaded: http://archive.org/details/dltv_127_episode [11:53] uploaded: http://archive.org/details/dltv_128_episode [12:16] uploaded: http://archive.org/details/dltv_129_episode [12:34] uploaded: http://archive.org/details/dltv_130_episode [12:56] uploaded: http://archive.org/details/dltv_131_episode [12:56] has anyone tried scraping google with phantomJS [12:56] it or selenium gets us over the "real" browser hurdle [13:26] http://archive.org/details/dltv_132_episode [13:26] full dvd 2 of dl.tv is uploaded [13:26] :-D [13:35] Wow, I REALLY need to fix the spam issue on the wiki. [13:38] it bypasses the re-captcha eh? [13:38] or completes it [13:42] uploaded: http://archive.org/details/dltv_133_episode [14:00] uploaded: http://archive.org/details/dltv_134_episode [14:16] uploaded: http://archive.org/details/dltv_135_episode [14:51] uploaded: http://archive.org/details/dltv_137_episode [14:51] uploaded: http://archive.org/details/dltv_136_episode [14:52] Busy man [14:52] interesting, the googlebot seems to download my bigger forumplanet warcs partially. ~16mb each [14:57] ersi: I still have another 25gb of dl.tv [14:57] ersi: all of crankygeeks is up there [15:13] uploaded: http://archive.org/details/dltv_138_episode [15:16] i think the format changed with episode 139 [15:17] its big res and smaller file [15:17] i only know this cause the last episode was 50:18 and 228.2mb in size [15:18] episode 139 is 50:14 but 203.2mb in size [15:18] also looks more wide-screen [15:45] uploaded: http://archive.org/details/dltv_139_episode [15:50] uploaded: http://archive.org/details/dltv_140_episode [15:52] No need to keep updating us [15:53] If I did that, this channel would be unusuable [15:53] sorry [15:54] #godane [15:55] it doubles as a hashtag [15:55] #godane-bs [15:59] :D [16:00] i may becoming like SketchCow with techtv videos [16:00] i wish you had a interview on techtv [16:09] #sketchcow-bs would be a wasteland, since nothing he says is bs [16:10] except "I will never take a picture of myself in a tutu" [16:19] FANFICTION/2/24/240/u/2405042/Must_Have_Yaoi/2405042.cooked.warc.gz [16:19] I wonder if that's still aroudn [16:21] heh, I forgot how slow extracting one file out of a 50 GB tar is [16:28] how can 87% of splinder been delete? [16:29] there most have been a lot of small profiles [16:29] with next to zero pics [17:11] did you guys get this: http://www.theregister.co.uk/2007/02/26/microsoft_archive_goes_torrent/ [17:38] I didn't [17:38] a five-year gap is kind of large [17:40] long tail torrents seldom work out [18:02] yipdw: it's on groklaw. [18:02] http://www.groklaw.net/staticpages/index.php?page=2007021720190018 [18:02] they're also working on transcribing it [18:25] does anyone know a directory tree map tool (ilke baobab, filelight or seqiuoaview) that accepts a textfile with one line per file location as input and maps that? [18:26] who is this hatman chump, anyway [18:26] he messaged me for some reason [18:39] baobab, filelight, gdmap do not do it. jdiskreport supports saving and opening scan data but uses a binary format [18:39] kdirstat looks promising, you can save a plaintext cache file [18:40] yeah, awesome [18:41] you can edit it just fine. checksums inside are not checked if you just open the cache file [18:46] hm, this wont be easy [18:47] directory tree map tool? [18:47] kdirstat wants one line specifing a directory and then the files as lines below [18:47] yeah [18:47] http://media.cdn.ubuntu-de.org/wiki/attachments/50/28/gdmap.png [18:47] http://media.cdn.ubuntu-de.org/wiki/thumbnails/6/64/6431f4bdde74de4697fc08067034edf7617bbf08i250x.png [18:48] oh [18:49] another nifty tool in that arena (though without the graphical part) is ncdu, if at the console [18:49] yeah i use that a lot [18:49] but it doesn't do cache files [18:49] but the data i have are wget logs, not a filesystem [18:49] would like to get an overview on what we have on fileplanet so far [19:27] i am diving into sed-hell [19:28] what needs fixing? [19:48] oh boy it works [19:48] slowly [19:48] terribly [19:48] just like i do it all the time [19:49] sed inplace ftw [20:19] SketchCow: i've just learned: when i buy your next documentaries please send them to me with the cheapest service you can [20:34] I reat that as sex hell [20:34] read* [20:34] and got all excited, and then was disappoint when I reread [20:34] :( [20:48] poop, this is not trivial and i give up [20:50] i have a list of paths to files [20:51] i need to rework it so before each file that is in a new (sub-)directory there is a line with that directory [21:00] if someone writes that (gnu tools) we could get something like http://i.imgur.com/OT7Et.png from wget -nv logs [21:00] example data https://pastee.org/hyumg (the numbers at the end of lines are sizes, must be kept intact) [21:03] example result https://pastee.org/tf98s [21:04] good night [22:35] SketchCow: (delayed) because in the UK apparently we pay tax in the amount that it cost to send it, not just on the value of the parcel [22:36] mmm [22:36] http://www.wzzm13.com/news/article/217608/14/Drivers-asked-to-detour-around-I-196-buckling [22:36] when I went through, it was only a 2" to 3" rise [22:36] now it's supposedly 8" to 10" [22:36] <3 the heat [22:40] i think it has less to do with the heat than it does with governments having less money to deal with the damage from heavy trucks [22:42] the rise was not there yesterday. at all. [22:43] which isn't contradictory to what i said [22:43] and looking at the picture they just updated with, I think the measurement given may be the length of the buckled section rather than the height. [22:45] roads deteriorate very quickly if they're used by heavy trucks and aren't maintained meticulously [22:45] it could have changed in that time [22:47] then why do these large buckling events only seem to happen in very hot weather? [22:55] and by heat today, we're talking 106F today, 102F yesterday, and high 90s on the 4th [23:04] o_O [23:04] Here in the UK our main problem is the tarmac cracks due to heavy trucks/buses [23:04] that inself isn't much of a problem, but then the water gets in, freezes and that destories everything. [23:05] While we've never had your kind of heat, I'm wondering if air is some how getting trapped under there and then heating up... [23:17] well, yes [23:18] Coderjoe: that's like saying "why does my pasta only grow larger when the water is hot" [23:19] it's an odd question [23:20] roads get damaged more in temperature extremes, when they're not very well maintained [23:25] http://i.imgur.com/8BKpN.jpg [23:26] Coderjoe: haha, that is awesome [23:26] i love that [23:26] you then admit it has more to do with the temperature extreme than ill maintainance? [23:27] Coderjoe: i'm saying that roads under loads of heavy trucks will deteriorate very quickly if they're not maintained meticulously, that's all [23:27] the road in question would have lasted until the next planned maintainance there had the temperatures not gotten so high today. (and I am sure the interior temperature of the blacktop was higher than the air temp of 107F) [23:28] smaller vehicles don't exert enough axle damage to even merit repairing them [23:29] according to a study that was done in the US, a 40 ton truck does 10,000 times as much axle damage as a 2-ton car [23:29] as i recall [23:30] what does damage to axles have to do with it? [23:30] Coderjoe: as in damage to the roads transmitted by the axles [23:34] so roads deteriorate very rapidly unless they are subsidised to well in excess of what trucks actually pay [23:35] but wat do i no lol [23:36] i am going to watch stupid videos then go to bed