[00:06] not me but i have ipv6 connectivity if needed [01:15] SketchCow: 200k. WTF. This community is ridiculous. [02:26] i found some xplay videos [02:26] from 2011 though [02:55] SketchCow around, or is he still knee-deep in doc shooting? [02:59] (And is he still looking for arms for the CHM project?) [03:08] most likely in the air or sleeping [03:11] <_fox> running some warriors [03:11] <_fox> this is p neat [03:19] <_fox> http://diybookscanner.myshopify.com/products/diy-book-scanner-kit [03:23] $595 seems hella steep for that [03:23] erm, bicycle level triggers though ok yeah this is a bit fancier then i thought initially [03:23] god knows i couldnt build it [07:18] so do I want to even ask about http://blog.picplz.com/day/2012/06/01/ [07:24] Not without glancing at http://picplz.heroku.com/ [07:28] oh wonderful [08:07] Aranje: #piczzz [08:08] ersi, thanks! [08:08] and http://archiveteam.org/index.php?title=Picplz of course :) [08:08] no prob [08:08] yep yep, running it already [08:09] trying to fish out exactly what the requirements are for compiling wget warc so I can make lists of things people need installed on various linuxes (and in my particular case, freebsd) [08:09] pending tomorrow though, I want to sleep [08:11] id love to calculate the size of the internet archive per year [08:11] so every active page that was up on january 1st or w/e for each year [08:12] Aranje: => lua 5.1 basically [08:12] so gnutls-dev, lua (headers, or just runtime?) build-essential [08:13] on debian/ubuntu I needed liblua-5.1-dev [08:14] bah freebsd won't let me install gnutls-dev because theirs is vulnerable to 2 different security issues [08:14] awesome [08:14] I use debian, had to apt-get install liblua5.1.0-dev [08:14] ha [08:14] portaudit is cockblocking me [08:14] er [08:14] (like a good app) [08:14] OpenSSL should work too. [08:14] apt-get install liblua5.1.0-dev lua5.1 [08:26] oof [08:26] first line of the script is a doozy [08:26] /bin/bash doesn't exist (bash exists, but it doesn't reside there!) [08:27] yeah, how about not using a bsd machine *trolls* [08:27] Hey mang, it's what I got [08:27] already running the debian machine [08:27] aka easymode [08:28] on the plus side, freebsd needs a grand total of two things [08:28] (assuming the compile works, that is) [08:28] or not. it failed to find a lua, probably because there is no development lua port [12:35] Ooh, I jsut found more Splinder data. Need to run the script over that [13:15] Aranje: there may be other things needed on bsd. I'm pretty sure that the script expects GNU userspace tools that it may use [13:17] mmm [13:18] pretty much all of the lifting is done in lua now. nice. [13:59] here is 1000 more usernames food for the me.com crawlers http://pastebin.com/ZDBY8pf1 [13:59] who has access to memac.heroku.com? [14:00] are you the famous jonas? [14:01] nevermind [14:02] mh ?:D [14:11] alard: ^ [14:22] jjonas: Thanks. I pasted them in here: http://memac.heroku.com/rescue-me (in batches) [14:25] gamespy/IGN so reminds me of yahoo [14:58] today's ovh beta server giveaway is at 21:00 UTC [14:59] yay [15:00] ;) [15:00] as before, you need to follow their twitter account https://twitter.com/#!/OVH [15:01] cant be arsed to find the urls, pick any ovh site. servers -> us flag, then you should see the signup form [15:01] iirc i hammered it starting 2 minutes before the announced time and got through [15:19] Hey, so thing. [15:19] Tomorrow, SF is doing a big fiber outage [15:19] Expected 8-12 hour outage. [15:20] Archive.org is working to see about keeping itself up. [15:20] But it might not happen. [15:26] ouch [15:26] O [15:26] : [15:27] Well, if they can keep themselves up, I'm sure we can arrange they'll get some traffic. [15:27] wtf why can't I type anymore :o [15:27] any public details as to why there is going to be a big fiber outage? [15:27] not got redundant linking? [15:29] SmileyG: if it affects multiple providers, redundant links are moot [15:29] .... [15:29] then its not redundant :) [15:29] I mean if the whole AREA is offline yeah sure your screwed. [15:30] But any provider having outages.... that shouldn't effect others. [15:30] affect? :/ [15:32] i thought archive.org had redundant hosting [15:33] shouldn't affect / shouldn't have an effect [15:33] It's an SF thing [15:34] http://www.worldofmule.net/tiki-index.php?page=IBM%20PC [15:34] All I know is, this was a big discussion at the lunch, I wanted you guys to know. [15:35] I was in on that M.U.L.E. thing [15:35] yeah, your name is in some of the writeups [15:35] :) [15:36] i can't seem to find anything about it, but searching for anything these days sucks [15:39] interestingly, "sf internet" on search.twitter.com turns up results for "Eduard Khil has died" [16:02] http://nedbatchelder.com/blog/201201/goodbye_tabblo.html [16:15] Schedule today: Doing a round of cleanup post-trip, sending some e-mails, then down to NYC for a memorial service. [16:31] huh, that's odd timing for the fiber outage tomorrow. It's primaries day here in california. [16:43] http://www.cbc.ca/news/canada/story/2012/06/03/pol-campaign-to-oppose-budget-bill.html [16:46] SketchCow: w.r.t your comment on IUMA pictures, for some reason the thumbnail on the wayback link doesn't work but clicking on it does [16:47] but this isn't that consistent [16:48] I has a sad: http://rt.com/art-and-culture/news/trololo-dead-stroke-stpetersburg-898/ [16:51] It's not consistent, but I think it can be done. [16:51] I want to wait, rope around, fix it up. [17:22] is wget buggy and forgetting to grab files when mirroring? [17:22] i grabbed a site and random images are missing (other images from the same dir were downloaded fine) [17:26] closure inches slowly toward $20k [17:33] <_fox> that's cool [17:35] I need help unpacking a .warc file [17:36] SketchCow: hanzo's warcextract should do what you need [17:36] Do you have the warc publically available [17:37] http://fos.textfiles.com/borscht/ [17:47] grr, netsplut [17:53] there we go, no more netspluts [17:54] SketchCow: I was wrong, warcextract only prints a human readable summary. wondering if using alard's warc-proxy and then running wget against it is the best solution... [17:55] What might be interesting is this: modify warcextract so that it doesn't just print the results, but builds a zip file. [17:55] Find a solution, top priority. [17:55] I'm now getting baragged with angry, unhappy, sad tabblo ex-users [17:55] Tabblo apparently, based on what I'm getting, did a VERY poor and possibly no job informing people of the shutdown. [18:02] httrack sucks [18:03] It does, indeed, suck. [18:04] http://archive.org/~edward/search.php [18:04] Faster search [18:04] also more powerful [18:08] SketchCow: Aren't the zip files interesting to these Tabblo people? [18:08] Give me the search link [18:09] http://archive.org/download/test-memac-index-test/tabblo.html [18:09] Thank you [18:09] (It's a temporary object, will disappear in 30 days.) [18:10] usually takes a bit longer than 30, actually [18:10] They promised 30. :) [18:10] and if we can't trust the Internet Archive... [18:11] haha [18:11] Meanwhile, I'm halfway warctozip.py [18:11] actually, pretty sure there's an automatic task for it now [18:11] So actually, need a little hint here [18:11] It looks like the zip files have all the photos. [18:11] But the .warc files do not [18:11] Is that right? [18:13] The .warc.gz files do not contain the original photos, no. [18:13] The .warc.gz files are what you would see if you went to the web page. [18:13] Ah. [18:14] Wow, so JUST the .warc.gz files are 450gb? [18:14] The zip files were a terabyte, I think. [18:14] mommy [18:15] How does one remove "http://" from a string in python? [18:15] ? [18:15] does the normal re module not work [18:16] (note I am python baby, I just have used it before) [18:16] I had hoped you'd just taptap the answer. A bit lazy. :) [18:16] re.sub [18:16] I'd just substring it out [18:16] but I suck :D [18:17] oh, sorry [18:17] I don't know the syntax off the top of my head D: [18:17] hacky way would be to split in 3 parts and take the last [18:17] where: &w_identifier=archiveteam-tabblo* | size: 1,210,601,813 KB [18:17] print url[7:] [18:17] SketchCow: ^ [18:17] something.split('/', 3)[2] [18:18] ersi: that wouldn't deal with https though [18:18] joepie91: yeah, it wouldn't :) [18:18] (note: my split syntax may be shoddy, not that experienced in python yet) [18:18] (but it should theoretically work) [18:18] joepie91: what if you have "http://test.com/a/file/here.txt" [18:18] underscor: the 3 indicates max 3 parts [18:18] wouldn't that only return test.com? [18:18] doh [18:18] so after the first two parts it throws the rest into the 3rd element [18:18] anyhow, urllib module probably has a nice way [18:19] or urllib2 [18:19] urllib40000 [18:19] lol [18:19] seriously though, I'm particularly good at dirty hacks, but I very much doubt whether that's a good skill... [18:19] :| [18:19] * joepie91 blames his PHP background [18:20] I'm trying to figure out if there's a lot of packet loss between me and ia, or between the box I'm sshing through and the target [18:20] >:I [18:20] it's atrocious [18:20] and of course traceroute looks normal [18:20] http://archive.org/search_beta/ is the official beta search. [18:20] alard: you probably already got this, but re.sub(r'^http://', '', str) [18:21] underscor: you -> box with ssh -> ia ? [18:22] me->ia->another ia box [18:22] ok [18:22] SSH into 'ia' and ping 'another ia box' [18:22] if no packet loss, issue is between you and ia [18:22] :P [18:22] https://github.com/alard/warctozip [18:23] Reply from 207.241.224.4: bytes=32 time=175ms TTL=250 [18:23] Reply from 207.241.224.4: bytes=32 time=179ms TTL=250 [18:23] Reply from 207.241.224.4: bytes=32 time=192ms TTL=250 [18:23] Reply from 207.241.224.4: bytes=32 time=227ms TTL=250 [18:23] ./warctozip.py somefile.warc.gz zipfile.zip [18:23] gross [18:23] underscor: mtr is your friend [18:23] Reply from 74.125.228.9: bytes=32 time=34ms TTL=252 [18:23] (google.com) [18:24] hmm, wonder why my latency to ia is so high [18:24] hm, sec [18:25] where's the server you are pinging located physically? [18:25] country or state [18:25] erm [18:25] that's probably not a fair test [18:25] maybe the work started early? [18:25] since google is probably down the street [18:25] and IA is across the country [18:26] Chicago: 4 packets transmitted, 4 received, 0% packet loss, time 3000ms [18:26] rtt min/avg/max/mdev = 57.259/61.906/67.758/3.837 ms [18:26] Atlanta: 4 packets transmitted, 4 received, 0% packet loss, time 2998ms [18:26] rtt min/avg/max/mdev = 62.096/62.531/62.778/0.407 ms [18:26] 64 bytes from 207.241.224.4: icmp_req=1 ttl=51 time=179 ms [18:26] PING 207.241.224.4 (207.241.224.4) 56(84) bytes of data. [18:26] 64 bytes from 207.241.224.4: icmp_req=2 ttl=51 time=185 ms [18:26] --- 207.241.224.4 ping statistics --- [18:26] 64 bytes from 207.241.224.4: icmp_req=3 ttl=51 time=186 ms [18:26] 3 packets transmitted, 3 received, 0% packet loss, time 2002ms [18:26] rtt min/avg/max/mdev = 179.598/183.757/186.218/2.957 ms [18:26] thats from the UK. [18:26] OKAY STOP DOING THIS [18:26] OKAY. STOP. DOING. THIS. [18:27] :) [18:27] Phoenix: rtt min/avg/max/mdev = 30.298/30.862/31.170/0.356 ms [18:27] * SmileyG has stopped [18:27] seems there's nothing wrong with that server [18:27] Oh my god, so aspy [18:27] SketchCow: :D [18:27] Was I right, is it due to the work>? [18:27] DOES IT MATTER? [18:27] I dunno? [18:27] btw why caps? [18:27] I guarantee you 5 fulltime people RIGHT NOW are working VERY HARD on network, if there's any. [18:28] Cool, Any plans for a UK based DC? ;) [18:28] ...hahahaha [18:28] Because aspy posting of network program output deserves caps, chloroform and a dump in the river [18:28] dude so much hate :( [18:28] You don't even KNOW hate [18:28] i didn't paste anything btw, exec -o ftw [18:28] If you knew hate, I'd be in your room right now [18:28] Hm. [18:29] this has gone wildly off topic, shall we go to -bs? [18:29] watch out, SmileyG. you're playing with fire [18:29] -bs? [18:29] and SketchCowis a canister of condensed propane [18:29] archiveteam-bs [18:29] SketchCow is* [18:29] underscor: he seems sane enough to me. [18:29] offtopic channel? [18:29] Yes [18:29] yes [18:29] wtf I'm already there [18:29] see /topic [18:29] * joepie91 stares at self [18:29] it's so offtopic, you're already there. [18:29] Archive Team: You're already there [18:29] lol [18:29] lol [18:30] Archive Team: The Downloading Is Coming From Inside Your Servers [18:30] http://ia601202.us.archive.org/3/items/test-memac-index-test/tabblo.html has saved my fucking bacon. [18:30] Because OH MY GOD did Hp really fuck up the Tabblo thing. [18:31] seems to be an HP motif [18:31] ^ [18:31] It appears in the rush to shut it down, they really didn't do a good job of mailing out notifies. [18:31] SketchCow: are people twittering/emailing you about it? [18:31] Fucking your shit up, so you don't have it? [18:34] SketchCow: Before you start linking that url, should we make a permanent one? [18:34] he should just be able to pop it out of the test-items collection [18:34] (which will remove the auto-purge) [18:35] The name isn't really good. [18:35] SketchCow: I suspect someone at HP noticed 'oh fuck, those servers are still running.. quick, shut them down before the boss notices!' [18:36] he can rename too [18:36] I am sure it's actually because HP is doing a round of cost-cutting and layoffs for a different reason. [18:39] Axing it axing it axing it - cause my CEO told me soo [18:39] I have to go now - driving to NYC to take people to a memorial service [18:39] I didn't know her, but they're very broken up about it, so it'll be a tough night. [18:39] When I get back, I'd like to work on some stuff, it'll be late. [18:40] Good luck with the driving and attending memorial service [18:40] This tabblo search thing, jesus that saves a life [18:40] Freggin' great work [18:49] 10 minutes until ovh beta server giveaway [18:56] :o [18:56] 2 minutes [18:57] link? [18:57] aka start now [18:57] http://www.ovh.com/fr/serveurs_dedies/commande_usa_beta.xml [18:57] you need to follow them already [18:57] I do [18:57] ignore the message about the servers being gone [18:58] keeping trying [18:58] would anyone have a use for a parser that parses a load of .eml files, and generates an attachment directory + sqlite database of all of them, plus can optionally render the entire database into a bunch of static HTML including sorting by several fields? [18:58] Weren't people having serious issues with OVH earlier? [18:58] what is DÃ©solÃ©, nous avons atteint la limite des serveurs disponibles aujourd'hui. Revenez demain? [18:58] lol [18:58] i suck [18:58] utc 2100 is 2 hours away [18:58] sorry [18:58] damn [18:58] hahaha [18:58] np [18:58] * joepie91 feels like a captcha monkey [18:58] until then, go find the link on ovh.co.uk or so ;) [18:58] underscor: They're out of servers [18:59] ah [18:59] Holy fuck I spent literally two hours studying French and read that; +1 baller points to me [18:59] Anyway, weren't people complaining about OVH last giveaway? [19:03] Thank you, thank you, thank you. You have saved my memories and all my hard work of putting them together. Please thank everyone on your team from the bottom of my heart. You understand something that HP can never understand, the most important thing in life are the people in it. We are not just users or customers adding to your bottom line, but we are people. We are so sad to see our memories just wiped away with a switch. [19:03] Thank you so much for doing this us. [19:03] Wow. That's really touching [19:03] It's things like that that make all the effort worth it [19:04] yah [19:06] i dont get why companies dont just make such sites static [19:07] $ [19:10] SketchCow: you should xpost that to collections :) [19:11] picplz estimate 2.5TB [19:16] awesome job on virtual machine, thanks, that makes it really easy to contribute [19:20] I don't think the test collection cleanup is very aggressive, I have stuff that's been in there since february [19:21] okay, what the hell just happened [19:21] my router completely shit itself :| [19:22] what I said and probably didn't arrive: meh, anyhow, if anyone wants said email parser: git clone http://git.cryto.net/repo/projects/joepie91/emailparser/ [19:23] re: Tabblo: http://thenextweb.com/insider/2012/06/03/startups-should-bend-over-backwards-to-let-users-take-their-data-after-they-shut-down/ [19:23] (and Picplz) [19:24] the comment by "Mike Post" is really weirf [19:24] weird [19:31] SketchCow: was there anything not already archived in the geocities stuff i sent? [19:34] yipdw: sounds like the average 'you have to pay for the air you breathe' guy... [19:37] anyhow, are programmers currently needed to write crawlers or anything? [19:40] joepie91: for picplz? no [19:40] joepie91: #piczzz [19:40] just in general :P [19:41] * joepie91 enjoys crawling, parsing, etc [19:41] oh [19:42] in general, yes [19:42] please do check out the *-grab repositories on ArchiveTeam's github for AT conventions [19:42] re: file format, reporting, etc [19:45] alright, will have a read soon [19:45] is there any specific document that lists the whole thing, by any chance? or best to just try to derive it from the repos? [19:46] none yet [19:46] I guess documenting archival standards would be a good th ing [19:47] what? archival standards? [19:47] best practices for ArchiveTeam projects, yes [19:48] as far as I know, we have no such document [19:48] do we even have best practises? [19:48] that's news to me :) [19:49] if you look at the *-grab repositories, there are patterns and conventions [19:49] and I think usage of WARC is a best practice [19:49] well, besides A) doing is more important than thinking B) BE ON A SANE FILESYSTEM, YOU FUCK C) TIMESTAMPS AND SHIT! [19:49] yeah, WARC all the way baby [19:49] the point is to codify them so that people who want to write software to do crawling have a place to start [19:50] also, that assists with point (A) [19:50] ye [19:50] True enough, I guess [19:51] I don't know how important (B) is at this point [19:51] it *was* important before we had a standard format [19:51] rephrased -- it was important when there was no standard format [19:52] I don't know if it still is [19:52] but, yes, the idea is to hash all that sort of stuff out [19:52] It was *very* important before we started doing WARC [19:52] that's what I wrote [19:52] as well as C) - but WARC takes care of that as well [19:53] let me fucking rephrase it then [19:53] "It's not as important now, because of exactly what you pointed out a few lines up" [19:53] Anyhow, it's all good and yes - it's a good idea to write it down, good idea sir [19:54] (1) there could be other factors that rely on filesystem semantics that I haven't considered [19:54] (2) calm down [19:55] http://i.qkme.me/35et2u.jpg [20:57] ovh giveaway in 2 minutes [20:59] I got one! [20:59] yay [21:00] wheep wheep [21:02] ovh? [21:02] server company [21:05] Do we really need to link people to Google in here? I'll leave this right here in case anyone has something they don't know and is curious to find out more about: http://google.com (Hint: It's a search engine) :D [21:09] lol [21:32] oh goddamnit [21:32] missed the giveaway [21:33] ffs [21:45] joepie91: there's always tomorrow :D [21:46] it's just a bit aggravating, because the *reason* for missing it is a CERTAIN journalist that has been spreading bullshit [21:46] so it's basically someone that I was already mad at, contacting me at this exact point [21:46] making me miss the OVH giveaway on top of that [21:46] :| [21:47] also, pm [21:51] ovh giveaway eh [21:56] found it hm [22:25] holy crap that index thing for tabblo is good [22:37] hmmm [22:37] I have no idea what tabblo is :S [22:37] Do I suck? [22:41] nope- it's hard to keep up with all the websites shutting down or shutdown these days [22:50] http://owely.com/4cYUBD [23:00] oh sweet, since we made the picplz tracker faster, it looks like fetches have sped up [23:01] judging from the graph [23:27] anyone familiar with tracking down processes on a linux box that are using up lots of cpu but are not obvious with top or iotop? [23:30] why wouldn't the process in question show up? [23:33] have you tried tree view? (so you can see parent & child processes easily) [23:33] I like htop a lot as a better top [23:35] chronomex: i don't know, and dashcloud yes i have (with htop) [23:35] odd [23:35] 09:05:27 up 55 days, 15:47, 1 user, load average: 2.64, 2.75, 2.54 [23:35] load has been 2 for days [23:35] the cpu has 4 cores/8 threads so its not killing the box [23:35] its just annoying not knowing what it is [23:36] if you're running a desktop, make sure you check the plugin-container (or the actual plugin) should you have a web browser running [23:36] it's a server [23:37] have you tried sorting it by time? maybe your mystery process will float to the top if it's using a lot of CPU time [23:38] htop [23:38] also cumulative time including dead children [23:38] wrong window my bad [23:38] chronomex: how do i get that? [23:39] in top, press capital-S [23:39]