[00:08] arkhive: you mentioned somewhere in the backlog that some guy was offering to sell his CEDs for $5 or they go to the shredder- had you thought about buying some of them? [01:04] CED's? [01:22] namespace: http://en.wikipedia.org/wiki/Capacitance_Electronic_Disc [02:06] so i got techcrunch 2006 with no error it looks like [02:06] i looked for only ERROR and grep -v 404 and only got one 406 error [02:07] and that file i check and it doesn't exist [02:47] TC isn't closing, right [02:51] its not closing [02:52] i do panic downloads so we at least have a copy in case it just shutdowns [02:52] also these are year dumps [02:53] so i should have all the 2005 to 2012 articles and images [02:53] based on url paths anyways [02:58] ah, cool [03:06] more hdnation episodes are getting uploaded [03:06] after these 16 episodes i will have 25 episodes uploaded [03:06] and that will be all 2009 episodes [03:16] * winr4r salutes [05:26] So am I the only who sees the bullshit that is infochimps business plan. Take their twitter data dumps which violate the twitter TOS about how to package and display tweets [05:26] also overlaying a license on top of data they do not own [05:37] I am finding datasets licensed wrong left and right. The question is how to deal with it [05:43] I can use them all I want in non-public settings and no one would know, building something public off of them opens me up to legal shenanigans [05:47] At the same time infochimps is smart enough to no re-license free government data [06:14] does anyone know a channel where i could bs about keeping a website from falling over? [06:15] Sue:) anything I could help with? [06:20] hai Aranje [06:20] i've been working on keeping this one guys site from falling over [06:20] yes hello Sue :D [06:20] something like 5k hps [06:20] on lilly? [06:20] no [06:21] his own server [06:21] oh :D [06:21] what kinda thing is he running? [06:21] porn [06:21] hot [06:21] obviously [06:21] no caching? [06:21] php-apc + varnish [06:21] and it's still dumping [06:21] hmm [06:21] cache misses? [06:21] can you profile? [06:22] very little missing, already profiled [06:22] it's mostly because he has like 40 plugins [06:22] :/ [06:22] but whatever [06:22] oh fuuuu [06:22] * Aranje pimpslaps [06:22] 56 active [06:23] jeez [06:23] wp-based? [06:23] or some other cms? [06:23] lol soon: pornopress [06:23] wp [06:24] wp caches pretty hotly [06:24] but the plugins... [06:24] it's still getting heavy even with all the caching in place [06:24] i've got 384 php-fpm children and it wants more [06:25] shouldn't even need that many wtf [06:25] how's the db handling? [06:25] seems fine [06:25] most of the load is php [06:25] mm [06:25] cause sometimes you can drop certain tables into ram [06:25] xtradb/mariadb [06:25] good point [06:25] the big smashy ones [06:26] the plugins that abuse db really bad are logging [06:26] like the malwhatever that blocks "bad clients" [06:26] e,e [06:26] it puts weblogs into db [06:26] what i was thinking was i would set up haproxy and do some sort of cluster fs on a bunch of vms [06:26] because php is just killing it [06:27] yeah it's likely he should have multiple frontends... I mean if you can't get the plugins to work with apc more [06:27] cause that's likely the issue [06:27] apc is not a magic bullet, but it's fucking slick [06:27] yeah [06:28] yeah, put a haproxy with good session tracking in front and throw another box behind it [06:28] see if that helps [06:28] If not... you're back to plugins [06:28] i also need to learn me some varnish again [06:28] i don't remember how to push cookies through [06:28] yus [06:28] mmm [06:29] dat shit is the gravy [06:29] i'd rather just disable caching for cookie based sessions [06:29] but i know there's a faster way [06:29] whats his stats on logged-in vs not [06:29] cause if it's like 85% not-logged-in, fuckem [06:30] static the entire site and only gen new for logged-in [06:30] one sec [06:30] and only every... 5 minutes or something [06:30] cache long and cache hard [06:30] :3 [06:31] and make sure dude's got his resources minimized. That won't help with server load much, but it'll make pageloads feel faster [09:16] I found a new gif to explain what ArchiveTeam does - https://d24w6bsrhbeh9d.cloudfront.net/photo/a2NNAn9_460sa.gif [09:38] no that actually isn't it [11:17] jesus christ i hope that's not real [11:17] or that they're big squishy inflatable gloves or something [11:59] looks fun. [11:59] wait, I may be getting confused. [12:15] yes [12:18] Damn. [12:20] Sue: Yeah, multiple front-ends is the way to scale that piece of shit - as long as he doesn't want to make it a sane shop, of course. But scaling out instead of up is usually the way to go. [12:21] Sue: If you want to keep it simple, just setup a bunch of front-end nodes and do DNS round robin on them. That could work in the meanwhile, while configing load-balancers and shit [14:13] i may need help with my techcrunch.com download [14:14] my problem is with this: --accept-regex="($year|.jpg|.png|.jpeg|.gif)" [14:14] year=2007 [14:15] its download articles from like 2008 and stuff [14:25] all of them? or just some? [14:25] also shouldn't you be escaping those periods [14:45] its just grabing stuff that i thought it shouldn't [14:45] i think its minor now [14:46] it was like 20mb for all the other years before [14:47] so there will be some double articles with these grabs i guess [15:29] https://plus.google.com/u/0/107105551313411539773/posts/awYNoK18Q6p [16:17] i think i figured out the problem [16:17] i think i need to add \ before .jpg or something [16:18] cause gif before was grabing everything that with gift in it [16:18] *in the url [16:23] 10 points to winr4r ;) [17:08] you're welcome! [18:42] http://youtu.be/3r3BOZ6QQtU [19:48] looks like there was 2 snapshots a day of techcrunch.com since oct 2011 [19:49] its still not a bad idea that i do this panic download [19:50] 2007 year doesn't have that many snapshots [20:23] someone maybe want to look at this: http://tech.slashdot.org/story/13/06/28/176207/googles-blogger-to-delete-all-adult-blogs-that-have-ads?utm_source=rss1.0mainlinkanon&utm_medium=feed [20:23] godane: we are already on it [20:23] it came up discussion in #archiveteam yesterday [20:23] ok [20:24] thanks anyway (salutes!) [21:19] godane: i am hacking a little wget script [21:19] to download adult blog [21:19] AUTH_URL=$(curl -k -s "http://www.blogger.com/blogin.g?blogspotURL=http://blogname..blogspot.fr/" |grep "kd-button-bar" |sed -e "s/.*href=\"//g" -e "s/\" target.*//g") [21:19] CRAWL_URL=$(curl --cookie-jar cookies.db $AUTH_URL|grep "here." |sed -e "s/.*HREF=\"//g" -e "s/\">.*//g") [21:20] wget --load-cookies cookies.db -rk -w 1 --random-wait -p -R '*\?*' -l 1 $CRAWL_URL [21:53] http://pastebin.com/A6YvX14Z better version [21:56] if you want to do an emergency backup of some site [21:56] hosted on blogger