#archiveteam-bs 2013-06-28,Fri

↑back Search

Time Nickname Message
00:08 🔗 dashcloud arkhive: you mentioned somewhere in the backlog that some guy was offering to sell his CEDs for $5 or they go to the shredder- had you thought about buying some of them?
01:04 🔗 namespace CED's?
01:22 🔗 underscor namespace: http://en.wikipedia.org/wiki/Capacitance_Electronic_Disc
02:06 🔗 godane so i got techcrunch 2006 with no error it looks like
02:06 🔗 godane i looked for only ERROR and grep -v 404 and only got one 406 error
02:07 🔗 godane and that file i check and it doesn't exist
02:47 🔗 winr4r TC isn't closing, right
02:51 🔗 godane its not closing
02:52 🔗 godane i do panic downloads so we at least have a copy in case it just shutdowns
02:52 🔗 godane also these are year dumps
02:53 🔗 godane so i should have all the 2005 to 2012 articles and images
02:53 🔗 godane based on url paths anyways
02:58 🔗 winr4r ah, cool
03:06 🔗 godane more hdnation episodes are getting uploaded
03:06 🔗 godane after these 16 episodes i will have 25 episodes uploaded
03:06 🔗 godane and that will be all 2009 episodes
03:16 🔗 * winr4r salutes
05:26 🔗 omf_ So am I the only who sees the bullshit that is infochimps business plan. Take their twitter data dumps which violate the twitter TOS about how to package and display tweets
05:26 🔗 omf_ also overlaying a license on top of data they do not own
05:37 🔗 omf_ I am finding datasets licensed wrong left and right. The question is how to deal with it
05:43 🔗 omf_ I can use them all I want in non-public settings and no one would know, building something public off of them opens me up to legal shenanigans
05:47 🔗 omf_ At the same time infochimps is smart enough to no re-license free government data
06:14 🔗 Sue does anyone know a channel where i could bs about keeping a website from falling over?
06:15 🔗 Aranje Sue:) anything I could help with?
06:20 🔗 Sue hai Aranje
06:20 🔗 Sue i've been working on keeping this one guys site from falling over
06:20 🔗 Aranje yes hello Sue :D
06:20 🔗 Sue something like 5k hps
06:20 🔗 Aranje on lilly?
06:20 🔗 Sue no
06:21 🔗 Sue his own server
06:21 🔗 Aranje oh :D
06:21 🔗 Aranje what kinda thing is he running?
06:21 🔗 Sue porn
06:21 🔗 Aranje hot
06:21 🔗 Sue obviously
06:21 🔗 Aranje no caching?
06:21 🔗 Sue php-apc + varnish
06:21 🔗 Sue and it's still dumping
06:21 🔗 Aranje hmm
06:21 🔗 Aranje cache misses?
06:21 🔗 Aranje can you profile?
06:22 🔗 Sue very little missing, already profiled
06:22 🔗 Sue it's mostly because he has like 40 plugins
06:22 🔗 Aranje :/
06:22 🔗 Sue but whatever
06:22 🔗 Aranje oh fuuuu
06:22 🔗 * Aranje pimpslaps
06:22 🔗 Sue 56 active
06:23 🔗 Aranje jeez
06:23 🔗 Aranje wp-based?
06:23 🔗 Aranje or some other cms?
06:23 🔗 Aranje lol soon: pornopress
06:23 🔗 Sue wp
06:24 🔗 Aranje wp caches pretty hotly
06:24 🔗 Aranje but the plugins...
06:24 🔗 Sue it's still getting heavy even with all the caching in place
06:24 🔗 Sue i've got 384 php-fpm children and it wants more
06:25 🔗 Aranje shouldn't even need that many wtf
06:25 🔗 Aranje how's the db handling?
06:25 🔗 Sue seems fine
06:25 🔗 Sue most of the load is php
06:25 🔗 Aranje mm
06:25 🔗 Aranje cause sometimes you can drop certain tables into ram
06:25 🔗 Sue xtradb/mariadb
06:25 🔗 Sue good point
06:25 🔗 Aranje the big smashy ones
06:26 🔗 Aranje the plugins that abuse db really bad are logging
06:26 🔗 Aranje like the malwhatever that blocks "bad clients"
06:26 🔗 Aranje e,e
06:26 🔗 Aranje it puts weblogs into db
06:26 🔗 Sue what i was thinking was i would set up haproxy and do some sort of cluster fs on a bunch of vms
06:26 🔗 Sue because php is just killing it
06:27 🔗 Aranje yeah it's likely he should have multiple frontends... I mean if you can't get the plugins to work with apc more
06:27 🔗 Aranje cause that's likely the issue
06:27 🔗 Aranje apc is not a magic bullet, but it's fucking slick
06:27 🔗 Sue yeah
06:28 🔗 Aranje yeah, put a haproxy with good session tracking in front and throw another box behind it
06:28 🔗 Aranje see if that helps
06:28 🔗 Aranje If not... you're back to plugins
06:28 🔗 Sue i also need to learn me some varnish again
06:28 🔗 Sue i don't remember how to push cookies through
06:28 🔗 Aranje yus
06:28 🔗 Aranje mmm
06:29 🔗 Aranje dat shit is the gravy
06:29 🔗 Sue i'd rather just disable caching for cookie based sessions
06:29 🔗 Sue but i know there's a faster way
06:29 🔗 Aranje whats his stats on logged-in vs not
06:29 🔗 Aranje cause if it's like 85% not-logged-in, fuckem
06:30 🔗 Aranje static the entire site and only gen new for logged-in
06:30 🔗 Sue one sec
06:30 🔗 Aranje and only every... 5 minutes or something
06:30 🔗 Aranje cache long and cache hard
06:30 🔗 Aranje :3
06:31 🔗 Aranje and make sure dude's got his resources minimized. That won't help with server load much, but it'll make pageloads feel faster
09:16 🔗 omf_ I found a new gif to explain what ArchiveTeam does - https://d24w6bsrhbeh9d.cloudfront.net/photo/a2NNAn9_460sa.gif
09:38 🔗 xmc no that actually isn't it
11:17 🔗 winr4r jesus christ i hope that's not real
11:17 🔗 winr4r or that they're big squishy inflatable gloves or something
11:59 🔗 Baljem looks fun.
11:59 🔗 Baljem wait, I may be getting confused.
12:15 🔗 winr4r yes
12:18 🔗 ersi Damn.
12:20 🔗 ersi Sue: Yeah, multiple front-ends is the way to scale that piece of shit - as long as he doesn't want to make it a sane shop, of course. But scaling out instead of up is usually the way to go.
12:21 🔗 ersi Sue: If you want to keep it simple, just setup a bunch of front-end nodes and do DNS round robin on them. That could work in the meanwhile, while configing load-balancers and shit
14:13 🔗 godane i may need help with my techcrunch.com download
14:14 🔗 godane my problem is with this: --accept-regex="($year|.jpg|.png|.jpeg|.gif)"
14:14 🔗 godane year=2007
14:15 🔗 godane its download articles from like 2008 and stuff
14:25 🔗 winr4r all of them? or just some?
14:25 🔗 winr4r also shouldn't you be escaping those periods
14:45 🔗 godane its just grabing stuff that i thought it shouldn't
14:45 🔗 godane i think its minor now
14:46 🔗 godane it was like 20mb for all the other years before
14:47 🔗 godane so there will be some double articles with these grabs i guess
15:29 🔗 SmileyG https://plus.google.com/u/0/107105551313411539773/posts/awYNoK18Q6p
16:17 🔗 godane i think i figured out the problem
16:17 🔗 godane i think i need to add \ before .jpg or something
16:18 🔗 godane cause gif before was grabing everything that with gift in it
16:18 🔗 godane *in the url
16:23 🔗 Baljem 10 points to winr4r ;)
17:08 🔗 winr4r you're welcome!
18:42 🔗 Ravenloft http://youtu.be/3r3BOZ6QQtU
19:48 🔗 godane looks like there was 2 snapshots a day of techcrunch.com since oct 2011
19:49 🔗 godane its still not a bad idea that i do this panic download
19:50 🔗 godane 2007 year doesn't have that many snapshots
20:23 🔗 godane someone maybe want to look at this: http://tech.slashdot.org/story/13/06/28/176207/googles-blogger-to-delete-all-adult-blogs-that-have-ads?utm_source=rss1.0mainlinkanon&utm_medium=feed
20:23 🔗 SmileyG godane: we are already on it
20:23 🔗 SmileyG it came up discussion in #archiveteam yesterday
20:23 🔗 godane ok
20:24 🔗 SmileyG thanks anyway (salutes!)
21:19 🔗 nico____ godane: i am hacking a little wget script
21:19 🔗 nico____ to download adult blog
21:19 🔗 nico____ AUTH_URL=$(curl -k -s "http://www.blogger.com/blogin.g?blogspotURL=http://blogname..blogspot.fr/" |grep "kd-button-bar" |sed -e "s/.*href=\"//g" -e "s/\" target.*//g")
21:19 🔗 nico____ CRAWL_URL=$(curl --cookie-jar cookies.db $AUTH_URL|grep "here</A>." |sed -e "s/.*HREF=\"//g" -e "s/\">.*//g")
21:20 🔗 nico____ wget --load-cookies cookies.db -rk -w 1 --random-wait -p -R '*\?*' -l 1 $CRAWL_URL
21:53 🔗 nico____ http://pastebin.com/A6YvX14Z better version
21:56 🔗 nico_32 if you want to do an emergency backup of some site
21:56 🔗 nico_32 hosted on blogger

irclogger-viewer