Time |
Nickname |
Message |
00:08
🔗
|
dashcloud |
arkhive: you mentioned somewhere in the backlog that some guy was offering to sell his CEDs for $5 or they go to the shredder- had you thought about buying some of them? |
01:04
🔗
|
namespace |
CED's? |
01:22
🔗
|
underscor |
namespace: http://en.wikipedia.org/wiki/Capacitance_Electronic_Disc |
02:06
🔗
|
godane |
so i got techcrunch 2006 with no error it looks like |
02:06
🔗
|
godane |
i looked for only ERROR and grep -v 404 and only got one 406 error |
02:07
🔗
|
godane |
and that file i check and it doesn't exist |
02:47
🔗
|
winr4r |
TC isn't closing, right |
02:51
🔗
|
godane |
its not closing |
02:52
🔗
|
godane |
i do panic downloads so we at least have a copy in case it just shutdowns |
02:52
🔗
|
godane |
also these are year dumps |
02:53
🔗
|
godane |
so i should have all the 2005 to 2012 articles and images |
02:53
🔗
|
godane |
based on url paths anyways |
02:58
🔗
|
winr4r |
ah, cool |
03:06
🔗
|
godane |
more hdnation episodes are getting uploaded |
03:06
🔗
|
godane |
after these 16 episodes i will have 25 episodes uploaded |
03:06
🔗
|
godane |
and that will be all 2009 episodes |
03:16
🔗
|
* |
winr4r salutes |
05:26
🔗
|
omf_ |
So am I the only who sees the bullshit that is infochimps business plan. Take their twitter data dumps which violate the twitter TOS about how to package and display tweets |
05:26
🔗
|
omf_ |
also overlaying a license on top of data they do not own |
05:37
🔗
|
omf_ |
I am finding datasets licensed wrong left and right. The question is how to deal with it |
05:43
🔗
|
omf_ |
I can use them all I want in non-public settings and no one would know, building something public off of them opens me up to legal shenanigans |
05:47
🔗
|
omf_ |
At the same time infochimps is smart enough to no re-license free government data |
06:14
🔗
|
Sue |
does anyone know a channel where i could bs about keeping a website from falling over? |
06:15
🔗
|
Aranje |
Sue:) anything I could help with? |
06:20
🔗
|
Sue |
hai Aranje |
06:20
🔗
|
Sue |
i've been working on keeping this one guys site from falling over |
06:20
🔗
|
Aranje |
yes hello Sue :D |
06:20
🔗
|
Sue |
something like 5k hps |
06:20
🔗
|
Aranje |
on lilly? |
06:20
🔗
|
Sue |
no |
06:21
🔗
|
Sue |
his own server |
06:21
🔗
|
Aranje |
oh :D |
06:21
🔗
|
Aranje |
what kinda thing is he running? |
06:21
🔗
|
Sue |
porn |
06:21
🔗
|
Aranje |
hot |
06:21
🔗
|
Sue |
obviously |
06:21
🔗
|
Aranje |
no caching? |
06:21
🔗
|
Sue |
php-apc + varnish |
06:21
🔗
|
Sue |
and it's still dumping |
06:21
🔗
|
Aranje |
hmm |
06:21
🔗
|
Aranje |
cache misses? |
06:21
🔗
|
Aranje |
can you profile? |
06:22
🔗
|
Sue |
very little missing, already profiled |
06:22
🔗
|
Sue |
it's mostly because he has like 40 plugins |
06:22
🔗
|
Aranje |
:/ |
06:22
🔗
|
Sue |
but whatever |
06:22
🔗
|
Aranje |
oh fuuuu |
06:22
🔗
|
* |
Aranje pimpslaps |
06:22
🔗
|
Sue |
56 active |
06:23
🔗
|
Aranje |
jeez |
06:23
🔗
|
Aranje |
wp-based? |
06:23
🔗
|
Aranje |
or some other cms? |
06:23
🔗
|
Aranje |
lol soon: pornopress |
06:23
🔗
|
Sue |
wp |
06:24
🔗
|
Aranje |
wp caches pretty hotly |
06:24
🔗
|
Aranje |
but the plugins... |
06:24
🔗
|
Sue |
it's still getting heavy even with all the caching in place |
06:24
🔗
|
Sue |
i've got 384 php-fpm children and it wants more |
06:25
🔗
|
Aranje |
shouldn't even need that many wtf |
06:25
🔗
|
Aranje |
how's the db handling? |
06:25
🔗
|
Sue |
seems fine |
06:25
🔗
|
Sue |
most of the load is php |
06:25
🔗
|
Aranje |
mm |
06:25
🔗
|
Aranje |
cause sometimes you can drop certain tables into ram |
06:25
🔗
|
Sue |
xtradb/mariadb |
06:25
🔗
|
Sue |
good point |
06:25
🔗
|
Aranje |
the big smashy ones |
06:26
🔗
|
Aranje |
the plugins that abuse db really bad are logging |
06:26
🔗
|
Aranje |
like the malwhatever that blocks "bad clients" |
06:26
🔗
|
Aranje |
e,e |
06:26
🔗
|
Aranje |
it puts weblogs into db |
06:26
🔗
|
Sue |
what i was thinking was i would set up haproxy and do some sort of cluster fs on a bunch of vms |
06:26
🔗
|
Sue |
because php is just killing it |
06:27
🔗
|
Aranje |
yeah it's likely he should have multiple frontends... I mean if you can't get the plugins to work with apc more |
06:27
🔗
|
Aranje |
cause that's likely the issue |
06:27
🔗
|
Aranje |
apc is not a magic bullet, but it's fucking slick |
06:27
🔗
|
Sue |
yeah |
06:28
🔗
|
Aranje |
yeah, put a haproxy with good session tracking in front and throw another box behind it |
06:28
🔗
|
Aranje |
see if that helps |
06:28
🔗
|
Aranje |
If not... you're back to plugins |
06:28
🔗
|
Sue |
i also need to learn me some varnish again |
06:28
🔗
|
Sue |
i don't remember how to push cookies through |
06:28
🔗
|
Aranje |
yus |
06:28
🔗
|
Aranje |
mmm |
06:29
🔗
|
Aranje |
dat shit is the gravy |
06:29
🔗
|
Sue |
i'd rather just disable caching for cookie based sessions |
06:29
🔗
|
Sue |
but i know there's a faster way |
06:29
🔗
|
Aranje |
whats his stats on logged-in vs not |
06:29
🔗
|
Aranje |
cause if it's like 85% not-logged-in, fuckem |
06:30
🔗
|
Aranje |
static the entire site and only gen new for logged-in |
06:30
🔗
|
Sue |
one sec |
06:30
🔗
|
Aranje |
and only every... 5 minutes or something |
06:30
🔗
|
Aranje |
cache long and cache hard |
06:30
🔗
|
Aranje |
:3 |
06:31
🔗
|
Aranje |
and make sure dude's got his resources minimized. That won't help with server load much, but it'll make pageloads feel faster |
09:16
🔗
|
omf_ |
I found a new gif to explain what ArchiveTeam does - https://d24w6bsrhbeh9d.cloudfront.net/photo/a2NNAn9_460sa.gif |
09:38
🔗
|
xmc |
no that actually isn't it |
11:17
🔗
|
winr4r |
jesus christ i hope that's not real |
11:17
🔗
|
winr4r |
or that they're big squishy inflatable gloves or something |
11:59
🔗
|
Baljem |
looks fun. |
11:59
🔗
|
Baljem |
wait, I may be getting confused. |
12:15
🔗
|
winr4r |
yes |
12:18
🔗
|
ersi |
Damn. |
12:20
🔗
|
ersi |
Sue: Yeah, multiple front-ends is the way to scale that piece of shit - as long as he doesn't want to make it a sane shop, of course. But scaling out instead of up is usually the way to go. |
12:21
🔗
|
ersi |
Sue: If you want to keep it simple, just setup a bunch of front-end nodes and do DNS round robin on them. That could work in the meanwhile, while configing load-balancers and shit |
14:13
🔗
|
godane |
i may need help with my techcrunch.com download |
14:14
🔗
|
godane |
my problem is with this: --accept-regex="($year|.jpg|.png|.jpeg|.gif)" |
14:14
🔗
|
godane |
year=2007 |
14:15
🔗
|
godane |
its download articles from like 2008 and stuff |
14:25
🔗
|
winr4r |
all of them? or just some? |
14:25
🔗
|
winr4r |
also shouldn't you be escaping those periods |
14:45
🔗
|
godane |
its just grabing stuff that i thought it shouldn't |
14:45
🔗
|
godane |
i think its minor now |
14:46
🔗
|
godane |
it was like 20mb for all the other years before |
14:47
🔗
|
godane |
so there will be some double articles with these grabs i guess |
15:29
🔗
|
SmileyG |
https://plus.google.com/u/0/107105551313411539773/posts/awYNoK18Q6p |
16:17
🔗
|
godane |
i think i figured out the problem |
16:17
🔗
|
godane |
i think i need to add \ before .jpg or something |
16:18
🔗
|
godane |
cause gif before was grabing everything that with gift in it |
16:18
🔗
|
godane |
*in the url |
16:23
🔗
|
Baljem |
10 points to winr4r ;) |
17:08
🔗
|
winr4r |
you're welcome! |
18:42
🔗
|
Ravenloft |
http://youtu.be/3r3BOZ6QQtU |
19:48
🔗
|
godane |
looks like there was 2 snapshots a day of techcrunch.com since oct 2011 |
19:49
🔗
|
godane |
its still not a bad idea that i do this panic download |
19:50
🔗
|
godane |
2007 year doesn't have that many snapshots |
20:23
🔗
|
godane |
someone maybe want to look at this: http://tech.slashdot.org/story/13/06/28/176207/googles-blogger-to-delete-all-adult-blogs-that-have-ads?utm_source=rss1.0mainlinkanon&utm_medium=feed |
20:23
🔗
|
SmileyG |
godane: we are already on it |
20:23
🔗
|
SmileyG |
it came up discussion in #archiveteam yesterday |
20:23
🔗
|
godane |
ok |
20:24
🔗
|
SmileyG |
thanks anyway (salutes!) |
21:19
🔗
|
nico____ |
godane: i am hacking a little wget script |
21:19
🔗
|
nico____ |
to download adult blog |
21:19
🔗
|
nico____ |
AUTH_URL=$(curl -k -s "http://www.blogger.com/blogin.g?blogspotURL=http://blogname..blogspot.fr/" |grep "kd-button-bar" |sed -e "s/.*href=\"//g" -e "s/\" target.*//g") |
21:19
🔗
|
nico____ |
CRAWL_URL=$(curl --cookie-jar cookies.db $AUTH_URL|grep "here</A>." |sed -e "s/.*HREF=\"//g" -e "s/\">.*//g") |
21:20
🔗
|
nico____ |
wget --load-cookies cookies.db -rk -w 1 --random-wait -p -R '*\?*' -l 1 $CRAWL_URL |
21:53
🔗
|
nico____ |
http://pastebin.com/A6YvX14Z better version |
21:56
🔗
|
nico_32 |
if you want to do an emergency backup of some site |
21:56
🔗
|
nico_32 |
hosted on blogger |