#archiveteam 2015-11-25,Wed

↑back Search

Time Nickname Message
00:12 🔗 clb92 has joined #archiveteam
00:17 🔗 xk_id has quit IRC (Remote host closed the connection)
00:24 🔗 aaaaaaaaa yes it does and FOS is still set to 25 connections total, I believe. Or at least I don't remember someone saying they raised it.
00:35 🔗 Ungstein has joined #archiveteam
00:38 🔗 godane has joined #archiveteam
00:40 🔗 clb92 has quit IRC ()
00:45 🔗 remsen has quit IRC (Read error: Operation timed out)
00:46 🔗 yipdw it's at 50
00:46 🔗 yipdw the warrior stuff is typically under /chfoo/* and that module has 100 max
01:00 🔗 kyan has joined #archiveteam
01:13 🔗 pikhq has quit IRC (Ping timeout: 252 seconds)
01:14 🔗 espes__ has quit IRC (Read error: Operation timed out)
01:14 🔗 espes__ has joined #archiveteam
01:18 🔗 pikhq has joined #archiveteam
01:46 🔗 JesseW has joined #archiveteam
01:49 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
01:50 🔗 aaaaaaaaa has joined #archiveteam
01:50 🔗 swebb sets mode: +o aaaaaaaaa
02:00 🔗 Microguru has joined #archiveteam
02:17 🔗 primus104 has quit IRC (Leaving.)
02:40 🔗 JesseW has quit IRC (Leaving.)
02:47 🔗 JesseW has joined #archiveteam
02:48 🔗 bwn has quit IRC (Read error: Operation timed out)
03:00 🔗 icedice has quit IRC (Read error: Connection reset by peer)
03:01 🔗 icedice has joined #archiveteam
03:17 🔗 JesseW has quit IRC (Leaving.)
04:04 🔗 JesseW has joined #archiveteam
04:26 🔗 remsen has joined #archiveteam
04:39 🔗 aaaaaaaaa has quit IRC (Leaving)
05:00 🔗 remsen has quit IRC (Leaving)
05:00 🔗 remsen has joined #archiveteam
05:07 🔗 Sk1d has quit IRC (Read error: Operation timed out)
05:11 🔗 z00nx has quit IRC (Quit: WeeChat 1.3)
05:11 🔗 z00nx has joined #archiveteam
05:12 🔗 z00nx has quit IRC (Client Quit)
05:13 🔗 z00nx has joined #archiveteam
05:14 🔗 Sk1d has joined #archiveteam
05:40 🔗 Atluxity arkiver: I think I found some available storage capasity, and that I can ask to have a few TB allocated to me. maybe one TB for an archivebot instance and two TB for an rsync target
05:46 🔗 xk_id has joined #archiveteam
05:49 🔗 WinterFox has joined #archiveteam
06:01 🔗 JesseW has quit IRC (Leaving.)
06:09 🔗 xk_id has quit IRC (Remote host closed the connection)
06:10 🔗 icedice has quit IRC (Ping timeout: 360 seconds)
06:26 🔗 bwn has joined #archiveteam
06:35 🔗 xk_id has joined #archiveteam
06:38 🔗 nightpool has quit IRC (Read error: Operation timed out)
06:40 🔗 nightpool has joined #archiveteam
06:55 🔗 redlob_ has joined #archiveteam
06:59 🔗 redlob has quit IRC (Read error: Operation timed out)
07:02 🔗 RedType has quit IRC (Read error: Operation timed out)
07:07 🔗 RedType has joined #archiveteam
07:14 🔗 xk_id has quit IRC (Remote host closed the connection)
07:20 🔗 xk_id has joined #archiveteam
07:39 🔗 nightpool has quit IRC (Read error: Operation timed out)
07:39 🔗 xk_id has quit IRC (Remote host closed the connection)
07:48 🔗 sivoais has quit IRC (Read error: Operation timed out)
07:49 🔗 nightpool has joined #archiveteam
07:49 🔗 sivoais has joined #archiveteam
07:50 🔗 RedType has quit IRC (Read error: Operation timed out)
07:52 🔗 RedType has joined #archiveteam
07:53 🔗 icedice has joined #archiveteam
07:53 🔗 icedice Can someone voice me at #archivebot ?
07:54 🔗 nightpool has quit IRC (Read error: Operation timed out)
07:58 🔗 icedice has quit IRC (Quit: Page closed)
08:06 🔗 bwn_ has joined #archiveteam
08:06 🔗 arkiver Soc currently we are at 50 items/min for docstoc
08:07 🔗 arkiver We need to get that up to at least 200 items/min
08:07 🔗 arkiver Right now the site has no slowdowns due to our grab, so it's holding up fine
08:08 🔗 arkiver Atlucity: do you think you can get more concurrent online?
08:08 🔗 Atluxity sure
08:08 🔗 arkiver Atlucity: For docstoc 2 TB would be too low I think
08:08 🔗 Atluxity right
08:09 🔗 arkiver thanks
08:10 🔗 icedice has joined #archiveteam
08:11 🔗 kyan Oh, docstoc's going? I'd be delighted to to start some warriors, I'll try to figure out how
08:12 🔗 bwn has quit IRC (Ping timeout: 606 seconds)
08:13 🔗 kyan arkiver: By the way, is it better to run the scripts without warriors?
08:13 🔗 arkiver whatever is easier for you
08:13 🔗 kyan Also, how much concurrency is recommended per IP?
08:13 🔗 arkiver It looks like docstoc is not banning IPs, so I'd say that depends on your hardware
08:14 🔗 kyan Yay!
08:14 🔗 arkiver also, max concurrent on warrior is 6 and for scripts 20
08:14 🔗 kyan Ok, my vps can do 16 concurrent grab-site grabs, but I'm assuming these scripts are less cpu intensive
08:15 🔗 arkiver They might be
08:15 🔗 Atluxity arkiver: 490 more concurrent started. 10 x 49 hosts
08:15 🔗 arkiver Atluxity: nice!
08:16 🔗 arkiver That should around double what we have now :)
08:16 🔗 kyan Wow, I feel insignificant :3
08:16 🔗 Atluxity no
08:16 🔗 Atluxity I was already running that
08:17 🔗 Atluxity mmm.. I should investage what my bottleneck is
08:17 🔗 sivoais has quit IRC (Read error: Operation timed out)
08:17 🔗 icedice kyan, can you voice me?
08:17 🔗 icedice (on #archivebot)
08:17 🔗 Atluxity Project code is out of date and needs to be upgraded.
08:17 🔗 Atluxity of course
08:18 🔗 kyan icedice: No, sorry, I don't have ops there
08:18 🔗 icedice Ok
08:19 🔗 Atluxity arkiver: I was running old code.... :\
08:22 🔗 Atluxity updated, starting up 490 to see first
08:27 🔗 primus104 has joined #archiveteam
08:28 🔗 sivoais has joined #archiveteam
08:29 🔗 kyan arkiver: Do I touch STOP to stop it gracefully? Is it possible to change the concurrency after starting it, or is it necessary to stop it and restart?
08:29 🔗 Atluxity yes, touch STOP
08:29 🔗 Atluxity then it will log that it has noticed the stop file
08:29 🔗 Atluxity then you can rm STOP
08:30 🔗 Atluxity then start new
08:30 🔗 kyan Awesome, thanks! :D
08:30 🔗 kyan (sorry to bother you arkiver)
08:31 🔗 Atluxity arkiver: let me know if you want more concurrents.
08:34 🔗 arkiver Atluxity: items/min has just more then doubled!
08:34 🔗 arkiver We might need even more then this, but let's see how the site and FOS hold up for now
08:34 🔗 arkiver kyan: every bit helps!
08:34 🔗 Atluxity yeah, its better to do this a bit slow
08:35 🔗 kyan Getting an error
08:35 🔗 kyan http://pastebin.com/xDBc31kV
08:35 🔗 kyan any thoughts on how to fix it? Thanks :)
08:35 🔗 nightpool has joined #archiveteam
08:35 🔗 arkiver https://github.com/ArchiveTeam/docstoc-grab see below for wget.pod problems
08:36 🔗 atomotic has joined #archiveteam
08:36 🔗 schbirid has joined #archiveteam
08:36 🔗 * kyan rtfms
08:37 🔗 kyan "If anything goes wrong while running the commands below, please scroll down to the bottom of this page. There's troubleshooting information there." ...D'oh.
08:38 🔗 godane has quit IRC (Quit: Leaving.)
08:39 🔗 godane has joined #archiveteam
08:39 🔗 nightpool has quit IRC (Read error: Operation timed out)
08:49 🔗 Elegance has quit IRC (Read error: Connection reset by peer)
08:50 🔗 xk_id has joined #archiveteam
08:52 🔗 Elegance has joined #archiveteam
09:03 🔗 kyan FWIW, the docstoc scripts are WAY easier on the CPU than grab-site. http://pastebin.com/pQx8XyWe
09:06 🔗 kyan arkiver: I'm seeing some 403s like 127=403 http://img.docstoccdn.com/thumb/orig/19463956.png. Is this normal?
09:07 🔗 kyan (same thing happens in a browser.)
09:12 🔗 kyan Also 160=500 http://embed.docstoc.com/Errors/Errors.aspx?aspxerrorpath=/Pages/Documen
09:12 🔗 kyan ts/Browse/BrowseDocuments.aspx
09:13 🔗 kyan (Like this http://embed.docstoc.com/handlers/downloadfilefromflash.ashx?docid=2489658)
09:26 🔗 Atluxity arkiver: I started 5 x 49 more
09:28 🔗 MMovie1 has joined #archiveteam
09:29 🔗 xk_id has quit IRC (Remote host closed the connection)
09:31 🔗 MMovie has quit IRC (Ping timeout: 310 seconds)
09:33 🔗 Atluxity more items out at least
09:34 🔗 Atluxity slight increas in items/hour
09:34 🔗 Atluxity will wait for feedback regarding FOS health
09:34 🔗 bwn_ has quit IRC (Read error: Operation timed out)
09:38 🔗 kyan Should I run 2 copies of run-pipeline to go over 20 concurrency? 20 seems to be going as fine as it ever was
09:38 🔗 Atluxity I am not sure
09:39 🔗 kyan Ok, thanks :)
09:42 🔗 kyan And by the way, thank you a lot to whoever wrote the scripts (arkiver?) to save Docstoc. :)
09:52 🔗 icedice has quit IRC (Ping timeout: 240 seconds)
10:06 🔗 arkiver2 has joined #archiveteam
10:07 🔗 bwn_ has joined #archiveteam
10:12 🔗 arkiver2 has quit IRC (Ping timeout: 252 seconds)
10:13 🔗 Ungstein1 has joined #archiveteam
10:14 🔗 Ungstein has quit IRC (Ping timeout: 252 seconds)
10:18 🔗 philpem has quit IRC (Ping timeout: 252 seconds)
10:18 🔗 kyan Also docstoc companies: http://www.expertcircle.com/ http://www.license123.com/
10:19 🔗 marvinw is now known as ivan`
10:21 🔗 ivan` kyan: you can use half as much cpu with --wpull-args=--html-parser=libxml2-lxml but it will be more prone to segfaulting
10:21 🔗 kyan Huh, cool, thanks! I'll probably leave it the way it is, I'd rather not have to pay much attention to it :P
10:21 🔗 ivan` also a normal xeon server will be about 3x faster than the weird atom server that online.net sells
10:22 🔗 kyan Hmm, well I already paid the 20eur setup fee for the online.net one :P
10:22 🔗 ivan` heh
10:22 🔗 ivan` I snagged a limited edition server @ $30/mo with xeon/4TB storage/32GB a few months back
10:22 🔗 kyan nice!
10:23 🔗 kyan This is like $20 a month, which is a LOT of money for me
10:23 🔗 kyan so I'm not planning to upgrade... :P
10:26 🔗 kyan Also, the server is generally faster than my fastest computer at home (8 cores, 2.4GHz vs. 2 cores, 2.26GHz) so to me it seems... pretty damn fast.
10:27 🔗 SmileyG has quit IRC (Quit: http://www.milkme.co.uk - You'll never understand.)
10:28 🔗 Smiley has joined #archiveteam
10:30 🔗 philpem has joined #archiveteam
10:33 🔗 Smiley has quit IRC (Quit: http://www.milkme.co.uk - You'll never understand.)
10:34 🔗 Smiley has joined #archiveteam
10:39 🔗 kemi has joined #archiveteam
10:47 🔗 Microguru has quit IRC (Remote host closed the connection)
10:55 🔗 kemi Hi, dunno if it's the right place to say it but wat.tv is closing, I thought that might interest people here
11:00 🔗 xk_id has joined #archiveteam
11:01 🔗 kyan kemi: Good to know, thanks for letting us know!
11:16 🔗 xk_id has quit IRC (Read error: Operation timed out)
11:18 🔗 nightpool has joined #archiveteam
11:25 🔗 nightpool has quit IRC (Read error: Operation timed out)
11:43 🔗 kyan arkiver: Seems like the rsync upload is having problems: "@ERROR: max connections (100) reached -- try again later"
11:43 🔗 arkiver Yes
11:43 🔗 arkiver We're going to get a target from Kenshin!
11:44 🔗 kyan Yay! :D
11:58 🔗 arkiver We have a target from Kenshin!
12:00 🔗 kyan Extra yay! :D :D
12:00 🔗 Atluxity does this mean change in code? how is workers notifed about targets?
12:01 🔗 arkiver workers request a target from the tracker
12:02 🔗 Atluxity ah, nice
12:02 🔗 Atluxity seemless change for us then
12:05 🔗 WinterFox has quit IRC (Read error: Operation timed out)
12:05 🔗 Atluxity awh yeah... look at it go
12:10 🔗 xk_id has joined #archiveteam
12:12 🔗 nightpool has joined #archiveteam
12:12 🔗 arkiver we're now at 130 items/min
12:14 🔗 kyan (13000 document IDs per minute, I think that works out to! DOPE!)
12:14 🔗 * kyan has fond memories of Docstoc's promotional emails showing up every now and then...
12:16 🔗 xk_id has quit IRC (Read error: Operation timed out)
12:17 🔗 nightpool has quit IRC (Ping timeout: 258 seconds)
12:19 🔗 icedice has joined #archiveteam
12:34 🔗 arkiver Atluxity: do you think you can get more concurrent on docstoc? or maybe move the concurrent on yuku over to docstoc for now?
12:39 🔗 arkiver Atluxity: nevermind about that
12:42 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
12:43 🔗 arkiver kemi: any more info on wat.tv
12:43 🔗 arkiver ?
12:45 🔗 kemi it's owned by TF1, they're going to keep their videos on a new service but users-uploaded ones are gonna disappear on February, 17th 2016
12:45 🔗 kemi http://www.numerama.com/business/132400-tf1-ferme-wat-tv.html here some link if you can read French
12:46 🔗 arkiver I can't speek french
12:46 🔗 arkiver How do you see if a video is user uploaded or by TF1?
12:47 🔗 arkiver would this be user uploaded? http://www.wat.tv/video/oggi-pioggia-davide-esposito-7ixcx_7bqlx_.html
12:47 🔗 Atluxity arkiver: nvm?
12:47 🔗 Atluxity ok
12:47 🔗 arkiver Atluxity: looks like docstoc is now slowing down a bit
12:47 🔗 arkiver So might be good to not put more pressure on it
12:47 🔗 Atluxity yeah, seem to be we are pushing the limit of the target
12:47 🔗 arkiver yeah
12:48 🔗 Atluxity the head of the cloud I am beta testing was just in and complemented my network load generation
12:48 🔗 arkiver haha
12:49 🔗 arkiver it will go even higher when you go on google code too ;)
12:49 🔗 Atluxity yeah, I told him
12:49 🔗 Atluxity he warned me this arrangment would need to be more gentle from January on
12:50 🔗 Atluxity this is not beta forever :)
13:16 🔗 xk_id has joined #archiveteam
13:22 🔗 xk_id has quit IRC (Read error: Operation timed out)
13:34 🔗 bwn_ has quit IRC (Read error: Connection reset by peer)
13:34 🔗 bwn_ has joined #archiveteam
13:53 🔗 luckcolor has joined #archiveteam
13:55 🔗 luckcolor hello guys
13:56 🔗 Kenshin Atluxity: are you the one pushing traffic from basefarm?
13:56 🔗 luckcolor basefarm?
13:56 🔗 Atluxity Yes
13:56 🔗 luckcolor I don't know what that is
13:57 🔗 Kenshin Atluxity: can you get me peering with them over amsix?
13:58 🔗 Atluxity I can try
13:58 🔗 luckcolor ah sorry kenshin i haven't saw that you were talking to Atluxity
13:58 🔗 Kenshin or should i just poke at their peering email
13:58 🔗 Kenshin luckcolor: no worries
13:59 🔗 luckcolor anyway i was just about to say that on my worrior i have two tiems that are running by 11 hourse
13:59 🔗 luckcolor *hours
13:59 🔗 luckcolor and seem to be stuck in a url redirect loop
14:00 🔗 Kenshin Atluxity: i dropped them a mail from AS24482. if you know someone in basefarm that can help, would be appreciated
14:00 🔗 Kenshin you're doing 400M or so of traffic, would be nice if it could be done over peering links
14:01 🔗 trs80 that's a lot of traffic
14:01 🔗 luckcolor here a sample of the urls
14:01 🔗 luckcolor 37966=200 http://embed.docstoc.com/handlers/downloadfilefromflash.ashx?docid=6448510&ref_url=http://www.docstoc.com/docs/6448510/icons/core/icons/core/icons/sap/page/images/icons/core/icons/sap/page/images/icons/sap/page/NI.gif.
14:01 🔗 luckcolor 37967=200 http://embed.docstoc.com/handlers/downloadfilefromflash.ashx?docid=6448510&ref_url=http://www.docstoc.com/docs/6448510/icons/core/icons/core/icons/sap/page/images/icons/core/icons/sap/page/images/icons/sap/page/PI.gif.
14:01 🔗 Kenshin trs80: not really. you used to do like 1G remember? lol
14:01 🔗 Kenshin luckcolor: yeah i think you're hitting repeated URLs
14:02 🔗 luckcolor yeah
14:02 🔗 Kenshin luckcolor: talk to arkiver. he's managing the code
14:02 🔗 arkiver strange
14:02 🔗 Kenshin there he is
14:03 🔗 arkiver I'll have a fix in in a bit
14:04 🔗 arkiver for now I paused the grab, I want to take some load off of docstoc
14:04 🔗 arkiver please keep the concurrent running though!
14:04 🔗 trs80 kenshin: aarnet came in months later for our yearly review, and were like "you did a lot of extra traffic in august, any reason why?"
14:05 🔗 Kenshin trs80: heh, what did you use as an excuse?
14:05 🔗 trs80 it wasn't until after the meeting that I remembered that's when I was an rsync host
14:05 🔗 trs80 kenshin: just shrugged my shoulders
14:05 🔗 Kenshin it worked?
14:05 🔗 trs80 yeah. it wasn't a bad review, more like "here's how you're using your connection, what can we do for you?" sort of thing
14:05 🔗 Kenshin i assume your university pays them?
14:05 🔗 trs80 and they showed us monthly usage graphs
14:06 🔗 Kenshin i remember aarnet was very anal with their mirror server usage
14:06 🔗 trs80 school, yeah we pay a pretty cheap rate for all we can eat effectively
14:06 🔗 Kenshin but then again i don't blame them, AU bandwidth is expensive as hell
14:06 🔗 trs80 in theory there's a 10TB/student/year usage limit during business hours
14:06 🔗 trs80 but we don't come close to that
14:06 🔗 Kenshin ic, so even the rsync burst wasn't that big an issue
14:07 🔗 trs80 k12 schools get a much better deal than unis
14:07 🔗 Kenshin as long as you don't get into trouble i guess, having you as a backup rsync is kinda important
14:07 🔗 trs80 yeah, it was fine, just unusual for our traffic to increase so much for a limited period of time
14:08 🔗 trs80 I don't have a huge amount of space atm, but that's mostly because I've got a few TB of internetarchive.bak
14:08 🔗 Kenshin i guess when push comes to shove, rsync data is probably more important?
14:09 🔗 trs80 yeah, ia.bak seems a bit stagnant atm, and it can always be re-downloaded
14:09 🔗 Kenshin sadly for huge projects, rsync is like playing musical chairs
14:09 🔗 trs80 here we go, metered traffic went from ~100GB to 3TB in august 2014
14:09 🔗 Kenshin ...
14:09 🔗 Kenshin woops.
14:09 🔗 Kenshin lol
14:10 🔗 trs80 unmetered traffic was ~18TB then, and 24TB in august this year
14:10 🔗 remsen has quit IRC (Read error: Operation timed out)
14:11 🔗 Kenshin we did have a lot of crazy projects in the last 2 years
14:11 🔗 trs80 was it twitch?
14:11 🔗 Kenshin last year was twitch
14:11 🔗 Kenshin this year was bliptv
14:11 🔗 Kenshin god damn if we have another video site.
14:12 🔗 trs80 most of our traffic over aarnet isn't metered as they have excellent peering
14:12 🔗 Kenshin yeah they have SG,HK and LAX iirc
14:12 🔗 trs80 kenshin: 3h ago: <kemi> Hi, dunno if it's the right place to say it but wat.tv is closing, I thought that might interest people here
14:12 🔗 Kenshin ... f***
14:13 🔗 Atluxity Kenshin: when did you drop them an email, and how long ago?
14:14 🔗 Atluxity wait...
14:14 🔗 Kenshin 15 minutes ago
14:14 🔗 Atluxity when and to what adress
14:14 🔗 Kenshin peering@basefarm.no
14:14 🔗 Atluxity aha
14:14 🔗 Atluxity cool, will talk to them now
14:14 🔗 Kenshin nice thanks
14:16 🔗 trs80 wat.tv closes feb 17 2016 fwiw
14:17 🔗 xk_id has joined #archiveteam
14:18 🔗 Atluxity Kenshin: poked the propper techy, gave some odd answeres about what this was about, and he could see no reason for not peering
14:18 🔗 Atluxity he would try to get it done today or tomorrow morning
14:19 🔗 Kenshin Atluxity: cool thanks
14:21 🔗 luckcolor guys also there's a silly problem with my nickname
14:21 🔗 luckcolor the server is probably cutting a letter
14:21 🔗 luckcolor the last one
14:21 🔗 luckcolor -_-
14:21 🔗 luckcolor should end with s
14:21 🔗 Ymgve has quit IRC (Read error: Connection reset by peer)
14:22 🔗 Ymgve has joined #archiveteam
14:25 🔗 xk_id has quit IRC (Read error: Operation timed out)
14:34 🔗 Atluxity luckcolor: on this chat?
14:34 🔗 Atluxity do you get an error message if you try the command /nick luckcolors ?
14:35 🔗 luckcolor atluxity in general i get this nickname
14:35 🔗 luckcolor the command doesnt return anything
14:36 🔗 luckcolor and my nick is still the same
14:36 🔗 luckcolor -__
14:36 🔗 luckcolor -_-
14:37 🔗 Atluxity guess the server does not support such long nicknames
14:38 🔗 Atluxity I see no one here with more than 9 chars in their nick
14:43 🔗 Atluxity uh oh
14:43 🔗 Atluxity Kenshin: docstoc items per hour is dropping :\
14:43 🔗 Atluxity is it you or me?
14:43 🔗 antomatic tracker rate limiting, by the lok of it
14:43 🔗 Atluxity yeah
14:44 🔗 Kenshin arkiver said he didn't want to tax docstoc too much
14:45 🔗 Atluxity ah, he said at 15:04 right
14:57 🔗 phuzion So, we don't need another 5000 threads or anything like that?
15:02 🔗 antomatic Doesn't look to me like the tracker is handing out any new items at all right now
15:04 🔗 phuzion antomatic: yeah, looks like arkiver paused the grab about an hour ago
15:15 🔗 luckcolor yeah he did
15:21 🔗 xk_id has joined #archiveteam
15:25 🔗 Kenshin there's some issue with the code that luckcolor found, so i think arkiver is working on it
15:25 🔗 Kenshin give him some time
15:26 🔗 Kenshin phuzion: no, i think Atluxity threw in a LOT of resources already
15:26 🔗 Kenshin so best not to add anymore strain to docstoc
15:26 🔗 Kenshin i think google code is coming up soon as well, that one can be raped
15:27 🔗 Kenshin Atluxity: basefarm got back, will peer tomorrow, thanks!
15:29 🔗 ozlo has quit IRC (Quit: If only I was sure that my head on the door was a dream...)
15:30 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
15:30 🔗 xk_id has quit IRC (Read error: Operation timed out)
15:36 🔗 luckcolor has quit IRC (Read error: Operation timed out)
15:48 🔗 luckcolor has joined #archiveteam
15:50 🔗 Ghost_of_ has joined #archiveteam
15:55 🔗 primus104 has quit IRC (Leaving.)
16:14 🔗 arkiver tracker restarted
16:14 🔗 arkiver The websites really slowed down a lot. Sometimes it's best to totally take off the load from it and build it up slowly again
16:20 🔗 JesseW has joined #archiveteam
16:44 🔗 Start has quit IRC (Quit: Disconnected.)
16:44 🔗 JesseW has quit IRC (Leaving.)
16:47 🔗 scyther has joined #archiveteam
16:52 🔗 Ghost_of_ has quit IRC (Quit: Leaving)
16:53 🔗 xk_id has joined #archiveteam
16:56 🔗 phuzion Are we getting banned by cloudfront? https://gist.github.com/anonymous/788d8f001402b1e45a72
16:57 🔗 luckcolor has quit IRC (Read error: Connection reset by peer)
16:57 🔗 HCross phuzion, it access denies with a 403 when I hit it from my other server over HTTP
16:57 🔗 phuzion Ok
16:58 🔗 HCross you try some of them
16:59 🔗 MrRadar I get a 403 access denied on those links as well
16:59 🔗 phuzion Same
17:00 🔗 HCross On my domestic connection too
17:16 🔗 arkiver Some links seem to be gone
17:17 🔗 arkiver they return 403
17:17 🔗 arkiver if you look at logs from other items you'll also see many 200's
17:20 🔗 arkiver I'm going to start the Google Code project in a bit
17:21 🔗 HCross ok, still waiting for Kimsufi stock, but will get on it asap
17:21 🔗 arkiver ok!
17:21 🔗 arkiver I'm not going to make google code the default warrior project yet
17:22 🔗 HCross later tonight, ill see if Scaleway have capacity, and if they do I will start on the ScaleArchiver
17:29 🔗 phuzion arkiver: I should be able to throw about 50 DO instances at google code
17:34 🔗 arkiver nice
17:36 🔗 phuzion I just need to figure out the automation part of my droplet creation.
17:36 🔗 HCross phuzion, I was going to say about that
17:37 🔗 phuzion HCross: Do you have any ideas on how to do that?
17:37 🔗 HCross some form of batch script?
17:38 🔗 phuzion Ansible supports DO, but I don't have the right version of ansible
17:38 🔗 HCross I am working on something for Scaleway at sometime, clicking order on the servers are fine, as there are a max of 10
17:40 🔗 HCross Take allok at SaltStack
17:40 🔗 HCross a look at
17:42 🔗 phuzion I'm looking at something called tugboat right now
17:42 🔗 phuzion https://github.com/pearkes/tugboat
17:55 🔗 kyan arkiver: Regarding the loop-y problem, is there somehing I should do to abort jobs that are already going and have that problem? Or will they fix themselves eventually? URL example http://embed.docstoc.com/handlers/downloadfilefromflash.ashx?docid=109$830&ref_url=http://www.docstoc.com/docs/common/common/common/common/common/comm$n/common/common/common/common/common/common/common/common/common/common/common/c
17:55 🔗 kyan ommon/common/common/common/common/common/common/common/common/common/common/com$
17:55 🔗 kyan on/common/common/common/common/common/common/common/common/common/common/common$
17:55 🔗 kyan common/common/common/common/common/common/common/common/common/common/common/co$
17:55 🔗 kyan mon/common/common/common/common/common/common/common/common/common/common/commo$
17:55 🔗 kyan Wow. Anyway, that, ish.
17:56 🔗 arkiver right. will have a fix up in a bit
17:56 🔗 kyan Ok, cool, sorry to bother you :)
17:57 🔗 Start has joined #archiveteam
17:59 🔗 primus104 has joined #archiveteam
18:03 🔗 arkiver items added to googlecode grab!
18:04 🔗 remsen has joined #archiveteam
18:07 🔗 phuzion arkiver: What do you think is a reasonable amount of concurrent per IP? 2? 10?
18:07 🔗 phuzion (I'm using small DO instances, keep in mind)
18:08 🔗 HCross what is the CPU backend?
18:08 🔗 phuzion CPU backend?
18:08 🔗 phuzion On the DO instances?
18:08 🔗 HCross because if I come along with my E5 server then ofc it smore
18:11 🔗 arkiver it can be hard to find out an URL is in a loop, but I'm trying to do it with this https://github.com/ArchiveTeam/docstoc-grab/blob/master/docstoc.lua#L120-L135
18:11 🔗 arkiver So scripts for docstoc are updated
18:12 🔗 arkiver Atluxity: please let me know when you have updated your scripts and I'll set the new version in the tracker
18:12 🔗 phuzion arkiver: New version of the script is available but not required, is what you're saying?
18:12 🔗 arkiver We're making good progress and I don't want to interrupt that
18:13 🔗 phuzion Right
18:13 🔗 arkiver Basically waiting an extra hour shouldn't matter too much
18:13 🔗 phuzion Do you think 2 concurrent threads per instance is good, or can I bump that up?
18:13 🔗 arkiver and in that hours soe extra thousand documents can be saved
18:13 🔗 arkiver for google code?
18:13 🔗 phuzion Yeah
18:13 🔗 arkiver maybe
18:13 🔗 arkiver I honestly have no idea
18:13 🔗 arkiver I don't think google bans IPs
18:14 🔗 arkiver But these projects sometimes have a few hundred thousand URLs
18:14 🔗 arkiver And size might also be big for some items
18:14 🔗 phuzion 2 it is then :)
18:14 🔗 arkiver Yes, I guess you can always up the limit if the machines can handle more
18:18 🔗 Atluxity arkiver: googlecode-grab is updated
18:19 🔗 Atluxity oh, docstoc
18:19 🔗 Atluxity right
18:23 🔗 phuzion Instances are being updated as we speak. I should be coming online with 100 simultaneous threads within about 10 minutes or so.
18:23 🔗 Atluxity arkiver: docstock updated
18:26 🔗 Atluxity wondering about the prossess, when I upload stuff to rsync target, is it 1:1 with what was downloaded or is it compressed?
18:27 🔗 phuzion Atluxity: You talking about the warc that seesaw creates?
18:27 🔗 Atluxity probably
18:28 🔗 phuzion because that's run through gzip before being sent off to the rsync target.
18:28 🔗 Atluxity thats odd
18:28 🔗 phuzion basically, seesaw mirrors the content (and metadata) into a WARC file, then gzips the WARC, then rsyncs it to FOS or whoever the rsync target is.
18:28 🔗 Atluxity my beta-cloud manager was wondering why I created so much outbound traffic, and not so much inbound
18:29 🔗 Atluxity but it might be someone else doing stuff (tm)
18:30 🔗 MrRadar Anonymous kills ISIS darknet site: http://www.dailydot.com/politics/isis-tor-hidden-service-down/
18:30 🔗 MrRadar I think that was the one SketchCow threw into ArchiveBot last week
18:33 🔗 yipdw it was http://archive.fart.website/archivebot/viewer/job/hjbkk
18:33 🔗 Yukundali has joined #archiveteam
18:33 🔗 Yukundali salutations all
18:34 🔗 Atluxity greetings
18:34 🔗 Yukundali Someone has a bad script scraping our site, just wanted to come and greet before requesting the script to stop
18:34 🔗 Atluxity cool
18:34 🔗 Atluxity what site?
18:34 🔗 Yukundali Yuku
18:34 🔗 Atluxity ah, yes
18:35 🔗 Atluxity it is shutting down, no?
18:35 🔗 Yukundali http://pastebin.com/g40vDhUE
18:35 🔗 Yukundali the requests are malformed
18:35 🔗 Yukundali and its chocking our memcache servers
18:35 🔗 Yukundali we run a very archaic infrastructure
18:35 🔗 yipdw oh good it's not ArchiveBot
18:35 🔗 Atluxity arkiver: poke
18:36 🔗 Atluxity Yukundali: we should fix that....
18:37 🔗 Atluxity Yukundali: this is the status of our scraping, if you had not found it already http://tracker.archiveteam.org/yuku/
18:37 🔗 Atluxity I'll stop my workers
18:37 🔗 Yukundali impressive
18:38 🔗 Atluxity there, can take a little while, but my workers will not ask for more work, just finish their ongoing items
18:38 🔗 Atluxity sadly I have no control over the code we are running
18:38 🔗 Yukundali its ok, I already blocked them
18:39 🔗 Yukundali but I might want to join you guys, I love scraping data
18:39 🔗 Atluxity sounds great
18:39 🔗 xmc adbrite is dead btw
18:39 🔗 Start has quit IRC (Quit: Disconnected.)
18:39 🔗 phuzion Yukundali: Would you mind helping us out by suggesting how we can improve the quality of the requests?
18:39 🔗 xmc they went out of business years ago
18:40 🔗 Yukundali I would suggest throttling based on responce time... or TTFB
18:41 🔗 phuzion Someone generally manually adjusts the rate of requests based on what we are observing as response times.
18:41 🔗 phuzion But you said that the requests are malformed?
18:41 🔗 MrRadar Take a look at the pastebin he linked
18:41 🔗 MrRadar The same substring is repeated in the URL dozens of times
18:41 🔗 phuzion Oh hah wow
18:42 🔗 phuzion Ok yeah, I can see memcache freaking out about that.
18:43 🔗 phuzion arkiver: ping?
18:45 🔗 kyan Will the Yuku jobs that were blocked due to the issue be detected as failed and retried? Wouldn't want to miss stuff...
18:46 🔗 kyan (but wow those requests, those are long urls)
18:46 🔗 arkiver Yukundali: hi
18:47 🔗 HCross Yukundali, as second on the list, I want to say sorry for any server melting that we might have done
18:47 🔗 arkiver Yukundali: tracker is paused. load should be 0 from our side in a bit
18:47 🔗 Yukundali it normally wouldn't be a problem, but as you may know, Yuku doesn't have the most stable infrastructure
18:47 🔗 arkiver Yeah, that's why we are grabbing your website
18:47 🔗 phuzion lmao
18:47 🔗 Atluxity seen worse....
18:47 🔗 arkiver I also knnow that we are grabbing the individual posts, which is on purpose
18:48 🔗 arkiver (which could be seen as malicious)
18:48 🔗 kyan arkiver: It looks like similar problem to the loop on docstoc: http://pastebin.com/g40vDhUE
18:48 🔗 phuzion But yeah, Yukundali, if we fix the malformed requests, would that help things out a little bit and make your life easier?
18:48 🔗 Yukundali yes, ohh... and maybe throttle it down a bit
18:49 🔗 Yukundali like I said, we run memcache servers... which are horrible... so they can't take too much
18:49 🔗 Yukundali I'm hoping to get couchbase running soon, we are also hiring a team to help with infrastructure... so yuku will see better days
18:49 🔗 Yukundali but until then, be gentile... she is old
18:49 🔗 arkiver ok
18:50 🔗 Yukundali none-the-less, I'm all for what you guys are doing
18:50 🔗 Yukundali and honestly quite impressed
18:50 🔗 arkiver I'll lower the limit, please let me know what you think
18:50 🔗 arkiver thanks!
18:50 🔗 arkiver Little thing though, maybe it'd be good to set some other status code then 200 next time
18:51 🔗 arkiver chfoo: can you please send me the logs of yuku?
18:51 🔗 phuzion arkiver: any ETA on opening up google code? I've got my DO instances spun and ready to go.
18:51 🔗 kyan If their infrastructure is havig as much trouble as it sounds like it is from some of the threads, I think incorrect status codes are a small worry :P
18:51 🔗 Yukundali what do you need? I can give it to ya
18:52 🔗 arkiver Google Code project is started.
18:52 🔗 Yukundali ok, I'm going to lift my ban on ArchiveTeam bot
18:52 🔗 HCross Yukundali, do you have any sort of whole site backup that could be handed over?
18:52 🔗 arkiver thanks
18:52 🔗 HCross arkiver, congrats
18:53 🔗 Yukundali we have replicated data and backups, but nothing to just hand over
18:53 🔗 arkiver yeah
18:53 🔗 Yukundali but
18:53 🔗 Yukundali I can give you a secret
18:53 🔗 arkiver I'd rather get the data thrugh http
18:53 🔗 HCross ooooh
18:53 🔗 Yukundali we use Mobique as an api to integrate into tapatalk
18:53 🔗 Yukundali all our data is accessable there
18:53 🔗 Yukundali w/o the html
18:54 🔗 Yukundali * mobiequo
18:54 🔗 arkiver Yukundali: do you know what we do with the data we grab?
18:54 🔗 Yukundali no
18:55 🔗 Yukundali I would assume archive it
18:55 🔗 MrRadar It's added to the Internet Archive's Wayback Machine
18:55 🔗 arkiver Yeah, after that it's all made public
18:55 🔗 phuzion Google Code: Process RsyncUpload returned exit code 12 for Item project:test-mysql-project
18:55 🔗 MrRadar So having browsable web pages is important
18:55 🔗 arkiver You might have heard of the Wayback Machine
18:55 🔗 Yukundali yea
18:55 🔗 kyan Would be good to get the back-end data too if that's an option
18:55 🔗 arkiver Everything we grab goes into the wayback machine
18:55 🔗 kyan IMO
18:55 🔗 Yukundali ohh really?!
18:55 🔗 arkiver yeah
18:55 🔗 Yukundali well blow me down ... lol
18:55 🔗 kyan Just for the sake of having everything
18:56 🔗 Yukundali ok, then please have at
18:56 🔗 arkiver I'll give you some examples
18:56 🔗 Yukundali archive.org has saved our buts numerous times
18:56 🔗 MrRadar We've got a special arrangement with them, thanks to our fearless leader SketchCow
18:56 🔗 MrRadar (Whom you may also know as Jason Scott of textfiles.com)
18:57 🔗 arkiver We have currently saved 682 GB from yuku
18:58 🔗 arkiver It's not all uploaded yet, some of it is here https://archive.org/search.php?query=mediatype%3A%22web%22%20AND%20%28yuku%29
18:58 🔗 arkiver Our current projects are here http://tracker.archiveteam.org/
18:59 🔗 arkiver with for example docstoc and google code
18:59 🔗 arkiver phuzion: looks like the rsync target is removed... :/
18:59 🔗 arkiver chfoo: can you please recreate the rsync target for googlecode?
19:00 🔗 Atluxity poke me when I can fire up yuku-grab again
19:00 🔗 arkiver ok
19:00 🔗 Yukundali we lost an advertiser due to "Excessive non-human traffic" :
19:00 🔗 Yukundali we lost an advertiser due to "Excessive non-human traffic" : /
19:00 🔗 arkiver Atluxity: I'll first have to go through the logs since we have some bad 200 items
19:01 🔗 Atluxity I suspect there will be some work, yes
19:01 🔗 Yukundali could you restart the scrape slowly pls
19:01 🔗 Yukundali we are negotiating with them now
19:01 🔗 kyan Thats not good
19:01 🔗 kyan sorry to hear it
19:01 🔗 arkiver I'll first figure out what exactly we are grabbing from advertisers
19:02 🔗 yipdw your advertiser clearly isn't prepared for the Singularity
19:02 🔗 Atluxity lol
19:02 🔗 MrRadar Yes, let us know what to block to avoid tripping their detection
19:03 🔗 arkiver Yukundali: dod you unblock us?
19:03 🔗 arkiver did*
19:04 🔗 Yukundali Yes, the new code is rolling out to our webservers now
19:04 🔗 arkiver awesome
19:09 🔗 arkiver Yukundali: what domain does your advertiser use for the advertisements?
19:10 🔗 Yukundali we have a lot. The new ad rules coming out will significantly hurt our business model however ( advertisers are kicking out our subdomains )
19:10 🔗 Yukundali we have hundreds of domains
19:11 🔗 Yukundali mostly : lefora.com, yuku.com, freeforums.org, forumer.com
19:12 🔗 arkiver I'm not very sure how all the advertising views work
19:12 🔗 arkiver Is there some kind of image that is loaded for the advertisers?
19:12 🔗 arkiver or something from one of the advertisers' domains?
19:13 🔗 arkiver or can they directly see that a page on *.yuku.com is downloaded?
19:13 🔗 Yukundali we have multiple points of detection, from inside our code, to pixels, to javascripts
19:14 🔗 arkiver This is a partial log I just grabbed from yuku: http://paste.nerds.io/raw/qeqoqenitu
19:15 🔗 arkiver Do you see somethinig that'd have to be blocked so advertisers don't see us/
19:15 🔗 arkiver ?
19:15 🔗 Yukundali thats a private board
19:16 🔗 Yukundali but no I don't see anything
19:16 🔗 Yukundali I'll talk with advertisers more in depth and return to this channel with a better answer
19:17 🔗 arkiver I'm not sure what you mean by private, but I'm ust seeing a normal forum, not much private here here http://camgirlnotes.fr.yuku.com/topic/675/
19:17 🔗 arkiver Yukundale: ok
19:18 🔗 arkiver and thanks for contacting us! :)
19:19 🔗 Yukundali ahh, i only tested the private board links on the scrape...
19:19 🔗 Yukundali thanks for being understanding, I'll return some day : )
19:19 🔗 Yukundali has quit IRC (Quit: http://chat.efnet.org )
19:23 🔗 Atluxity good guy
19:24 🔗 JesseW has joined #archiveteam
19:26 🔗 aaaaaaaaa has joined #archiveteam
19:26 🔗 swebb sets mode: +o aaaaaaaaa
19:29 🔗 atomotic has joined #archiveteam
19:30 🔗 xk_id has quit IRC (Read error: Connection reset by peer)
19:30 🔗 xk_id has joined #archiveteam
19:59 🔗 aaaaaaaaa Wow, just saw the logs. Nice to have another webmaster show up and be nice. Rather than, "I BLOCKED YOUR USER AGENT AND BLOCKED YOUR IPS FUCK OFF!.
20:00 🔗 phuzion aaaaaaaaa: Yeah. Have you read the Posterous story or were you involved with it?
20:01 🔗 aaaaaaaaa Yeah, maybe we should give that guy a cake too.
20:01 🔗 aaaaaaaaa at least I think it was cake
20:01 🔗 phuzion Cheesecake, but yeah
20:03 🔗 mr-b has quit IRC (Read error: Operation timed out)
20:15 🔗 wacky_ has quit IRC (Connection closed)
20:17 🔗 mr-b has joined #archiveteam
20:19 🔗 lol_ has joined #archiveteam
20:20 🔗 lol_ has quit IRC (Client Quit)
20:27 🔗 JesseW has quit IRC (Leaving.)
20:35 🔗 Start has joined #archiveteam
20:44 🔗 JesseW has joined #archiveteam
20:46 🔗 Start has quit IRC (Quit: Disconnected.)
20:52 🔗 JesseW has quit IRC (Leaving.)
21:00 🔗 remsen has quit IRC (Read error: Operation timed out)
21:05 🔗 K4k has joined #archiveteam
21:07 🔗 Start has joined #archiveteam
21:09 🔗 WinterFox has joined #archiveteam
21:11 🔗 K4k has quit IRC (WeeChat 1.3)
21:15 🔗 cvb has joined #archiveteam
21:17 🔗 K4k has joined #archiveteam
21:18 🔗 WinterFox has quit IRC (Remote host closed the connection)
21:19 🔗 bwn_ has quit IRC (Read error: Operation timed out)
21:21 🔗 phuzion SketchCow: ping?
21:23 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
21:23 🔗 Atom__ has joined #archiveteam
21:23 🔗 Infreq has quit IRC (Read error: Operation timed out)
21:24 🔗 Infreq has joined #archiveteam
21:25 🔗 RichardG_ has joined #archiveteam
21:26 🔗 Atom-- has quit IRC (Read error: Operation timed out)
21:26 🔗 lukeman has quit IRC (Read error: Operation timed out)
21:27 🔗 lukeman has joined #archiveteam
21:27 🔗 Meeh_ has joined #archiveteam
21:28 🔗 schbirid has quit IRC (Quit: Leaving)
21:28 🔗 RichardG has quit IRC (Read error: Operation timed out)
21:28 🔗 Baljem has joined #archiveteam
21:28 🔗 aliz has quit IRC (Read error: Operation timed out)
21:29 🔗 aliz has joined #archiveteam
21:30 🔗 Baljem_ has quit IRC (Read error: Operation timed out)
21:30 🔗 bwn_ has joined #archiveteam
21:30 🔗 Meeh has quit IRC (Read error: Connection reset by peer)
21:31 🔗 goekesmi_ has joined #archiveteam
21:33 🔗 goekesmi has quit IRC (Ping timeout: 499 seconds)
21:34 🔗 lysobit has quit IRC (Read error: Operation timed out)
21:34 🔗 SadDM has quit IRC (Ping timeout: 499 seconds)
21:34 🔗 midas has quit IRC (Ping timeout: 499 seconds)
21:34 🔗 Nemo_bis has quit IRC (Read error: Operation timed out)
21:37 🔗 lysobit has joined #archiveteam
21:39 🔗 Gfy has quit IRC (Ping timeout: 730 seconds)
21:39 🔗 midas has joined #archiveteam
21:43 🔗 SadDM has joined #archiveteam
21:43 🔗 swebb sets mode: +o SadDM
21:48 🔗 zenguy_pc has quit IRC (Read error: Operation timed out)
21:52 🔗 RichardG_ is now known as RichardG
21:57 🔗 Gfy has joined #archiveteam
22:13 🔗 SketchCow Whut
22:14 🔗 SimpBrain something something google code rsync i think
22:17 🔗 Start has quit IRC (Quit: Disconnected.)
22:24 🔗 Froggypwn has quit IRC (Ping timeout: 310 seconds)
22:25 🔗 Froggypwn has joined #archiveteam
22:26 🔗 icedice has quit IRC (Ping timeout: 360 seconds)
22:28 🔗 SketchCow Yeah, why not just write ping and then walk around with your hands around your ass assuming, what, I'll never look at the IRC channel again
22:28 🔗 SketchCow Or, you know, e-mail
22:28 🔗 * SketchCow is trying to dig out from this mess of a room
22:30 🔗 scyther has quit IRC (Read error: Connection reset by peer)
22:34 🔗 BlueMaxim has joined #archiveteam
22:48 🔗 K4k has quit IRC (Read error: Operation timed out)
23:04 🔗 Atluxity Ping timed out...
23:06 🔗 MrRadar Maybe it's being conveyed via avian carriers? (https://www.ietf.org/rfc/rfc1149.txt)
23:19 🔗 aaaaaaaaa Never underestimate the bandwidth of a flock of avian carriers with USB drives careening through the sky.
23:20 🔗 joepie91 do we have any ongoing AT projects for hardware drivers?
23:29 🔗 Stiletto has quit IRC ()
23:41 🔗 ironman_ has quit IRC (Quit: Connection closed for inactivity)
23:43 🔗 remsen has joined #archiveteam
23:47 🔗 arkiver joepie91: for hardware drivers?
23:53 🔗 joepie91 yes
23:53 🔗 joepie91 drivers
23:53 🔗 joepie91 for hardware
23:53 🔗 joepie91 lol

irclogger-viewer