[00:12] *** clb92 has joined #archiveteam [00:17] *** xk_id has quit IRC (Remote host closed the connection) [00:24] yes it does and FOS is still set to 25 connections total, I believe. Or at least I don't remember someone saying they raised it. [00:35] *** Ungstein has joined #archiveteam [00:38] *** godane has joined #archiveteam [00:40] *** clb92 has quit IRC () [00:45] *** remsen has quit IRC (Read error: Operation timed out) [00:46] it's at 50 [00:46] the warrior stuff is typically under /chfoo/* and that module has 100 max [01:00] *** kyan has joined #archiveteam [01:13] *** pikhq has quit IRC (Ping timeout: 252 seconds) [01:14] *** espes__ has quit IRC (Read error: Operation timed out) [01:14] *** espes__ has joined #archiveteam [01:18] *** pikhq has joined #archiveteam [01:46] *** JesseW has joined #archiveteam [01:49] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [01:50] *** aaaaaaaaa has joined #archiveteam [01:50] *** swebb sets mode: +o aaaaaaaaa [02:00] *** Microguru has joined #archiveteam [02:17] *** primus104 has quit IRC (Leaving.) [02:40] *** JesseW has quit IRC (Leaving.) [02:47] *** JesseW has joined #archiveteam [02:48] *** bwn has quit IRC (Read error: Operation timed out) [03:00] *** icedice has quit IRC (Read error: Connection reset by peer) [03:01] *** icedice has joined #archiveteam [03:17] *** JesseW has quit IRC (Leaving.) [04:04] *** JesseW has joined #archiveteam [04:26] *** remsen has joined #archiveteam [04:39] *** aaaaaaaaa has quit IRC (Leaving) [05:00] *** remsen has quit IRC (Leaving) [05:00] *** remsen has joined #archiveteam [05:07] *** Sk1d has quit IRC (Read error: Operation timed out) [05:11] *** z00nx has quit IRC (Quit: WeeChat 1.3) [05:11] *** z00nx has joined #archiveteam [05:12] *** z00nx has quit IRC (Client Quit) [05:13] *** z00nx has joined #archiveteam [05:14] *** Sk1d has joined #archiveteam [05:40] arkiver: I think I found some available storage capasity, and that I can ask to have a few TB allocated to me. maybe one TB for an archivebot instance and two TB for an rsync target [05:46] *** xk_id has joined #archiveteam [05:49] *** WinterFox has joined #archiveteam [06:01] *** JesseW has quit IRC (Leaving.) [06:09] *** xk_id has quit IRC (Remote host closed the connection) [06:10] *** icedice has quit IRC (Ping timeout: 360 seconds) [06:26] *** bwn has joined #archiveteam [06:35] *** xk_id has joined #archiveteam [06:38] *** nightpool has quit IRC (Read error: Operation timed out) [06:40] *** nightpool has joined #archiveteam [06:55] *** redlob_ has joined #archiveteam [06:59] *** redlob has quit IRC (Read error: Operation timed out) [07:02] *** RedType has quit IRC (Read error: Operation timed out) [07:07] *** RedType has joined #archiveteam [07:14] *** xk_id has quit IRC (Remote host closed the connection) [07:20] *** xk_id has joined #archiveteam [07:39] *** nightpool has quit IRC (Read error: Operation timed out) [07:39] *** xk_id has quit IRC (Remote host closed the connection) [07:48] *** sivoais has quit IRC (Read error: Operation timed out) [07:49] *** nightpool has joined #archiveteam [07:49] *** sivoais has joined #archiveteam [07:50] *** RedType has quit IRC (Read error: Operation timed out) [07:52] *** RedType has joined #archiveteam [07:53] *** icedice has joined #archiveteam [07:53] Can someone voice me at #archivebot ? [07:54] *** nightpool has quit IRC (Read error: Operation timed out) [07:58] *** icedice has quit IRC (Quit: Page closed) [08:06] *** bwn_ has joined #archiveteam [08:06] Soc currently we are at 50 items/min for docstoc [08:07] We need to get that up to at least 200 items/min [08:07] Right now the site has no slowdowns due to our grab, so it's holding up fine [08:08] Atlucity: do you think you can get more concurrent online? [08:08] sure [08:08] Atlucity: For docstoc 2 TB would be too low I think [08:08] right [08:09] thanks [08:10] *** icedice has joined #archiveteam [08:11] Oh, docstoc's going? I'd be delighted to to start some warriors, I'll try to figure out how [08:12] *** bwn has quit IRC (Ping timeout: 606 seconds) [08:13] arkiver: By the way, is it better to run the scripts without warriors? [08:13] whatever is easier for you [08:13] Also, how much concurrency is recommended per IP? [08:13] It looks like docstoc is not banning IPs, so I'd say that depends on your hardware [08:14] Yay! [08:14] also, max concurrent on warrior is 6 and for scripts 20 [08:14] Ok, my vps can do 16 concurrent grab-site grabs, but I'm assuming these scripts are less cpu intensive [08:15] They might be [08:15] arkiver: 490 more concurrent started. 10 x 49 hosts [08:15] Atluxity: nice! [08:16] That should around double what we have now :) [08:16] Wow, I feel insignificant :3 [08:16] no [08:16] I was already running that [08:17] mmm.. I should investage what my bottleneck is [08:17] *** sivoais has quit IRC (Read error: Operation timed out) [08:17] kyan, can you voice me? [08:17] (on #archivebot) [08:17] Project code is out of date and needs to be upgraded. [08:17] of course [08:18] icedice: No, sorry, I don't have ops there [08:18] Ok [08:19] arkiver: I was running old code.... :\ [08:22] updated, starting up 490 to see first [08:27] *** primus104 has joined #archiveteam [08:28] *** sivoais has joined #archiveteam [08:29] arkiver: Do I touch STOP to stop it gracefully? Is it possible to change the concurrency after starting it, or is it necessary to stop it and restart? [08:29] yes, touch STOP [08:29] then it will log that it has noticed the stop file [08:29] then you can rm STOP [08:30] then start new [08:30] Awesome, thanks! :D [08:30] (sorry to bother you arkiver) [08:31] arkiver: let me know if you want more concurrents. [08:34] Atluxity: items/min has just more then doubled! [08:34] We might need even more then this, but let's see how the site and FOS hold up for now [08:34] kyan: every bit helps! [08:34] yeah, its better to do this a bit slow [08:35] Getting an error [08:35] http://pastebin.com/xDBc31kV [08:35] any thoughts on how to fix it? Thanks :) [08:35] *** nightpool has joined #archiveteam [08:35] https://github.com/ArchiveTeam/docstoc-grab see below for wget.pod problems [08:36] *** atomotic has joined #archiveteam [08:36] *** schbirid has joined #archiveteam [08:36] * kyan rtfms [08:37] "If anything goes wrong while running the commands below, please scroll down to the bottom of this page. There's troubleshooting information there." ...D'oh. [08:38] *** godane has quit IRC (Quit: Leaving.) [08:39] *** godane has joined #archiveteam [08:39] *** nightpool has quit IRC (Read error: Operation timed out) [08:49] *** Elegance has quit IRC (Read error: Connection reset by peer) [08:50] *** xk_id has joined #archiveteam [08:52] *** Elegance has joined #archiveteam [09:03] FWIW, the docstoc scripts are WAY easier on the CPU than grab-site. http://pastebin.com/pQx8XyWe [09:06] arkiver: I'm seeing some 403s like 127=403 http://img.docstoccdn.com/thumb/orig/19463956.png. Is this normal? [09:07] (same thing happens in a browser.) [09:12] Also 160=500 http://embed.docstoc.com/Errors/Errors.aspx?aspxerrorpath=/Pages/Documen [09:12] ts/Browse/BrowseDocuments.aspx [09:13] (Like this http://embed.docstoc.com/handlers/downloadfilefromflash.ashx?docid=2489658) [09:26] arkiver: I started 5 x 49 more [09:28] *** MMovie1 has joined #archiveteam [09:29] *** xk_id has quit IRC (Remote host closed the connection) [09:31] *** MMovie has quit IRC (Ping timeout: 310 seconds) [09:33] more items out at least [09:34] slight increas in items/hour [09:34] will wait for feedback regarding FOS health [09:34] *** bwn_ has quit IRC (Read error: Operation timed out) [09:38] Should I run 2 copies of run-pipeline to go over 20 concurrency? 20 seems to be going as fine as it ever was [09:38] I am not sure [09:39] Ok, thanks :) [09:42] And by the way, thank you a lot to whoever wrote the scripts (arkiver?) to save Docstoc. :) [09:52] *** icedice has quit IRC (Ping timeout: 240 seconds) [10:06] *** arkiver2 has joined #archiveteam [10:07] *** bwn_ has joined #archiveteam [10:12] *** arkiver2 has quit IRC (Ping timeout: 252 seconds) [10:13] *** Ungstein1 has joined #archiveteam [10:14] *** Ungstein has quit IRC (Ping timeout: 252 seconds) [10:18] *** philpem has quit IRC (Ping timeout: 252 seconds) [10:18] Also docstoc companies: http://www.expertcircle.com/ http://www.license123.com/ [10:19] *** marvinw is now known as ivan` [10:21] kyan: you can use half as much cpu with --wpull-args=--html-parser=libxml2-lxml but it will be more prone to segfaulting [10:21] Huh, cool, thanks! I'll probably leave it the way it is, I'd rather not have to pay much attention to it :P [10:21] also a normal xeon server will be about 3x faster than the weird atom server that online.net sells [10:22] Hmm, well I already paid the 20eur setup fee for the online.net one :P [10:22] heh [10:22] I snagged a limited edition server @ $30/mo with xeon/4TB storage/32GB a few months back [10:22] nice! [10:23] This is like $20 a month, which is a LOT of money for me [10:23] so I'm not planning to upgrade... :P [10:26] Also, the server is generally faster than my fastest computer at home (8 cores, 2.4GHz vs. 2 cores, 2.26GHz) so to me it seems... pretty damn fast. [10:27] *** SmileyG has quit IRC (Quit: http://www.milkme.co.uk - You'll never understand.) [10:28] *** Smiley has joined #archiveteam [10:30] *** philpem has joined #archiveteam [10:33] *** Smiley has quit IRC (Quit: http://www.milkme.co.uk - You'll never understand.) [10:34] *** Smiley has joined #archiveteam [10:39] *** kemi has joined #archiveteam [10:47] *** Microguru has quit IRC (Remote host closed the connection) [10:55] Hi, dunno if it's the right place to say it but wat.tv is closing, I thought that might interest people here [11:00] *** xk_id has joined #archiveteam [11:01] kemi: Good to know, thanks for letting us know! [11:16] *** xk_id has quit IRC (Read error: Operation timed out) [11:18] *** nightpool has joined #archiveteam [11:25] *** nightpool has quit IRC (Read error: Operation timed out) [11:43] arkiver: Seems like the rsync upload is having problems: "@ERROR: max connections (100) reached -- try again later" [11:43] Yes [11:43] We're going to get a target from Kenshin! [11:44] Yay! :D [11:58] We have a target from Kenshin! [12:00] Extra yay! :D :D [12:00] does this mean change in code? how is workers notifed about targets? [12:01] workers request a target from the tracker [12:02] ah, nice [12:02] seemless change for us then [12:05] *** WinterFox has quit IRC (Read error: Operation timed out) [12:05] awh yeah... look at it go [12:10] *** xk_id has joined #archiveteam [12:12] *** nightpool has joined #archiveteam [12:12] we're now at 130 items/min [12:14] (13000 document IDs per minute, I think that works out to! DOPE!) [12:14] * kyan has fond memories of Docstoc's promotional emails showing up every now and then... [12:16] *** xk_id has quit IRC (Read error: Operation timed out) [12:17] *** nightpool has quit IRC (Ping timeout: 258 seconds) [12:19] *** icedice has joined #archiveteam [12:34] Atluxity: do you think you can get more concurrent on docstoc? or maybe move the concurrent on yuku over to docstoc for now? [12:39] Atluxity: nevermind about that [12:42] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [12:43] kemi: any more info on wat.tv [12:43] ? [12:45] it's owned by TF1, they're going to keep their videos on a new service but users-uploaded ones are gonna disappear on February, 17th 2016 [12:45] http://www.numerama.com/business/132400-tf1-ferme-wat-tv.html here some link if you can read French [12:46] I can't speek french [12:46] How do you see if a video is user uploaded or by TF1? [12:47] would this be user uploaded? http://www.wat.tv/video/oggi-pioggia-davide-esposito-7ixcx_7bqlx_.html [12:47] arkiver: nvm? [12:47] ok [12:47] Atluxity: looks like docstoc is now slowing down a bit [12:47] So might be good to not put more pressure on it [12:47] yeah, seem to be we are pushing the limit of the target [12:47] yeah [12:48] the head of the cloud I am beta testing was just in and complemented my network load generation [12:48] haha [12:49] it will go even higher when you go on google code too ;) [12:49] yeah, I told him [12:49] he warned me this arrangment would need to be more gentle from January on [12:50] this is not beta forever :) [13:16] *** xk_id has joined #archiveteam [13:22] *** xk_id has quit IRC (Read error: Operation timed out) [13:34] *** bwn_ has quit IRC (Read error: Connection reset by peer) [13:34] *** bwn_ has joined #archiveteam [13:53] *** luckcolor has joined #archiveteam [13:55] hello guys [13:56] Atluxity: are you the one pushing traffic from basefarm? [13:56] basefarm? [13:56] Yes [13:56] I don't know what that is [13:57] Atluxity: can you get me peering with them over amsix? [13:58] I can try [13:58] ah sorry kenshin i haven't saw that you were talking to Atluxity [13:58] or should i just poke at their peering email [13:58] luckcolor: no worries [13:59] anyway i was just about to say that on my worrior i have two tiems that are running by 11 hourse [13:59] *hours [13:59] and seem to be stuck in a url redirect loop [14:00] Atluxity: i dropped them a mail from AS24482. if you know someone in basefarm that can help, would be appreciated [14:00] you're doing 400M or so of traffic, would be nice if it could be done over peering links [14:01] that's a lot of traffic [14:01] here a sample of the urls [14:01] 37966=200 http://embed.docstoc.com/handlers/downloadfilefromflash.ashx?docid=6448510&ref_url=http://www.docstoc.com/docs/6448510/icons/core/icons/core/icons/sap/page/images/icons/core/icons/sap/page/images/icons/sap/page/NI.gif. [14:01] 37967=200 http://embed.docstoc.com/handlers/downloadfilefromflash.ashx?docid=6448510&ref_url=http://www.docstoc.com/docs/6448510/icons/core/icons/core/icons/sap/page/images/icons/core/icons/sap/page/images/icons/sap/page/PI.gif. [14:01] trs80: not really. you used to do like 1G remember? lol [14:01] luckcolor: yeah i think you're hitting repeated URLs [14:02] yeah [14:02] luckcolor: talk to arkiver. he's managing the code [14:02] strange [14:02] there he is [14:03] I'll have a fix in in a bit [14:04] for now I paused the grab, I want to take some load off of docstoc [14:04] please keep the concurrent running though! [14:04] kenshin: aarnet came in months later for our yearly review, and were like "you did a lot of extra traffic in august, any reason why?" [14:05] trs80: heh, what did you use as an excuse? [14:05] it wasn't until after the meeting that I remembered that's when I was an rsync host [14:05] kenshin: just shrugged my shoulders [14:05] it worked? [14:05] yeah. it wasn't a bad review, more like "here's how you're using your connection, what can we do for you?" sort of thing [14:05] i assume your university pays them? [14:05] and they showed us monthly usage graphs [14:06] i remember aarnet was very anal with their mirror server usage [14:06] school, yeah we pay a pretty cheap rate for all we can eat effectively [14:06] but then again i don't blame them, AU bandwidth is expensive as hell [14:06] in theory there's a 10TB/student/year usage limit during business hours [14:06] but we don't come close to that [14:06] ic, so even the rsync burst wasn't that big an issue [14:07] k12 schools get a much better deal than unis [14:07] as long as you don't get into trouble i guess, having you as a backup rsync is kinda important [14:07] yeah, it was fine, just unusual for our traffic to increase so much for a limited period of time [14:08] I don't have a huge amount of space atm, but that's mostly because I've got a few TB of internetarchive.bak [14:08] i guess when push comes to shove, rsync data is probably more important? [14:09] yeah, ia.bak seems a bit stagnant atm, and it can always be re-downloaded [14:09] sadly for huge projects, rsync is like playing musical chairs [14:09] here we go, metered traffic went from ~100GB to 3TB in august 2014 [14:09] ... [14:09] woops. [14:09] lol [14:10] unmetered traffic was ~18TB then, and 24TB in august this year [14:10] *** remsen has quit IRC (Read error: Operation timed out) [14:11] we did have a lot of crazy projects in the last 2 years [14:11] was it twitch? [14:11] last year was twitch [14:11] this year was bliptv [14:11] god damn if we have another video site. [14:12] most of our traffic over aarnet isn't metered as they have excellent peering [14:12] yeah they have SG,HK and LAX iirc [14:12] kenshin: 3h ago: Hi, dunno if it's the right place to say it but wat.tv is closing, I thought that might interest people here [14:12] ... f*** [14:13] Kenshin: when did you drop them an email, and how long ago? [14:14] wait... [14:14] 15 minutes ago [14:14] when and to what adress [14:14] peering@basefarm.no [14:14] aha [14:14] cool, will talk to them now [14:14] nice thanks [14:16] wat.tv closes feb 17 2016 fwiw [14:17] *** xk_id has joined #archiveteam [14:18] Kenshin: poked the propper techy, gave some odd answeres about what this was about, and he could see no reason for not peering [14:18] he would try to get it done today or tomorrow morning [14:19] Atluxity: cool thanks [14:21] guys also there's a silly problem with my nickname [14:21] the server is probably cutting a letter [14:21] the last one [14:21] -_- [14:21] should end with s [14:21] *** Ymgve has quit IRC (Read error: Connection reset by peer) [14:22] *** Ymgve has joined #archiveteam [14:25] *** xk_id has quit IRC (Read error: Operation timed out) [14:34] luckcolor: on this chat? [14:34] do you get an error message if you try the command /nick luckcolors ? [14:35] atluxity in general i get this nickname [14:35] the command doesnt return anything [14:36] and my nick is still the same [14:36] -__ [14:36] -_- [14:37] guess the server does not support such long nicknames [14:38] I see no one here with more than 9 chars in their nick [14:43] uh oh [14:43] Kenshin: docstoc items per hour is dropping :\ [14:43] is it you or me? [14:43] tracker rate limiting, by the lok of it [14:43] yeah [14:44] arkiver said he didn't want to tax docstoc too much [14:45] ah, he said at 15:04 right [14:57] So, we don't need another 5000 threads or anything like that? [15:02] Doesn't look to me like the tracker is handing out any new items at all right now [15:04] antomatic: yeah, looks like arkiver paused the grab about an hour ago [15:15] yeah he did [15:21] *** xk_id has joined #archiveteam [15:25] there's some issue with the code that luckcolor found, so i think arkiver is working on it [15:25] give him some time [15:26] phuzion: no, i think Atluxity threw in a LOT of resources already [15:26] so best not to add anymore strain to docstoc [15:26] i think google code is coming up soon as well, that one can be raped [15:27] Atluxity: basefarm got back, will peer tomorrow, thanks! [15:29] *** ozlo has quit IRC (Quit: If only I was sure that my head on the door was a dream...) [15:30] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [15:30] *** xk_id has quit IRC (Read error: Operation timed out) [15:36] *** luckcolor has quit IRC (Read error: Operation timed out) [15:48] *** luckcolor has joined #archiveteam [15:50] *** Ghost_of_ has joined #archiveteam [15:55] *** primus104 has quit IRC (Leaving.) [16:14] tracker restarted [16:14] The websites really slowed down a lot. Sometimes it's best to totally take off the load from it and build it up slowly again [16:20] *** JesseW has joined #archiveteam [16:44] *** Start has quit IRC (Quit: Disconnected.) [16:44] *** JesseW has quit IRC (Leaving.) [16:47] *** scyther has joined #archiveteam [16:52] *** Ghost_of_ has quit IRC (Quit: Leaving) [16:53] *** xk_id has joined #archiveteam [16:56] Are we getting banned by cloudfront? https://gist.github.com/anonymous/788d8f001402b1e45a72 [16:57] *** luckcolor has quit IRC (Read error: Connection reset by peer) [16:57] phuzion, it access denies with a 403 when I hit it from my other server over HTTP [16:57] Ok [16:58] you try some of them [16:59] I get a 403 access denied on those links as well [16:59] Same [17:00] On my domestic connection too [17:16] Some links seem to be gone [17:17] they return 403 [17:17] if you look at logs from other items you'll also see many 200's [17:20] I'm going to start the Google Code project in a bit [17:21] ok, still waiting for Kimsufi stock, but will get on it asap [17:21] ok! [17:21] I'm not going to make google code the default warrior project yet [17:22] later tonight, ill see if Scaleway have capacity, and if they do I will start on the ScaleArchiver [17:29] arkiver: I should be able to throw about 50 DO instances at google code [17:34] nice [17:36] I just need to figure out the automation part of my droplet creation. [17:36] phuzion, I was going to say about that [17:37] HCross: Do you have any ideas on how to do that? [17:37] some form of batch script? [17:38] Ansible supports DO, but I don't have the right version of ansible [17:38] I am working on something for Scaleway at sometime, clicking order on the servers are fine, as there are a max of 10 [17:40] Take allok at SaltStack [17:40] a look at [17:42] I'm looking at something called tugboat right now [17:42] https://github.com/pearkes/tugboat [17:55] arkiver: Regarding the loop-y problem, is there somehing I should do to abort jobs that are already going and have that problem? Or will they fix themselves eventually? URL example http://embed.docstoc.com/handlers/downloadfilefromflash.ashx?docid=109$830&ref_url=http://www.docstoc.com/docs/common/common/common/common/common/comm$n/common/common/common/common/common/common/common/common/common/common/common/c [17:55] ommon/common/common/common/common/common/common/common/common/common/common/com$ [17:55] on/common/common/common/common/common/common/common/common/common/common/common$ [17:55] common/common/common/common/common/common/common/common/common/common/common/co$ [17:55] mon/common/common/common/common/common/common/common/common/common/common/commo$ [17:55] Wow. Anyway, that, ish. [17:56] right. will have a fix up in a bit [17:56] Ok, cool, sorry to bother you :) [17:57] *** Start has joined #archiveteam [17:59] *** primus104 has joined #archiveteam [18:03] items added to googlecode grab! [18:04] *** remsen has joined #archiveteam [18:07] arkiver: What do you think is a reasonable amount of concurrent per IP? 2? 10? [18:07] (I'm using small DO instances, keep in mind) [18:08] what is the CPU backend? [18:08] CPU backend? [18:08] On the DO instances? [18:08] because if I come along with my E5 server then ofc it smore [18:11] it can be hard to find out an URL is in a loop, but I'm trying to do it with this https://github.com/ArchiveTeam/docstoc-grab/blob/master/docstoc.lua#L120-L135 [18:11] So scripts for docstoc are updated [18:12] Atluxity: please let me know when you have updated your scripts and I'll set the new version in the tracker [18:12] arkiver: New version of the script is available but not required, is what you're saying? [18:12] We're making good progress and I don't want to interrupt that [18:13] Right [18:13] Basically waiting an extra hour shouldn't matter too much [18:13] Do you think 2 concurrent threads per instance is good, or can I bump that up? [18:13] and in that hours soe extra thousand documents can be saved [18:13] for google code? [18:13] Yeah [18:13] maybe [18:13] I honestly have no idea [18:13] I don't think google bans IPs [18:14] But these projects sometimes have a few hundred thousand URLs [18:14] And size might also be big for some items [18:14] 2 it is then :) [18:14] Yes, I guess you can always up the limit if the machines can handle more [18:18] arkiver: googlecode-grab is updated [18:19] oh, docstoc [18:19] right [18:23] Instances are being updated as we speak. I should be coming online with 100 simultaneous threads within about 10 minutes or so. [18:23] arkiver: docstock updated [18:26] wondering about the prossess, when I upload stuff to rsync target, is it 1:1 with what was downloaded or is it compressed? [18:27] Atluxity: You talking about the warc that seesaw creates? [18:27] probably [18:28] because that's run through gzip before being sent off to the rsync target. [18:28] thats odd [18:28] basically, seesaw mirrors the content (and metadata) into a WARC file, then gzips the WARC, then rsyncs it to FOS or whoever the rsync target is. [18:28] my beta-cloud manager was wondering why I created so much outbound traffic, and not so much inbound [18:29] but it might be someone else doing stuff (tm) [18:30] Anonymous kills ISIS darknet site: http://www.dailydot.com/politics/isis-tor-hidden-service-down/ [18:30] I think that was the one SketchCow threw into ArchiveBot last week [18:33] it was http://archive.fart.website/archivebot/viewer/job/hjbkk [18:33] *** Yukundali has joined #archiveteam [18:33] salutations all [18:34] greetings [18:34] Someone has a bad script scraping our site, just wanted to come and greet before requesting the script to stop [18:34] cool [18:34] what site? [18:34] Yuku [18:34] ah, yes [18:35] it is shutting down, no? [18:35] http://pastebin.com/g40vDhUE [18:35] the requests are malformed [18:35] and its chocking our memcache servers [18:35] we run a very archaic infrastructure [18:35] oh good it's not ArchiveBot [18:35] arkiver: poke [18:36] Yukundali: we should fix that.... [18:37] Yukundali: this is the status of our scraping, if you had not found it already http://tracker.archiveteam.org/yuku/ [18:37] I'll stop my workers [18:37] impressive [18:38] there, can take a little while, but my workers will not ask for more work, just finish their ongoing items [18:38] sadly I have no control over the code we are running [18:38] its ok, I already blocked them [18:39] but I might want to join you guys, I love scraping data [18:39] sounds great [18:39] adbrite is dead btw [18:39] *** Start has quit IRC (Quit: Disconnected.) [18:39] Yukundali: Would you mind helping us out by suggesting how we can improve the quality of the requests? [18:39] they went out of business years ago [18:40] I would suggest throttling based on responce time... or TTFB [18:41] Someone generally manually adjusts the rate of requests based on what we are observing as response times. [18:41] But you said that the requests are malformed? [18:41] Take a look at the pastebin he linked [18:41] The same substring is repeated in the URL dozens of times [18:41] Oh hah wow [18:42] Ok yeah, I can see memcache freaking out about that. [18:43] arkiver: ping? [18:45] Will the Yuku jobs that were blocked due to the issue be detected as failed and retried? Wouldn't want to miss stuff... [18:46] (but wow those requests, those are long urls) [18:46] Yukundali: hi [18:47] Yukundali, as second on the list, I want to say sorry for any server melting that we might have done [18:47] Yukundali: tracker is paused. load should be 0 from our side in a bit [18:47] it normally wouldn't be a problem, but as you may know, Yuku doesn't have the most stable infrastructure [18:47] Yeah, that's why we are grabbing your website [18:47] lmao [18:47] seen worse.... [18:47] I also knnow that we are grabbing the individual posts, which is on purpose [18:48] (which could be seen as malicious) [18:48] arkiver: It looks like similar problem to the loop on docstoc: http://pastebin.com/g40vDhUE [18:48] But yeah, Yukundali, if we fix the malformed requests, would that help things out a little bit and make your life easier? [18:48] yes, ohh... and maybe throttle it down a bit [18:49] like I said, we run memcache servers... which are horrible... so they can't take too much [18:49] I'm hoping to get couchbase running soon, we are also hiring a team to help with infrastructure... so yuku will see better days [18:49] but until then, be gentile... she is old [18:49] ok [18:50] none-the-less, I'm all for what you guys are doing [18:50] and honestly quite impressed [18:50] I'll lower the limit, please let me know what you think [18:50] thanks! [18:50] Little thing though, maybe it'd be good to set some other status code then 200 next time [18:51] chfoo: can you please send me the logs of yuku? [18:51] arkiver: any ETA on opening up google code? I've got my DO instances spun and ready to go. [18:51] If their infrastructure is havig as much trouble as it sounds like it is from some of the threads, I think incorrect status codes are a small worry :P [18:51] what do you need? I can give it to ya [18:52] Google Code project is started. [18:52] ok, I'm going to lift my ban on ArchiveTeam bot [18:52] Yukundali, do you have any sort of whole site backup that could be handed over? [18:52] thanks [18:52] arkiver, congrats [18:53] we have replicated data and backups, but nothing to just hand over [18:53] yeah [18:53] but [18:53] I can give you a secret [18:53] I'd rather get the data thrugh http [18:53] ooooh [18:53] we use Mobique as an api to integrate into tapatalk [18:53] all our data is accessable there [18:53] w/o the html [18:54] * mobiequo [18:54] Yukundali: do you know what we do with the data we grab? [18:54] no [18:55] I would assume archive it [18:55] It's added to the Internet Archive's Wayback Machine [18:55] Yeah, after that it's all made public [18:55] Google Code: Process RsyncUpload returned exit code 12 for Item project:test-mysql-project [18:55] So having browsable web pages is important [18:55] You might have heard of the Wayback Machine [18:55] yea [18:55] Would be good to get the back-end data too if that's an option [18:55] Everything we grab goes into the wayback machine [18:55] IMO [18:55] ohh really?! [18:55] yeah [18:55] well blow me down ... lol [18:55] Just for the sake of having everything [18:56] ok, then please have at [18:56] I'll give you some examples [18:56] archive.org has saved our buts numerous times [18:56] We've got a special arrangement with them, thanks to our fearless leader SketchCow [18:56] (Whom you may also know as Jason Scott of textfiles.com) [18:57] We have currently saved 682 GB from yuku [18:58] It's not all uploaded yet, some of it is here https://archive.org/search.php?query=mediatype%3A%22web%22%20AND%20%28yuku%29 [18:58] Our current projects are here http://tracker.archiveteam.org/ [18:59] with for example docstoc and google code [18:59] phuzion: looks like the rsync target is removed... :/ [18:59] chfoo: can you please recreate the rsync target for googlecode? [19:00] poke me when I can fire up yuku-grab again [19:00] ok [19:00] we lost an advertiser due to "Excessive non-human traffic" : [19:00] we lost an advertiser due to "Excessive non-human traffic" : / [19:00] Atluxity: I'll first have to go through the logs since we have some bad 200 items [19:01] I suspect there will be some work, yes [19:01] could you restart the scrape slowly pls [19:01] we are negotiating with them now [19:01] Thats not good [19:01] sorry to hear it [19:01] I'll first figure out what exactly we are grabbing from advertisers [19:02] your advertiser clearly isn't prepared for the Singularity [19:02] lol [19:02] Yes, let us know what to block to avoid tripping their detection [19:03] Yukundali: dod you unblock us? [19:03] did* [19:04] Yes, the new code is rolling out to our webservers now [19:04] awesome [19:09] Yukundali: what domain does your advertiser use for the advertisements? [19:10] we have a lot. The new ad rules coming out will significantly hurt our business model however ( advertisers are kicking out our subdomains ) [19:10] we have hundreds of domains [19:11] mostly : lefora.com, yuku.com, freeforums.org, forumer.com [19:12] I'm not very sure how all the advertising views work [19:12] Is there some kind of image that is loaded for the advertisers? [19:12] or something from one of the advertisers' domains? [19:13] or can they directly see that a page on *.yuku.com is downloaded? [19:13] we have multiple points of detection, from inside our code, to pixels, to javascripts [19:14] This is a partial log I just grabbed from yuku: http://paste.nerds.io/raw/qeqoqenitu [19:15] Do you see somethinig that'd have to be blocked so advertisers don't see us/ [19:15] ? [19:15] thats a private board [19:16] but no I don't see anything [19:16] I'll talk with advertisers more in depth and return to this channel with a better answer [19:17] I'm not sure what you mean by private, but I'm ust seeing a normal forum, not much private here here http://camgirlnotes.fr.yuku.com/topic/675/ [19:17] Yukundale: ok [19:18] and thanks for contacting us! :) [19:19] ahh, i only tested the private board links on the scrape... [19:19] thanks for being understanding, I'll return some day : ) [19:19] *** Yukundali has quit IRC (Quit: http://chat.efnet.org ) [19:23] good guy [19:24] *** JesseW has joined #archiveteam [19:26] *** aaaaaaaaa has joined #archiveteam [19:26] *** swebb sets mode: +o aaaaaaaaa [19:29] *** atomotic has joined #archiveteam [19:30] *** xk_id has quit IRC (Read error: Connection reset by peer) [19:30] *** xk_id has joined #archiveteam [19:59] Wow, just saw the logs. Nice to have another webmaster show up and be nice. Rather than, "I BLOCKED YOUR USER AGENT AND BLOCKED YOUR IPS FUCK OFF!. [20:00] aaaaaaaaa: Yeah. Have you read the Posterous story or were you involved with it? [20:01] Yeah, maybe we should give that guy a cake too. [20:01] at least I think it was cake [20:01] Cheesecake, but yeah [20:03] *** mr-b has quit IRC (Read error: Operation timed out) [20:15] *** wacky_ has quit IRC (Connection closed) [20:17] *** mr-b has joined #archiveteam [20:19] *** lol_ has joined #archiveteam [20:20] *** lol_ has quit IRC (Client Quit) [20:27] *** JesseW has quit IRC (Leaving.) [20:35] *** Start has joined #archiveteam [20:44] *** JesseW has joined #archiveteam [20:46] *** Start has quit IRC (Quit: Disconnected.) [20:52] *** JesseW has quit IRC (Leaving.) [21:00] *** remsen has quit IRC (Read error: Operation timed out) [21:05] *** K4k has joined #archiveteam [21:07] *** Start has joined #archiveteam [21:09] *** WinterFox has joined #archiveteam [21:11] *** K4k has quit IRC (WeeChat 1.3) [21:15] *** cvb has joined #archiveteam [21:17] *** K4k has joined #archiveteam [21:18] *** WinterFox has quit IRC (Remote host closed the connection) [21:19] *** bwn_ has quit IRC (Read error: Operation timed out) [21:21] SketchCow: ping? [21:23] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [21:23] *** Atom__ has joined #archiveteam [21:23] *** Infreq has quit IRC (Read error: Operation timed out) [21:24] *** Infreq has joined #archiveteam [21:25] *** RichardG_ has joined #archiveteam [21:26] *** Atom-- has quit IRC (Read error: Operation timed out) [21:26] *** lukeman has quit IRC (Read error: Operation timed out) [21:27] *** lukeman has joined #archiveteam [21:27] *** Meeh_ has joined #archiveteam [21:28] *** schbirid has quit IRC (Quit: Leaving) [21:28] *** RichardG has quit IRC (Read error: Operation timed out) [21:28] *** Baljem has joined #archiveteam [21:28] *** aliz has quit IRC (Read error: Operation timed out) [21:29] *** aliz has joined #archiveteam [21:30] *** Baljem_ has quit IRC (Read error: Operation timed out) [21:30] *** bwn_ has joined #archiveteam [21:30] *** Meeh has quit IRC (Read error: Connection reset by peer) [21:31] *** goekesmi_ has joined #archiveteam [21:33] *** goekesmi has quit IRC (Ping timeout: 499 seconds) [21:34] *** lysobit has quit IRC (Read error: Operation timed out) [21:34] *** SadDM has quit IRC (Ping timeout: 499 seconds) [21:34] *** midas has quit IRC (Ping timeout: 499 seconds) [21:34] *** Nemo_bis has quit IRC (Read error: Operation timed out) [21:37] *** lysobit has joined #archiveteam [21:39] *** Gfy has quit IRC (Ping timeout: 730 seconds) [21:39] *** midas has joined #archiveteam [21:43] *** SadDM has joined #archiveteam [21:43] *** swebb sets mode: +o SadDM [21:48] *** zenguy_pc has quit IRC (Read error: Operation timed out) [21:52] *** RichardG_ is now known as RichardG [21:57] *** Gfy has joined #archiveteam [22:13] Whut [22:14] something something google code rsync i think [22:17] *** Start has quit IRC (Quit: Disconnected.) [22:24] *** Froggypwn has quit IRC (Ping timeout: 310 seconds) [22:25] *** Froggypwn has joined #archiveteam [22:26] *** icedice has quit IRC (Ping timeout: 360 seconds) [22:28] Yeah, why not just write ping and then walk around with your hands around your ass assuming, what, I'll never look at the IRC channel again [22:28] Or, you know, e-mail [22:28] * SketchCow is trying to dig out from this mess of a room [22:30] *** scyther has quit IRC (Read error: Connection reset by peer) [22:34] *** BlueMaxim has joined #archiveteam [22:48] *** K4k has quit IRC (Read error: Operation timed out) [23:04] Ping timed out... [23:06] Maybe it's being conveyed via avian carriers? (https://www.ietf.org/rfc/rfc1149.txt) [23:19] Never underestimate the bandwidth of a flock of avian carriers with USB drives careening through the sky. [23:20] do we have any ongoing AT projects for hardware drivers? [23:29] *** Stiletto has quit IRC () [23:41] *** ironman_ has quit IRC (Quit: Connection closed for inactivity) [23:43] *** remsen has joined #archiveteam [23:47] joepie91: for hardware drivers? [23:53] yes [23:53] drivers [23:53] for hardware [23:53] lol