[02:19] following up on some words in #archivebot here [02:20] I used to be very much in the "yes we grab everything" position [02:20] I've gut a slightly finer point on it recently [02:20] not sure what I mean exactly, putting this out in case someone wishes to discuss [04:18] i'm starting to upload more funny ore die videos and hp manuals [09:14] https://business.twitter.com/en-gb/products/pricing -> You’ll only be charged when people follow your Promoted Account or retweet, reply, favourite or click on your Promoted Tweets. You’ll never be charged for your organic activity on Twitter. [09:15] * schbirid has been replying to each and every promoted tweet since i found that [09:21] schbirid: for some reason i think it would be awesome to combine https://twitter.com/markovs with promoted tweets [09:23] ooooh hohoho >:D [12:50] xmc: I hear you re:Archivebot... we've thrown some HUGE stuff at it without much thought. That really ties it up, and when something like Ferguson happens and we really need it, it's busy downloading Linux kernel mailing lists or Edgar Rice Burroughs fan sites. [12:51] maybe we need 1 pipeline empty for shit that is going down now [12:51] Maybe we need to ask ourselves why folks are using it instead of running a wget themselves. [12:52] also a option [12:52] most likely, ease of usage [12:53] yeah, definitly, but what are the parts that it makes easy? [12:53] for example... [12:53] I *LOVE* that it automatically grabs media hosted on other domains. [12:55] If somebody smarter than me could extract that bit of magic from archivebot and add a description of how to do it to the wiki's "mirroring with wget" page, I'd probably do a *bunch* more small-medium sized grabs on my own. [12:55] I think that the biggest issue is with the steep learningcurve of wgetting a complete domain + warc + ignore patterns and uploading it to ia that might be the biggest issue [12:55] yeah [12:55] we should all be able to run our own archivebot [12:56] if it's easy enough to set up.. yeah [12:56] I suppose the one thing that it does that is really magical is the way it uplkoads the warcs on a daily basis. [12:57] If we were all to do our own little caputes then we'd have to constantly be bugging SketchCow to move them for us. [12:58] well, we could ask SketchCow to create a dumpcollection or 1 rsync target we dump it to [12:58] hmm, that's a though [12:58] (dump it to? that sounds way too dutch) [13:41] so [13:41] midas: that already exists [13:42] in some form, at least -- that's the idea behind separate !ao pipelines, and !ao < FILE, and was also the idea behind pipeline IDs (which admittedly are not yet all that usable since they're auto-generated, Zooko's Triangle etc) [13:51] yipdw: would it be possible for a person to set up an autonomous archivebot pipeline... one that doesn't talk to the main control channel or report to the public dashboard? [13:52] yeah [13:52] I do that for testing [13:52] it is somewhat documented in INSTALL; however there's a lot of bits in the bot that should really just be CLI tools [13:53] so there's a dependency on an IRC server (and CouchDB server for that matter) that is a bit odd [13:53] wow really? I didn't expect that answer... I expected domething along the lines of "Pffft... go figure it out yourself. I'm busy doing God's work" :-D [13:53] there's a branch in the archivebot repo that is aimed at fixing this [13:53] Nice... I'll be keeping an eye on that [13:54] it's the taco-bell branch [13:55] O_o interesting name [13:55] http://widgetsandshit.com/teddziuba/2010/10/taco-bell-programming.html [13:56] it's not really as extreme as that post espouses but it is nevertheless a simplification [13:59] I've never heard that term, but the concept is familiar... "You have simple yeat powerful tools... use them" [14:01] so, which pieces are you looking to simplify out (just out of curiosity)? [14:02] cogs was a pretty big mess of objects that also leaked a lot of memory [14:02] that's now a few pipelines [14:02] (and doesn't leak) [14:03] the dashboard used to do a fair amount of JSON processing before it output data; that's mostly gone now and the dashboard is also part of a pipeline [14:03] Wut [14:03] those were changes done out of necessity to keep the bot from destroying its host [14:04] everything else is really more of an aesthetic thing -- "I don't like that this code is duplicated here, so I'm going to make it common" [14:04] so less urgent :P [14:07] I am so thankful that the world is filled with intelligent people who have a bit of time on their hands and are into cool stuff. [14:20] SadDM: yeah, me too [14:20] archivebot wouldn't really exist without redis+wpull [14:54] urgh [14:55] that first [14:55] now, i dont like my collegues anymore [14:55] one of them kinda broke my great deployment idea from git [14:58] they made a new repo containing multiple folders before getting to the source of the files [14:59] empty folders? [15:00] nope [15:00] well sort of [15:00] it's project/public_html/files <-- i want to clone the files directly [15:01] hm maybe i can branch it [15:03] https://imgur.com/gallery/Qd9ksk5 [15:06] Archive.org material :) [15:10] derployment [15:14] norbert79: yes, was thinking that [15:23] Happy friday! https://www.youtube.com/watch?v=8PVal8Fy7CM [15:33] .t [15:33] Fri, 15 Aug 2014 15:33:46 GMT [15:33] um.. [15:34] .t https://www.youtube.com/watch?v=8PVal8Fy7CM [15:34] no? [15:34] * joepie91 boggles [15:34] .title [15:34] joepie91: My Name is John Daker - BEST VERSION w/ SUBTITLES - YouTube [15:34] ah there we go [17:02] just know some videos of funny or die say the description twice [17:03] this is cause some videos so up twice in my xml dump but i add code so i could get all thing into one line [17:03] keywords also appear twice with these videos too [17:05] also i'm past 26k [17:05] in godaneinbox [17:06] also i'm close to getting number 46k for the manuals collection [18:02] Anyone know if IA offers downloads of files by anything other than HTTP? rsync? FTP? I know about the torrents [18:02] I wanna get the WL insurance C file, and it's friggin huge [18:05] It doesn't appear so, but you could try an accelerator like axel. [18:05] I think they got rid of ftp downloads a while ago. [18:05] SadDM: another thing is that the archivebot machines mostly have way better connectivity, if I crawled a 100gb site myself it would take weeks to upload to ia [18:06] and it's more work which means less likely to actually get done [20:06] i'm up to 202k files that i have uploaded [20:21] phuzion: I wonder how hard it'd be to build an rsync proxy for IA... [20:21] joepie91: Not sure. Wanna try? [20:22] I'll test it on the wikileaks insurance file if you wanna blow 325GB of data on it :) [20:26] heh [20:26] pft, 325GB :P [20:27] phuzion: no rsyncd lib for node :( [20:28] nodejs? [20:29] ya [20:34] for some reason this conversation got me interested in implementing archivebot on Plan 9 [20:34] I don't know why [21:34] On the off chance anyone knows the answer: How long should the whole BGP messing up routes last? I'm getting weird behavior the past few days that I think may be related but my ISP insists everything is fine. [21:45] aaaaaaaaa: indefinite, if you're referring to recent problems with routers not having enough memory [21:48] Figured that is what I was going to get. Of course, they'd never admit there was a problem, but I've got packets that get stuck going in loops according to traceroute, just disappear to nowhere, etc and only for certain destinations. [21:48] Oh well. That's the service you get from a duopoly. [21:55] aaaaaaaaa: which part, Comcast or AT&T [22:02] arketype: forever until they upgrade. [22:06] Comcast, I've not seen a packet go through at&t on any traceroute [22:06] I think my ISP is trying to route around them [22:07] around at&t [22:09] that's one way to route around any network neutrality laws [22:09] "pay us for premium TCAM space" [22:09] Usually my packets go through AT&T to level3 but now they seem to be going through comcast [22:13] Oh well. [22:42] https://github.com/paypal/merchant-sdk-java/blob/master/merchantsample/src/main/java/com/sample/merchant/CheckoutServlet.java <-- this is what Java developers think is a reasonable "sample" program [23:23] as a Java developer.. *sigh* ..no comment [23:24] though it would be same crap with a single .php file, servlets are simple..