#archiveteam-bs 2014-08-15,Fri

↑back Search

Time	Nickname	Message
02:19 ^🔗	xmc	following up on some words in #archivebot here
02:20 ^🔗	xmc	I used to be very much in the "yes we grab everything" position
02:20 ^🔗	xmc	I've gut a slightly finer point on it recently
02:20 ^🔗	xmc	not sure what I mean exactly, putting this out in case someone wishes to discuss
04:18 ^🔗	godane	i'm starting to upload more funny ore die videos and hp manuals
09:14 ^🔗	schbirid	https://business.twitter.com/en-gb/products/pricing -> Youâll only be charged when people follow your Promoted Account or retweet, reply, favourite or click on your Promoted Tweets. Youâll never be charged for your organic activity on Twitter.
09:15 ^🔗	*	schbirid has been replying to each and every promoted tweet since i found that
09:21 ^🔗	midas	schbirid: for some reason i think it would be awesome to combine https://twitter.com/markovs with promoted tweets
09:23 ^🔗	schbirid	ooooh hohoho >:D
12:50 ^🔗	SadDM	xmc: I hear you re:Archivebot... we've thrown some HUGE stuff at it without much thought. That really ties it up, and when something like Ferguson happens and we really need it, it's busy downloading Linux kernel mailing lists or Edgar Rice Burroughs fan sites.
12:51 ^🔗	midas	maybe we need 1 pipeline empty for shit that is going down now
12:51 ^🔗	SadDM	Maybe we need to ask ourselves why folks are using it instead of running a wget themselves.
12:52 ^🔗	midas	also a option
12:52 ^🔗	midas	most likely, ease of usage
12:53 ^🔗	SadDM	yeah, definitly, but what are the parts that it makes easy?
12:53 ^🔗	SadDM	for example...
12:53 ^🔗	SadDM	I LOVE that it automatically grabs media hosted on other domains.
12:55 ^🔗	SadDM	If somebody smarter than me could extract that bit of magic from archivebot and add a description of how to do it to the wiki's "mirroring with wget" page, I'd probably do a bunch more small-medium sized grabs on my own.
12:55 ^🔗	midas	I think that the biggest issue is with the steep learningcurve of wgetting a complete domain + warc + ignore patterns and uploading it to ia that might be the biggest issue
12:55 ^🔗	midas	yeah
12:55 ^🔗	schbirid	we should all be able to run our own archivebot
12:56 ^🔗	SadDM	if it's easy enough to set up.. yeah
12:56 ^🔗	SadDM	I suppose the one thing that it does that is really magical is the way it uplkoads the warcs on a daily basis.
12:57 ^🔗	SadDM	If we were all to do our own little caputes then we'd have to constantly be bugging SketchCow to move them for us.
12:58 ^🔗	midas	well, we could ask SketchCow to create a dumpcollection or 1 rsync target we dump it to
12:58 ^🔗	SadDM	hmm, that's a though
12:58 ^🔗	midas	(dump it to? that sounds way too dutch)
13:41 ^🔗	yipdw	so
13:41 ^🔗	yipdw	midas: that already exists
13:42 ^🔗	yipdw	in some form, at least -- that's the idea behind separate !ao pipelines, and !ao < FILE, and was also the idea behind pipeline IDs (which admittedly are not yet all that usable since they're auto-generated, Zooko's Triangle etc)
13:51 ^🔗	SadDM	yipdw: would it be possible for a person to set up an autonomous archivebot pipeline... one that doesn't talk to the main control channel or report to the public dashboard?
13:52 ^🔗	yipdw	yeah
13:52 ^🔗	yipdw	I do that for testing
13:52 ^🔗	yipdw	it is somewhat documented in INSTALL; however there's a lot of bits in the bot that should really just be CLI tools
13:53 ^🔗	yipdw	so there's a dependency on an IRC server (and CouchDB server for that matter) that is a bit odd
13:53 ^🔗	SadDM	wow really? I didn't expect that answer... I expected domething along the lines of "Pffft... go figure it out yourself. I'm busy doing God's work" :-D
13:53 ^🔗	yipdw	there's a branch in the archivebot repo that is aimed at fixing this
13:53 ^🔗	SadDM	Nice... I'll be keeping an eye on that
13:54 ^🔗	yipdw	it's the taco-bell branch
13:55 ^🔗	SadDM	O_o interesting name
13:55 ^🔗	yipdw	http://widgetsandshit.com/teddziuba/2010/10/taco-bell-programming.html
13:56 ^🔗	yipdw	it's not really as extreme as that post espouses but it is nevertheless a simplification
13:59 ^🔗	SadDM	I've never heard that term, but the concept is familiar... "You have simple yeat powerful tools... use them"
14:01 ^🔗	SadDM	so, which pieces are you looking to simplify out (just out of curiosity)?
14:02 ^🔗	yipdw	cogs was a pretty big mess of objects that also leaked a lot of memory
14:02 ^🔗	yipdw	that's now a few pipelines
14:02 ^🔗	yipdw	(and doesn't leak)
14:03 ^🔗	yipdw	the dashboard used to do a fair amount of JSON processing before it output data; that's mostly gone now and the dashboard is also part of a pipeline
14:03 ^🔗	SketchCow	Wut
14:03 ^🔗	yipdw	those were changes done out of necessity to keep the bot from destroying its host
14:04 ^🔗	yipdw	everything else is really more of an aesthetic thing -- "I don't like that this code is duplicated here, so I'm going to make it common"
14:04 ^🔗	yipdw	so less urgent :P
14:07 ^🔗	SadDM	I am so thankful that the world is filled with intelligent people who have a bit of time on their hands and are into cool stuff.
14:20 ^🔗	yipdw	SadDM: yeah, me too
14:20 ^🔗	yipdw	archivebot wouldn't really exist without redis+wpull
14:54 ^🔗	midas	urgh
14:55 ^🔗	midas	that first
14:55 ^🔗	midas	now, i dont like my collegues anymore
14:55 ^🔗	midas	one of them kinda broke my great deployment idea from git
14:58 ^🔗	midas	they made a new repo containing multiple folders before getting to the source of the files
14:59 ^🔗	deathy	empty folders?
15:00 ^🔗	midas	nope
15:00 ^🔗	midas	well sort of
15:00 ^🔗	midas	it's project/public_html/files <-- i want to clone the files directly
15:01 ^🔗	midas	hm maybe i can branch it
15:03 ^🔗	joepie91	https://imgur.com/gallery/Qd9ksk5
15:06 ^🔗	norbert79	Archive.org material :)
15:10 ^🔗	xmc	derployment
15:14 ^🔗	joepie91	norbert79: yes, was thinking that
15:23 ^🔗	swebb	Happy friday! https://www.youtube.com/watch?v=8PVal8Fy7CM
15:33 ^🔗	joepie91	.t
15:33 ^🔗	botpie91	Fri, 15 Aug 2014 15:33:46 GMT
15:33 ^🔗	joepie91	um..
15:34 ^🔗	joepie91	.t https://www.youtube.com/watch?v=8PVal8Fy7CM
15:34 ^🔗	joepie91	no?
15:34 ^🔗	*	joepie91 boggles
15:34 ^🔗	joepie91	.title
15:34 ^🔗	botpie91	joepie91: My Name is John Daker - BEST VERSION w/ SUBTITLES - YouTube
15:34 ^🔗	joepie91	ah there we go
17:02 ^🔗	godane	just know some videos of funny or die say the description twice
17:03 ^🔗	godane	this is cause some videos so up twice in my xml dump but i add code so i could get all thing into one line
17:03 ^🔗	godane	keywords also appear twice with these videos too
17:05 ^🔗	godane	also i'm past 26k
17:05 ^🔗	godane	in godaneinbox
17:06 ^🔗	godane	also i'm close to getting number 46k for the manuals collection
18:02 ^🔗	phuzion	Anyone know if IA offers downloads of files by anything other than HTTP? rsync? FTP? I know about the torrents
18:02 ^🔗	phuzion	I wanna get the WL insurance C file, and it's friggin huge
18:05 ^🔗	aaaaaaaaa	It doesn't appear so, but you could try an accelerator like axel.
18:05 ^🔗	aaaaaaaaa	I think they got rid of ftp downloads a while ago.
18:05 ^🔗	DFJustin	SadDM: another thing is that the archivebot machines mostly have way better connectivity, if I crawled a 100gb site myself it would take weeks to upload to ia
18:06 ^🔗	DFJustin	and it's more work which means less likely to actually get done
20:06 ^🔗	godane	i'm up to 202k files that i have uploaded
20:21 ^🔗	joepie91	phuzion: I wonder how hard it'd be to build an rsync proxy for IA...
20:21 ^🔗	phuzion	joepie91: Not sure. Wanna try?
20:22 ^🔗	phuzion	I'll test it on the wikileaks insurance file if you wanna blow 325GB of data on it :)
20:26 ^🔗	joepie91	heh
20:26 ^🔗	joepie91	pft, 325GB :P
20:27 ^🔗	joepie91	phuzion: no rsyncd lib for node :(
20:28 ^🔗	phuzion	nodejs?
20:29 ^🔗	joepie91	ya
20:34 ^🔗	yipdw	for some reason this conversation got me interested in implementing archivebot on Plan 9
20:34 ^🔗	yipdw	I don't know why
21:34 ^🔗	aaaaaaaaa	On the off chance anyone knows the answer: How long should the whole BGP messing up routes last? I'm getting weird behavior the past few days that I think may be related but my ISP insists everything is fine.
21:45 ^🔗	yipdw	aaaaaaaaa: indefinite, if you're referring to recent problems with routers not having enough memory
21:48 ^🔗	aaaaaaaaa	Figured that is what I was going to get. Of course, they'd never admit there was a problem, but I've got packets that get stuck going in loops according to traceroute, just disappear to nowhere, etc and only for certain destinations.
21:48 ^🔗	aaaaaaaaa	Oh well. That's the service you get from a duopoly.
21:55 ^🔗	yipdw	aaaaaaaaa: which part, Comcast or AT&T
22:02 ^🔗	Smiley	arketype: forever until they upgrade.
22:06 ^🔗	aaaaaaaaa	Comcast, I've not seen a packet go through at&t on any traceroute
22:06 ^🔗	aaaaaaaaa	I think my ISP is trying to route around them
22:07 ^🔗	aaaaaaaaa	around at&t
22:09 ^🔗	yipdw	that's one way to route around any network neutrality laws
22:09 ^🔗	yipdw	"pay us for premium TCAM space"
22:09 ^🔗	aaaaaaaaa	Usually my packets go through AT&T to level3 but now they seem to be going through comcast
22:13 ^🔗	aaaaaaaaa	Oh well.
22:42 ^🔗	yipdw	https://github.com/paypal/merchant-sdk-java/blob/master/merchantsample/src/main/java/com/sample/merchant/CheckoutServlet.java <-- this is what Java developers think is a reasonable "sample" program
23:23 ^🔗	deathy	as a Java developer.. sigh ..no comment
23:24 ^🔗	deathy	though it would be same crap with a single .php file, servlets are simple..

irclogger-viewer