#archiveteam 2012-10-11,Thu

↑back Search

Time	Nickname	Message
00:00 ^🔗	nintendud	ah, I'm seeing timeouts in my warrior
00:00 ^🔗	nintendud	it must really be getting crushed
00:00 ^🔗	primus	what does FOS stand for?
00:01 ^🔗	nintendud	Free and Open Source? Maybe?
00:02 ^🔗	sankin	just curious, what are the hardware specs for fos?
00:02 ^🔗	SketchCow	This is really bad.
00:02 ^🔗	nintendud	it's a raspberry pi hooked up to a RAID array
00:02 ^🔗	SketchCow	It shouldn't be this hammered.
00:02 ^🔗	nintendud	Oh?
00:04 ^🔗	nintendud	speaking of 'pi's, apparently you can colocate a pi in Austria for free: https://www.edis.at/en/server/colocation/austria/raspberrypi/
00:04 ^🔗	SketchCow	FOS stands for Fortress of Solitude
00:04 ^🔗	SketchCow	It replaced a machine named Batcave
00:04 ^🔗	nintendud	Hah, nice
00:04 ^🔗	SketchCow	FOS became a way to refer to it easily.
00:04 ^🔗	primus	:-) awesome, thanks
00:05 ^🔗	S[h]O[r]T	i always thought it was a fun take on FiOS because verizon sponsored it :P
00:05 ^🔗	nintendud	iFOS. By Apple.
00:05 ^🔗	S[h]O[r]T	even though i know that is no where near true
00:17 ^🔗	nintendud	I wonder if these fixed 30 second retries has us all hammering FOS at the same time.
00:17 ^🔗	chronomex	thundering herd effect?
00:18 ^🔗	nintendud	TIL that term. Essentially, although more than one can rsync at a time.
00:18 ^🔗	nintendud	it's why random backoff in ethernet is a thing
00:19 ^🔗	nintendud	I keep getting about 5% uploaded before it dies
00:24 ^🔗	SketchCow	Machine is seriously getting hammered.
00:24 ^🔗	SketchCow	Not sure what to do yet.
00:24 ^🔗	SketchCow	Might set rsync.
00:25 ^🔗	nintendud	Is it coming in 30 second waves?
00:25 ^🔗	nintendud	Or is it just a constant surge of traffic?
00:25 ^🔗	SketchCow	ha ha you act like pressing keys makes anything happen.
00:25 ^🔗	nintendud	Oh right. The tubes. They are clogged.
00:26 ^🔗	S[h]O[r]T	if you have access to the switch or firewall in front of it you can block certain IP ranges or ports to slow down the flow of traffic in
00:26 ^🔗	SketchCow	I like where you said that too.
00:26 ^🔗	SketchCow	All these suggestions are well meaning and useless.
00:26 ^🔗	SketchCow	I'm going to implement a max connections as soon as I can get vi to respond.
00:27 ^🔗	S[h]O[r]T	well if you had access to the switch you could just deny all rsync or anything else and allow ssh :p
00:27 ^🔗	S[h]O[r]T	that wouldnt be useless
00:28 ^🔗	SketchCow	Yes.
00:28 ^🔗	SketchCow	So.....
00:28 ^🔗	SketchCow	If only we could turn lead into gold, we could solve a number of problems.
00:28 ^🔗	SketchCow	But the impossibility of that makes it useless.
00:30 ^🔗	SketchCow	Realize my temper is going to be short while I wrestle with a machine with over 1,100 rsync connections active.
00:30 ^🔗	nintendud	Yup. Good luck, soldier.
00:31 ^🔗	SketchCow	And advice along the line of "to fix the problem, you should fix the problem" is brain fart
00:32 ^🔗	SketchCow	It has been trying to open a vi session for 4 minutes.
00:32 ^🔗	SketchCow	That's how bad it is.
00:32 ^🔗	SketchCow	I have two other windows, trying to set up a killing of rsync
00:34 ^🔗	S[h]O[r]T	im guessing you didnt want any advice then and are just venting
00:34 ^🔗	DoubleJ	Mine finally timed out so I was able to pause the VM. So my minuscule part of the load is off.
00:39 ^🔗	SketchCow	I set it to 20
00:43 ^🔗	SketchCow	Now doing a megakill
00:44 ^🔗	SketchCow	Bitches
00:44 ^🔗	SketchCow	ps -ef \| grep rsync \| awk '{PRINT $2}' \| xargs kill
00:47 ^🔗	nintendud	no killall?
00:47 ^🔗	chronomex	or skill
00:48 ^🔗	SketchCow	shhh, I'm oldschool
00:48 ^🔗	*	chronomex nods knowingly
00:49 ^🔗	chronomex	you have legitimate claim to the phrase "I have underwear that's older than your home directory"
00:56 ^🔗	igelritte	nice
00:57 ^🔗	SketchCow	root@teamarchive-1:/etc# killall rsync
00:57 ^🔗	igelritte	thought I think if he had used 'ps -aux \| grep'... that would have been better
00:59 ^🔗	igelritte	looks like it's time for bed. Gettin' a little punchy.
00:59 ^🔗	igelritte	later
01:00 ^🔗	SketchCow	Machine is pretty hosed.
01:06 ^🔗	SketchCow	FOS crashed.
01:07 ^🔗	BlueMax	Wow, what happened
01:22 ^🔗	SketchCow	DJ Smiley remix of the main page of archiveteam.org now in place.
01:32 ^🔗	SketchCow	fos is back
01:32 ^🔗	SketchCow	now running with some severed rsync limits while we get shit in shape.
02:18 ^🔗	godane	i'm uploading issue 150 dvd of linux format
03:47 ^🔗	bsmith096	@ERROR: max connections (5) reached -- try again later
03:47 ^🔗	bsmith096	Starting RsyncUpload for Item woodp
03:47 ^🔗	bsmith096	getting a whole mess of these
03:47 ^🔗	bsmith096	rsync error: error starting client-server protocol (code 5) at main.c(1534) [sender=3.0.9]
03:49 ^🔗	S[h]O[r]T	the server (fos) stuff rsyncs too is limited to 5 rsync connections atm, it was having issues earlier. SketchCow should updated one its all good at some point
03:51 ^🔗	bsmith096	so is the script gonna continue at some point cause it just keeps trying to dend those 2 users over and over
03:51 ^🔗	bsmith096	send
03:51 ^🔗	S[h]O[r]T	yeah it will keep trying until it gets through
03:51 ^🔗	S[h]O[r]T	can just leave it running
03:52 ^🔗	underscor	I thought it only tries 50 times
03:52 ^🔗	underscor	and then gives up?
03:54 ^🔗	S[h]O[r]T	if it does thats 25min and there must be a bug?
03:55 ^🔗	S[h]O[r]T	thats good tho :P
03:58 ^🔗	S[h]O[r]T	i looked awhile back and i just a bit ago, was pretty sure the rsync in pipeline doesnt have a lot of overhead but i could be wrong. i know there are some options to turn off compression and us a lower encryption that generate less cpu usage.
03:58 ^🔗	S[h]O[r]T	for client and server
04:24 ^🔗	underscor	S[h]O[r]T: Well, I'm just saying
04:24 ^🔗	underscor	with the rate limit on fos
04:24 ^🔗	underscor	it's very likely you could not get in in 25m
04:24 ^🔗	underscor	and then the thing will just give up
04:24 ^🔗	underscor	and you're wasted
04:24 ^🔗	underscor	3
04:24 ^🔗	underscor	D:
04:43 ^🔗	S[h]O[r]T	im saying its been more than 25min and i havent got in and its still trying
04:47 ^🔗	underscor	oic
04:48 ^🔗	underscor	maybe I'm wrong
04:48 ^🔗	underscor	I just overheard someone say that
04:48 ^🔗	underscor	looks like SketchCow upped it to 10
04:48 ^🔗	underscor	none of my threads are doing any work still, though
04:48 ^🔗	underscor	hopefully we can reopen the floodgates soon
04:48 ^🔗	underscor	otherwise we're definitely not going to do well with webshots XD
04:51 ^🔗	underscor	yay!
04:51 ^🔗	underscor	finally got one in
04:51 ^🔗	underscor	w00t
05:13 ^🔗	S[h]O[r]T	i dont see it got upped to 10:P
06:27 ^🔗	ivan`	is anyone in the rehosting-dead-pastebins business?
06:27 ^🔗	ivan`	100K pastes from paste.lisp.org would be better off googlable
06:33 ^🔗	chronomex	do you have them??
06:42 ^🔗	ivan`	yes
06:42 ^🔗	ivan`	http://ludios.org/tmp/paste.lisp.org.7z
06:42 ^🔗	ivan`	chronomex: ^
06:43 ^🔗	deathy	something up with warrior upload? getting "@ERROR: max connections (10) reached -- try again later"
06:45 ^🔗	chronomex	<3
06:45 ^🔗	Cameron_D	The server we rsync to is currently limited because it was having problems earlier
06:46 ^🔗	chronomex	thanks ivan`! are you involved with paste.lisp.org?
06:46 ^🔗	ivan`	no, I think stassats runs it, but his reply did not indicate interest in restoring them
06:46 ^🔗	chronomex	aye.
06:47 ^🔗	chronomex	ow, this is a lot of files
06:47 ^🔗	ivan`	heh
06:47 ^🔗	deathy	hoping limit gets increased/lifted... almost all warriors waiting for upload :\|
06:47 ^🔗	chronomex	*wow
06:47 ^🔗	SketchCow	WHY HELLO
06:48 ^🔗	*	chronomex shoves this into IA
06:48 ^🔗	SketchCow	You crying sallybags.
06:48 ^🔗	chronomex	wassap brah
06:48 ^🔗	SketchCow	You whip a virtual server within an inch of its life, and then woah, you all want it jogging around the track 5 minutes later.
06:49 ^🔗	SketchCow	Also, I like Underscor whining on 3 channels about me taking a reasonable attempt to prevent the machine dying.
06:49 ^🔗	SketchCow	948 simultaneous rsyncs.
06:49 ^🔗	SketchCow	Think about that.
06:49 ^🔗	chronomex	o_O
06:49 ^🔗	SketchCow	You know what you did.
06:49 ^🔗	chronomex	bitches gotta bitch
06:49 ^🔗	*	SketchCow gets the newspaper
06:49 ^🔗	deathy	good job team! :)
06:49 ^🔗	Cameron_D	haha
06:50 ^🔗	soultcer	We need support for distributing uploads to multiple servers. Next one to complain about fos being unreachable will be volunteered to code that into the seesaw kit.
06:50 ^🔗	chronomex	ivan`: can you share some info about this file? when was it captured, was the paste dead at the time, is it complete, etc
06:52 ^🔗	SketchCow	Tomorrow, FOS goes down when one of the admins inceases its swap from 2gb to 6gb.
06:59 ^🔗	ivan`	chronomex: pastes were captured 2011-11-14 and 2012-05-01 and 2012-10-06 (though perhaps I should strip those); not complete, I don't have pastes 129789-131892
07:00 ^🔗	chronomex	ok
07:00 ^🔗	chronomex	:D
07:04 ^🔗	chronomex	http://archive.org/details/paste_lisp_org
07:06 ^🔗	SketchCow	So, basically I have a couple days to prepare some more archiveteam items for ingestion into the wayback.
07:09 ^🔗	SketchCow	188,329,776 14.0M/s eta 3h 58m
07:09 ^🔗	SketchCow	Now that's a spicy meatball
07:10 ^🔗	SketchCow	1,067,816,696 17.7M/s eta 4h 19m
07:10 ^🔗	SketchCow	Downloaded a gig. Going to take 4 hours. It's like that.
07:14 ^🔗	SketchCow	With luck, I can make a lot more of these things green.
07:14 ^🔗	SketchCow	If this all works, all the green ones go into the wayback machine instantly.
07:15 ^🔗	SketchCow	Instant SOPA review end of the month!
07:15 ^🔗	SketchCow	That'd be nice.
07:41 ^🔗	SketchCow	Re-initiated uploads from fos to archive.org of webshots loads.
07:46 ^🔗	alard	Hi there. I've looked up the "rsync retries 50 times" claim that I made yesterday. I now think that's wrong and that it retries indefinitely, so your warriors will just wait until they can upload.
08:37 ^🔗	SketchCow	OK, napping.
13:07 ^🔗	joepie91	SketchCow: I'm not quite sure what to do with this, but I archived the videos that someone (I think bsmith096) linked me to a while ago as rare footage: http://aarnist.cryto.net:81/youtube/all/
13:07 ^🔗	joepie91	flv/mp4/webm format
13:17 ^🔗	balrog_	having a hard time with rsync with warrior
13:17 ^🔗	balrog_	getting "max connections reached" errors
13:22 ^🔗	joepie91	same
13:22 ^🔗	balrog_	alard: ya there?
13:24 ^🔗	S[h]O[r]T	the server the scripts rsync to is currently limited because it was having problems earlier
13:24 ^🔗	balrog_	yeah, but how do I keep the warrior going?
13:24 ^🔗	balrog_	I have this bandwidth which otherwise isn't going to get used
13:24 ^🔗	S[h]O[r]T	just have to wait, it will keep retrying uploads :\
13:25 ^🔗	balrog_	need to shorten the wait time from 30 seconds to more like 5 then
13:42 ^🔗	balrog_	is there any way I can tweak this? :/
13:45 ^🔗	SmileyG	more concurrent threads?
13:45 ^🔗	SmileyG	problem is we are all downloading it faster than FOS can accept it back in
13:45 ^🔗	balrog_	yeah
13:46 ^🔗	SmileyG	The fix is FOS Accepting it faster, or us having larger caches.
13:46 ^🔗	SmileyG	larger caches are possible if you do more concurrent downloads, however depending how fast you download in ratio to the max upload, your still going to get stuck eventually
13:47 ^🔗	SmileyG	joepie91: I'd upload them to IA and give SketchCow a link,
13:47 ^🔗	SmileyG	thats what I've done with the fish ezine I get each month.
13:48 ^🔗	SmileyG	[08:46:31] <@alard> Hi there. I've looked up the "rsync retries 50 times" claim that I made yesterday. I now think that's wrong and that it retries indefinitely, so your warriors will just wait until they can upload.
13:48 ^🔗	SmileyG	Phew! that was a worry for me.
14:02 ^🔗	balrog_	5 connections seems a bit low
14:09 ^🔗	alard	Hey, "need to shorten the wait time from 30 seconds to more like 5 then" is not a good idea. If we all do that, will just increase the load on the server, but won't increase the number of uploads.
14:10 ^🔗	flaushy	the problem is that users like me with slow uploads (max 1 MiB/s) will use the slots for a long time :/
14:10 ^🔗	alard	We'll just have to wait until the server can handle more connections. (Or we'll have to find some other server were we can upload to, to spread the load.)
14:10 ^🔗	flaushy	right
14:11 ^🔗	flaushy	can we, until then, increase the warrior concurrency to a higher max than 6?
14:13 ^🔗	alard	No, that would require a lot of updates. (And I also don't really see how that would help. It would just add more waiting uploads.)
14:13 ^🔗	alard	Just be patient. :)
14:13 ^🔗	flaushy	okie :)
14:14 ^🔗	flaushy	at least we don't loose the queues, which is great
14:15 ^🔗	soultcer	alard: What would the requirements of such a server be?
14:15 ^🔗	SmileyG	we need a mini IA just for our upoloads lol
14:16 ^🔗	flaushy	i mean underscor looks like haveing enough bandwidth to act as a caching server for smaller guys. But i might be wrong
14:22 ^🔗	alard	soultcer: It would need downstream and upstream bandwidth, and a not too small disk to receive files before it packages and uploads them to archive.org.
14:22 ^🔗	alard	Uploads are 50GB batches, so a multiple of that.
14:24 ^🔗	flaushy	would 100 mbit be enough?
14:24 ^🔗	soultcer	Maybe renting a cheap server from hetzner or ovh for a month would work
14:25 ^🔗	alard	Yes, 100mbit would be enough (we also don't have to send all uploads to one server).
14:26 ^🔗	SmileyG	the bt ones are the issue right?
14:26 ^🔗	SmileyG	because they are so short.... ?
14:26 ^🔗	SmileyG	SHame we can't package multiple users together on the warrior?
14:26 ^🔗	alard	I do not know what the issue is. It can't be bt, I would think, since we have only a few thousand small users there.
14:27 ^🔗	SmileyG	alard: but most of them finish in sub30 seconds
14:27 ^🔗	SmileyG	thats a lot of rsync processes spawning constantly for such small tranfers.
14:27 ^🔗	alard	Yes, so there aren't many active at the same time. But I don't know what the issue is, really. It could be the number of processes spawning, or just the number of concurrent uploads.
14:28 ^🔗	alard	Resuming uploads are probably also more expensive than new uploads.
14:28 ^🔗	alard	(There would have been a few of them when the server came back up, I suspect.)
14:29 ^🔗	alard	It doesn't have to be rsync, by the way. That's just what fos currently has.
14:30 ^🔗	alard	Anyway, I'll be back later.
14:30 ^🔗	SmileyG	o/
14:30 ^🔗	soultcer	Does the bundling script rely on the partial setting? You could use --inplace, then it won't have to rename/move files after finishing
14:34 ^🔗	SmileyG	partial works for the webshots but makes no sense with the BT ones
14:37 ^🔗	SmileyG	14254080 52% 166.45kB/s 0:01:18
14:41 ^🔗	SketchCow	Back once again.
14:44 ^🔗	flaushy	meh, i need a couple of minutes mostly
14:44 ^🔗	flaushy	alard: i ask at my university
14:44 ^🔗	flaushy	whether i can crawl with the pools at night, and whether a dump would be acceptable
14:48 ^🔗	tef_	alard, DFJustin: 0.18 and 1.0 warcsare the same bar the version number, yes. (I have this from one of the authors of the warc spec)
14:49 ^🔗	tef_	pps warc2warc in warctools can recompress warcs record by record. warc2warc *.warc.gz -O output.warc.gz
14:53 ^🔗	soultcer	If it can recompress warcs, can it also concatenate them? Simply create one big warc file instead of tarring multiple warc files. Would make it easier for IA to use?
15:15 ^🔗	DFJustin	so SketchCow / underscor, can you pull bt usernames out of the wayback database, I can do stuff like http://wayback.archive.org/web//http://www.btinternet.com/~ but I only get a few hundred at a time and it will take forever
15:19 ^🔗	alard	DFJustin: underscor sent a list from the wayback database yesterday.
15:20 ^🔗	DFJustin	well I was getting usernames just now that rescue-me didn't know, although I think most of them are long gone
15:21 ^🔗	alard	Ah, I don't know what they searched for.
15:23 ^🔗	alard	soultcer: I think --partial or --inplace doesn't really matter (moving a file on the same disk isn't that expensive, is it?)
15:25 ^🔗	alard	I was playing with this for the one-big-warc problem: https://github.com/alard/megawarc Any good suggestions?
15:25 ^🔗	SketchCow	http://24.media.tumblr.com/tumblr_m9dvjezOvX1qm3r26o1_500.jpg
15:25 ^🔗	soultcer	When you have a big file half-uploaded, and then continue without --inplace, it will first make a temp copy of the already existing stuff, then write to that temp copy
15:25 ^🔗	soultcer	Only when it has finished uploading, will it remove that copy
15:26 ^🔗	soultcer	I had trouble transfering a file because rsync took more than 1.5 times the size of the file when I didn't use inplace
15:27 ^🔗	alard	In any case, --inplace can't be used here, because then half-uploaded files could be moved by the postprocessing script.
15:28 ^🔗	soultcer	alard: What do we need the original tar file for?
15:28 ^🔗	alard	It's nice to be able to find the per-user files.
15:28 ^🔗	SketchCow	yes
15:28 ^🔗	alard	And for mobileme there are wget.logs and other files.
15:32 ^🔗	alard	So even though you'd probably never want the original tar file back, it's useful to keep the data somewhere. The 'restore' function demonstrates that there's no data lost.
15:48 ^🔗	tef_	alard: if you have extra logs to put in, warc record can handle that metadata records
15:50 ^🔗	alard	tef_: I know. The new projects have one single warc file per user. The older projects, mobileme, have the logs and a few other files besides the warcs.
15:51 ^🔗	alard	(And even with mobileme the wget log is also in the warc files, I think.)
15:51 ^🔗	tef_	nice
15:51 ^🔗	tef_	but yeah converting from .tar to warcgz could happily convert non warc records into warcrecords in the final output
15:52 ^🔗	alard	Yes, so you could make one file that has everything.
15:52 ^🔗	SketchCow	Here's a hilarious one - the fortunecity collection. It's warcs AND straight html.
15:52 ^🔗	tef_	SketchCow: warc records can be of 'resource' instead of 'response' :-)
15:52 ^🔗	alard	We could put the tar file in the big warc.
15:59 ^🔗	tef_	heh
16:19 ^🔗	underscor	SketchCow: I wasn't whining!
16:22 ^🔗	underscor	alard: Does the seesaw kit support round-robining rsync servers?
16:22 ^🔗	underscor	Because I have 12 boxes at archive.org we could rr between
16:28 ^🔗	alard	underscor: Not yet, but it could. I think it would be even better to do it with HTTP PUT uploads, though. That would make round-robining easier. (And it might be less stressful for the server.)
16:28 ^🔗	underscor	Hmm, as safe as rsync though?
16:28 ^🔗	underscor	(checksum-wise)
16:28 ^🔗	SmileyG	hmmmm
16:28 ^🔗	SketchCow	alard: First test of megawarc coming up
16:28 ^🔗	alard	Does rsync make many checksums?
16:29 ^🔗	underscor	I thought it did a checksum
16:29 ^🔗	underscor	But actually, no
16:29 ^🔗	SmileyG	it does
16:29 ^🔗	underscor	In write only mode, it doesn't
16:29 ^🔗	alard	Only if you allow it, I thought. (Other than the filesize thing.)
16:29 ^🔗	SmileyG	files to check #0/1
16:29 ^🔗	SmileyG	currently it appears to check the writes...
16:30 ^🔗	SmileyG	can you just use dns RR too?....
16:30 ^🔗	underscor	Yeah, but that requires waiting for propagation, etc
16:30 ^🔗	underscor	Also a lot of places (RIT included) munge the results
16:31 ^🔗	underscor	and only return one of them until the cache expires
16:31 ^🔗	SmileyG	o
16:31 ^🔗	SmileyG	ttl 5
16:31 ^🔗	SmileyG	:D
16:31 ^🔗	underscor	haha
16:31 ^🔗	underscor	They ignore ttl :()
16:31 ^🔗	underscor	:( *
16:31 ^🔗	SmileyG	just make sure your dns server can take it
16:31 ^🔗	SmileyG	wut Â¬_Â¬
16:31 ^🔗	underscor	Yeah
16:31 ^🔗	underscor	sux
16:32 ^🔗	SmileyG	ok, have the tracker hand out upload details?
16:32 ^🔗	SmileyG	along with username?
16:33 ^🔗	underscor	alard: Setting up a PUT server for testing
16:34 ^🔗	alard	We could write a tiny node.js PUT server with checksums. :)
16:36 ^🔗	soultcer	Why complicate it further by introducing another programming language?
16:36 ^🔗	alard	Good question.
16:41 ^🔗	soultcer	Is there no simple point to point file transfer protocol witch checksumming?
16:43 ^🔗	alard	Do we need checksums? (If we're uncomplicating. :)
16:44 ^🔗	underscor	Nah.
16:44 ^🔗	underscor	I was just putting up a put accepter in nginx
16:44 ^🔗	underscor	since I already have it on these boxen
16:44 ^🔗	alard	After all, once it's on that server we uploaded to we'll be using the non-checksummed s3 api to bring it to archive.org.
16:46 ^🔗	alard	underscor: Do you happen to know if there's a way to distinguish uploaded from still-uploading files?
16:46 ^🔗	underscor	No idea. Let me see.
16:48 ^🔗	alard	That's useful to know for the postprocessing. (The current packaging script moves any file it can find.)
16:49 ^🔗	deathy	"A file uploaded with the PUT method is first written to a temporary file, then a file is renamed. Starting from version 0.8.9 temporary files and the persistent store can be put on different file systems but be aware that in this case a file is copied across two file systems instead of the cheap rename operation."
16:49 ^🔗	deathy	apparently from "ngx_http_dav_module" docs
16:50 ^🔗	alard	Ah, that's promising.
16:54 ^🔗	S[h]O[r]T	FTP
16:55 ^🔗	soultcer	FTP? What are we, farmers?
16:57 ^🔗	S[h]O[r]T	lol
16:57 ^🔗	underscor	http://p.defau.lt/?SBDTYn8UhfxVvm4rSmlydw
16:57 ^🔗	underscor	cc alard
16:57 ^🔗	underscor	:D
16:57 ^🔗	underscor	and it didn't appear until after the upload fully finished
16:57 ^🔗	alard	Nice.
16:58 ^🔗	alard	Does it make directories? (As in /webshots/underscor/something.warc.gz ?)
16:58 ^🔗	underscor	I can enable it
16:59 ^🔗	underscor	So if you put to http://bt-download00.us.archive.org:8302/webshots/some/path/here/libtorrent-packages.tar.gz
16:59 ^🔗	underscor	it will create /some/path/here on the fly
16:59 ^🔗	alard	It's not necessary, but I with the rsync uploads I generally let every download upload to a separate directory.
17:00 ^🔗	alard	Doesn't really serve a purpose.
17:00 ^🔗	alard	I'll be back later.
17:01 ^🔗	underscor	alard: option enabled.
17:02 ^🔗	underscor	Holler at me when you get back if you think this would be a better idea going forward, and I can push out to the rest of the boxes
17:19 ^🔗	godane	i got up to episode 43 of t3 podcast
17:36 ^🔗	joepie91	S[h]O[r]T: no, absolutely not FTP
17:36 ^🔗	joepie91	lol
18:51 ^🔗	joepie91	very relevant: I don't have time for silliness. Just let me know if you're removing our footage, or if I'm forwarding this to our attorneys. I'm not interested in your creative commons bs (which those of us who actually work in media refers to as amateur licensing) and I have told you that we do not want our work in any of your videos. Let me repeat: we want NONE of our work in ANY of your or any third party
18:51 ^🔗	joepie91	videos, and our exclusive licensing agreements exist specifically so that is enforcable.
18:51 ^🔗	joepie91	er
18:51 ^🔗	joepie91	faol
18:51 ^🔗	joepie91	fail *
18:51 ^🔗	joepie91	http://arstechnica.com/tech-policy/2012/10/court-rules-book-scanning-is-fair-use-suggesting-google-books-victory/
18:51 ^🔗	joepie91	ignore the above blob of text, it was an earlier copypaste from a pastebin :P
18:53 ^🔗	chronomex	now I'm curious
18:53 ^🔗	chronomex	however I have work to do
20:08 ^🔗	SketchCow	alard's not here, is he?
20:08 ^🔗	SketchCow	I think eh went awayyyy
20:10 ^🔗	alard	Hello!
20:12 ^🔗	SketchCow	Hey, my net went wonky.
20:12 ^🔗	SketchCow	ImportError: No module named ordereddict
20:13 ^🔗	SketchCow	How do I fix that?
20:13 ^🔗	alard	Python 2.6?
20:14 ^🔗	alard	wget https://bitbucket.org/wooparadog/zkit/raw/4ce69af1742f/ordereddict.py
20:14 ^🔗	SketchCow	File "megawarc.py", line 64, in <module>
20:14 ^🔗	SketchCow	ImportError: No module named ordereddict
20:14 ^🔗	SketchCow	Traceback (most recent call last):
20:14 ^🔗	SketchCow	from ordereddict import OrderedDict
20:14 ^🔗	SketchCow	root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python2.7 megawarc.py
20:15 ^🔗	soultcer	OrderedDict is in collections for py 2.7
20:15 ^🔗	SketchCow	Bear in mind I am a perl guy at best.
20:15 ^🔗	SketchCow	We do it differently there.
20:18 ^🔗	soultcer	SketchCow: Replace "from orderecdict import OrderedDict" with this: http://pastebin.com/dQdZ0wX8
20:18 ^🔗	soultcer	Should work fine in py 2.7, and for py 2.6 you can download the ordereddict module alard told you about
20:20 ^🔗	SketchCow	OK
20:20 ^🔗	SketchCow	So I just wasted some time trying that.
20:20 ^🔗	soultcer	alard: You are only using the ordered dict for cosmetics anyway, right?
20:21 ^🔗	alard	Yes.
20:21 ^🔗	SketchCow	Alard, please put it in the megawarc github if it works
20:21 ^🔗	SketchCow	because damn, I don't edit python very well.
20:21 ^🔗	chronomex	spaces, no tabs
20:21 ^🔗	chronomex	though it pains me to say so
20:21 ^🔗	SketchCow	Yeah, no, like I don't do python
20:21 ^🔗	PepsiMax	omh
20:22 ^🔗	SketchCow	And the github should be improved, not my local copy of it anyway
20:22 ^🔗	chronomex	:)
20:24 ^🔗	alard	SketchCow: I've updated the github repository. Try again. (It worked for me before and it still works now.)
20:32 ^🔗	SketchCow	Usage: megawarc [--verbose] build FILE megawarc [--verbose] restore FILE
20:32 ^🔗	SketchCow	root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc
20:32 ^🔗	SketchCow	root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc build BOARDS-COH-05.tar
20:32 ^🔗	SketchCow	Looking much better.
20:33 ^🔗	SketchCow	Now, let's see if the 11gb file that results is good.
20:33 ^🔗	SketchCow	Do you account for things being in subdirectories in the .tar?
20:43 ^🔗	alard	Well, it doesn't care. What it does is this: it walks through the tar, one entry at a time. If it is a file and the filename ends with .warc.gz, it checks to see if it is an extractable gzip. If all that is OK, the warc file is added to the warc. In all other cases (directories, unreadable warcs, other files) the file is added to the leftover tar.
20:43 ^🔗	alard	For the tar reconstruction, it pastes together the content from the leftover tar, the tar headers and parts from the warc. So directories don't matter.
20:53 ^🔗	SketchCow	root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc build BOARDS-COH-05.tar
20:53 ^🔗	SketchCow	root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# ls -l
20:53 ^🔗	SketchCow	total 21136664
20:53 ^🔗	SketchCow	-rw-r--r-- 1 root root 10822246400 Oct 11 19:26 BOARDS-COH-05.tar
20:53 ^🔗	SketchCow	-rw-r--r-- 1 root root 84149 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.json.gz
20:53 ^🔗	SketchCow	-rw-r--r-- 1 root root 10240 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.tar
20:53 ^🔗	SketchCow	-rw-r--r-- 1 root root 10821470898 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.warc.gz
20:53 ^🔗	SketchCow	OK, so.
20:53 ^🔗	SketchCow	That worked... but there was no progress bar, and no updates.
20:53 ^🔗	SketchCow	So I'll use this for now, but I would definitely add something to indicate work is being done.
20:58 ^🔗	alard	SketchCow: Add --verbose
20:59 ^🔗	alard	It won't show a progress bar, but it will tell you what's taking so long.
20:59 ^🔗	alard	underscor: Do you have a /webshots/alard/webshots.com-user-siebertphotoshop-20121011-225722.warc.gz ?
21:01 ^🔗	SketchCow	Oh!
21:07 ^🔗	joepie91	<SmileyG>joepie91: I'd upload them to IA and give SketchCow a link,
21:08 ^🔗	joepie91	that's a bit hard
21:08 ^🔗	joepie91	they're on a server
21:08 ^🔗	joepie91	:P
21:08 ^🔗	joepie91	can't get to them now anyway, that server seems offline
21:13 ^🔗	underscor	alard: http://p.defau.lt/?rZSWA6hEE1SROnuBQKxtAQ
21:13 ^🔗	underscor	Lookin awesome :D
21:13 ^🔗	alard	Great. Ready for more?
21:14 ^🔗	underscor	joepie91: :o what was that mispaste about :D
21:15 ^🔗	underscor	alard: yep!
21:15 ^🔗	underscor	Shall I roll out to bt-download01-11 now too?
21:15 ^🔗	underscor	(for roundrobining)
21:15 ^🔗	alard	That would be nice. The tracker picks one of the urls from a list, so it's possible to remove/add urls later.
21:16 ^🔗	underscor	Ah, nice!
21:16 ^🔗	underscor	ok, I'll work on pushing the config
21:16 ^🔗	flaushy	is the limit of rsync only for webshot or for all projects?
21:16 ^🔗	underscor	I'll need the "cleanup" script too
21:16 ^🔗	underscor	flaushy: bt is set to 5 right now, webshots 10
21:17 ^🔗	flaushy	would it make sense to switch underscor?
21:17 ^🔗	flaushy	from webshots to bt?
21:18 ^🔗	flaushy	or are the rsyncs on bt crowded as well?
21:18 ^🔗	alard	Webshots is now uploading over HTTP (once your warrior gets the update).
21:18 ^🔗	soultcer	Sweet
21:18 ^🔗	flaushy	so warrior restart time :)
21:18 ^🔗	flaushy	awesome
21:18 ^🔗	SketchCow	What?
21:19 ^🔗	SketchCow	So wait, stuff is now banging directly into archive? Or something else.
21:19 ^🔗	alard	SketchCow: underscor wants it.
21:19 ^🔗	underscor	SketchCow: well, I have 12 machines we can load balance between
21:19 ^🔗	SketchCow	Underscor wants a lot of things, but I like to be included while I'm over here trying to make this machine function.
21:19 ^🔗	underscor	so I thought it might be a better idea
21:21 ^🔗	SketchCow	Please at least tell me it's going into http://archive.org/details/webshots-freeze-frame with the same format structure
21:21 ^🔗	alard	(We've been discussing this for a while, but we can change it again if you think it's not a good idea.)
21:21 ^🔗	alard	It's exactly the same.
21:21 ^🔗	underscor	It's exactly the same, just that it is roundrobined between 12 boxes instead of a single one
21:21 ^🔗	SketchCow	I trust you'll do the right thing, but if we're using an environment, I just want to know, with my name being mentioned, we're going to shift gears.
21:22 ^🔗	SketchCow	Because then I can focus on it as a "clear out the rest of what we have" instead of "work my ass off on this box trying to make it function for the time being"
21:23 ^🔗	joepie91	http://p.defau.lt/?rZSWA6hEE1SROnuBQKxtAQ
21:23 ^🔗	alard	Ah, yes. It won't change immediately: the current warriors are still trying to rsync and will keep trying, until they succeed or until they're restarted.
21:23 ^🔗	joepie91	er
21:23 ^🔗	joepie91	<underscor>joepie91: :o what was that mispaste about :D
21:23 ^🔗	joepie91	is what I wanted to paste
21:23 ^🔗	joepie91	anyway
21:23 ^🔗	joepie91	wtf is with my clipboard today
21:24 ^🔗	joepie91	tl;dr guy makes movie about occupy protests, then starts demanding that videos that reuse parts of his movie are taken down
21:24 ^🔗	joepie91	let me find the full paste
21:25 ^🔗	SketchCow	drop to -bs
21:26 ^🔗	SketchCow	Anyway, I am all for solutions that increase the bandwidth away from FOS, which is meant to be a buffer of 20tb for incoming data, but doesn't function as well as it could as something to blow 50tb of insanity in, do operations on, and blow out.
21:27 ^🔗	joepie91	SketchCow: what is the main bottleneck for fos?
21:27 ^🔗	SketchCow	I just need to know that's what's going on so I know I'm bailing water out of a bathtub for a little, and not trying to rescue a sinking ship.
21:27 ^🔗	SketchCow	FOS is a virtual box that does about 20 things.
21:27 ^🔗	SketchCow	So the bottleneck for FOS is FOS
21:27 ^🔗	SketchCow	Oversubscription.
21:27 ^🔗	SketchCow	In this particular case, we had the same disk being used for file writes, file compilation, and file reads
21:28 ^🔗	SketchCow	Which is normally not THAT big a deal but it was doing a LOT, and we had 900+ rsyncs
21:28 ^🔗	SketchCow	Eventually swap exploded
21:28 ^🔗	underscor	and everything goes to hell
21:28 ^🔗	alard	Webshots on FOS is now sizzling out, but bt internet is still using rsync. But that's so small it's probably something to keep there?
21:28 ^🔗	SketchCow	I expect so, yes.
21:28 ^🔗	SketchCow	Webshots is TOO DAMN HIGH
21:29 ^🔗	joepie91	so basically disk I/O is the bottleneck?
21:29 ^🔗	joepie91	or the main one, at least
21:29 ^🔗	SketchCow	It's one of them.
21:29 ^🔗	joepie91	hmm
21:29 ^🔗	joepie91	let me think about this for a moment
21:29 ^🔗	SketchCow	I guess if we're looking to find out, we can circle the sizzling wreck and waste a few days determining why.
21:29 ^🔗	SketchCow	No, don't think about it.
21:29 ^🔗	SketchCow	Think about things and projects that need saving.
21:29 ^🔗	joepie91	there's not much ability to save if the library is burning down
21:30 ^🔗	flaushy	is there a script for bt as well?
21:30 ^🔗	SketchCow	underscor has twice your brainpower, and 400x your resources (200x mine) and has an unhealthy compulsion to optimize.
21:30 ^🔗	SketchCow	He'll fix it.
21:30 ^🔗	*	underscor giggles giddily
21:31 ^🔗	SketchCow	He literally cuddles with the internat archive infrastructure.
21:31 ^🔗	*	underscor whistles innocently
21:31 ^🔗	joepie91	... not sure why you seem so strongly opposed to my decision to invest some of my _own_ time and thought into finding a possible solution
21:31 ^🔗	underscor	but, but, but, petaboxen are so cute~
21:31 ^🔗	joepie91	I personally don't really care who has more brainpower or infrastructure - more people thinking about it instead of watching random series because boredom, means more chance of a solution
21:31 ^🔗	SketchCow	This was a rare case where miscommunication, exacerbated by a red-eye flight, meant that I fell out of the loop of a solution set.
21:31 ^🔗	SketchCow	And got surprised, and whined.
21:32 ^🔗	chronomex	SketchCow can't stand WWIC
21:34 ^🔗	SketchCow	The teamarchive/FOS machine will now get 8gb of swap instead of 2.
21:34 ^🔗	underscor	SketchCow: What script do you use to inject these into IA?
21:34 ^🔗	underscor	(and can I get it plz)
21:35 ^🔗	SketchCow	I have a custom injector that uses a s3 call.
21:35 ^🔗	*	SmileyG sighs
21:35 ^🔗	SmileyG	still borked? :(
21:35 ^🔗	SketchCow	Before we do this with your round-robin thing.
21:35 ^🔗	SketchCow	What's still borked.
21:36 ^🔗	SketchCow	Anyway, before we do this with your round-robin thing, I think we need to decide if megawarc is ready for production.
21:36 ^🔗	SmileyG	my bt uploads by the look of things - looking at backlog now
21:36 ^🔗	SketchCow	Not borked.
21:36 ^🔗	SketchCow	It was being held at a limit, a limit which I will shortly lift as we move webshots over to a round-robin, and as FOS gets 4x the allocated RAM
21:38 ^🔗	SmileyG	Ah ok, I presumed it was the number of rsyncs due to the BT one being so fast that was the issue (i'd fill my queue in 30~ seconds).
21:39 ^🔗	SketchCow	Also
21:39 ^🔗	SketchCow	http://blog.archive.org/2012/10/10/the-ten-petabyte-party/
21:39 ^🔗	SketchCow	If you're in SF, go eat some foods
21:39 ^🔗	SmileyG	I wish.
21:40 ^🔗	*	joepie91 is practically on the other side of the world
21:40 ^🔗	SketchCow	Now, I want to discuss the format we put webshots in.
21:40 ^🔗	mistym	Probably every non-SF person here is wishing they'd be there now :b
21:40 ^🔗	SketchCow	My attention is gripped a little by seeing what the result of the megawarc program is.
21:40 ^🔗	SketchCow	http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-5
21:41 ^🔗	SketchCow	So first, let us see what the result of the derive is.
21:41 ^🔗	SketchCow	It's an 11gb megawarc, so it will take a few minutes.
21:42 ^🔗	joepie91	what is a megawarc?
21:42 ^🔗	soultcer	Could you teach the deriver to unpack the tar files?
21:42 ^🔗	SketchCow	soultcer: No
21:42 ^🔗	SketchCow	I sat in meetings across a week on it.
21:42 ^🔗	chronomex	teaching the deriver anything is a major undertaking
21:42 ^🔗	SketchCow	It's not the deriver, it's the wayback machine.
21:42 ^🔗	SketchCow	It's a mess.
21:42 ^🔗	chronomex	ah
21:43 ^🔗	SketchCow	So it's easier to generate a .warc.gz file that cats up all the other warcs in a specific way.
21:43 ^🔗	chronomex	the way I take it, WBM indexes tar files that remain on petaboxes?
21:43 ^🔗	chronomex	thus there's one copy of the WBM data or something
21:43 ^🔗	SketchCow	No, it's weirder.
21:43 ^🔗	SketchCow	It's all so weird.
21:43 ^🔗	chronomex	s/tar/warc.gz/
21:43 ^🔗	SketchCow	As much as we want me to go into the substance of this, here we go.
21:43 ^🔗	SketchCow	I see three audiences for our data.
21:43 ^🔗	SketchCow	1. Wayback Machine
21:44 ^🔗	SketchCow	2. The individuals who had their data on the thing, wanting their shit back
21:44 ^🔗	SketchCow	3. Historians from The Future, with The Future being 1 hour to forever from now.
21:44 ^🔗	chronomex	yeap
21:44 ^🔗	SmileyG	agreed.
21:45 ^🔗	SketchCow	So, the problem is, 1. is very, very, very old school and was designed from the ground up along a whole range of very carefully decided "things".
21:45 ^🔗	SketchCow	It is also, being from a non-profit, not overly packed with dev teams.
21:45 ^🔗	SketchCow	This translates to "it takes things a certain way"
21:45 ^🔗	chronomex	picky, brittle.
21:45 ^🔗	SketchCow	It's possible to go 'well, leave things as they are, and make a second version'
21:45 ^🔗	SketchCow	And we're doing that for the moment with some items, for the sake of stepping into it slowly.
21:46 ^🔗	SketchCow	Obviously that doesn't work with MobileMe.
21:46 ^🔗	SketchCow	Now, I asked MobileMe to miss this current load-in to Wayback, because whatever decision/process is made becomes a 274tb decision.
21:47 ^🔗	flaushy	do you have slides for a 5 minute presentation why you should join the archiveteam? i am going to a small congress from the ccc tomorrow
21:47 ^🔗	SketchCow	No, just links to my talks.
21:48 ^🔗	SmileyG	flaushy: hmmmmm not that I'm aware of - watch Jasons defcon speach and talk about Soy Sauce?
21:48 ^🔗	flaushy	could be good enough :)
21:48 ^🔗	SketchCow	http://www.us.archive.org/log_show.php?task_id=127610039
21:48 ^🔗	chronomex	soy sauce itself is >5 minutes :P
21:48 ^🔗	SketchCow	Can you guys see that?
21:48 ^🔗	SmileyG	yes SketchCow
21:48 ^🔗	chronomex	I can
21:48 ^🔗	SketchCow	Ok, so that's the deriver working with a megawarc.
21:48 ^🔗	flaushy	need login
21:49 ^🔗	SketchCow	Get a damn library card, buddy!
21:49 ^🔗	underscor	^
21:49 ^🔗	SketchCow	They're freeeeee
21:49 ^🔗	underscor	sweet, 1.8gb already on the first node!
21:49 ^🔗	underscor	cc alard, SketchCow
21:49 ^🔗	SketchCow	OK, so turning from that experiment, and still waiting to make sure it works.....
21:49 ^🔗	SketchCow	...I'd like to consider a process where we generate the megawarc by default.
21:50 ^🔗	SketchCow	And upload THOSE as webshots.
21:50 ^🔗	SketchCow	So my current process is "grab 50gb of these delightful picture warcs, .tar them, and then shove them on the internet archive."
21:51 ^🔗	alard	underscor: My uploads are going really fast.
21:51 ^🔗	SketchCow	But those .tars are good for the 2. (individuals) with a LOT of help from additional alard scripts, and 3. And not good for 1.
21:51 ^🔗	underscor	alard: that's a good thing, right?
21:51 ^🔗	underscor	hehe
21:51 ^🔗	SketchCow	Your uploads are going to boxes that aren't maxed out to misery
21:51 ^🔗	alard	underscor: Yes. :)
21:51 ^🔗	underscor	:D
21:51 ^🔗	underscor	SketchCow: hahahah
21:51 ^🔗	alard	We'll see how long it lasts.
21:51 ^🔗	chronomex	if we start with megawarcs, it's possible to make a tool that does range-requests and gets chunks in the middle
21:52 ^🔗	underscor	http://maelstrom.foxfag.com/munin/us.archive.org/bt-download00.us.archive.org/if_eth0.html
21:52 ^🔗	SmileyG	SketchCow: can we create some kind of "index" of the megawarc which we could feed into 1. (and use for 2.)
21:52 ^🔗	SketchCow	So I guess the question I pose to alard is, if we generate megawarcs, how hard would it be to write something that takes a link to the megawarc and returns your warc inside it?
21:52 ^🔗	SketchCow	SmileyG: The megawarc, by DEFINITION, works with 1. and 3.
21:52 ^🔗	SketchCow	And if it's in the Wayback, it helps 2.
21:53 ^🔗	SmileyG	SketchCow: ah duur failing to read.
21:53 ^🔗	chronomex	SmileyG: yes, there is an index. it's called a cdx.
21:53 ^🔗	SketchCow	So in THEORY, this would be fine.
21:53 ^🔗	chronomex	deriver makes it iirc
21:53 ^🔗	alard	The json file tells you where each file is, with byte ranges.
21:53 ^🔗	SmileyG	This is why I shoudln't irc while dying.
21:53 ^🔗	chronomex	how about not dying
21:53 ^🔗	alard	So it will tell you that user-x.warc.gz is in the big-warc from bytes a-b. This byte range you can feed to http://warctozip.archive.org/, for example. (This is how the tabblo/mobileme search things work.)
21:54 ^🔗	SketchCow	OK.
21:54 ^🔗	alard	Or you could do a curl with a byte range to get the warc.gz, if you don't like zip.
21:54 ^🔗	SketchCow	So it SHOULD be possible with current tools to assist 2.
21:54 ^🔗	SketchCow	Or some minor scripting to access current tools.
21:54 ^🔗	alard	Yes.
21:54 ^🔗	chronomex	current tools or minor changes, yes
21:54 ^🔗	SketchCow	Ok.
21:54 ^🔗	SketchCow	Then yes, we're going to:
21:54 ^🔗	*	SmileyG has other things on his plate hes thinking about. Time to disappear again.
21:55 ^🔗	SketchCow	1. Start pushing webshots up from underscor's Circle-Jerk to archive.org as native megawarcs
21:55 ^🔗	SketchCow	2. See about (carefully) converting both previous webshots and mobileme to native megawarcs.
21:55 ^🔗	SmileyG	99. Geocities?
21:56 ^🔗	SketchCow	Geocities as we did it will never go into the wayback.
21:56 ^🔗	SmileyG	Never? drat
21:56 ^🔗	SketchCow	As we did it.
21:56 ^🔗	chronomex	nope, we didn't manage to collect enough metadata to put it into warc
21:56 ^🔗	SketchCow	In THEORY, we could generate warcs with some sort of obviousness that it could pull in.
21:56 ^🔗	SmileyG	we can't "redo" it though so....
21:56 ^🔗	SmileyG	hmmm, as long as its "as" accessable as the others then shrug.
21:57 ^🔗	SketchCow	But man, I don't want to stress the IA infrastructe with THAT project this exact moment.
21:57 ^🔗	SketchCow	And by infrastructure I mean people.
21:57 ^🔗	SmileyG	wtf is hitting my keyboard o_O
21:57 ^🔗	SketchCow	sperm
21:57 ^🔗	SmileyG	worrying.
21:57 ^🔗	SketchCow	It dries
21:57 ^🔗	SmileyG	then its all crispy and the keys get stuck :<
21:58 ^🔗	SketchCow	check #archiveteam-spermclean
21:58 ^🔗	SketchCow	Read the FAQ
21:58 ^🔗	joepie91	stupid idea: set up haproxy on shitty unmetered gbit vps, proxy to various backends
21:58 ^🔗	SmileyG	lol sorry, dragging this off topic Â¬_Â¬; Really am going away, just gonna watch the convo unless someone on the internet turns out to be wrong.
21:58 ^🔗	joepie91	upload over HTTP
21:58 ^🔗	SketchCow	joepie91: We did that way back when
21:58 ^🔗	SmileyG	shitty unmetered gbit vps <<< Howm uch $$$?
21:58 ^🔗	SketchCow	It was hilarrrrrrrrrrrious
21:59 ^🔗	joepie91	SmileyG: not necessarily that much
21:59 ^🔗	joepie91	expect disk I/O etc to suck though, but that doesn't matter if it's just a proxy
21:59 ^🔗	soultcer	joepie91: Shitty unmetered VPS have one problem: In the end they are still a shitty VPS.
21:59 ^🔗	SketchCow	We did it on batcave, as I recall
21:59 ^🔗	joepie91	soultcer: shitty in the sense of everything but the bandwidth sucks
21:59 ^🔗	joepie91	:p
21:59 ^🔗	joepie91	SketchCow: what were the results?
21:59 ^🔗	SketchCow	Oh, it was very effective
21:59 ^🔗	alard	That was to fix network weirdness, where uploads directly to s3.us.archive.org were much slower than uploads proxied through batcave, also on archive.org.
22:00 ^🔗	alard	joepie91: But I think the HTTP upload from the warriors works fine now, without proxy stuff. The tracker redirects to one of the upload servers.
22:01 ^🔗	soultcer	SketchCow: Does the Internet Archive hire remote workers?
22:01 ^🔗	joepie91	alard: alright
22:01 ^🔗	joepie91	so... the upload problems should be solved, or?
22:01 ^🔗	alard	Yes, for the time being. :)
22:02 ^🔗	alard	Update your Webshots scripts, if you aren't using a warrior.
22:06 ^🔗	SketchCow	So, alard, is there a way to make a megawarc generator that just takes a directory instead of a tar?
22:08 ^🔗	alard	That depends on your definition of "megawarc". As it is now, the json contains tar headers and the position of the warcs in the original tar file. You could leave that out, though, and keep the properties that are useful for indexing the big-warc.
22:08 ^🔗	alard	What would be the best way to get the filenames to the megawarc script? Use find and pipe to stdin?
22:09 ^🔗	alard	(There may be too many files to go as command line arguments.)
22:09 ^🔗	SketchCow	I am comfortable with doing a tar to stdin... :)
22:10 ^🔗	SketchCow	or to stdout, I guess you might say
22:10 ^🔗	soultcer	Make the script recursively search a directory for warcs?
22:10 ^🔗	alard	Well, piping tar into the megawarc script won't easily work, since the script needs two passes over each warc file. (Once to check if it can be decompressed, once to copy it to the big-warc.)
22:11 ^🔗	SketchCow	Well, I assumed a different script taking a different approach.
22:11 ^🔗	alard	Yes, but I think you want the gzip test. If you don't have that test one invalid file can ruin the whole warc.
22:11 ^🔗	SketchCow	I mean, let's back it up. What I'd like is a way to take a directory, instead of a .tar, and make it a megawarc.
22:11 ^🔗	SketchCow	However that's done, I approve.
22:12 ^🔗	soultcer	So you want to stop even creating the tars for new projects and just uploads warcs to the idea plus a small tar for the logfiles?
22:12 ^🔗	SketchCow	It's expensive as shit, but making a .tar file, and then running megawarc against it, then removing the tar file and uploading the megawarc files....I could live with that.
22:12 ^🔗	SketchCow	that might be smartest.
22:13 ^🔗	alard	I think the tar isn't really necessary, especially not if you don't want to 'reconstruct' a tar that never was from the megawarc.
22:13 ^🔗	alard	Or you could use made-up tar headers.
22:13 ^🔗	SketchCow	Well, let's think about it.
22:14 ^🔗	SketchCow	DO we want the .tar file? By reconstructing later, you have a nice standard collection of the files.
22:14 ^🔗	SketchCow	And you can pull things from it.
22:14 ^🔗	soultcer	Well we need some way to store "these records in the warc file belong to a single user"
22:18 ^🔗	SketchCow	So, how do we feel about that? I think a .tar existing somewhere along the line works very well for what we want to do.
22:18 ^🔗	SketchCow	because then then .tars can go into The Next Thing After Internet Archive
22:19 ^🔗	underscor	TNTAIA, for short
22:19 ^🔗	soultcer	But then you have to store both the tar file and the megawarc
22:19 ^🔗	SketchCow	No, no.
22:19 ^🔗	SketchCow	you are using a .tar as the intermediary instead of the file directory to generate the megawarc
22:20 ^🔗	SketchCow	So in come the piles o' files
22:20 ^🔗	SketchCow	At some point, you have a 50gb collection (say)
22:20 ^🔗	SketchCow	You make it a .tar
22:20 ^🔗	SketchCow	You megawarc the .tar
22:20 ^🔗	SketchCow	you upload the megawarc.
22:20 ^🔗	SketchCow	Now the thing's been standardized out past the filesystem
22:20 ^🔗	SketchCow	And can be turned into 50gb chunks in the future on your holocube 2000x
22:21 ^🔗	soultcer	How about instead of creating a .tar and megawarcing it, you directly create the megawarc from the 50gb of source files?
22:21 ^🔗	SketchCow	This is what we just discussed.
22:22 ^🔗	soultcer	Oh, I thought you wanted to keep the "create a tar and megawarc it" step
22:22 ^🔗	SketchCow	I asked about that possibility, but it does lead to concerns.
22:22 ^🔗	SketchCow	by making something a .tar and then making it a megawarc, we have an intermediary thing it's converted back into that's able to be manipulated by other programs.
22:23 ^🔗	SketchCow	And I am saying, I think this is a good idea for future extensibility.
22:23 ^🔗	SketchCow	1. 2. and 3. are all handled.
22:23 ^🔗	soultcer	Even if you skip the tar step, you can later convert it back to a tar file.
22:24 ^🔗	soultcer	Though as long as going through the tar step doesn't create much of a bottleneck, it is probably nice to use tools that already exist and that do one thing well
22:24 ^🔗	SketchCow	http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-5
22:24 ^🔗	SketchCow	OK, so update.
22:24 ^🔗	SketchCow	It definitely made a >CDX
22:25 ^🔗	underscor	nice!
22:25 ^🔗	underscor	SketchCow: So is that what you want to do? fill up 50gb->tar->megawarc->ingest->rinse/repeat?
22:26 ^🔗	SketchCow	That is my proposal - I can give you scripts that I wrote and which alard wrote.
22:26 ^🔗	underscor	okay, awesome
22:26 ^🔗	SketchCow	But first, I want to have us discuss 50gb->megawarc->ingest
22:26 ^🔗	SketchCow	Because that was also on the table. pros and cons.
22:26 ^🔗	underscor	alard gave me the "watcher" script
22:26 ^🔗	underscor	but I don't have an tar/s3-er
22:26 ^🔗	underscor	that moves things into a temp dir
22:27 ^🔗	SketchCow	Right.
22:27 ^🔗	SketchCow	No, I'll give you those, but first I want this decided.
22:27 ^🔗	SketchCow	Also, I asked people to verify the CDX just generated.
22:27 ^🔗	SketchCow	Because if it just made borscht, more borscht is not a buddy.
22:27 ^🔗	SketchCow	I'm also about to restore that megawarc-5 to see what happens.
22:29 ^🔗	soultcer	SketchCow: As I said, it would be easy to modify alard's megawarc creator so it directly takes a directory of small warcs/wget logs and creates the same output (minus some tar metadata, that isn't necessary)
22:30 ^🔗	SketchCow	As alard said, one corrupted gz makes it not work
22:31 ^🔗	alard	soultcer: Just thought that it should be possible to add tar headers as well. Let Python create them, as if it is making a tar.
22:31 ^🔗	alard	I'll have a look.
22:31 ^🔗	SketchCow	Let's put it this way.
22:31 ^🔗	alard	Other question: it's possible to put the extra-tar and the json inside the big-warc. Is that useful?
22:31 ^🔗	soultcer	alard: Would work, but why would we need the tar headers anyway? It's mostly metadata about the filesystem on fos.
22:31 ^🔗	alard	You'd have one file, but the index would be less accessible.
22:32 ^🔗	SketchCow	alard: It makes it harder to decipher later. I'd keep it outside
22:32 ^🔗	alard	soultcer: It has timestamps.
22:32 ^🔗	alard	and it makes it easier to make a tar.
22:33 ^🔗	soultcer	alard: I don't really see why we would need any of the tar metadata, but it would of course be possible to create some of it, but I have no idea how to create the tar header string you have in the json file
22:34 ^🔗	joepie91	SketchCow: I'm still having rsync issues for btinternet - does that happen for everyone?
22:34 ^🔗	SketchCow	The issue is not everything we add is a warc.gz
22:34 ^🔗	SketchCow	Sometimes it's going to be additional 'stuff'
22:35 ^🔗	joepie91	for every single job: @ERROR: max connections (5) reached -- try again later
22:35 ^🔗	soultcer	SketchCow: The additional files are simply put into a single tar archive
22:35 ^🔗	SketchCow	btinternet just got more love
22:36 ^🔗	joepie91	looks fixed now, thanks
22:36 ^🔗	joepie91	:P
22:36 ^🔗	soultcer	Together with the metadata from the json file, the tar archive with the additional files and the megawarc file, you can recreate the original directory structure, or create a tar archive with all files
22:36 ^🔗	joepie91	whoa
22:36 ^🔗	joepie91	starts scrolling like mad
22:36 ^🔗	joepie91	lol
22:37 ^🔗	joepie91	looks like people had a lot in queue
22:37 ^🔗	SketchCow	soultcer: I'm going to again defer to alard's opinion on this.
22:37 ^🔗	joepie91	especially Sue
22:37 ^🔗	joepie91	cough
22:37 ^🔗	SketchCow	Incoming crap - ends up as megawarc
22:37 ^🔗	SketchCow	I just want the megawarc that results to be useful to the historians and the individuals as much as it is to wayback.
22:38 ^🔗	soultcer	And the wget logs I assume?
22:39 ^🔗	SketchCow	I don't want anything being lost
22:41 ^🔗	joepie91	lol, at this pace, btinternet will be done in 30 minutes
22:47 ^🔗	soultcer	alard: So what do you think? Bundle as tar, then megawarc; or directly create the megawarc?
22:50 ^🔗	SketchCow	Give him a moment, I see he's been coding some stuff related to uploads.
22:51 ^🔗	soultcer	Sure.
22:53 ^🔗	soultcer	The thing with the tar is: It includes a lot of metadata on who created the tar file (uid/gid), when it was created (mtime/ctime) and the filesystem permissions. I am not sure if we want to keep those, or not
23:06 ^🔗	SketchCow	I've reconstructed a .tar from the megawarc.
23:06 ^🔗	SketchCow	Now unpacking it to see if everything comes out ok.
23:07 ^🔗	soultcer	They should be bit-for-bit copies I think
23:07 ^🔗	SketchCow	Absolutely.
23:08 ^🔗	SketchCow	Regardless, I am doing what someone in 10 years would be doing.
23:17 ^🔗	Sue	i'm sorry for what i did to btinternet
23:17 ^🔗	chronomex	they got graped
23:18 ^🔗	Sue	i had like
23:18 ^🔗	Sue	300-400 rsync jobs queued up
23:18 ^🔗	Sue	apparently 17G woth
23:18 ^🔗	Sue	*worth
23:18 ^🔗	chronomex	jesus
23:18 ^🔗	Sue	it's about to be done
23:21 ^🔗	alard	New version: https://github.com/alard/megawarc
23:21 ^🔗	alard	I renamed megawarc build TAR to megawarc convert TAR (seemed more logical).
23:22 ^🔗	alard	There's now also a megawarc pack TAR FILE-1 FILE-2 ... option that packs files/paths directly.
23:22 ^🔗	alard	You still need to provide TAR to make the file names, but that tar doesn't exist.
23:23 ^🔗	alard	E.g. ./megawarc pack webshots-12345.tar 12345/ should work.
23:24 ^🔗	alard	Then ./megawarc restore webshots-12345.tar would give you a tar file.
23:24 ^🔗	soultcer	alard: Nice work. I was just thinking about simply using the TarInfo class to create the tar_header structure, but I see you not only thought of it faster, you implemented it while I was still thinking about the details ;-)
23:27 ^🔗	alard	I copied most of it from Python's tarfile.py.
23:27 ^🔗	soultcer	Good programmers code, better programmers reuse ;-)
23:28 ^🔗	Sue	btinternet is now in the negative
23:29 ^🔗	joepie91	ooo
23:29 ^🔗	joepie91	100MB btinternet user incoming
23:29 ^🔗	joepie91	... wat
23:29 ^🔗	joepie91	how's that even possible?
23:29 ^🔗	SketchCow	means they paid for premium
23:29 ^🔗	joepie91	SketchCow: but premium users are on a separate server
23:29 ^🔗	SketchCow	A la geocities and a few others, the old address is kept while the premium address goes up.
23:29 ^🔗	chronomex	recursion!
23:29 ^🔗	SketchCow	We found 1gb geocities users
23:29 ^🔗	joepie91	ah
23:30 ^🔗	alard	Time to find more usernames then, (There are also 1185 usernames still claimed, over 1000 by Sue.)
23:30 ^🔗	joepie91	wonder how they did that though
23:30 ^🔗	joepie91	because there's a separate server for all premium users
23:30 ^🔗	joepie91	two IPs away from the free server
23:30 ^🔗	Sue	over 1k by me? must be a glitch
23:30 ^🔗	alard	Are all your instances finished?
23:31 ^🔗	Sue	i'm still doing probably 20-30
23:31 ^🔗	Sue	the screen isn't full of no item recieved yet
23:31 ^🔗	joepie91	mine is
23:31 ^🔗	joepie91	or well
23:31 ^🔗	joepie91	alternating between no item received and tracker rate limiting
23:31 ^🔗	joepie91	lol
23:33 ^🔗	Sue	can you release items per user? that's strange that i have so many...
23:33 ^🔗	alard	I've put them back in the queue.
23:34 ^🔗	Sue	ok
23:34 ^🔗	alard	And with that I'm off to bed. Bye!
23:34 ^🔗	SketchCow	Thanks again, alard
23:34 ^🔗	Sue	i hunger for more
23:35 ^🔗	joepie91	suddenly, 5mbit!
23:35 ^🔗	joepie91	goodnight alard
23:35 ^🔗	joepie91	:)
23:39 ^🔗	DFJustin	wow so unless we find way more users, all of btinternet will fit on a microsd card
23:39 ^🔗	DFJustin	INCREASING COSTS
23:40 ^🔗	SketchCow	Huh, someone recorded alard's process for programming new code.
23:40 ^🔗	SketchCow	http://www.youtube.com/watch?feature=fvwp&NR=1&v=8VTW1iUn3Bg
23:40 ^🔗	SketchCow	Screencap's gotten good
23:40 ^🔗	SketchCow	(He's the one in the glasses)
23:43 ^🔗	Sue	i'm out of users, 14 left downloading
23:50 ^🔗	SketchCow	That's to be expected.
23:57 ^🔗	SketchCow	Internet Archive's teams have signed off on the megawarcs.
23:57 ^🔗	SketchCow	So guess what - FOS is making a ton of fucking megawarcs tonight.

irclogger-viewer