#archiveteam 2012-10-11,Thu

↑back Search

Time Nickname Message
00:00 🔗 nintendud ah, I'm seeing timeouts in my warrior
00:00 🔗 nintendud it must really be getting crushed
00:00 🔗 primus what does FOS stand for?
00:01 🔗 nintendud Free and Open Source? Maybe?
00:02 🔗 sankin just curious, what are the hardware specs for fos?
00:02 🔗 SketchCow This is really bad.
00:02 🔗 nintendud it's a raspberry pi hooked up to a RAID array
00:02 🔗 SketchCow It shouldn't be this hammered.
00:02 🔗 nintendud Oh?
00:04 🔗 nintendud speaking of 'pi's, apparently you can colocate a pi in Austria for free: https://www.edis.at/en/server/colocation/austria/raspberrypi/
00:04 🔗 SketchCow FOS stands for Fortress of Solitude
00:04 🔗 SketchCow It replaced a machine named Batcave
00:04 🔗 nintendud Hah, nice
00:04 🔗 SketchCow FOS became a way to refer to it easily.
00:04 🔗 primus :-) awesome, thanks
00:05 🔗 S[h]O[r]T i always thought it was a fun take on FiOS because verizon sponsored it :P
00:05 🔗 nintendud iFOS. By Apple.
00:05 🔗 S[h]O[r]T even though i know that is no where near true
00:17 🔗 nintendud I wonder if these fixed 30 second retries has us all hammering FOS at the same time.
00:17 🔗 chronomex thundering herd effect?
00:18 🔗 nintendud TIL that term. Essentially, although more than one can rsync at a time.
00:18 🔗 nintendud it's why random backoff in ethernet is a thing
00:19 🔗 nintendud I keep getting about 5% uploaded before it dies
00:24 🔗 SketchCow Machine is seriously getting hammered.
00:24 🔗 SketchCow Not sure what to do yet.
00:24 🔗 SketchCow Might set rsync.
00:25 🔗 nintendud Is it coming in 30 second waves?
00:25 🔗 nintendud Or is it just a constant surge of traffic?
00:25 🔗 SketchCow ha ha you act like pressing keys makes anything happen.
00:25 🔗 nintendud Oh right. The tubes. They are clogged.
00:26 🔗 S[h]O[r]T if you have access to the switch or firewall in front of it you can block certain IP ranges or ports to slow down the flow of traffic in
00:26 🔗 SketchCow I like where you said that too.
00:26 🔗 SketchCow All these suggestions are well meaning and useless.
00:26 🔗 SketchCow I'm going to implement a max connections as soon as I can get vi to respond.
00:27 🔗 S[h]O[r]T well if you had access to the switch you could just deny all rsync or anything else and allow ssh :p
00:27 🔗 S[h]O[r]T that wouldnt be useless
00:28 🔗 SketchCow Yes.
00:28 🔗 SketchCow So.....
00:28 🔗 SketchCow If only we could turn lead into gold, we could solve a number of problems.
00:28 🔗 SketchCow But the impossibility of that makes it useless.
00:30 🔗 SketchCow Realize my temper is going to be short while I wrestle with a machine with over 1,100 rsync connections active.
00:30 🔗 nintendud Yup. Good luck, soldier.
00:31 🔗 SketchCow And advice along the line of "to fix the problem, you should fix the problem" is brain fart
00:32 🔗 SketchCow It has been trying to open a vi session for 4 minutes.
00:32 🔗 SketchCow That's how bad it is.
00:32 🔗 SketchCow I have two other windows, trying to set up a killing of rsync
00:34 🔗 S[h]O[r]T im guessing you didnt want any advice then and are just venting
00:34 🔗 DoubleJ Mine finally timed out so I was able to pause the VM. So my minuscule part of the load is off.
00:39 🔗 SketchCow I set it to 20
00:43 🔗 SketchCow Now doing a megakill
00:44 🔗 SketchCow Bitches
00:44 🔗 SketchCow ps -ef | grep rsync | awk '{PRINT $2}' | xargs kill
00:47 🔗 nintendud no killall?
00:47 🔗 chronomex or skill
00:48 🔗 SketchCow shhh, I'm oldschool
00:48 🔗 * chronomex nods knowingly
00:49 🔗 chronomex you have legitimate claim to the phrase "I have underwear that's older than your home directory"
00:56 🔗 igelritte nice
00:57 🔗 SketchCow root@teamarchive-1:/etc# killall rsync
00:57 🔗 igelritte thought I think if he had used 'ps -aux | grep'... that would have been better
00:59 🔗 igelritte looks like it's time for bed. Gettin' a little punchy.
00:59 🔗 igelritte later
01:00 🔗 SketchCow Machine is pretty hosed.
01:06 🔗 SketchCow FOS crashed.
01:07 🔗 BlueMax Wow, what happened
01:22 🔗 SketchCow DJ Smiley remix of the main page of archiveteam.org now in place.
01:32 🔗 SketchCow fos is back
01:32 🔗 SketchCow now running with some severed rsync limits while we get shit in shape.
02:18 🔗 godane i'm uploading issue 150 dvd of linux format
03:47 🔗 bsmith096 @ERROR: max connections (5) reached -- try again later
03:47 🔗 bsmith096 Starting RsyncUpload for Item woodp
03:47 🔗 bsmith096 getting a whole mess of these
03:47 🔗 bsmith096 rsync error: error starting client-server protocol (code 5) at main.c(1534) [sender=3.0.9]
03:49 🔗 S[h]O[r]T the server (fos) stuff rsyncs too is limited to 5 rsync connections atm, it was having issues earlier. SketchCow should updated one its all good at some point
03:51 🔗 bsmith096 so is the script gonna continue at some point cause it just keeps trying to dend those 2 users over and over
03:51 🔗 bsmith096 send
03:51 🔗 S[h]O[r]T yeah it will keep trying until it gets through
03:51 🔗 S[h]O[r]T can just leave it running
03:52 🔗 underscor I thought it only tries 50 times
03:52 🔗 underscor and then gives up?
03:54 🔗 S[h]O[r]T if it does thats 25min and there must be a bug?
03:55 🔗 S[h]O[r]T thats good tho :P
03:58 🔗 S[h]O[r]T i looked awhile back and i just a bit ago, was pretty sure the rsync in pipeline doesnt have a lot of overhead but i could be wrong. i know there are some options to turn off compression and us a lower encryption that generate less cpu usage.
03:58 🔗 S[h]O[r]T for client and server
04:24 🔗 underscor S[h]O[r]T: Well, I'm just saying
04:24 🔗 underscor with the rate limit on fos
04:24 🔗 underscor it's very likely you could not get in in 25m
04:24 🔗 underscor and then the thing will just give up
04:24 🔗 underscor and you're wasted
04:24 🔗 underscor 3
04:24 🔗 underscor D:
04:43 🔗 S[h]O[r]T im saying its been more than 25min and i havent got in and its still trying
04:47 🔗 underscor oic
04:48 🔗 underscor maybe I'm wrong
04:48 🔗 underscor I just overheard someone say that
04:48 🔗 underscor looks like SketchCow upped it to 10
04:48 🔗 underscor none of my threads are doing any work still, though
04:48 🔗 underscor hopefully we can reopen the floodgates soon
04:48 🔗 underscor otherwise we're definitely not going to do well with webshots XD
04:51 🔗 underscor yay!
04:51 🔗 underscor finally got one in
04:51 🔗 underscor w00t
05:13 🔗 S[h]O[r]T i dont see it got upped to 10:P
06:27 🔗 ivan` is anyone in the rehosting-dead-pastebins business?
06:27 🔗 ivan` 100K pastes from paste.lisp.org would be better off googlable
06:33 🔗 chronomex do you have them??
06:42 🔗 ivan` yes
06:42 🔗 ivan` http://ludios.org/tmp/paste.lisp.org.7z
06:42 🔗 ivan` chronomex: ^
06:43 🔗 deathy something up with warrior upload? getting "@ERROR: max connections (10) reached -- try again later"
06:45 🔗 chronomex <3
06:45 🔗 Cameron_D The server we rsync to is currently limited because it was having problems earlier
06:46 🔗 chronomex thanks ivan`! are you involved with paste.lisp.org?
06:46 🔗 ivan` no, I think stassats runs it, but his reply did not indicate interest in restoring them
06:46 🔗 chronomex aye.
06:47 🔗 chronomex ow, this is a lot of files
06:47 🔗 ivan` heh
06:47 🔗 deathy hoping limit gets increased/lifted... almost all warriors waiting for upload :|
06:47 🔗 chronomex *wow
06:47 🔗 SketchCow WHY HELLO
06:48 🔗 * chronomex shoves this into IA
06:48 🔗 SketchCow You crying sallybags.
06:48 🔗 chronomex wassap brah
06:48 🔗 SketchCow You whip a virtual server within an inch of its life, and then woah, you all want it jogging around the track 5 minutes later.
06:49 🔗 SketchCow Also, I like Underscor whining on 3 channels about me taking a reasonable attempt to prevent the machine dying.
06:49 🔗 SketchCow 948 simultaneous rsyncs.
06:49 🔗 SketchCow Think about that.
06:49 🔗 chronomex o_O
06:49 🔗 SketchCow You know what you did.
06:49 🔗 chronomex bitches gotta bitch
06:49 🔗 * SketchCow gets the newspaper
06:49 🔗 deathy good job team! :)
06:49 🔗 Cameron_D haha
06:50 🔗 soultcer We need support for distributing uploads to multiple servers. Next one to complain about fos being unreachable will be volunteered to code that into the seesaw kit.
06:50 🔗 chronomex ivan`: can you share some info about this file? when was it captured, was the paste dead at the time, is it complete, etc
06:52 🔗 SketchCow Tomorrow, FOS goes down when one of the admins inceases its swap from 2gb to 6gb.
06:59 🔗 ivan` chronomex: pastes were captured 2011-11-14 and 2012-05-01 and 2012-10-06 (though perhaps I should strip those); not complete, I don't have pastes 129789-131892
07:00 🔗 chronomex ok
07:00 🔗 chronomex :D
07:04 🔗 chronomex http://archive.org/details/paste_lisp_org
07:06 🔗 SketchCow So, basically I have a couple days to prepare some more archiveteam items for ingestion into the wayback.
07:09 🔗 SketchCow 188,329,776 14.0M/s eta 3h 58m
07:09 🔗 SketchCow Now that's a spicy meatball
07:10 🔗 SketchCow 1,067,816,696 17.7M/s eta 4h 19m
07:10 🔗 SketchCow Downloaded a gig. Going to take 4 hours. It's like that.
07:14 🔗 SketchCow With luck, I can make a lot more of these things green.
07:14 🔗 SketchCow If this all works, all the green ones go into the wayback machine instantly.
07:15 🔗 SketchCow Instant SOPA review end of the month!
07:15 🔗 SketchCow That'd be nice.
07:41 🔗 SketchCow Re-initiated uploads from fos to archive.org of webshots loads.
07:46 🔗 alard Hi there. I've looked up the "rsync retries 50 times" claim that I made yesterday. I now think that's wrong and that it retries indefinitely, so your warriors will just wait until they can upload.
08:37 🔗 SketchCow OK, napping.
13:07 🔗 joepie91 SketchCow: I'm not quite sure what to do with this, but I archived the videos that someone (I think bsmith096) linked me to a while ago as rare footage: http://aarnist.cryto.net:81/youtube/all/
13:07 🔗 joepie91 flv/mp4/webm format
13:17 🔗 balrog_ having a hard time with rsync with warrior
13:17 🔗 balrog_ getting "max connections reached" errors
13:22 🔗 joepie91 same
13:22 🔗 balrog_ alard: ya there?
13:24 🔗 S[h]O[r]T the server the scripts rsync to is currently limited because it was having problems earlier
13:24 🔗 balrog_ yeah, but how do I keep the warrior going?
13:24 🔗 balrog_ I have this bandwidth which otherwise isn't going to get used
13:24 🔗 S[h]O[r]T just have to wait, it will keep retrying uploads :\
13:25 🔗 balrog_ need to shorten the wait time from 30 seconds to more like 5 then
13:42 🔗 balrog_ is there any way I can tweak this? :/
13:45 🔗 SmileyG more concurrent threads?
13:45 🔗 SmileyG problem is we are all downloading it faster than FOS can accept it back in
13:45 🔗 balrog_ yeah
13:46 🔗 SmileyG The fix is FOS Accepting it faster, or us having larger caches.
13:46 🔗 SmileyG larger caches are possible if you do more concurrent downloads, however depending how fast you download in ratio to the max upload, your still going to get stuck eventually
13:47 🔗 SmileyG joepie91: I'd upload them to IA and give SketchCow a link,
13:47 🔗 SmileyG thats what I've done with the fish ezine I get each month.
13:48 🔗 SmileyG [08:46:31] <@alard> Hi there. I've looked up the "rsync retries 50 times" claim that I made yesterday. I now think that's wrong and that it retries indefinitely, so your warriors will just wait until they can upload.
13:48 🔗 SmileyG Phew! that was a worry for me.
14:02 🔗 balrog_ 5 connections seems a bit low
14:09 🔗 alard Hey, "need to shorten the wait time from 30 seconds to more like 5 then" is not a good idea. If we all do that, will just increase the load on the server, but won't increase the number of uploads.
14:10 🔗 flaushy the problem is that users like me with slow uploads (max 1 MiB/s) will use the slots for a long time :/
14:10 🔗 alard We'll just have to wait until the server can handle more connections. (Or we'll have to find some other server were we can upload to, to spread the load.)
14:10 🔗 flaushy right
14:11 🔗 flaushy can we, until then, increase the warrior concurrency to a higher max than 6?
14:13 🔗 alard No, that would require a lot of updates. (And I also don't really see how that would help. It would just add more waiting uploads.)
14:13 🔗 alard Just be patient. :)
14:13 🔗 flaushy okie :)
14:14 🔗 flaushy at least we don't loose the queues, which is great
14:15 🔗 soultcer alard: What would the requirements of such a server be?
14:15 🔗 SmileyG we need a mini IA just for our upoloads lol
14:16 🔗 flaushy i mean underscor looks like haveing enough bandwidth to act as a caching server for smaller guys. But i might be wrong
14:22 🔗 alard soultcer: It would need downstream and upstream bandwidth, and a not too small disk to receive files before it packages and uploads them to archive.org.
14:22 🔗 alard Uploads are 50GB batches, so a multiple of that.
14:24 🔗 flaushy would 100 mbit be enough?
14:24 🔗 soultcer Maybe renting a cheap server from hetzner or ovh for a month would work
14:25 🔗 alard Yes, 100mbit would be enough (we also don't have to send all uploads to one server).
14:26 🔗 SmileyG the bt ones are the issue right?
14:26 🔗 SmileyG because they are so short.... ?
14:26 🔗 SmileyG SHame we can't package multiple users together on the warrior?
14:26 🔗 alard I do not know what the issue is. It can't be bt, I would think, since we have only a few thousand small users there.
14:27 🔗 SmileyG alard: but most of them finish in sub30 seconds
14:27 🔗 SmileyG thats a lot of rsync processes spawning constantly for such small tranfers.
14:27 🔗 alard Yes, so there aren't many active at the same time. But I don't know what the issue is, really. It could be the number of processes spawning, or just the number of concurrent uploads.
14:28 🔗 alard Resuming uploads are probably also more expensive than new uploads.
14:28 🔗 alard (There would have been a few of them when the server came back up, I suspect.)
14:29 🔗 alard It doesn't have to be rsync, by the way. That's just what fos currently has.
14:30 🔗 alard Anyway, I'll be back later.
14:30 🔗 SmileyG o/
14:30 🔗 soultcer Does the bundling script rely on the partial setting? You could use --inplace, then it won't have to rename/move files after finishing
14:34 🔗 SmileyG partial works for the webshots but makes no sense with the BT ones
14:37 🔗 SmileyG 14254080 52% 166.45kB/s 0:01:18
14:41 🔗 SketchCow Back once again.
14:44 🔗 flaushy meh, i need a couple of minutes mostly
14:44 🔗 flaushy alard: i ask at my university
14:44 🔗 flaushy whether i can crawl with the pools at night, and whether a dump would be acceptable
14:48 🔗 tef_ alard, DFJustin: 0.18 and 1.0 warcsare the same bar the version number, yes. (I have this from one of the authors of the warc spec)
14:49 🔗 tef_ pps warc2warc in warctools can recompress warcs record by record. warc2warc *.warc.gz -O output.warc.gz
14:53 🔗 soultcer If it can recompress warcs, can it also concatenate them? Simply create one big warc file instead of tarring multiple warc files. Would make it easier for IA to use?
15:15 🔗 DFJustin so SketchCow / underscor, can you pull bt usernames out of the wayback database, I can do stuff like http://wayback.archive.org/web/*/http://www.btinternet.com/~* but I only get a few hundred at a time and it will take forever
15:19 🔗 alard DFJustin: underscor sent a list from the wayback database yesterday.
15:20 🔗 DFJustin well I was getting usernames just now that rescue-me didn't know, although I think most of them are long gone
15:21 🔗 alard Ah, I don't know what they searched for.
15:23 🔗 alard soultcer: I think --partial or --inplace doesn't really matter (moving a file on the same disk isn't that expensive, is it?)
15:25 🔗 alard I was playing with this for the one-big-warc problem: https://github.com/alard/megawarc Any good suggestions?
15:25 🔗 SketchCow http://24.media.tumblr.com/tumblr_m9dvjezOvX1qm3r26o1_500.jpg
15:25 🔗 soultcer When you have a big file half-uploaded, and then continue without --inplace, it will first make a temp copy of the already existing stuff, then write to that temp copy
15:25 🔗 soultcer Only when it has finished uploading, will it remove that copy
15:26 🔗 soultcer I had trouble transfering a file because rsync took more than 1.5 times the size of the file when I didn't use inplace
15:27 🔗 alard In any case, --inplace can't be used here, because then half-uploaded files could be moved by the postprocessing script.
15:28 🔗 soultcer alard: What do we need the original tar file for?
15:28 🔗 alard It's nice to be able to find the per-user files.
15:28 🔗 SketchCow yes
15:28 🔗 alard And for mobileme there are wget.logs and other files.
15:32 🔗 alard So even though you'd probably never want the original tar file back, it's useful to keep the data somewhere. The 'restore' function demonstrates that there's no data lost.
15:48 🔗 tef_ alard: if you have extra logs to put in, warc record can handle that metadata records
15:50 🔗 alard tef_: I know. The new projects have one single warc file per user. The older projects, mobileme, have the logs and a few other files besides the warcs.
15:51 🔗 alard (And even with mobileme the wget log is also in the warc files, I think.)
15:51 🔗 tef_ nice
15:51 🔗 tef_ but yeah converting from .tar to warcgz could happily convert non warc records into warcrecords in the final output
15:52 🔗 alard Yes, so you could make one file that has everything.
15:52 🔗 SketchCow Here's a hilarious one - the fortunecity collection. It's warcs AND straight html.
15:52 🔗 tef_ SketchCow: warc records can be of 'resource' instead of 'response' :-)
15:52 🔗 alard We could put the tar file in the big warc.
15:59 🔗 tef_ heh
16:19 🔗 underscor SketchCow: I wasn't whining!
16:22 🔗 underscor alard: Does the seesaw kit support round-robining rsync servers?
16:22 🔗 underscor Because I have 12 boxes at archive.org we could rr between
16:28 🔗 alard underscor: Not yet, but it could. I think it would be even better to do it with HTTP PUT uploads, though. That would make round-robining easier. (And it might be less stressful for the server.)
16:28 🔗 underscor Hmm, as safe as rsync though?
16:28 🔗 underscor (checksum-wise)
16:28 🔗 SmileyG hmmmm
16:28 🔗 SketchCow alard: First test of megawarc coming up
16:28 🔗 alard Does rsync make many checksums?
16:29 🔗 underscor I thought it did a checksum
16:29 🔗 underscor But actually, no
16:29 🔗 SmileyG it does
16:29 🔗 underscor In write only mode, it doesn't
16:29 🔗 alard Only if you allow it, I thought. (Other than the filesize thing.)
16:29 🔗 SmileyG files to check #0/1
16:29 🔗 SmileyG currently it appears to check the writes...
16:30 🔗 SmileyG can you just use dns RR too?....
16:30 🔗 underscor Yeah, but that requires waiting for propagation, etc
16:30 🔗 underscor Also a lot of places (RIT included) munge the results
16:31 🔗 underscor and only return one of them until the cache expires
16:31 🔗 SmileyG o
16:31 🔗 SmileyG ttl 5
16:31 🔗 SmileyG :D
16:31 🔗 underscor haha
16:31 🔗 underscor They ignore ttl :()
16:31 🔗 underscor :( *
16:31 🔗 SmileyG just make sure your dns server can take it
16:31 🔗 SmileyG wut ¬_¬
16:31 🔗 underscor Yeah
16:31 🔗 underscor sux
16:32 🔗 SmileyG ok, have the tracker hand out upload details?
16:32 🔗 SmileyG along with username?
16:33 🔗 underscor alard: Setting up a PUT server for testing
16:34 🔗 alard We could write a tiny node.js PUT server with checksums. :)
16:36 🔗 soultcer Why complicate it further by introducing another programming language?
16:36 🔗 alard Good question.
16:41 🔗 soultcer Is there no simple point to point file transfer protocol witch checksumming?
16:43 🔗 alard Do we need checksums? (If we're uncomplicating. :)
16:44 🔗 underscor Nah.
16:44 🔗 underscor I was just putting up a put accepter in nginx
16:44 🔗 underscor since I already have it on these boxen
16:44 🔗 alard After all, once it's on that server we uploaded to we'll be using the non-checksummed s3 api to bring it to archive.org.
16:46 🔗 alard underscor: Do you happen to know if there's a way to distinguish uploaded from still-uploading files?
16:46 🔗 underscor No idea. Let me see.
16:48 🔗 alard That's useful to know for the postprocessing. (The current packaging script moves any file it can find.)
16:49 🔗 deathy "A file uploaded with the PUT method is first written to a temporary file, then a file is renamed. Starting from version 0.8.9 temporary files and the persistent store can be put on different file systems but be aware that in this case a file is copied across two file systems instead of the cheap rename operation."
16:49 🔗 deathy apparently from "ngx_http_dav_module" docs
16:50 🔗 alard Ah, that's promising.
16:54 🔗 S[h]O[r]T FTP
16:55 🔗 soultcer FTP? What are we, farmers?
16:57 🔗 S[h]O[r]T lol
16:57 🔗 underscor http://p.defau.lt/?SBDTYn8UhfxVvm4rSmlydw
16:57 🔗 underscor cc alard
16:57 🔗 underscor :D
16:57 🔗 underscor and it didn't appear until after the upload fully finished
16:57 🔗 alard Nice.
16:58 🔗 alard Does it make directories? (As in /webshots/underscor/something.warc.gz ?)
16:58 🔗 underscor I can enable it
16:59 🔗 underscor So if you put to http://bt-download00.us.archive.org:8302/webshots/some/path/here/libtorrent-packages.tar.gz
16:59 🔗 underscor it will create /some/path/here on the fly
16:59 🔗 alard It's not necessary, but I with the rsync uploads I generally let every download upload to a separate directory.
17:00 🔗 alard Doesn't really serve a purpose.
17:00 🔗 alard I'll be back later.
17:01 🔗 underscor alard: option enabled.
17:02 🔗 underscor Holler at me when you get back if you think this would be a better idea going forward, and I can push out to the rest of the boxes
17:19 🔗 godane i got up to episode 43 of t3 podcast
17:36 🔗 joepie91 S[h]O[r]T: no, absolutely not FTP
17:36 🔗 joepie91 lol
18:51 🔗 joepie91 very relevant: I don't have time for silliness. Just let me know if you're removing our footage, or if I'm forwarding this to our attorneys. I'm not interested in your creative commons bs (which those of us who actually work in media refers to as amateur licensing) and I have told you that we do not want our work in any of your videos. Let me repeat: we want NONE of our work in ANY of your or any third party
18:51 🔗 joepie91 videos, and our exclusive licensing agreements exist specifically so that is enforcable.
18:51 🔗 joepie91 er
18:51 🔗 joepie91 faol
18:51 🔗 joepie91 fail *
18:51 🔗 joepie91 http://arstechnica.com/tech-policy/2012/10/court-rules-book-scanning-is-fair-use-suggesting-google-books-victory/
18:51 🔗 joepie91 ignore the above blob of text, it was an earlier copypaste from a pastebin :P
18:53 🔗 chronomex now I'm curious
18:53 🔗 chronomex however I have work to do
20:08 🔗 SketchCow alard's not here, is he?
20:08 🔗 SketchCow I think eh went awayyyy
20:10 🔗 alard Hello!
20:12 🔗 SketchCow Hey, my net went wonky.
20:12 🔗 SketchCow ImportError: No module named ordereddict
20:13 🔗 SketchCow How do I fix that?
20:13 🔗 alard Python 2.6?
20:14 🔗 alard wget https://bitbucket.org/wooparadog/zkit/raw/4ce69af1742f/ordereddict.py
20:14 🔗 SketchCow File "megawarc.py", line 64, in <module>
20:14 🔗 SketchCow ImportError: No module named ordereddict
20:14 🔗 SketchCow Traceback (most recent call last):
20:14 🔗 SketchCow from ordereddict import OrderedDict
20:14 🔗 SketchCow root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python2.7 megawarc.py
20:15 🔗 soultcer OrderedDict is in collections for py 2.7
20:15 🔗 SketchCow Bear in mind I am a perl guy at best.
20:15 🔗 SketchCow We do it differently there.
20:18 🔗 soultcer SketchCow: Replace "from orderecdict import OrderedDict" with this: http://pastebin.com/dQdZ0wX8
20:18 🔗 soultcer Should work fine in py 2.7, and for py 2.6 you can download the ordereddict module alard told you about
20:20 🔗 SketchCow OK
20:20 🔗 SketchCow So I just wasted some time trying that.
20:20 🔗 soultcer alard: You are only using the ordered dict for cosmetics anyway, right?
20:21 🔗 alard Yes.
20:21 🔗 SketchCow Alard, please put it in the megawarc github if it works
20:21 🔗 SketchCow because damn, I don't edit python very well.
20:21 🔗 chronomex spaces, no tabs
20:21 🔗 chronomex though it pains me to say so
20:21 🔗 SketchCow Yeah, no, like I don't do python
20:21 🔗 PepsiMax omh
20:22 🔗 SketchCow And the github should be improved, not my local copy of it anyway
20:22 🔗 chronomex :)
20:24 🔗 alard SketchCow: I've updated the github repository. Try again. (It worked for me before and it still works now.)
20:32 🔗 SketchCow Usage: megawarc [--verbose] build FILE megawarc [--verbose] restore FILE
20:32 🔗 SketchCow root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc
20:32 🔗 SketchCow root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc build BOARDS-COH-05.tar
20:32 🔗 SketchCow Looking much better.
20:33 🔗 SketchCow Now, let's see if the 11gb file that results is good.
20:33 🔗 SketchCow Do you account for things being in subdirectories in the .tar?
20:43 🔗 alard Well, it doesn't care. What it does is this: it walks through the tar, one entry at a time. If it is a file *and* the filename ends with .warc.gz, it checks to see if it is an extractable gzip. If all that is OK, the warc file is added to the warc. In all other cases (directories, unreadable warcs, other files) the file is added to the leftover tar.
20:43 🔗 alard For the tar reconstruction, it pastes together the content from the leftover tar, the tar headers and parts from the warc. So directories don't matter.
20:53 🔗 SketchCow root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc build BOARDS-COH-05.tar
20:53 🔗 SketchCow root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# ls -l
20:53 🔗 SketchCow total 21136664
20:53 🔗 SketchCow -rw-r--r-- 1 root root 10822246400 Oct 11 19:26 BOARDS-COH-05.tar
20:53 🔗 SketchCow -rw-r--r-- 1 root root 84149 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.json.gz
20:53 🔗 SketchCow -rw-r--r-- 1 root root 10240 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.tar
20:53 🔗 SketchCow -rw-r--r-- 1 root root 10821470898 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.warc.gz
20:53 🔗 SketchCow OK, so.
20:53 🔗 SketchCow That worked... but there was no progress bar, and no updates.
20:53 🔗 SketchCow So I'll use this for now, but I would definitely add something to indicate work is being done.
20:58 🔗 alard SketchCow: Add --verbose
20:59 🔗 alard It won't show a progress bar, but it will tell you what's taking so long.
20:59 🔗 alard underscor: Do you have a /webshots/alard/webshots.com-user-siebertphotoshop-20121011-225722.warc.gz ?
21:01 🔗 SketchCow Oh!
21:07 🔗 joepie91 <SmileyG>joepie91: I'd upload them to IA and give SketchCow a link,
21:08 🔗 joepie91 that's a bit hard
21:08 🔗 joepie91 they're on a server
21:08 🔗 joepie91 :P
21:08 🔗 joepie91 can't get to them now anyway, that server seems offline
21:13 🔗 underscor alard: http://p.defau.lt/?rZSWA6hEE1SROnuBQKxtAQ
21:13 🔗 underscor Lookin awesome :D
21:13 🔗 alard Great. Ready for more?
21:14 🔗 underscor joepie91: :o what was that mispaste about :D
21:15 🔗 underscor alard: yep!
21:15 🔗 underscor Shall I roll out to bt-download01-11 now too?
21:15 🔗 underscor (for roundrobining)
21:15 🔗 alard That would be nice. The tracker picks one of the urls from a list, so it's possible to remove/add urls later.
21:16 🔗 underscor Ah, nice!
21:16 🔗 underscor ok, I'll work on pushing the config
21:16 🔗 flaushy is the limit of rsync only for webshot or for all projects?
21:16 🔗 underscor I'll need the "cleanup" script too
21:16 🔗 underscor flaushy: bt is set to 5 right now, webshots 10
21:17 🔗 flaushy would it make sense to switch underscor?
21:17 🔗 flaushy from webshots to bt?
21:18 🔗 flaushy or are the rsyncs on bt crowded as well?
21:18 🔗 alard Webshots is now uploading over HTTP (once your warrior gets the update).
21:18 🔗 soultcer Sweet
21:18 🔗 flaushy so warrior restart time :)
21:18 🔗 flaushy awesome
21:18 🔗 SketchCow What?
21:19 🔗 SketchCow So wait, stuff is now banging directly into archive? Or something else.
21:19 🔗 alard SketchCow: underscor wants it.
21:19 🔗 underscor SketchCow: well, I have 12 machines we can load balance between
21:19 🔗 SketchCow Underscor wants a lot of things, but I like to be included while I'm over here trying to make this machine function.
21:19 🔗 underscor so I thought it might be a better idea
21:21 🔗 SketchCow Please at least tell me it's going into http://archive.org/details/webshots-freeze-frame with the same format structure
21:21 🔗 alard (We've been discussing this for a while, but we can change it again if you think it's not a good idea.)
21:21 🔗 alard It's exactly the same.
21:21 🔗 underscor It's exactly the same, just that it is roundrobined between 12 boxes instead of a single one
21:21 🔗 SketchCow I trust you'll do the right thing, but if we're using an environment, I just want to know, with my name being mentioned, we're going to shift gears.
21:22 🔗 SketchCow Because then I can focus on it as a "clear out the rest of what we have" instead of "work my ass off on this box trying to make it function for the time being"
21:23 🔗 joepie91 http://p.defau.lt/?rZSWA6hEE1SROnuBQKxtAQ
21:23 🔗 alard Ah, yes. It won't change immediately: the current warriors are still trying to rsync and will keep trying, until they succeed or until they're restarted.
21:23 🔗 joepie91 er
21:23 🔗 joepie91 <underscor>joepie91: :o what was that mispaste about :D
21:23 🔗 joepie91 is what I wanted to paste
21:23 🔗 joepie91 anyway
21:23 🔗 joepie91 wtf is with my clipboard today
21:24 🔗 joepie91 tl;dr guy makes movie about occupy protests, then starts demanding that videos that reuse parts of his movie are taken down
21:24 🔗 joepie91 let me find the full paste
21:25 🔗 SketchCow drop to -bs
21:26 🔗 SketchCow Anyway, I am all for solutions that increase the bandwidth away from FOS, which is meant to be a buffer of 20tb for incoming data, but doesn't function as well as it could as something to blow 50tb of insanity in, do operations on, and blow out.
21:27 🔗 joepie91 SketchCow: what is the main bottleneck for fos?
21:27 🔗 SketchCow I just need to know that's what's going on so I know I'm bailing water out of a bathtub for a little, and not trying to rescue a sinking ship.
21:27 🔗 SketchCow FOS is a virtual box that does about 20 things.
21:27 🔗 SketchCow So the bottleneck for FOS is FOS
21:27 🔗 SketchCow Oversubscription.
21:27 🔗 SketchCow In this particular case, we had the same disk being used for file writes, file compilation, and file reads
21:28 🔗 SketchCow Which is normally not THAT big a deal but it was doing a LOT, and we had 900+ rsyncs
21:28 🔗 SketchCow Eventually swap exploded
21:28 🔗 underscor and everything goes to hell
21:28 🔗 alard Webshots on FOS is now sizzling out, but bt internet is still using rsync. But that's so small it's probably something to keep there?
21:28 🔗 SketchCow I expect so, yes.
21:28 🔗 SketchCow Webshots is TOO DAMN HIGH
21:29 🔗 joepie91 so basically disk I/O is the bottleneck?
21:29 🔗 joepie91 or the main one, at least
21:29 🔗 SketchCow It's one of them.
21:29 🔗 joepie91 hmm
21:29 🔗 joepie91 let me think about this for a moment
21:29 🔗 SketchCow I guess if we're looking to find out, we can circle the sizzling wreck and waste a few days determining why.
21:29 🔗 SketchCow No, don't think about it.
21:29 🔗 SketchCow Think about things and projects that need saving.
21:29 🔗 joepie91 there's not much ability to save if the library is burning down
21:30 🔗 flaushy is there a script for bt as well?
21:30 🔗 SketchCow underscor has twice your brainpower, and 400x your resources (200x mine) and has an unhealthy compulsion to optimize.
21:30 🔗 SketchCow He'll fix it.
21:30 🔗 * underscor giggles giddily
21:31 🔗 SketchCow He literally cuddles with the internat archive infrastructure.
21:31 🔗 * underscor whistles innocently
21:31 🔗 joepie91 ... not sure why you seem so strongly opposed to my decision to invest some of my _own_ time and thought into finding a possible solution
21:31 🔗 underscor but, but, but, petaboxen are so cute~
21:31 🔗 joepie91 I personally don't really care who has more brainpower or infrastructure - more people thinking about it instead of watching random series because boredom, means more chance of a solution
21:31 🔗 SketchCow This was a rare case where miscommunication, exacerbated by a red-eye flight, meant that I fell out of the loop of a solution set.
21:31 🔗 SketchCow And got surprised, and whined.
21:32 🔗 chronomex SketchCow can't stand WWIC
21:34 🔗 SketchCow The teamarchive/FOS machine will now get 8gb of swap instead of 2.
21:34 🔗 underscor SketchCow: What script do you use to inject these into IA?
21:34 🔗 underscor (and can I get it plz)
21:35 🔗 SketchCow I have a custom injector that uses a s3 call.
21:35 🔗 * SmileyG sighs
21:35 🔗 SmileyG still borked? :(
21:35 🔗 SketchCow Before we do this with your round-robin thing.
21:35 🔗 SketchCow What's still borked.
21:36 🔗 SketchCow Anyway, before we do this with your round-robin thing, I think we need to decide if megawarc is ready for production.
21:36 🔗 SmileyG my bt uploads by the look of things - looking at backlog now
21:36 🔗 SketchCow Not borked.
21:36 🔗 SketchCow It was being held at a limit, a limit which I will shortly lift as we move webshots over to a round-robin, and as FOS gets 4x the allocated RAM
21:38 🔗 SmileyG Ah ok, I presumed it was the number of rsyncs due to the BT one being so fast that was the issue (i'd fill my queue in 30~ seconds).
21:39 🔗 SketchCow Also
21:39 🔗 SketchCow http://blog.archive.org/2012/10/10/the-ten-petabyte-party/
21:39 🔗 SketchCow If you're in SF, go eat some foods
21:39 🔗 SmileyG I wish.
21:40 🔗 * joepie91 is practically on the other side of the world
21:40 🔗 SketchCow Now, I want to discuss the format we put webshots in.
21:40 🔗 mistym Probably every non-SF person here is wishing they'd be there now :b
21:40 🔗 SketchCow My attention is gripped a little by seeing what the result of the megawarc program is.
21:40 🔗 SketchCow http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-5
21:41 🔗 SketchCow So first, let us see what the result of the derive is.
21:41 🔗 SketchCow It's an 11gb megawarc, so it will take a few minutes.
21:42 🔗 joepie91 what is a megawarc?
21:42 🔗 soultcer Could you teach the deriver to unpack the tar files?
21:42 🔗 SketchCow soultcer: No
21:42 🔗 SketchCow I sat in meetings across a week on it.
21:42 🔗 chronomex teaching the deriver anything is a major undertaking
21:42 🔗 SketchCow It's not the deriver, it's the wayback machine.
21:42 🔗 SketchCow It's a mess.
21:42 🔗 chronomex ah
21:43 🔗 SketchCow So it's easier to generate a .warc.gz file that cats up all the other warcs in a specific way.
21:43 🔗 chronomex the way I take it, WBM indexes tar files that remain on petaboxes?
21:43 🔗 chronomex thus there's one copy of the WBM data or something
21:43 🔗 SketchCow No, it's weirder.
21:43 🔗 SketchCow It's all so weird.
21:43 🔗 chronomex s/tar/warc.gz/
21:43 🔗 SketchCow As much as we want me to go into the substance of this, here we go.
21:43 🔗 SketchCow I see three audiences for our data.
21:43 🔗 SketchCow 1. Wayback Machine
21:44 🔗 SketchCow 2. The individuals who had their data on the thing, wanting their shit back
21:44 🔗 SketchCow 3. Historians from The Future, with The Future being 1 hour to forever from now.
21:44 🔗 chronomex yeap
21:44 🔗 SmileyG agreed.
21:45 🔗 SketchCow So, the problem is, 1. is very, very, very old school and was designed from the ground up along a whole range of very carefully decided "things".
21:45 🔗 SketchCow It is also, being from a non-profit, not overly packed with dev teams.
21:45 🔗 SketchCow This translates to "it takes things a certain way"
21:45 🔗 chronomex picky, brittle.
21:45 🔗 SketchCow It's possible to go 'well, leave things as they are, and make a second version'
21:45 🔗 SketchCow And we're doing that for the moment with some items, for the sake of stepping into it slowly.
21:46 🔗 SketchCow Obviously that doesn't work with MobileMe.
21:46 🔗 SketchCow Now, I asked MobileMe to miss this current load-in to Wayback, because whatever decision/process is made becomes a 274tb decision.
21:47 🔗 flaushy do you have slides for a 5 minute presentation why you should join the archiveteam? i am going to a small congress from the ccc tomorrow
21:47 🔗 SketchCow No, just links to my talks.
21:48 🔗 SmileyG flaushy: hmmmmm not that I'm aware of - watch Jasons defcon speach and talk about Soy Sauce?
21:48 🔗 flaushy could be good enough :)
21:48 🔗 SketchCow http://www.us.archive.org/log_show.php?task_id=127610039
21:48 🔗 chronomex soy sauce itself is >5 minutes :P
21:48 🔗 SketchCow Can you guys see that?
21:48 🔗 SmileyG yes SketchCow
21:48 🔗 chronomex I can
21:48 🔗 SketchCow Ok, so that's the deriver working with a megawarc.
21:48 🔗 flaushy need login
21:49 🔗 SketchCow Get a damn library card, buddy!
21:49 🔗 underscor ^
21:49 🔗 SketchCow They're freeeeee
21:49 🔗 underscor sweet, 1.8gb already on the first node!
21:49 🔗 underscor cc alard, SketchCow
21:49 🔗 SketchCow OK, so turning from that experiment, and still waiting to make sure it works.....
21:49 🔗 SketchCow ...I'd like to consider a process where we generate the megawarc by default.
21:50 🔗 SketchCow And upload THOSE as webshots.
21:50 🔗 SketchCow So my current process is "grab 50gb of these delightful picture warcs, .tar them, and then shove them on the internet archive."
21:51 🔗 alard underscor: My uploads are going really fast.
21:51 🔗 SketchCow But those .tars are good for the 2. (individuals) with a LOT of help from additional alard scripts, and 3. And not good for 1.
21:51 🔗 underscor alard: that's a good thing, right?
21:51 🔗 underscor hehe
21:51 🔗 SketchCow Your uploads are going to boxes that aren't maxed out to misery
21:51 🔗 alard underscor: Yes. :)
21:51 🔗 underscor :D
21:51 🔗 underscor SketchCow: hahahah
21:51 🔗 alard We'll see how long it lasts.
21:51 🔗 chronomex if we start with megawarcs, it's possible to make a tool that does range-requests and gets chunks in the middle
21:52 🔗 underscor http://maelstrom.foxfag.com/munin/us.archive.org/bt-download00.us.archive.org/if_eth0.html
21:52 🔗 SmileyG SketchCow: can we create some kind of "index" of the megawarc which we could feed into 1. (and use for 2.)
21:52 🔗 SketchCow So I guess the question I pose to alard is, if we generate megawarcs, how hard would it be to write something that takes a link to the megawarc and returns your warc inside it?
21:52 🔗 SketchCow SmileyG: The megawarc, by DEFINITION, works with 1. and 3.
21:52 🔗 SketchCow And if it's in the Wayback, it helps 2.
21:53 🔗 SmileyG SketchCow: ah duur failing to read.
21:53 🔗 chronomex SmileyG: yes, there is an index. it's called a cdx.
21:53 🔗 SketchCow So in THEORY, this would be fine.
21:53 🔗 chronomex deriver makes it iirc
21:53 🔗 alard The json file tells you where each file is, with byte ranges.
21:53 🔗 SmileyG This is why I shoudln't irc while dying.
21:53 🔗 chronomex how about not dying
21:53 🔗 alard So it will tell you that user-x.warc.gz is in the big-warc from bytes a-b. This byte range you can feed to http://warctozip.archive.org/, for example. (This is how the tabblo/mobileme search things work.)
21:54 🔗 SketchCow OK.
21:54 🔗 alard Or you could do a curl with a byte range to get the warc.gz, if you don't like zip.
21:54 🔗 SketchCow So it SHOULD be possible with current tools to assist 2.
21:54 🔗 SketchCow Or some minor scripting to access current tools.
21:54 🔗 alard Yes.
21:54 🔗 chronomex current tools or minor changes, yes
21:54 🔗 SketchCow Ok.
21:54 🔗 SketchCow Then yes, we're going to:
21:54 🔗 * SmileyG has other things on his plate hes thinking about. Time to disappear again.
21:55 🔗 SketchCow 1. Start pushing webshots up from underscor's Circle-Jerk to archive.org as native megawarcs
21:55 🔗 SketchCow 2. See about (carefully) converting both previous webshots and mobileme to native megawarcs.
21:55 🔗 SmileyG 99. Geocities?
21:56 🔗 SketchCow Geocities as we did it will never go into the wayback.
21:56 🔗 SmileyG Never? drat
21:56 🔗 SketchCow As we did it.
21:56 🔗 chronomex nope, we didn't manage to collect enough metadata to put it into warc
21:56 🔗 SketchCow In THEORY, we could generate warcs with some sort of obviousness that it could pull in.
21:56 🔗 SmileyG we can't "redo" it though so....
21:56 🔗 SmileyG hmmm, as long as its "as" accessable as the others then shrug.
21:57 🔗 SketchCow But man, I don't want to stress the IA infrastructe with THAT project this exact moment.
21:57 🔗 SketchCow And by infrastructure I mean people.
21:57 🔗 SmileyG wtf is hitting my keyboard o_O
21:57 🔗 SketchCow sperm
21:57 🔗 SmileyG worrying.
21:57 🔗 SketchCow It dries
21:57 🔗 SmileyG then its all crispy and the keys get stuck :<
21:58 🔗 SketchCow check #archiveteam-spermclean
21:58 🔗 SketchCow Read the FAQ
21:58 🔗 joepie91 stupid idea: set up haproxy on shitty unmetered gbit vps, proxy to various backends
21:58 🔗 SmileyG lol sorry, dragging this off topic ¬_¬; Really am going away, just gonna watch the convo unless someone on the internet turns out to be wrong.
21:58 🔗 joepie91 upload over HTTP
21:58 🔗 SketchCow joepie91: We did that way back when
21:58 🔗 SmileyG shitty unmetered gbit vps <<< Howm uch $$$?
21:58 🔗 SketchCow It was hilarrrrrrrrrrrious
21:59 🔗 joepie91 SmileyG: not necessarily that much
21:59 🔗 joepie91 expect disk I/O etc to suck though, but that doesn't matter if it's just a proxy
21:59 🔗 soultcer joepie91: Shitty unmetered VPS have one problem: In the end they are still a shitty VPS.
21:59 🔗 SketchCow We did it on batcave, as I recall
21:59 🔗 joepie91 soultcer: shitty in the sense of everything but the bandwidth sucks
21:59 🔗 joepie91 :p
21:59 🔗 joepie91 SketchCow: what were the results?
21:59 🔗 SketchCow Oh, it was very effective
21:59 🔗 alard That was to fix network weirdness, where uploads directly to s3.us.archive.org were much slower than uploads proxied through batcave, also on archive.org.
22:00 🔗 alard joepie91: But I think the HTTP upload from the warriors works fine now, without proxy stuff. The tracker redirects to one of the upload servers.
22:01 🔗 soultcer SketchCow: Does the Internet Archive hire remote workers?
22:01 🔗 joepie91 alard: alright
22:01 🔗 joepie91 so... the upload problems should be solved, or?
22:01 🔗 alard Yes, for the time being. :)
22:02 🔗 alard Update your Webshots scripts, if you aren't using a warrior.
22:06 🔗 SketchCow So, alard, is there a way to make a megawarc generator that just takes a directory instead of a tar?
22:08 🔗 alard That depends on your definition of "megawarc". As it is now, the json contains tar headers and the position of the warcs in the original tar file. You could leave that out, though, and keep the properties that are useful for indexing the big-warc.
22:08 🔗 alard What would be the best way to get the filenames to the megawarc script? Use find and pipe to stdin?
22:09 🔗 alard (There may be too many files to go as command line arguments.)
22:09 🔗 SketchCow I am comfortable with doing a tar to stdin... :)
22:10 🔗 SketchCow or to stdout, I guess you might say
22:10 🔗 soultcer Make the script recursively search a directory for warcs?
22:10 🔗 alard Well, piping tar into the megawarc script won't easily work, since the script needs two passes over each warc file. (Once to check if it can be decompressed, once to copy it to the big-warc.)
22:11 🔗 SketchCow Well, I assumed a different script taking a different approach.
22:11 🔗 alard Yes, but I think you want the gzip test. If you don't have that test one invalid file can ruin the whole warc.
22:11 🔗 SketchCow I mean, let's back it up. What I'd like is a way to take a directory, instead of a .tar, and make it a megawarc.
22:11 🔗 SketchCow However that's done, I approve.
22:12 🔗 soultcer So you want to stop even creating the tars for new projects and just uploads warcs to the idea plus a small tar for the logfiles?
22:12 🔗 SketchCow It's expensive as shit, but making a .tar file, and then running megawarc against it, then removing the tar file and uploading the megawarc files....I could live with that.
22:12 🔗 SketchCow that might be smartest.
22:13 🔗 alard I think the tar isn't really necessary, especially not if you don't want to 'reconstruct' a tar that never was from the megawarc.
22:13 🔗 alard Or you could use made-up tar headers.
22:13 🔗 SketchCow Well, let's think about it.
22:14 🔗 SketchCow DO we want the .tar file? By reconstructing later, you have a nice standard collection of the files.
22:14 🔗 SketchCow And you can pull things from it.
22:14 🔗 soultcer Well we need some way to store "these records in the warc file belong to a single user"
22:18 🔗 SketchCow So, how do we feel about that? I think a .tar existing somewhere along the line works very well for what we want to do.
22:18 🔗 SketchCow because then then .tars can go into The Next Thing After Internet Archive
22:19 🔗 underscor TNTAIA, for short
22:19 🔗 soultcer But then you have to store both the tar file and the megawarc
22:19 🔗 SketchCow No, no.
22:19 🔗 SketchCow you are using a .tar as the intermediary instead of the file directory to generate the megawarc
22:20 🔗 SketchCow So in come the piles o' files
22:20 🔗 SketchCow At some point, you have a 50gb collection (say)
22:20 🔗 SketchCow You make it a .tar
22:20 🔗 SketchCow You megawarc the .tar
22:20 🔗 SketchCow you upload the megawarc.
22:20 🔗 SketchCow Now the thing's been standardized out past the filesystem
22:20 🔗 SketchCow And can be turned into 50gb chunks in the future on your holocube 2000x
22:21 🔗 soultcer How about instead of creating a .tar and megawarcing it, you directly create the megawarc from the 50gb of source files?
22:21 🔗 SketchCow This is what we just discussed.
22:22 🔗 soultcer Oh, I thought you wanted to keep the "create a tar and megawarc it" step
22:22 🔗 SketchCow I asked about that possibility, but it does lead to concerns.
22:22 🔗 SketchCow by making something a .tar and then making it a megawarc, we have an intermediary thing it's converted back into that's able to be manipulated by other programs.
22:23 🔗 SketchCow And I am saying, I think this is a good idea for future extensibility.
22:23 🔗 SketchCow 1. 2. and 3. are all handled.
22:23 🔗 soultcer Even if you skip the tar step, you can later convert it back to a tar file.
22:24 🔗 soultcer Though as long as going through the tar step doesn't create much of a bottleneck, it is probably nice to use tools that already exist and that do one thing well
22:24 🔗 SketchCow http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-5
22:24 🔗 SketchCow OK, so update.
22:24 🔗 SketchCow It definitely made a >CDX
22:25 🔗 underscor nice!
22:25 🔗 underscor SketchCow: So is that what you want to do? fill up 50gb->tar->megawarc->ingest->rinse/repeat?
22:26 🔗 SketchCow That is my proposal - I can give you scripts that I wrote and which alard wrote.
22:26 🔗 underscor okay, awesome
22:26 🔗 SketchCow But first, I want to have us discuss 50gb->megawarc->ingest
22:26 🔗 SketchCow Because that was also on the table. pros and cons.
22:26 🔗 underscor alard gave me the "watcher" script
22:26 🔗 underscor but I don't have an tar/s3-er
22:26 🔗 underscor that moves things into a temp dir
22:27 🔗 SketchCow Right.
22:27 🔗 SketchCow No, I'll give you those, but first I want this decided.
22:27 🔗 SketchCow Also, I asked people to verify the CDX just generated.
22:27 🔗 SketchCow Because if it just made borscht, more borscht is not a buddy.
22:27 🔗 SketchCow I'm also about to restore that megawarc-5 to see what happens.
22:29 🔗 soultcer SketchCow: As I said, it would be easy to modify alard's megawarc creator so it directly takes a directory of small warcs/wget logs and creates the same output (minus some tar metadata, that isn't necessary)
22:30 🔗 SketchCow As alard said, one corrupted gz makes it not work
22:31 🔗 alard soultcer: Just thought that it should be possible to add tar headers as well. Let Python create them, as if it is making a tar.
22:31 🔗 alard I'll have a look.
22:31 🔗 SketchCow Let's put it this way.
22:31 🔗 alard Other question: it's possible to put the extra-tar and the json inside the big-warc. Is that useful?
22:31 🔗 soultcer alard: Would work, but why would we need the tar headers anyway? It's mostly metadata about the filesystem on fos.
22:31 🔗 alard You'd have one file, but the index would be less accessible.
22:32 🔗 SketchCow alard: It makes it harder to decipher later. I'd keep it outside
22:32 🔗 alard soultcer: It has timestamps.
22:32 🔗 alard and it makes it easier to make a tar.
22:33 🔗 soultcer alard: I don't really see why we would need any of the tar metadata, but it would of course be possible to create some of it, but I have no idea how to create the tar header string you have in the json file
22:34 🔗 joepie91 SketchCow: I'm still having rsync issues for btinternet - does that happen for everyone?
22:34 🔗 SketchCow The issue is not everything we add is a warc.gz
22:34 🔗 SketchCow Sometimes it's going to be additional 'stuff'
22:35 🔗 joepie91 for every single job: @ERROR: max connections (5) reached -- try again later
22:35 🔗 soultcer SketchCow: The additional files are simply put into a single tar archive
22:35 🔗 SketchCow btinternet just got more love
22:36 🔗 joepie91 looks fixed now, thanks
22:36 🔗 joepie91 :P
22:36 🔗 soultcer Together with the metadata from the json file, the tar archive with the additional files and the megawarc file, you can recreate the original directory structure, or create a tar archive with all files
22:36 🔗 joepie91 whoa
22:36 🔗 joepie91 starts scrolling like mad
22:36 🔗 joepie91 lol
22:37 🔗 joepie91 looks like people had a lot in queue
22:37 🔗 SketchCow soultcer: I'm going to again defer to alard's opinion on this.
22:37 🔗 joepie91 especially Sue
22:37 🔗 joepie91 *cough*
22:37 🔗 SketchCow Incoming crap - ends up as megawarc
22:37 🔗 SketchCow I just want the megawarc that results to be useful to the historians and the individuals as much as it is to wayback.
22:38 🔗 soultcer And the wget logs I assume?
22:39 🔗 SketchCow I don't want anything being lost
22:41 🔗 joepie91 lol, at this pace, btinternet will be done in 30 minutes
22:47 🔗 soultcer alard: So what do you think? Bundle as tar, then megawarc; or directly create the megawarc?
22:50 🔗 SketchCow Give him a moment, I see he's been coding some stuff related to uploads.
22:51 🔗 soultcer Sure.
22:53 🔗 soultcer The thing with the tar is: It includes a lot of metadata on who created the tar file (uid/gid), when it was created (mtime/ctime) and the filesystem permissions. I am not sure if we want to keep those, or not
23:06 🔗 SketchCow I've reconstructed a .tar from the megawarc.
23:06 🔗 SketchCow Now unpacking it to see if everything comes out ok.
23:07 🔗 soultcer They should be bit-for-bit copies I think
23:07 🔗 SketchCow Absolutely.
23:08 🔗 SketchCow Regardless, I am doing what someone in 10 years would be doing.
23:17 🔗 Sue i'm sorry for what i did to btinternet
23:17 🔗 chronomex they got graped
23:18 🔗 Sue i had like
23:18 🔗 Sue 300-400 rsync jobs queued up
23:18 🔗 Sue apparently 17G woth
23:18 🔗 Sue *worth
23:18 🔗 chronomex jesus
23:18 🔗 Sue it's about to be done
23:21 🔗 alard New version: https://github.com/alard/megawarc
23:21 🔗 alard I renamed megawarc build TAR to megawarc convert TAR (seemed more logical).
23:22 🔗 alard There's now also a megawarc pack TAR FILE-1 FILE-2 ... option that packs files/paths directly.
23:22 🔗 alard You still need to provide TAR to make the file names, but that tar doesn't exist.
23:23 🔗 alard E.g. ./megawarc pack webshots-12345.tar 12345/ should work.
23:24 🔗 alard Then ./megawarc restore webshots-12345.tar would give you a tar file.
23:24 🔗 soultcer alard: Nice work. I was just thinking about simply using the TarInfo class to create the tar_header structure, but I see you not only thought of it faster, you implemented it while I was still thinking about the details ;-)
23:27 🔗 alard I copied most of it from Python's tarfile.py.
23:27 🔗 soultcer Good programmers code, better programmers reuse ;-)
23:28 🔗 Sue btinternet is now in the negative
23:29 🔗 joepie91 ooo
23:29 🔗 joepie91 100MB btinternet user incoming
23:29 🔗 joepie91 ... wat
23:29 🔗 joepie91 how's that even possible?
23:29 🔗 SketchCow means they paid for premium
23:29 🔗 joepie91 SketchCow: but premium users are on a separate server
23:29 🔗 SketchCow A la geocities and a few others, the old address is kept while the premium address goes up.
23:29 🔗 chronomex recursion!
23:29 🔗 SketchCow We found 1gb geocities users
23:29 🔗 joepie91 ah
23:30 🔗 alard Time to find more usernames then, (There are also 1185 usernames still claimed, over 1000 by Sue.)
23:30 🔗 joepie91 wonder how they did that though
23:30 🔗 joepie91 because there's a separate server for all premium users
23:30 🔗 joepie91 two IPs away from the free server
23:30 🔗 Sue over 1k by me? must be a glitch
23:30 🔗 alard Are all your instances finished?
23:31 🔗 Sue i'm still doing probably 20-30
23:31 🔗 Sue the screen isn't full of no item recieved yet
23:31 🔗 joepie91 mine is
23:31 🔗 joepie91 or well
23:31 🔗 joepie91 alternating between no item received and tracker rate limiting
23:31 🔗 joepie91 lol
23:33 🔗 Sue can you release items per user? that's strange that i have so many...
23:33 🔗 alard I've put them back in the queue.
23:34 🔗 Sue ok
23:34 🔗 alard And with that I'm off to bed. Bye!
23:34 🔗 SketchCow Thanks again, alard
23:34 🔗 Sue i hunger for more
23:35 🔗 joepie91 suddenly, 5mbit!
23:35 🔗 joepie91 goodnight alard
23:35 🔗 joepie91 :)
23:39 🔗 DFJustin wow so unless we find way more users, all of btinternet will fit on a microsd card
23:39 🔗 DFJustin INCREASING COSTS
23:40 🔗 SketchCow Huh, someone recorded alard's process for programming new code.
23:40 🔗 SketchCow http://www.youtube.com/watch?feature=fvwp&NR=1&v=8VTW1iUn3Bg
23:40 🔗 SketchCow Screencap's gotten good
23:40 🔗 SketchCow (He's the one in the glasses)
23:43 🔗 Sue i'm out of users, 14 left downloading
23:50 🔗 SketchCow That's to be expected.
23:57 🔗 SketchCow Internet Archive's teams have signed off on the megawarcs.
23:57 🔗 SketchCow So guess what - FOS is making a ton of fucking megawarcs tonight.

irclogger-viewer