[00:00] ah, I'm seeing timeouts in my warrior [00:00] it must really be getting crushed [00:00] what does FOS stand for? [00:01] Free and Open Source? Maybe? [00:02] just curious, what are the hardware specs for fos? [00:02] This is really bad. [00:02] it's a raspberry pi hooked up to a RAID array [00:02] It shouldn't be this hammered. [00:02] Oh? [00:04] speaking of 'pi's, apparently you can colocate a pi in Austria for free: https://www.edis.at/en/server/colocation/austria/raspberrypi/ [00:04] FOS stands for Fortress of Solitude [00:04] It replaced a machine named Batcave [00:04] Hah, nice [00:04] FOS became a way to refer to it easily. [00:04] :-) awesome, thanks [00:05] i always thought it was a fun take on FiOS because verizon sponsored it :P [00:05] iFOS. By Apple. [00:05] even though i know that is no where near true [00:17] I wonder if these fixed 30 second retries has us all hammering FOS at the same time. [00:17] thundering herd effect? [00:18] TIL that term. Essentially, although more than one can rsync at a time. [00:18] it's why random backoff in ethernet is a thing [00:19] I keep getting about 5% uploaded before it dies [00:24] Machine is seriously getting hammered. [00:24] Not sure what to do yet. [00:24] Might set rsync. [00:25] Is it coming in 30 second waves? [00:25] Or is it just a constant surge of traffic? [00:25] ha ha you act like pressing keys makes anything happen. [00:25] Oh right. The tubes. They are clogged. [00:26] if you have access to the switch or firewall in front of it you can block certain IP ranges or ports to slow down the flow of traffic in [00:26] I like where you said that too. [00:26] All these suggestions are well meaning and useless. [00:26] I'm going to implement a max connections as soon as I can get vi to respond. [00:27] well if you had access to the switch you could just deny all rsync or anything else and allow ssh :p [00:27] that wouldnt be useless [00:28] Yes. [00:28] So..... [00:28] If only we could turn lead into gold, we could solve a number of problems. [00:28] But the impossibility of that makes it useless. [00:30] Realize my temper is going to be short while I wrestle with a machine with over 1,100 rsync connections active. [00:30] Yup. Good luck, soldier. [00:31] And advice along the line of "to fix the problem, you should fix the problem" is brain fart [00:32] It has been trying to open a vi session for 4 minutes. [00:32] That's how bad it is. [00:32] I have two other windows, trying to set up a killing of rsync [00:34] im guessing you didnt want any advice then and are just venting [00:34] Mine finally timed out so I was able to pause the VM. So my minuscule part of the load is off. [00:39] I set it to 20 [00:43] Now doing a megakill [00:44] Bitches [00:44] ps -ef | grep rsync | awk '{PRINT $2}' | xargs kill [00:47] no killall? [00:47] or skill [00:48] shhh, I'm oldschool [00:48] * chronomex nods knowingly [00:49] you have legitimate claim to the phrase "I have underwear that's older than your home directory" [00:56] nice [00:57] root@teamarchive-1:/etc# killall rsync [00:57] thought I think if he had used 'ps -aux | grep'... that would have been better [00:59] looks like it's time for bed. Gettin' a little punchy. [00:59] later [01:00] Machine is pretty hosed. [01:06] FOS crashed. [01:07] Wow, what happened [01:22] DJ Smiley remix of the main page of archiveteam.org now in place. [01:32] fos is back [01:32] now running with some severed rsync limits while we get shit in shape. [02:18] i'm uploading issue 150 dvd of linux format [03:47] @ERROR: max connections (5) reached -- try again later [03:47] Starting RsyncUpload for Item woodp [03:47] getting a whole mess of these [03:47] rsync error: error starting client-server protocol (code 5) at main.c(1534) [sender=3.0.9] [03:49] the server (fos) stuff rsyncs too is limited to 5 rsync connections atm, it was having issues earlier. SketchCow should updated one its all good at some point [03:51] so is the script gonna continue at some point cause it just keeps trying to dend those 2 users over and over [03:51] send [03:51] yeah it will keep trying until it gets through [03:51] can just leave it running [03:52] I thought it only tries 50 times [03:52] and then gives up? [03:54] if it does thats 25min and there must be a bug? [03:55] thats good tho :P [03:58] i looked awhile back and i just a bit ago, was pretty sure the rsync in pipeline doesnt have a lot of overhead but i could be wrong. i know there are some options to turn off compression and us a lower encryption that generate less cpu usage. [03:58] for client and server [04:24] S[h]O[r]T: Well, I'm just saying [04:24] with the rate limit on fos [04:24] it's very likely you could not get in in 25m [04:24] and then the thing will just give up [04:24] and you're wasted [04:24] 3 [04:24] D: [04:43] im saying its been more than 25min and i havent got in and its still trying [04:47] oic [04:48] maybe I'm wrong [04:48] I just overheard someone say that [04:48] looks like SketchCow upped it to 10 [04:48] none of my threads are doing any work still, though [04:48] hopefully we can reopen the floodgates soon [04:48] otherwise we're definitely not going to do well with webshots XD [04:51] yay! [04:51] finally got one in [04:51] w00t [05:13] i dont see it got upped to 10:P [06:27] is anyone in the rehosting-dead-pastebins business? [06:27] 100K pastes from paste.lisp.org would be better off googlable [06:33] do you have them?? [06:42] yes [06:42] http://ludios.org/tmp/paste.lisp.org.7z [06:42] chronomex: ^ [06:43] something up with warrior upload? getting "@ERROR: max connections (10) reached -- try again later" [06:45] <3 [06:45] The server we rsync to is currently limited because it was having problems earlier [06:46] thanks ivan`! are you involved with paste.lisp.org? [06:46] no, I think stassats runs it, but his reply did not indicate interest in restoring them [06:46] aye. [06:47] ow, this is a lot of files [06:47] heh [06:47] hoping limit gets increased/lifted... almost all warriors waiting for upload :| [06:47] *wow [06:47] WHY HELLO [06:48] * chronomex shoves this into IA [06:48] You crying sallybags. [06:48] wassap brah [06:48] You whip a virtual server within an inch of its life, and then woah, you all want it jogging around the track 5 minutes later. [06:49] Also, I like Underscor whining on 3 channels about me taking a reasonable attempt to prevent the machine dying. [06:49] 948 simultaneous rsyncs. [06:49] Think about that. [06:49] o_O [06:49] You know what you did. [06:49] bitches gotta bitch [06:49] * SketchCow gets the newspaper [06:49] good job team! :) [06:49] haha [06:50] We need support for distributing uploads to multiple servers. Next one to complain about fos being unreachable will be volunteered to code that into the seesaw kit. [06:50] ivan`: can you share some info about this file? when was it captured, was the paste dead at the time, is it complete, etc [06:52] Tomorrow, FOS goes down when one of the admins inceases its swap from 2gb to 6gb. [06:59] chronomex: pastes were captured 2011-11-14 and 2012-05-01 and 2012-10-06 (though perhaps I should strip those); not complete, I don't have pastes 129789-131892 [07:00] ok [07:00] :D [07:04] http://archive.org/details/paste_lisp_org [07:06] So, basically I have a couple days to prepare some more archiveteam items for ingestion into the wayback. [07:09] 188,329,776 14.0M/s eta 3h 58m [07:09] Now that's a spicy meatball [07:10] 1,067,816,696 17.7M/s eta 4h 19m [07:10] Downloaded a gig. Going to take 4 hours. It's like that. [07:14] With luck, I can make a lot more of these things green. [07:14] If this all works, all the green ones go into the wayback machine instantly. [07:15] Instant SOPA review end of the month! [07:15] That'd be nice. [07:41] Re-initiated uploads from fos to archive.org of webshots loads. [07:46] Hi there. I've looked up the "rsync retries 50 times" claim that I made yesterday. I now think that's wrong and that it retries indefinitely, so your warriors will just wait until they can upload. [08:37] OK, napping. [13:07] SketchCow: I'm not quite sure what to do with this, but I archived the videos that someone (I think bsmith096) linked me to a while ago as rare footage: http://aarnist.cryto.net:81/youtube/all/ [13:07] flv/mp4/webm format [13:17] having a hard time with rsync with warrior [13:17] getting "max connections reached" errors [13:22] same [13:22] alard: ya there? [13:24] the server the scripts rsync to is currently limited because it was having problems earlier [13:24] yeah, but how do I keep the warrior going? [13:24] I have this bandwidth which otherwise isn't going to get used [13:24] just have to wait, it will keep retrying uploads :\ [13:25] need to shorten the wait time from 30 seconds to more like 5 then [13:42] is there any way I can tweak this? :/ [13:45] more concurrent threads? [13:45] problem is we are all downloading it faster than FOS can accept it back in [13:45] yeah [13:46] The fix is FOS Accepting it faster, or us having larger caches. [13:46] larger caches are possible if you do more concurrent downloads, however depending how fast you download in ratio to the max upload, your still going to get stuck eventually [13:47] joepie91: I'd upload them to IA and give SketchCow a link, [13:47] thats what I've done with the fish ezine I get each month. [13:48] [08:46:31] <@alard> Hi there. I've looked up the "rsync retries 50 times" claim that I made yesterday. I now think that's wrong and that it retries indefinitely, so your warriors will just wait until they can upload. [13:48] Phew! that was a worry for me. [14:02] 5 connections seems a bit low [14:09] Hey, "need to shorten the wait time from 30 seconds to more like 5 then" is not a good idea. If we all do that, will just increase the load on the server, but won't increase the number of uploads. [14:10] the problem is that users like me with slow uploads (max 1 MiB/s) will use the slots for a long time :/ [14:10] We'll just have to wait until the server can handle more connections. (Or we'll have to find some other server were we can upload to, to spread the load.) [14:10] right [14:11] can we, until then, increase the warrior concurrency to a higher max than 6? [14:13] No, that would require a lot of updates. (And I also don't really see how that would help. It would just add more waiting uploads.) [14:13] Just be patient. :) [14:13] okie :) [14:14] at least we don't loose the queues, which is great [14:15] alard: What would the requirements of such a server be? [14:15] we need a mini IA just for our upoloads lol [14:16] i mean underscor looks like haveing enough bandwidth to act as a caching server for smaller guys. But i might be wrong [14:22] soultcer: It would need downstream and upstream bandwidth, and a not too small disk to receive files before it packages and uploads them to archive.org. [14:22] Uploads are 50GB batches, so a multiple of that. [14:24] would 100 mbit be enough? [14:24] Maybe renting a cheap server from hetzner or ovh for a month would work [14:25] Yes, 100mbit would be enough (we also don't have to send all uploads to one server). [14:26] the bt ones are the issue right? [14:26] because they are so short.... ? [14:26] SHame we can't package multiple users together on the warrior? [14:26] I do not know what the issue is. It can't be bt, I would think, since we have only a few thousand small users there. [14:27] alard: but most of them finish in sub30 seconds [14:27] thats a lot of rsync processes spawning constantly for such small tranfers. [14:27] Yes, so there aren't many active at the same time. But I don't know what the issue is, really. It could be the number of processes spawning, or just the number of concurrent uploads. [14:28] Resuming uploads are probably also more expensive than new uploads. [14:28] (There would have been a few of them when the server came back up, I suspect.) [14:29] It doesn't have to be rsync, by the way. That's just what fos currently has. [14:30] Anyway, I'll be back later. [14:30] o/ [14:30] Does the bundling script rely on the partial setting? You could use --inplace, then it won't have to rename/move files after finishing [14:34] partial works for the webshots but makes no sense with the BT ones [14:37] 14254080 52% 166.45kB/s 0:01:18 [14:41] Back once again. [14:44] meh, i need a couple of minutes mostly [14:44] alard: i ask at my university [14:44] whether i can crawl with the pools at night, and whether a dump would be acceptable [14:48] alard, DFJustin: 0.18 and 1.0 warcsare the same bar the version number, yes. (I have this from one of the authors of the warc spec) [14:49] pps warc2warc in warctools can recompress warcs record by record. warc2warc *.warc.gz -O output.warc.gz [14:53] If it can recompress warcs, can it also concatenate them? Simply create one big warc file instead of tarring multiple warc files. Would make it easier for IA to use? [15:15] so SketchCow / underscor, can you pull bt usernames out of the wayback database, I can do stuff like http://wayback.archive.org/web/*/http://www.btinternet.com/~* but I only get a few hundred at a time and it will take forever [15:19] DFJustin: underscor sent a list from the wayback database yesterday. [15:20] well I was getting usernames just now that rescue-me didn't know, although I think most of them are long gone [15:21] Ah, I don't know what they searched for. [15:23] soultcer: I think --partial or --inplace doesn't really matter (moving a file on the same disk isn't that expensive, is it?) [15:25] I was playing with this for the one-big-warc problem: https://github.com/alard/megawarc Any good suggestions? [15:25] http://24.media.tumblr.com/tumblr_m9dvjezOvX1qm3r26o1_500.jpg [15:25] When you have a big file half-uploaded, and then continue without --inplace, it will first make a temp copy of the already existing stuff, then write to that temp copy [15:25] Only when it has finished uploading, will it remove that copy [15:26] I had trouble transfering a file because rsync took more than 1.5 times the size of the file when I didn't use inplace [15:27] In any case, --inplace can't be used here, because then half-uploaded files could be moved by the postprocessing script. [15:28] alard: What do we need the original tar file for? [15:28] It's nice to be able to find the per-user files. [15:28] yes [15:28] And for mobileme there are wget.logs and other files. [15:32] So even though you'd probably never want the original tar file back, it's useful to keep the data somewhere. The 'restore' function demonstrates that there's no data lost. [15:48] alard: if you have extra logs to put in, warc record can handle that metadata records [15:50] tef_: I know. The new projects have one single warc file per user. The older projects, mobileme, have the logs and a few other files besides the warcs. [15:51] (And even with mobileme the wget log is also in the warc files, I think.) [15:51] nice [15:51] but yeah converting from .tar to warcgz could happily convert non warc records into warcrecords in the final output [15:52] Yes, so you could make one file that has everything. [15:52] Here's a hilarious one - the fortunecity collection. It's warcs AND straight html. [15:52] SketchCow: warc records can be of 'resource' instead of 'response' :-) [15:52] We could put the tar file in the big warc. [15:59] heh [16:19] SketchCow: I wasn't whining! [16:22] alard: Does the seesaw kit support round-robining rsync servers? [16:22] Because I have 12 boxes at archive.org we could rr between [16:28] underscor: Not yet, but it could. I think it would be even better to do it with HTTP PUT uploads, though. That would make round-robining easier. (And it might be less stressful for the server.) [16:28] Hmm, as safe as rsync though? [16:28] (checksum-wise) [16:28] hmmmm [16:28] alard: First test of megawarc coming up [16:28] Does rsync make many checksums? [16:29] I thought it did a checksum [16:29] But actually, no [16:29] it does [16:29] In write only mode, it doesn't [16:29] Only if you allow it, I thought. (Other than the filesize thing.) [16:29] files to check #0/1 [16:29] currently it appears to check the writes... [16:30] can you just use dns RR too?.... [16:30] Yeah, but that requires waiting for propagation, etc [16:30] Also a lot of places (RIT included) munge the results [16:31] and only return one of them until the cache expires [16:31] o [16:31] ttl 5 [16:31] :D [16:31] haha [16:31] They ignore ttl :() [16:31] :( * [16:31] just make sure your dns server can take it [16:31] wut ¬_¬ [16:31] Yeah [16:31] sux [16:32] ok, have the tracker hand out upload details? [16:32] along with username? [16:33] alard: Setting up a PUT server for testing [16:34] We could write a tiny node.js PUT server with checksums. :) [16:36] Why complicate it further by introducing another programming language? [16:36] Good question. [16:41] Is there no simple point to point file transfer protocol witch checksumming? [16:43] Do we need checksums? (If we're uncomplicating. :) [16:44] Nah. [16:44] I was just putting up a put accepter in nginx [16:44] since I already have it on these boxen [16:44] After all, once it's on that server we uploaded to we'll be using the non-checksummed s3 api to bring it to archive.org. [16:46] underscor: Do you happen to know if there's a way to distinguish uploaded from still-uploading files? [16:46] No idea. Let me see. [16:48] That's useful to know for the postprocessing. (The current packaging script moves any file it can find.) [16:49] "A file uploaded with the PUT method is first written to a temporary file, then a file is renamed. Starting from version 0.8.9 temporary files and the persistent store can be put on different file systems but be aware that in this case a file is copied across two file systems instead of the cheap rename operation." [16:49] apparently from "ngx_http_dav_module" docs [16:50] Ah, that's promising. [16:54] FTP [16:55] FTP? What are we, farmers? [16:57] lol [16:57] http://p.defau.lt/?SBDTYn8UhfxVvm4rSmlydw [16:57] cc alard [16:57] :D [16:57] and it didn't appear until after the upload fully finished [16:57] Nice. [16:58] Does it make directories? (As in /webshots/underscor/something.warc.gz ?) [16:58] I can enable it [16:59] So if you put to http://bt-download00.us.archive.org:8302/webshots/some/path/here/libtorrent-packages.tar.gz [16:59] it will create /some/path/here on the fly [16:59] It's not necessary, but I with the rsync uploads I generally let every download upload to a separate directory. [17:00] Doesn't really serve a purpose. [17:00] I'll be back later. [17:01] alard: option enabled. [17:02] Holler at me when you get back if you think this would be a better idea going forward, and I can push out to the rest of the boxes [17:19] i got up to episode 43 of t3 podcast [17:36] S[h]O[r]T: no, absolutely not FTP [17:36] lol [18:51] very relevant: I don't have time for silliness. Just let me know if you're removing our footage, or if I'm forwarding this to our attorneys. I'm not interested in your creative commons bs (which those of us who actually work in media refers to as amateur licensing) and I have told you that we do not want our work in any of your videos. Let me repeat: we want NONE of our work in ANY of your or any third party [18:51] videos, and our exclusive licensing agreements exist specifically so that is enforcable. [18:51] er [18:51] faol [18:51] fail * [18:51] http://arstechnica.com/tech-policy/2012/10/court-rules-book-scanning-is-fair-use-suggesting-google-books-victory/ [18:51] ignore the above blob of text, it was an earlier copypaste from a pastebin :P [18:53] now I'm curious [18:53] however I have work to do [20:08] alard's not here, is he? [20:08] I think eh went awayyyy [20:10] Hello! [20:12] Hey, my net went wonky. [20:12] ImportError: No module named ordereddict [20:13] How do I fix that? [20:13] Python 2.6? [20:14] wget https://bitbucket.org/wooparadog/zkit/raw/4ce69af1742f/ordereddict.py [20:14] File "megawarc.py", line 64, in [20:14] ImportError: No module named ordereddict [20:14] Traceback (most recent call last): [20:14] from ordereddict import OrderedDict [20:14] root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python2.7 megawarc.py [20:15] OrderedDict is in collections for py 2.7 [20:15] Bear in mind I am a perl guy at best. [20:15] We do it differently there. [20:18] SketchCow: Replace "from orderecdict import OrderedDict" with this: http://pastebin.com/dQdZ0wX8 [20:18] Should work fine in py 2.7, and for py 2.6 you can download the ordereddict module alard told you about [20:20] OK [20:20] So I just wasted some time trying that. [20:20] alard: You are only using the ordered dict for cosmetics anyway, right? [20:21] Yes. [20:21] Alard, please put it in the megawarc github if it works [20:21] because damn, I don't edit python very well. [20:21] spaces, no tabs [20:21] though it pains me to say so [20:21] Yeah, no, like I don't do python [20:21] omh [20:22] And the github should be improved, not my local copy of it anyway [20:22] :) [20:24] SketchCow: I've updated the github repository. Try again. (It worked for me before and it still works now.) [20:32] Usage: megawarc [--verbose] build FILE megawarc [--verbose] restore FILE [20:32] root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc [20:32] root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc build BOARDS-COH-05.tar [20:32] Looking much better. [20:33] Now, let's see if the 11gb file that results is good. [20:33] Do you account for things being in subdirectories in the .tar? [20:43] Well, it doesn't care. What it does is this: it walks through the tar, one entry at a time. If it is a file *and* the filename ends with .warc.gz, it checks to see if it is an extractable gzip. If all that is OK, the warc file is added to the warc. In all other cases (directories, unreadable warcs, other files) the file is added to the leftover tar. [20:43] For the tar reconstruction, it pastes together the content from the leftover tar, the tar headers and parts from the warc. So directories don't matter. [20:53] root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc build BOARDS-COH-05.tar [20:53] root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# ls -l [20:53] total 21136664 [20:53] -rw-r--r-- 1 root root 10822246400 Oct 11 19:26 BOARDS-COH-05.tar [20:53] -rw-r--r-- 1 root root 84149 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.json.gz [20:53] -rw-r--r-- 1 root root 10240 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.tar [20:53] -rw-r--r-- 1 root root 10821470898 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.warc.gz [20:53] OK, so. [20:53] That worked... but there was no progress bar, and no updates. [20:53] So I'll use this for now, but I would definitely add something to indicate work is being done. [20:58] SketchCow: Add --verbose [20:59] It won't show a progress bar, but it will tell you what's taking so long. [20:59] underscor: Do you have a /webshots/alard/webshots.com-user-siebertphotoshop-20121011-225722.warc.gz ? [21:01] Oh! [21:07] joepie91: I'd upload them to IA and give SketchCow a link, [21:08] that's a bit hard [21:08] they're on a server [21:08] :P [21:08] can't get to them now anyway, that server seems offline [21:13] alard: http://p.defau.lt/?rZSWA6hEE1SROnuBQKxtAQ [21:13] Lookin awesome :D [21:13] Great. Ready for more? [21:14] joepie91: :o what was that mispaste about :D [21:15] alard: yep! [21:15] Shall I roll out to bt-download01-11 now too? [21:15] (for roundrobining) [21:15] That would be nice. The tracker picks one of the urls from a list, so it's possible to remove/add urls later. [21:16] Ah, nice! [21:16] ok, I'll work on pushing the config [21:16] is the limit of rsync only for webshot or for all projects? [21:16] I'll need the "cleanup" script too [21:16] flaushy: bt is set to 5 right now, webshots 10 [21:17] would it make sense to switch underscor? [21:17] from webshots to bt? [21:18] or are the rsyncs on bt crowded as well? [21:18] Webshots is now uploading over HTTP (once your warrior gets the update). [21:18] Sweet [21:18] so warrior restart time :) [21:18] awesome [21:18] What? [21:19] So wait, stuff is now banging directly into archive? Or something else. [21:19] SketchCow: underscor wants it. [21:19] SketchCow: well, I have 12 machines we can load balance between [21:19] Underscor wants a lot of things, but I like to be included while I'm over here trying to make this machine function. [21:19] so I thought it might be a better idea [21:21] Please at least tell me it's going into http://archive.org/details/webshots-freeze-frame with the same format structure [21:21] (We've been discussing this for a while, but we can change it again if you think it's not a good idea.) [21:21] It's exactly the same. [21:21] It's exactly the same, just that it is roundrobined between 12 boxes instead of a single one [21:21] I trust you'll do the right thing, but if we're using an environment, I just want to know, with my name being mentioned, we're going to shift gears. [21:22] Because then I can focus on it as a "clear out the rest of what we have" instead of "work my ass off on this box trying to make it function for the time being" [21:23] http://p.defau.lt/?rZSWA6hEE1SROnuBQKxtAQ [21:23] Ah, yes. It won't change immediately: the current warriors are still trying to rsync and will keep trying, until they succeed or until they're restarted. [21:23] er [21:23] joepie91: :o what was that mispaste about :D [21:23] is what I wanted to paste [21:23] anyway [21:23] wtf is with my clipboard today [21:24] tl;dr guy makes movie about occupy protests, then starts demanding that videos that reuse parts of his movie are taken down [21:24] let me find the full paste [21:25] drop to -bs [21:26] Anyway, I am all for solutions that increase the bandwidth away from FOS, which is meant to be a buffer of 20tb for incoming data, but doesn't function as well as it could as something to blow 50tb of insanity in, do operations on, and blow out. [21:27] SketchCow: what is the main bottleneck for fos? [21:27] I just need to know that's what's going on so I know I'm bailing water out of a bathtub for a little, and not trying to rescue a sinking ship. [21:27] FOS is a virtual box that does about 20 things. [21:27] So the bottleneck for FOS is FOS [21:27] Oversubscription. [21:27] In this particular case, we had the same disk being used for file writes, file compilation, and file reads [21:28] Which is normally not THAT big a deal but it was doing a LOT, and we had 900+ rsyncs [21:28] Eventually swap exploded [21:28] and everything goes to hell [21:28] Webshots on FOS is now sizzling out, but bt internet is still using rsync. But that's so small it's probably something to keep there? [21:28] I expect so, yes. [21:28] Webshots is TOO DAMN HIGH [21:29] so basically disk I/O is the bottleneck? [21:29] or the main one, at least [21:29] It's one of them. [21:29] hmm [21:29] let me think about this for a moment [21:29] I guess if we're looking to find out, we can circle the sizzling wreck and waste a few days determining why. [21:29] No, don't think about it. [21:29] Think about things and projects that need saving. [21:29] there's not much ability to save if the library is burning down [21:30] is there a script for bt as well? [21:30] underscor has twice your brainpower, and 400x your resources (200x mine) and has an unhealthy compulsion to optimize. [21:30] He'll fix it. [21:30] * underscor giggles giddily [21:31] He literally cuddles with the internat archive infrastructure. [21:31] * underscor whistles innocently [21:31] ... not sure why you seem so strongly opposed to my decision to invest some of my _own_ time and thought into finding a possible solution [21:31] but, but, but, petaboxen are so cute~ [21:31] I personally don't really care who has more brainpower or infrastructure - more people thinking about it instead of watching random series because boredom, means more chance of a solution [21:31] This was a rare case where miscommunication, exacerbated by a red-eye flight, meant that I fell out of the loop of a solution set. [21:31] And got surprised, and whined. [21:32] SketchCow can't stand WWIC [21:34] The teamarchive/FOS machine will now get 8gb of swap instead of 2. [21:34] SketchCow: What script do you use to inject these into IA? [21:34] (and can I get it plz) [21:35] I have a custom injector that uses a s3 call. [21:35] * SmileyG sighs [21:35] still borked? :( [21:35] Before we do this with your round-robin thing. [21:35] What's still borked. [21:36] Anyway, before we do this with your round-robin thing, I think we need to decide if megawarc is ready for production. [21:36] my bt uploads by the look of things - looking at backlog now [21:36] Not borked. [21:36] It was being held at a limit, a limit which I will shortly lift as we move webshots over to a round-robin, and as FOS gets 4x the allocated RAM [21:38] Ah ok, I presumed it was the number of rsyncs due to the BT one being so fast that was the issue (i'd fill my queue in 30~ seconds). [21:39] Also [21:39] http://blog.archive.org/2012/10/10/the-ten-petabyte-party/ [21:39] If you're in SF, go eat some foods [21:39] I wish. [21:40] * joepie91 is practically on the other side of the world [21:40] Now, I want to discuss the format we put webshots in. [21:40] Probably every non-SF person here is wishing they'd be there now :b [21:40] My attention is gripped a little by seeing what the result of the megawarc program is. [21:40] http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-5 [21:41] So first, let us see what the result of the derive is. [21:41] It's an 11gb megawarc, so it will take a few minutes. [21:42] what is a megawarc? [21:42] Could you teach the deriver to unpack the tar files? [21:42] soultcer: No [21:42] I sat in meetings across a week on it. [21:42] teaching the deriver anything is a major undertaking [21:42] It's not the deriver, it's the wayback machine. [21:42] It's a mess. [21:42] ah [21:43] So it's easier to generate a .warc.gz file that cats up all the other warcs in a specific way. [21:43] the way I take it, WBM indexes tar files that remain on petaboxes? [21:43] thus there's one copy of the WBM data or something [21:43] No, it's weirder. [21:43] It's all so weird. [21:43] s/tar/warc.gz/ [21:43] As much as we want me to go into the substance of this, here we go. [21:43] I see three audiences for our data. [21:43] 1. Wayback Machine [21:44] 2. The individuals who had their data on the thing, wanting their shit back [21:44] 3. Historians from The Future, with The Future being 1 hour to forever from now. [21:44] yeap [21:44] agreed. [21:45] So, the problem is, 1. is very, very, very old school and was designed from the ground up along a whole range of very carefully decided "things". [21:45] It is also, being from a non-profit, not overly packed with dev teams. [21:45] This translates to "it takes things a certain way" [21:45] picky, brittle. [21:45] It's possible to go 'well, leave things as they are, and make a second version' [21:45] And we're doing that for the moment with some items, for the sake of stepping into it slowly. [21:46] Obviously that doesn't work with MobileMe. [21:46] Now, I asked MobileMe to miss this current load-in to Wayback, because whatever decision/process is made becomes a 274tb decision. [21:47] do you have slides for a 5 minute presentation why you should join the archiveteam? i am going to a small congress from the ccc tomorrow [21:47] No, just links to my talks. [21:48] flaushy: hmmmmm not that I'm aware of - watch Jasons defcon speach and talk about Soy Sauce? [21:48] could be good enough :) [21:48] http://www.us.archive.org/log_show.php?task_id=127610039 [21:48] soy sauce itself is >5 minutes :P [21:48] Can you guys see that? [21:48] yes SketchCow [21:48] I can [21:48] Ok, so that's the deriver working with a megawarc. [21:48] need login [21:49] Get a damn library card, buddy! [21:49] ^ [21:49] They're freeeeee [21:49] sweet, 1.8gb already on the first node! [21:49] cc alard, SketchCow [21:49] OK, so turning from that experiment, and still waiting to make sure it works..... [21:49] ...I'd like to consider a process where we generate the megawarc by default. [21:50] And upload THOSE as webshots. [21:50] So my current process is "grab 50gb of these delightful picture warcs, .tar them, and then shove them on the internet archive." [21:51] underscor: My uploads are going really fast. [21:51] But those .tars are good for the 2. (individuals) with a LOT of help from additional alard scripts, and 3. And not good for 1. [21:51] alard: that's a good thing, right? [21:51] hehe [21:51] Your uploads are going to boxes that aren't maxed out to misery [21:51] underscor: Yes. :) [21:51] :D [21:51] SketchCow: hahahah [21:51] We'll see how long it lasts. [21:51] if we start with megawarcs, it's possible to make a tool that does range-requests and gets chunks in the middle [21:52] http://maelstrom.foxfag.com/munin/us.archive.org/bt-download00.us.archive.org/if_eth0.html [21:52] SketchCow: can we create some kind of "index" of the megawarc which we could feed into 1. (and use for 2.) [21:52] So I guess the question I pose to alard is, if we generate megawarcs, how hard would it be to write something that takes a link to the megawarc and returns your warc inside it? [21:52] SmileyG: The megawarc, by DEFINITION, works with 1. and 3. [21:52] And if it's in the Wayback, it helps 2. [21:53] SketchCow: ah duur failing to read. [21:53] SmileyG: yes, there is an index. it's called a cdx. [21:53] So in THEORY, this would be fine. [21:53] deriver makes it iirc [21:53] The json file tells you where each file is, with byte ranges. [21:53] This is why I shoudln't irc while dying. [21:53] how about not dying [21:53] So it will tell you that user-x.warc.gz is in the big-warc from bytes a-b. This byte range you can feed to http://warctozip.archive.org/, for example. (This is how the tabblo/mobileme search things work.) [21:54] OK. [21:54] Or you could do a curl with a byte range to get the warc.gz, if you don't like zip. [21:54] So it SHOULD be possible with current tools to assist 2. [21:54] Or some minor scripting to access current tools. [21:54] Yes. [21:54] current tools or minor changes, yes [21:54] Ok. [21:54] Then yes, we're going to: [21:54] * SmileyG has other things on his plate hes thinking about. Time to disappear again. [21:55] 1. Start pushing webshots up from underscor's Circle-Jerk to archive.org as native megawarcs [21:55] 2. See about (carefully) converting both previous webshots and mobileme to native megawarcs. [21:55] 99. Geocities? [21:56] Geocities as we did it will never go into the wayback. [21:56] Never? drat [21:56] As we did it. [21:56] nope, we didn't manage to collect enough metadata to put it into warc [21:56] In THEORY, we could generate warcs with some sort of obviousness that it could pull in. [21:56] we can't "redo" it though so.... [21:56] hmmm, as long as its "as" accessable as the others then shrug. [21:57] But man, I don't want to stress the IA infrastructe with THAT project this exact moment. [21:57] And by infrastructure I mean people. [21:57] wtf is hitting my keyboard o_O [21:57] sperm [21:57] worrying. [21:57] It dries [21:57] then its all crispy and the keys get stuck :< [21:58] check #archiveteam-spermclean [21:58] Read the FAQ [21:58] stupid idea: set up haproxy on shitty unmetered gbit vps, proxy to various backends [21:58] lol sorry, dragging this off topic ¬_¬; Really am going away, just gonna watch the convo unless someone on the internet turns out to be wrong. [21:58] upload over HTTP [21:58] joepie91: We did that way back when [21:58] shitty unmetered gbit vps <<< Howm uch $$$? [21:58] It was hilarrrrrrrrrrrious [21:59] SmileyG: not necessarily that much [21:59] expect disk I/O etc to suck though, but that doesn't matter if it's just a proxy [21:59] joepie91: Shitty unmetered VPS have one problem: In the end they are still a shitty VPS. [21:59] We did it on batcave, as I recall [21:59] soultcer: shitty in the sense of everything but the bandwidth sucks [21:59] :p [21:59] SketchCow: what were the results? [21:59] Oh, it was very effective [21:59] That was to fix network weirdness, where uploads directly to s3.us.archive.org were much slower than uploads proxied through batcave, also on archive.org. [22:00] joepie91: But I think the HTTP upload from the warriors works fine now, without proxy stuff. The tracker redirects to one of the upload servers. [22:01] SketchCow: Does the Internet Archive hire remote workers? [22:01] alard: alright [22:01] so... the upload problems should be solved, or? [22:01] Yes, for the time being. :) [22:02] Update your Webshots scripts, if you aren't using a warrior. [22:06] So, alard, is there a way to make a megawarc generator that just takes a directory instead of a tar? [22:08] That depends on your definition of "megawarc". As it is now, the json contains tar headers and the position of the warcs in the original tar file. You could leave that out, though, and keep the properties that are useful for indexing the big-warc. [22:08] What would be the best way to get the filenames to the megawarc script? Use find and pipe to stdin? [22:09] (There may be too many files to go as command line arguments.) [22:09] I am comfortable with doing a tar to stdin... :) [22:10] or to stdout, I guess you might say [22:10] Make the script recursively search a directory for warcs? [22:10] Well, piping tar into the megawarc script won't easily work, since the script needs two passes over each warc file. (Once to check if it can be decompressed, once to copy it to the big-warc.) [22:11] Well, I assumed a different script taking a different approach. [22:11] Yes, but I think you want the gzip test. If you don't have that test one invalid file can ruin the whole warc. [22:11] I mean, let's back it up. What I'd like is a way to take a directory, instead of a .tar, and make it a megawarc. [22:11] However that's done, I approve. [22:12] So you want to stop even creating the tars for new projects and just uploads warcs to the idea plus a small tar for the logfiles? [22:12] It's expensive as shit, but making a .tar file, and then running megawarc against it, then removing the tar file and uploading the megawarc files....I could live with that. [22:12] that might be smartest. [22:13] I think the tar isn't really necessary, especially not if you don't want to 'reconstruct' a tar that never was from the megawarc. [22:13] Or you could use made-up tar headers. [22:13] Well, let's think about it. [22:14] DO we want the .tar file? By reconstructing later, you have a nice standard collection of the files. [22:14] And you can pull things from it. [22:14] Well we need some way to store "these records in the warc file belong to a single user" [22:18] So, how do we feel about that? I think a .tar existing somewhere along the line works very well for what we want to do. [22:18] because then then .tars can go into The Next Thing After Internet Archive [22:19] TNTAIA, for short [22:19] But then you have to store both the tar file and the megawarc [22:19] No, no. [22:19] you are using a .tar as the intermediary instead of the file directory to generate the megawarc [22:20] So in come the piles o' files [22:20] At some point, you have a 50gb collection (say) [22:20] You make it a .tar [22:20] You megawarc the .tar [22:20] you upload the megawarc. [22:20] Now the thing's been standardized out past the filesystem [22:20] And can be turned into 50gb chunks in the future on your holocube 2000x [22:21] How about instead of creating a .tar and megawarcing it, you directly create the megawarc from the 50gb of source files? [22:21] This is what we just discussed. [22:22] Oh, I thought you wanted to keep the "create a tar and megawarc it" step [22:22] I asked about that possibility, but it does lead to concerns. [22:22] by making something a .tar and then making it a megawarc, we have an intermediary thing it's converted back into that's able to be manipulated by other programs. [22:23] And I am saying, I think this is a good idea for future extensibility. [22:23] 1. 2. and 3. are all handled. [22:23] Even if you skip the tar step, you can later convert it back to a tar file. [22:24] Though as long as going through the tar step doesn't create much of a bottleneck, it is probably nice to use tools that already exist and that do one thing well [22:24] http://archive.org/details/archiveteam-city-of-heroes-forums-megawarc-5 [22:24] OK, so update. [22:24] It definitely made a >CDX [22:25] nice! [22:25] SketchCow: So is that what you want to do? fill up 50gb->tar->megawarc->ingest->rinse/repeat? [22:26] That is my proposal - I can give you scripts that I wrote and which alard wrote. [22:26] okay, awesome [22:26] But first, I want to have us discuss 50gb->megawarc->ingest [22:26] Because that was also on the table. pros and cons. [22:26] alard gave me the "watcher" script [22:26] but I don't have an tar/s3-er [22:26] that moves things into a temp dir [22:27] Right. [22:27] No, I'll give you those, but first I want this decided. [22:27] Also, I asked people to verify the CDX just generated. [22:27] Because if it just made borscht, more borscht is not a buddy. [22:27] I'm also about to restore that megawarc-5 to see what happens. [22:29] SketchCow: As I said, it would be easy to modify alard's megawarc creator so it directly takes a directory of small warcs/wget logs and creates the same output (minus some tar metadata, that isn't necessary) [22:30] As alard said, one corrupted gz makes it not work [22:31] soultcer: Just thought that it should be possible to add tar headers as well. Let Python create them, as if it is making a tar. [22:31] I'll have a look. [22:31] Let's put it this way. [22:31] Other question: it's possible to put the extra-tar and the json inside the big-warc. Is that useful? [22:31] alard: Would work, but why would we need the tar headers anyway? It's mostly metadata about the filesystem on fos. [22:31] You'd have one file, but the index would be less accessible. [22:32] alard: It makes it harder to decipher later. I'd keep it outside [22:32] soultcer: It has timestamps. [22:32] and it makes it easier to make a tar. [22:33] alard: I don't really see why we would need any of the tar metadata, but it would of course be possible to create some of it, but I have no idea how to create the tar header string you have in the json file [22:34] SketchCow: I'm still having rsync issues for btinternet - does that happen for everyone? [22:34] The issue is not everything we add is a warc.gz [22:34] Sometimes it's going to be additional 'stuff' [22:35] for every single job: @ERROR: max connections (5) reached -- try again later [22:35] SketchCow: The additional files are simply put into a single tar archive [22:35] btinternet just got more love [22:36] looks fixed now, thanks [22:36] :P [22:36] Together with the metadata from the json file, the tar archive with the additional files and the megawarc file, you can recreate the original directory structure, or create a tar archive with all files [22:36] whoa [22:36] starts scrolling like mad [22:36] lol [22:37] looks like people had a lot in queue [22:37] soultcer: I'm going to again defer to alard's opinion on this. [22:37] especially Sue [22:37] *cough* [22:37] Incoming crap - ends up as megawarc [22:37] I just want the megawarc that results to be useful to the historians and the individuals as much as it is to wayback. [22:38] And the wget logs I assume? [22:39] I don't want anything being lost [22:41] lol, at this pace, btinternet will be done in 30 minutes [22:47] alard: So what do you think? Bundle as tar, then megawarc; or directly create the megawarc? [22:50] Give him a moment, I see he's been coding some stuff related to uploads. [22:51] Sure. [22:53] The thing with the tar is: It includes a lot of metadata on who created the tar file (uid/gid), when it was created (mtime/ctime) and the filesystem permissions. I am not sure if we want to keep those, or not [23:06] I've reconstructed a .tar from the megawarc. [23:06] Now unpacking it to see if everything comes out ok. [23:07] They should be bit-for-bit copies I think [23:07] Absolutely. [23:08] Regardless, I am doing what someone in 10 years would be doing. [23:17] i'm sorry for what i did to btinternet [23:17] they got graped [23:18] i had like [23:18] 300-400 rsync jobs queued up [23:18] apparently 17G woth [23:18] *worth [23:18] jesus [23:18] it's about to be done [23:21] New version: https://github.com/alard/megawarc [23:21] I renamed megawarc build TAR to megawarc convert TAR (seemed more logical). [23:22] There's now also a megawarc pack TAR FILE-1 FILE-2 ... option that packs files/paths directly. [23:22] You still need to provide TAR to make the file names, but that tar doesn't exist. [23:23] E.g. ./megawarc pack webshots-12345.tar 12345/ should work. [23:24] Then ./megawarc restore webshots-12345.tar would give you a tar file. [23:24] alard: Nice work. I was just thinking about simply using the TarInfo class to create the tar_header structure, but I see you not only thought of it faster, you implemented it while I was still thinking about the details ;-) [23:27] I copied most of it from Python's tarfile.py. [23:27] Good programmers code, better programmers reuse ;-) [23:28] btinternet is now in the negative [23:29] ooo [23:29] 100MB btinternet user incoming [23:29] ... wat [23:29] how's that even possible? [23:29] means they paid for premium [23:29] SketchCow: but premium users are on a separate server [23:29] A la geocities and a few others, the old address is kept while the premium address goes up. [23:29] recursion! [23:29] We found 1gb geocities users [23:29] ah [23:30] Time to find more usernames then, (There are also 1185 usernames still claimed, over 1000 by Sue.) [23:30] wonder how they did that though [23:30] because there's a separate server for all premium users [23:30] two IPs away from the free server [23:30] over 1k by me? must be a glitch [23:30] Are all your instances finished? [23:31] i'm still doing probably 20-30 [23:31] the screen isn't full of no item recieved yet [23:31] mine is [23:31] or well [23:31] alternating between no item received and tracker rate limiting [23:31] lol [23:33] can you release items per user? that's strange that i have so many... [23:33] I've put them back in the queue. [23:34] ok [23:34] And with that I'm off to bed. Bye! [23:34] Thanks again, alard [23:34] i hunger for more [23:35] suddenly, 5mbit! [23:35] goodnight alard [23:35] :) [23:39] wow so unless we find way more users, all of btinternet will fit on a microsd card [23:39] INCREASING COSTS [23:40] Huh, someone recorded alard's process for programming new code. [23:40] http://www.youtube.com/watch?feature=fvwp&NR=1&v=8VTW1iUn3Bg [23:40] Screencap's gotten good [23:40] (He's the one in the glasses) [23:43] i'm out of users, 14 left downloading [23:50] That's to be expected. [23:57] Internet Archive's teams have signed off on the megawarcs. [23:57] So guess what - FOS is making a ton of fucking megawarcs tonight.