[00:00] ah, I'm seeing timeouts in my warrior [00:00] it must really be getting crushed [00:00] what does FOS stand for? [00:01] Free and Open Source? Maybe? [00:02] just curious, what are the hardware specs for fos? [00:02] This is really bad. [00:02] it's a raspberry pi hooked up to a RAID array [00:02] It shouldn't be this hammered. [00:02] Oh? [00:04] speaking of 'pi's, apparently you can colocate a pi in Austria for free: https://www.edis.at/en/server/colocation/austria/raspberrypi/ [00:04] FOS stands for Fortress of Solitude [00:04] It replaced a machine named Batcave [00:04] Hah, nice [00:04] FOS became a way to refer to it easily. [00:04] :-) awesome, thanks [00:05] i always thought it was a fun take on FiOS because verizon sponsored it :P [00:05] iFOS. By Apple. [00:05] even though i know that is no where near true [00:17] I wonder if these fixed 30 second retries has us all hammering FOS at the same time. [00:17] thundering herd effect? [00:18] TIL that term. Essentially, although more than one can rsync at a time. [00:18] it's why random backoff in ethernet is a thing [00:19] I keep getting about 5% uploaded before it dies [00:24] Machine is seriously getting hammered. [00:24] Not sure what to do yet. [00:24] Might set rsync. [00:25] Is it coming in 30 second waves? [00:25] Or is it just a constant surge of traffic? [00:25] ha ha you act like pressing keys makes anything happen. [00:25] Oh right. The tubes. They are clogged. [00:26] if you have access to the switch or firewall in front of it you can block certain IP ranges or ports to slow down the flow of traffic in [00:26] I like where you said that too. [00:26] All these suggestions are well meaning and useless. [00:26] I'm going to implement a max connections as soon as I can get vi to respond. [00:27] well if you had access to the switch you could just deny all rsync or anything else and allow ssh :p [00:27] that wouldnt be useless [00:28] Yes. [00:28] So..... [00:28] If only we could turn lead into gold, we could solve a number of problems. [00:28] But the impossibility of that makes it useless. [00:30] Realize my temper is going to be short while I wrestle with a machine with over 1,100 rsync connections active. [00:30] Yup. Good luck, soldier. [00:31] And advice along the line of "to fix the problem, you should fix the problem" is brain fart [00:32] It has been trying to open a vi session for 4 minutes. [00:32] That's how bad it is. [00:32] I have two other windows, trying to set up a killing of rsync [00:34] im guessing you didnt want any advice then and are just venting [00:34] Mine finally timed out so I was able to pause the VM. So my minuscule part of the load is off. [00:39] I set it to 20 [00:43] Now doing a megakill [00:44] Bitches [00:44] ps -ef | grep rsync | awk '{PRINT $2}' | xargs kill [00:47] no killall? [00:47] or skill [00:48] shhh, I'm oldschool [00:48] * chronomex nods knowingly [00:49] you have legitimate claim to the phrase "I have underwear that's older than your home directory" [00:56] nice [00:57] root@teamarchive-1:/etc# killall rsync [00:57] thought I think if he had used 'ps -aux | grep'... that would have been better [00:59] looks like it's time for bed. Gettin' a little punchy. [00:59] later [01:00] Machine is pretty hosed. [01:06] FOS crashed. [01:07] Wow, what happened [01:22] DJ Smiley remix of the main page of archiveteam.org now in place. [01:32] fos is back [01:32] now running with some severed rsync limits while we get shit in shape. [02:18] i'm uploading issue 150 dvd of linux format [03:47] @ERROR: max connections (5) reached -- try again later [03:47] Starting RsyncUpload for Item woodp [03:47] getting a whole mess of these [03:47] rsync error: error starting client-server protocol (code 5) at main.c(1534) [sender=3.0.9] [03:49] the server (fos) stuff rsyncs too is limited to 5 rsync connections atm, it was having issues earlier. SketchCow should updated one its all good at some point [03:51] so is the script gonna continue at some point cause it just keeps trying to dend those 2 users over and over [03:51] send [03:51] yeah it will keep trying until it gets through [03:51] can just leave it running [03:52] I thought it only tries 50 times [03:52] and then gives up? [03:54] if it does thats 25min and there must be a bug? [03:55] thats good tho :P [03:58] i looked awhile back and i just a bit ago, was pretty sure the rsync in pipeline doesnt have a lot of overhead but i could be wrong. i know there are some options to turn off compression and us a lower encryption that generate less cpu usage. [03:58] for client and server [04:24] S[h]O[r]T: Well, I'm just saying [04:24] with the rate limit on fos [04:24] it's very likely you could not get in in 25m [04:24] and then the thing will just give up [04:24] and you're wasted [04:24] 3 [04:24] D: [04:43] im saying its been more than 25min and i havent got in and its still trying [04:47] oic [04:48] maybe I'm wrong [04:48] I just overheard someone say that [04:48] looks like SketchCow upped it to 10 [04:48] none of my threads are doing any work still, though [04:48] hopefully we can reopen the floodgates soon [04:48] otherwise we're definitely not going to do well with webshots XD [04:51] yay! [04:51] finally got one in [04:51] w00t [05:13] i dont see it got upped to 10:P [06:27] is anyone in the rehosting-dead-pastebins business? [06:27] 100K pastes from paste.lisp.org would be better off googlable [06:33] do you have them?? [06:42] yes [06:42] http://ludios.org/tmp/paste.lisp.org.7z [06:42] chronomex: ^ [06:43] something up with warrior upload? getting "@ERROR: max connections (10) reached -- try again later" [06:45] <3 [06:45] The server we rsync to is currently limited because it was having problems earlier [06:46] thanks ivan`! are you involved with paste.lisp.org? [06:46] no, I think stassats runs it, but his reply did not indicate interest in restoring them [06:46] aye. [06:47] ow, this is a lot of files [06:47] heh [06:47] hoping limit gets increased/lifted... almost all warriors waiting for upload :| [06:47] *wow [06:47] WHY HELLO [06:48] * chronomex shoves this into IA [06:48] You crying sallybags. [06:48] wassap brah [06:48] You whip a virtual server within an inch of its life, and then woah, you all want it jogging around the track 5 minutes later. [06:49] Also, I like Underscor whining on 3 channels about me taking a reasonable attempt to prevent the machine dying. [06:49] 948 simultaneous rsyncs. [06:49] Think about that. [06:49] o_O [06:49] You know what you did. [06:49] bitches gotta bitch [06:49] * SketchCow gets the newspaper [06:49] good job team! :) [06:49] haha [06:50] We need support for distributing uploads to multiple servers. Next one to complain about fos being unreachable will be volunteered to code that into the seesaw kit. [06:50] ivan`: can you share some info about this file? when was it captured, was the paste dead at the time, is it complete, etc [06:52] Tomorrow, FOS goes down when one of the admins inceases its swap from 2gb to 6gb. [06:59] chronomex: pastes were captured 2011-11-14 and 2012-05-01 and 2012-10-06 (though perhaps I should strip those); not complete, I don't have pastes 129789-131892 [07:00] ok [07:00] :D [07:04] http://archive.org/details/paste_lisp_org [07:06] So, basically I have a couple days to prepare some more archiveteam items for ingestion into the wayback. [07:09] 188,329,776 14.0M/s eta 3h 58m [07:09] Now that's a spicy meatball [07:10] 1,067,816,696 17.7M/s eta 4h 19m [07:10] Downloaded a gig. Going to take 4 hours. It's like that. [07:14] With luck, I can make a lot more of these things green. [07:14] If this all works, all the green ones go into the wayback machine instantly. [07:15] Instant SOPA review end of the month! [07:15] That'd be nice. [07:41] Re-initiated uploads from fos to archive.org of webshots loads. [07:46] Hi there. I've looked up the "rsync retries 50 times" claim that I made yesterday. I now think that's wrong and that it retries indefinitely, so your warriors will just wait until they can upload. [08:37] OK, napping. [13:07] SketchCow: I'm not quite sure what to do with this, but I archived the videos that someone (I think bsmith096) linked me to a while ago as rare footage: http://aarnist.cryto.net:81/youtube/all/ [13:07] flv/mp4/webm format [13:17] having a hard time with rsync with warrior [13:17] getting "max connections reached" errors [13:22] same [13:22] alard: ya there? [13:24] the server the scripts rsync to is currently limited because it was having problems earlier [13:24] yeah, but how do I keep the warrior going? [13:24] I have this bandwidth which otherwise isn't going to get used [13:24] just have to wait, it will keep retrying uploads :\ [13:25] need to shorten the wait time from 30 seconds to more like 5 then [13:42] is there any way I can tweak this? :/ [13:45] more concurrent threads? [13:45] problem is we are all downloading it faster than FOS can accept it back in [13:45] yeah [13:46] The fix is FOS Accepting it faster, or us having larger caches. [13:46] larger caches are possible if you do more concurrent downloads, however depending how fast you download in ratio to the max upload, your still going to get stuck eventually [13:47] joepie91: I'd upload them to IA and give SketchCow a link, [13:47] thats what I've done with the fish ezine I get each month. [13:48] [08:46:31] <@alard> Hi there. I've looked up the "rsync retries 50 times" claim that I made yesterday. I now think that's wrong and that it retries indefinitely, so your warriors will just wait until they can upload. [13:48] Phew! that was a worry for me. [14:02] 5 connections seems a bit low [14:09] Hey, "need to shorten the wait time from 30 seconds to more like 5 then" is not a good idea. If we all do that, will just increase the load on the server, but won't increase the number of uploads. [14:10] the problem is that users like me with slow uploads (max 1 MiB/s) will use the slots for a long time :/ [14:10] We'll just have to wait until the server can handle more connections. (Or we'll have to find some other server were we can upload to, to spread the load.) [14:10] right [14:11] can we, until then, increase the warrior concurrency to a higher max than 6? [14:13] No, that would require a lot of updates. (And I also don't really see how that would help. It would just add more waiting uploads.) [14:13] Just be patient. :) [14:13] okie :) [14:14] at least we don't loose the queues, which is great [14:15] alard: What would the requirements of such a server be? [14:15] we need a mini IA just for our upoloads lol [14:16] i mean underscor looks like haveing enough bandwidth to act as a caching server for smaller guys. But i might be wrong [14:22] soultcer: It would need downstream and upstream bandwidth, and a not too small disk to receive files before it packages and uploads them to archive.org. [14:22] Uploads are 50GB batches, so a multiple of that. [14:24] would 100 mbit be enough? [14:24] Maybe renting a cheap server from hetzner or ovh for a month would work [14:25] Yes, 100mbit would be enough (we also don't have to send all uploads to one server). [14:26] the bt ones are the issue right? [14:26] because they are so short.... ? [14:26] SHame we can't package multiple users together on the warrior? [14:26] I do not know what the issue is. It can't be bt, I would think, since we have only a few thousand small users there. [14:27] alard: but most of them finish in sub30 seconds [14:27] thats a lot of rsync processes spawning constantly for such small tranfers. [14:27] Yes, so there aren't many active at the same time. But I don't know what the issue is, really. It could be the number of processes spawning, or just the number of concurrent uploads. [14:28] Resuming uploads are probably also more expensive than new uploads. [14:28] (There would have been a few of them when the server came back up, I suspect.) [14:29] It doesn't have to be rsync, by the way. That's just what fos currently has. [14:30] Anyway, I'll be back later. [14:30] o/ [14:30] Does the bundling script rely on the partial setting? You could use --inplace, then it won't have to rename/move files after finishing [14:34] partial works for the webshots but makes no sense with the BT ones [14:37] 14254080 52% 166.45kB/s 0:01:18 [14:41] Back once again. [14:44] meh, i need a couple of minutes mostly [14:44] alard: i ask at my university [14:44] whether i can crawl with the pools at night, and whether a dump would be acceptable [14:48] alard, DFJustin: 0.18 and 1.0 warcsare the same bar the version number, yes. (I have this from one of the authors of the warc spec) [14:49] pps warc2warc in warctools can recompress warcs record by record. warc2warc *.warc.gz -O output.warc.gz [14:53] If it can recompress warcs, can it also concatenate them? Simply create one big warc file instead of tarring multiple warc files. Would make it easier for IA to use? [15:15] so SketchCow / underscor, can you pull bt usernames out of the wayback database, I can do stuff like http://wayback.archive.org/web/*/http://www.btinternet.com/~* but I only get a few hundred at a time and it will take forever [15:19] DFJustin: underscor sent a list from the wayback database yesterday. [15:20] well I was getting usernames just now that rescue-me didn't know, although I think most of them are long gone [15:21] Ah, I don't know what they searched for. [15:23] soultcer: I think --partial or --inplace doesn't really matter (moving a file on the same disk isn't that expensive, is it?) [15:25] I was playing with this for the one-big-warc problem: https://github.com/alard/megawarc Any good suggestions? [15:25] http://24.media.tumblr.com/tumblr_m9dvjezOvX1qm3r26o1_500.jpg [15:25] When you have a big file half-uploaded, and then continue without --inplace, it will first make a temp copy of the already existing stuff, then write to that temp copy [15:25] Only when it has finished uploading, will it remove that copy [15:26] I had trouble transfering a file because rsync took more than 1.5 times the size of the file when I didn't use inplace [15:27] In any case, --inplace can't be used here, because then half-uploaded files could be moved by the postprocessing script. [15:28] alard: What do we need the original tar file for? [15:28] It's nice to be able to find the per-user files. [15:28] yes [15:28] And for mobileme there are wget.logs and other files. [15:32] So even though you'd probably never want the original tar file back, it's useful to keep the data somewhere. The 'restore' function demonstrates that there's no data lost. [15:48] alard: if you have extra logs to put in, warc record can handle that metadata records [15:50] tef_: I know. The new projects have one single warc file per user. The older projects, mobileme, have the logs and a few other files besides the warcs. [15:51] (And even with mobileme the wget log is also in the warc files, I think.) [15:51] nice [15:51] but yeah converting from .tar to warcgz could happily convert non warc records into warcrecords in the final output [15:52] Yes, so you could make one file that has everything. [15:52] Here's a hilarious one - the fortunecity collection. It's warcs AND straight html. [15:52] SketchCow: warc records can be of 'resource' instead of 'response' :-) [15:52] We could put the tar file in the big warc. [15:59] heh [16:19] SketchCow: I wasn't whining! [16:22] alard: Does the seesaw kit support round-robining rsync servers? [16:22] Because I have 12 boxes at archive.org we could rr between [16:28] underscor: Not yet, but it could. I think it would be even better to do it with HTTP PUT uploads, though. That would make round-robining easier. (And it might be less stressful for the server.) [16:28] Hmm, as safe as rsync though? [16:28] (checksum-wise) [16:28] hmmmm [16:28] alard: First test of megawarc coming up [16:28] Does rsync make many checksums? [16:29] I thought it did a checksum [16:29] But actually, no [16:29] it does [16:29] In write only mode, it doesn't [16:29] Only if you allow it, I thought. (Other than the filesize thing.) [16:29] files to check #0/1 [16:29] currently it appears to check the writes... [16:30] can you just use dns RR too?.... [16:30] Yeah, but that requires waiting for propagation, etc [16:30] Also a lot of places (RIT included) munge the results [16:31] and only return one of them until the cache expires [16:31] o [16:31] ttl 5 [16:31] :D [16:31] haha [16:31] They ignore ttl :() [16:31] :( * [16:31] just make sure your dns server can take it [16:31] wut Â¬_Â¬ [16:31] Yeah [16:31] sux [16:32] ok, have the tracker hand out upload details? [16:32] along with username? [16:33] alard: Setting up a PUT server for testing [16:34] We could write a tiny node.js PUT server with checksums. :) [16:36] Why complicate it further by introducing another programming language? [16:36] Good question. [16:41] Is there no simple point to point file transfer protocol witch checksumming? [16:43] Do we need checksums? (If we're uncomplicating. :) [16:44] Nah. [16:44] I was just putting up a put accepter in nginx [16:44] since I already have it on these boxen [16:44] After all, once it's on that server we uploaded to we'll be using the non-checksummed s3 api to bring it to archive.org. [16:46] underscor: Do you happen to know if there's a way to distinguish uploaded from still-uploading files? [16:46] No idea. Let me see. [16:48] That's useful to know for the postprocessing. (The current packaging script moves any file it can find.) [16:49] "A file uploaded with the PUT method is first written to a temporary file, then a file is renamed. Starting from version 0.8.9 temporary files and the persistent store can be put on different file systems but be aware that in this case a file is copied across two file systems instead of the cheap rename operation." [16:49] apparently from "ngx_http_dav_module" docs [16:50] Ah, that's promising. [16:54] FTP [16:55] FTP? What are we, farmers? [16:57] lol [16:57] http://p.defau.lt/?SBDTYn8UhfxVvm4rSmlydw [16:57] cc alard [16:57] :D [16:57] and it didn't appear until after the upload fully finished [16:57] Nice. [16:58] Does it make directories? (As in /webshots/underscor/something.warc.gz ?) [16:58] I can enable it [16:59] So if you put to http://bt-download00.us.archive.org:8302/webshots/some/path/here/libtorrent-packages.tar.gz [16:59] it will create /some/path/here on the fly [16:59] It's not necessary, but I with the rsync uploads I generally let every download upload to a separate directory. [17:00] Doesn't really serve a purpose. [17:00] I'll be back later. [17:01] alard: option enabled. [17:02] Holler at me when you get back if you think this would be a better idea going forward, and I can push out to the rest of the boxes [17:19] i got up to episode 43 of t3 podcast [17:36] S[h]O[r]T: no, absolutely not FTP [17:36] lol [18:51] very relevant: I don't have time for silliness. Just let me know if you're removing our footage, or if I'm forwarding this to our attorneys. I'm not interested in your creative commons bs (which those of us who actually work in media refers to as amateur licensing) and I have told you that we do not want our work in any of your videos. Let me repeat: we want NONE of our work in ANY of your or any third party [18:51] videos, and our exclusive licensing agreements exist specifically so that is enforcable. [18:51] er [18:51] faol [18:51] fail * [18:51] http://arstechnica.com/tech-policy/2012/10/court-rules-book-scanning-is-fair-use-suggesting-google-books-victory/ [18:51] ignore the above blob of text, it was an earlier copypaste from a pastebin :P [18:53] now I'm curious [18:53] however I have work to do [20:08] alard's not here, is he? [20:08] I think eh went awayyyy [20:10] Hello! [20:12] Hey, my net went wonky. [20:12] ImportError: No module named ordereddict [20:13] How do I fix that? [20:13] Python 2.6? [20:14] wget https://bitbucket.org/wooparadog/zkit/raw/4ce69af1742f/ordereddict.py [20:14] File "megawarc.py", line 64, in [20:14] ImportError: No module named ordereddict [20:14] Traceback (most recent call last): [20:14] from ordereddict import OrderedDict [20:14] root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python2.7 megawarc.py [20:15] OrderedDict is in collections for py 2.7 [20:15] Bear in mind I am a perl guy at best. [20:15] We do it differently there. [20:18] SketchCow: Replace "from orderecdict import OrderedDict" with this: http://pastebin.com/dQdZ0wX8 [20:18] Should work fine in py 2.7, and for py 2.6 you can download the ordereddict module alard told you about [20:20] OK [20:20] So I just wasted some time trying that. [20:20] alard: You are only using the ordered dict for cosmetics anyway, right? [20:21] Yes. [20:21] Alard, please put it in the megawarc github if it works [20:21] because damn, I don't edit python very well. [20:21] spaces, no tabs [20:21] though it pains me to say so [20:21] Yeah, no, like I don't do python [20:21] omh [20:22] And the github should be improved, not my local copy of it anyway [20:22] :) [20:24] SketchCow: I've updated the github repository. Try again. (It worked for me before and it still works now.) [20:32] Usage: megawarc [--verbose] build FILE megawarc [--verbose] restore FILE [20:32] root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc [20:32] root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc build BOARDS-COH-05.tar [20:32] Looking much better. [20:33] Now, let's see if the 11gb file that results is good. [20:33] Do you account for things being in subdirectories in the .tar? [20:43] Well, it doesn't care. What it does is this: it walks through the tar, one entry at a time. If it is a file *and* the filename ends with .warc.gz, it checks to see if it is an extractable gzip. If all that is OK, the warc file is added to the warc. In all other cases (directories, unreadable warcs, other files) the file is added to the leftover tar. [20:43] For the tar reconstruction, it pastes together the content from the leftover tar, the tar headers and parts from the warc. So directories don't matter. [20:53] root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# python megawarc build BOARDS-COH-05.tar [20:53] root@teamarchive-1:/2/CITY/CITYOFHEROES-2012-09# ls -l [20:53] total 21136664 [20:53] -rw-r--r-- 1 root root 10822246400 Oct 11 19:26 BOARDS-COH-05.tar [20:53] -rw-r--r-- 1 root root 84149 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.json.gz [20:53] -rw-r--r-- 1 root root 10240 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.tar [20:53] -rw-r--r-- 1 root root 10821470898 Oct 11 20:41 BOARDS-COH-05.tar.megawarc.warc.gz [20:53] OK, so. [20:53] That worked... but there was no progress bar, and no updates. [20:53] So I'll use this for now, but I would definitely add something to indicate work is being done. [20:58] SketchCow: Add --verbose [20:59] It won't show a progress bar, but it will tell you what's taking so long. [20:59] underscor: Do you have a /webshots/alard/webshots.com-user-siebertphotoshop-20121011-225722.warc.gz ? [21:01] Oh! [21:07] joepie91: I'd upload them to IA and give SketchCow a link, [21:08] that's a bit hard [21:08] they're on a server [21:08] :P [21:08] can't get to them now anyway, that server seems offline [21:13] alard: http://p.defau.lt/?rZSWA6hEE1SROnuBQKxtAQ [21:13] Lookin awesome :D [21:13] Great. Ready for more? [21:14] joepie91: :o what was that mispaste about :D [21:15] alard: yep! [21:15] Shall I roll out to bt-download01-11 now too? [21:15] (for roundrobining) [21:15] That would be nice. The tracker picks one of the urls from a list, so it's possible to remove/add urls later. [21:16] Ah, nice! [21:16] ok, I'll work on pushing the config [21:16] is the limit of rsync only for webshot or for all projects? [21:16] I'll need the "cleanup" script too [21:16] flaushy: bt is set to 5 right now, webshots 10 [21:17] would it make sense to switch underscor? [21:17] from webshots to bt? [21:18] or are the rsyncs on bt crowded as well? [21:18] Webshots is now uploading over HTTP (once your warrior gets the update). [21:18] Sweet [21:18] so warrior restart time :) [21:18] awesome [21:18] What? [21:19] So wait, stuff is now banging directly into archive? Or something else. [21:19] SketchCow: underscor wants it. [21:19] SketchCow: well, I have 12 machines we can load balance between [21:19] Underscor wants a lot of things, but I like to be included while I'm over here trying to make this machine function. [21:19] so I thought it might be a better idea [21:21] Please at least tell me it's going into http://archive.org/details/webshots-freeze-frame with the same format structure [21:21]