[00:09] chronomex: yeah [02:19] SketchCow: I didn't know you help with the on piracy documentary [07:08] So I sub taught today for a teacher that had ~20 years of disks/docs of educational software [07:08] Should've left a note asking if I could've scanned/dumped them :P [07:13] You can always tell him. [07:15] ya [07:15] sounds like a good haul [07:16] Can't guarantee the software, but if she was hanging onto manuals for "Educational Software on System 7.5" I assume it's kicking around [07:20] I figured it'd be interesting, since AFAIK educational/industrial software is hard to come by [07:22] It is. [07:22] Hmpf, Oracle seems to have deleted content from Sun wikis in the merge (looking for some documentation that no longer exists). [07:47] industrial software is super important for when you inherit a machine [07:47] as someone who's inherited multiple ancient machines, having that stuff on the net is top-notch exciting. [07:55] shaqfu: I would of copied all that data without asking [07:55] But that's perhaps just me [08:33] github got hacked [08:34] should we start to do a panic backup? [08:38] haha [10:31] they didn't get hacked in an interesting way though [10:31] git itself prevents hack like that from meaning anything serious [10:34] you can't change the underlying data in the git repository without invalidating the Merkle-tree and chaning all of the commit ids [10:34] changing [10:35] rails was hacked :) [12:03] I need an upload script for fortunecity :( [12:06] running out of space? [12:06] no [12:06] just in case I actually get banned [12:07] hmm, maybe I can just patch from the mobileme uploading script... [12:09] grr I could be pwning, if not for underscor [12:13] Hydriz: Don't upload to the mobileme share just because [12:13] get banned? from fortunecity? or from what? [12:14] no [12:14] Firstly, its copying and configuring a little of the mobileme-grab uploading script [12:15] Secondly, banned is the abuse of resources of the Toolserver [12:15] Toolserver = Wikimedia [12:15] Toolserver != ArchiveTEam [12:15] have you been banned? [12:15] Alrighty, didn't know you were using someone elses machine to grab down stuff to - seems a bit silly [12:15] 13:05 < Hydriz> I need an upload script for fortunecity :( [12:16] 13:08 < Hydriz> just in case I actually get banned [12:16] well, dont abuse toolserver then please [12:16] shyt [12:16] Hydriz: I'd recommend stopping downloading from that machine and wait until you get a hold of SketchCow and upload that stuff [12:16] yeah, I stopped [12:16] I don't intend to do it forever [12:16] its purely retarded [12:17] though it doesn't really mean banning [12:17] but its just conscience [12:20] So, you are the guy who makes slow the Toolserver! [12:20] LOL no [12:20] there is a ton of bots that are running on the Toolserver [12:20] mostly having memory leak [12:21] and that brings trouble to the Toolserver [12:21] anyway I just hope I have inspired you guys to grab faster? [12:22] like the mobileme thingy [12:23] It's hard to accelerate without running 500 workers [12:23] since fortuneshitty is so slow [12:23] yeah [12:23] its goddamn slow [12:24] one whole night and only 22GB [12:25] seems like you guys are actually grabbing memac faster than before... [12:25] which is good :) [12:27] wait, is fortunecity actually closing? [12:27] oh yes, just saw it [12:30] Hmm, wanted to check out my FortuneCity account I had in my childhood, web interface works, FTP doesn't. [12:30] link? [12:31] I'm getting Bad Password messages when I'm logging in on FTP while my password is correct. [12:31] FTP details are here: http://www.fortunecity.com/support/ [12:34] Trying with other FTP settings... [12:40] http://www.ghisler.ch/board/viewtopic.php?t=30283 [12:40] Seems to be a problem for multiple users. [12:47] hrm [12:48] Sounds like they have abandoned their servers. [16:10] Why hello. [16:10] Yo. [16:10] Schbirid: Yes, please upload the forums to archive.org and let me know where it went, at jason@textfiles.com. [17:48] OK, I've had a chat with Brewster regarding MobileMe. [17:49] MobileMe is a terrible problem we're going to turn into an informative experiment. [17:49] We need a day to think on it, but there's a very rough plan. [17:51] 1. Download MobileMe [17:51] 2. Generate very useful content info about mobileme data [17:51] 3. Put on drives [17:51] 4. Take drives offline [17:52] 5. Store drives [17:52] Then [17:52] 6. Within X amount of time, bring back drives [17:52] 7. Leela and Bender put data onto contemporary items [17:52] 8. Donate old drives to charities and needy [17:53] Comments and brainstorm welcome. [17:54] legal problem or space or other problem? [17:54] space i'm sure [18:00] Wrt 2., you should contact some researchers to know what they'd be interested in and whether they're willing to help you. So that you don't forget anything before taking them offline. [18:01] Do you have some contacts with researchers working on IA's content? There's emijrp, and maybe some mailing lists which could be (ab)used, like those for Wikimedia-related researchers and WikiSym (probably among hte biggest on user-generated content?). [18:05] should i include who did the work on the forumplanet.gamespy.com upload or rather not? i do not care but it might be better to not mention it i guess? [18:07] It's entirely up to you, Schbirid [18:08] Nemo_bis: what do you think? [18:08] ok [18:08] you are the only other person on it :) [18:08] Schbirid, don't put me [18:08] ok [18:17] Don't put me bro [18:27] PUT Nemo_bis HTTP/1.1 [18:27] HTTP/1.1 405 Method Not Allowed [18:28] hem [18:32] and 418 for me [18:32] haha [18:32] http://httpbin.org/status/418 [18:34] what about 402 [18:34] I was gonna say that but now i have to say 429 [18:34] 418 is the only one with a content body [18:36] this netlabel rocks http://www.blocsonic.com/releases/show/dig-deep [18:38] 402 surely not [18:44] 1. Make the upload to IA faster. (The 50MB/s upload we got to S3 via batcave is not fast enough.) [18:44] 2. Speed up the downloads. (kennethre was going fast, but maybe not going fast enough. That also depends on the upload speed.) [18:44] SketchCow: For your discussion: If you do intend to download all of MobileMe, it would be necessary to: [18:44] 3. Collect more usernames, because we have enough to keep us going for a while, but certainly not everything. [18:44] But of course, not doing any of that will still produce lots of data. [18:53] alard: we need that ED guy back [18:54] Yes, that would help. But only if the upload is faster, too. [19:03] well he can probally keep loads of stuff locally [19:04] upload's only an issue for heroku runners [19:10] How long does stuff last in unpowered hard drives? [19:26] i'd think longer than powered [19:27] I think that multiple S3 uploads are faster than a single S3 upload. Also, if you can stage uploads on EC2, then uploads to S3 can be parallelized and should be much faster. [19:29] so S3 scales by parallelizing transfers. [19:32] swebb_: We were doing 40-50 uploads to S3. (The IA S3 interface, that is.) Uploading directly was slow, something's wrong there, but through a proxy on batcave it was faster. Yet, we weren't able to push that to more than 50MB/s. [19:32] Aah. I'm not that familiar with the AI S3. [19:33] Well, you just do an HTTP PUT with some extra headers. The problem is that, for some unknown reason, uploads directly to http://s3.us.archive.org/ never go beyond 300-400kB/s. [19:34] Via the batcave-proxy route you can do 10MB/s, with one upload stream. But that wouldn't go faster than 50MB/s. Which is fast, but not fast enough. [19:34] BTW, if anyone's interested, I had to research a solution to a problem that we were having at work streaming tons of data across long-haul internet spans (like TCP traffic over the internet from US to Europe) and found this: http://udt.sourceforge.net/ Might be interesting to stream completed data to a central location at faster speeds if anyone's interested. [19:39] It's possible that batcave is a linux box or something with a more well-tuned TCP stack than the s2.us.archive.org server. TCP latency can be a bitch if not tuned very well. [19:39] if s3.us.archive.org is on the same LAN as batcave, then it won't suffer from internet latency. [19:48] damn, the slides in "Meredith Patterson and Sergey Bratus - The Science of Insecurity" are so complex thjey are actually distracting me [19:48] Meredith! [19:48] Love that girl, hugged her madly last year [19:49] :) [19:53] ok it is way over my head [19:54] any other recommendations from http://www.shmoocon.org/presentations ? i did watch the awesome "Kristin Paget - Credit Card Fraud: The Contactless Generation" so far [20:01] swebb_: From where I'm standing, it looks like shaping or some kind of bottleneck in the network path [20:03] ersi: It could very well be, but latency looks like traffic shaping also. The solution, move to something other than TCP for the transfer to a machine physically close to the slow server, then use TCP or whatever to move the data from there. [20:05] Hmm, now that you say that - yes, latency actually does. It's one of those 'bottlenecks' I vaguely mentioned but didn't think about [20:06] You can tune your linux machine to be less picky about a high-latency connection, but you need to change things on both ends. [20:06] cd /proc/sys/net/core [20:06] echo 10000000 > rmem_default [20:06] echo 10000000 > rmem_max [20:06] echo 10000000 > wmem_default [20:06] echo 10000000 > wmem_max [20:06] How about intermediates? [20:06] cd ../ipv4 [20:06] echo '4096 10000000 10000000' > tcp_mem [20:06] echo '4096 10000000 10000000' > tcp_rmem [20:06] cd route [20:06] echo '4096 10000000 10000000' > tcp_wmem [20:06] echo 1 > flush [20:07] Those settings will help latency on all TCP traffic on any linux machine. [20:09] or just use UDP (or that UDT tool) to get the data 'close' to the S3 server, so it can stream quickly from there. [20:10] swebb_, we've already mentioned that [20:11] that "get data close" thing is already being done by relaying through batcave, which is in the same network. I don't think it is the same LAN segment, but within the same building. [20:11] I was told it wasn't going to be useful. In fact, I did some experiments from a machine with a 200 ms latency to IA s3 servers and about 100 Mb/s bandwidth. [20:11] And there was some small result, but not much; usually it flowed from 50 to 1000 KB/s, averaging on 500 maybe, and reached 2-3000 as top or 1500 as highest average. [20:12] (IIRC) [20:12] I doubt any such trick can do better than the batcave proxying. [20:12] http://animalnewyork.com/2012/02/the-department-of-homeland-security-is-searching-your-facebook-and-twitter-for-these-words/ cool, my nick is on the list [20:13] mmm [20:13] Nemo_bis: where is your machine physically located? [20:13] swebb_, that was in Italy. GARR network. [20:15] I now desire a chocolate and vodka concoction [20:15] It looks like the s3.us.archive.org machine is located in San Francisco. So, Italy <--> SF will definitely suffer from latency issues if you're using TCP. [20:15] of course [20:16] the point is how you can improve through high latency connection [20:16] Don't use a protocol that waits for a return packet before sending a second packet (ie UDP). [20:16] http://www.archiveteam.org/index.php?title=User:Nemo_bis ontains some info [20:16] and what alternatives are there for s3-like upload? [20:18] S3 requires HTTP POST, so you're limited to TCP there. [20:18] well, there is the ftp interface [20:19] FTP would work. I didn't know about that one. [20:19] FTP is HORRIBLE [20:19] the new beta web interface is just a pretty front end on the s3 interface [20:20] the ftp interface is rather annoying, though, since you have to "check out" the item, which puts it on an FTP server, then interact with the FTP server to do whatever, and then finally "check in" the item, which pushes it back into the CMS [20:20] Blech. [20:21] FTP might be horrible, but that turd on steriods is working somewhat [20:22] Nothing works with FTP. Nothing. Nothing. [20:22] FTP gave me headaches for weeks. http://www.archive.org/post/267549/stale-pureftpd-upload-file-blocks-checkin [20:23] Then an angel IA sysadmin told me to use s3 and I was saved. [20:23] It's funny that you're getting sort of the same speed [20:24] me? no, it's way faster [20:24] At least from here. [20:24] I'd do this: Italy -- [UDT] --> batcave -- [HTTP PUT] --> s3.us.archive.org. Automate the S3 part to just dump everything that's a few days old or something so people from around the world can use a better protocol to stream data over the internet. [20:24] Sometimes is maxes my bandwidth (only 10 Mb/s full duplex at home). [20:26] I don't believe the latency is the issue; it's not just uploading from Italy, it's just as slow if you do an HTTP PUT from Amazon EC2. [20:27] The third application used a new protocol called that UDX that is a lightweight variant of the UDT protocol. UDX was able to sustain a data transfer rate of 9.2 Gb/s over a 10 Gb/s connection with a 200ms RTT (which corresponds to a 12,000 mile path or long enough to reach half way around the world). [20:27] alard: ec2 isn't known for low-latency either though. :) [20:28] No, but it should be fast enough not to jump through hoops to get a decent upload speed. [20:28] Were you using a us-west EC2 instance? [20:29] since the AI server is in SF, it would be a better connection from the west in EC2. [20:29] What's the default? [20:29] The default is on the east coast. [20:29] Ah, I see. Heroku is in us-east too. [20:29] correct [20:30] Still, I think it's silly that you can't get a decent upload to s3.us.archive.org while you can get a decent upload to batcave, which is just around the corner. [20:30] batcave may be tuned to handle latency better. [20:30] swebb_: from the same EC2 instance, going to batcave (at IA) was fast, going to the IA-s3 endpoint was slow. going to the IA-s3 endpoint via batcave was fast. [20:31] The whole idea is to get rid of latency issues alltogether. [20:31] ok, nevermind. I can tell that nobody cares. :) [20:32] swebb_: Well, I'm sure it can be much more efficient and better and more elegant, but indeed, I just want HTTP to be fast enough. :) [20:32] the proper fix is to get the IA admins to fix whatever the bottleneck on the S3 endpoint machines is. [20:32] I shouldn't have to care, that's the point. :) [20:33] the only way your UDP hopscotch method will work is if someone in charge of a machine there OKs it and sets it up [20:34] Yea, you'd need an account somewhere physically close to the S3 server. [20:39] Twitter delivers the firehose over HTTP (TCP) from California at a rate of 70Mbps (peak). People from the east coast and over seas can't consume the stream fast enough based on nothing other than internet latency and as a result, end up putting machines on the West coast to consume the stream, then trickle it via other means to a more local server for use. [20:47] and suddenly. 7MB/s to s3 [20:48] TCP can't be that bad, they must have crappy connectivity [20:51] Schbirid, with what latency? [20:51] (that varies as well, in my experience) [20:51] no idea, it just finished my s3cmd stuff in no time [20:51] heh [20:51] was the usual 150kilobytes before [20:51] lol [20:51] where is the item [20:52] currently http://www.archive.org/details/Forumplanet.gamespy.comArchive [20:53] i uploaded to files by accident, just deleted them, still there [20:58] ok, the files are correct now [21:00] no, still one planetfallout too much [21:01] and planetdoom [21:34] heh, SketchCow, i was just going to mail you about it. thanks for moving! [21:54] kennethre: I made a Heroku version of the fortunecity scripts, so if you'd like to join, feel free... :) https://github.com/ArchiveTeam/fortunecity [21:54] (The same holds for other people, by the way. You can run one dyno for free.) [21:55] nice :) [21:55] one dyno *per app* for free [21:55] so everyone could run 100 of those each [21:55] Yes. [21:55] :) [21:55] but you know, don't exploit [21:56] The rsync in your buildpack doesn't work, for some reason. It crashes, so I created a new one. [21:59] alard: it does, i didn't compile it right [22:32] "Stipend: None" [22:32] Fuck you, LoC, and your unpaid digitization too [22:38] one problem with the "move mobileme to offline drives" idea: it makes it harder for people to retrieve whatever data (of theirs) we saved [22:39] (the geocities story of the woman with a site dedicated to her deceased infant son comes to mind) [22:40] kennethre: Are your memac clients still running? [22:40] Coderjoe: Is it feasible to keep an index and get the data on request? [22:40] alard: nah [22:40] It's still really active on the tracker. [22:41] shaqfu: only if there is someone willing to pull out one of the offline drives to retrieve the data on request. [22:42] Coderjoe: Point taken, esp. if it's not all held by one person/entity [22:42] shaqfu: it would be held at IA. [22:43] Coderjoe: Oh, hm [22:43] shaqfu: see around 2012-03-06 00:50 eastern time for more [22:44] Hunh [22:44] Closed stacks at the IA, who would've thought [22:44] alard: where? [22:44] kennethre: The memac-tamer tracker. [22:44] alard: i only see 1 [22:44] kennethre: Just set the number of clients to 1. http://memac-tamer.heroku.com/ [22:45] Well, it's really busy with clients asking 'can I start?'. [22:45] ugh [22:45] let me see [22:45] Coderjoe: There's good precedent for it - every paper archive works like that - but I'm not fully sure how it'd work with digital data [22:47] damnit [22:47] found the app [22:47] off now [22:47] damn [22:47] You could guess at high-interest data and have that on-hand, and closed stack the rest [22:49] That way, you have stuff people usually want readily availible, and anything else (low-traffic data, miners) can put in requests or wait [22:49] kennethre: Good, less wasteful! [22:49] (I don't know what's in MobileMe, so dunno what'd qualify as high-value) [22:50] Well, one thing they could do is separate the web.me.com/homepage.mac.com data from the galleries and the files. [22:51] web+homepage are 1/4th of the data, but possibly interesting to more people. [22:53] alard: How large is the entire collection? [22:53] We once estimated 200TB, but here's the most recent data to to the calculations yourself: http://memac.heroku.com/ [22:56] alard: That'd be a good idea, since I assume a lot of what's on there is referenced by homepages [22:56] So people can look through the pages, and if there's a gap, they can request it [22:57] Sites on homepage/web do not, as far as I know, really refer to files on gallery or public. If they do, they probably have a copy of the file on web or homepage. [22:57] washington mutual used to have old statement archives on dark stacks - they dumped them to tape after 6mo or whatever; to see something older you requested it and the tape robot would get it within 24 hours [22:58] Tape robot? Fancy; the archives I've worked at used high schoolers [22:58] okay, maybe it was people, I dunno. [22:59] They probably used a robot; WaMu has more money than all the places I've worked at combined [22:59] *had [22:59] That too [23:00] The solution may be even easier: just get an account with one of the companies like http://www.bitcasa.com/ [23:01] Unlimited storage for 10 dollars a month, that's what we need. :) [23:01] Haha [23:01] sweet [23:01] that'll do [23:01] Hello, can we try your beta? [23:02] Can we borrow your entire line for, oh, a month? [23:03] Anyway, it's an interesting problem