#archiveteam 2012-03-06,Tue

↑back Search

Time Nickname Message
00:09 🔗 yipdw chronomex: yeah
02:19 🔗 godane SketchCow: I didn't know you help with the on piracy documentary
07:08 🔗 shaqfu So I sub taught today for a teacher that had ~20 years of disks/docs of educational software
07:08 🔗 shaqfu Should've left a note asking if I could've scanned/dumped them :P
07:13 🔗 Nemo_bis You can always tell him.
07:15 🔗 LordNlptp ya
07:15 🔗 LordNlptp sounds like a good haul
07:16 🔗 shaqfu Can't guarantee the software, but if she was hanging onto manuals for "Educational Software on System 7.5" I assume it's kicking around
07:20 🔗 shaqfu I figured it'd be interesting, since AFAIK educational/industrial software is hard to come by
07:22 🔗 Nemo_bis It is.
07:22 🔗 Nemo_bis Hmpf, Oracle seems to have deleted content from Sun wikis in the merge (looking for some documentation that no longer exists).
07:47 🔗 chronomex industrial software is super important for when you inherit a machine
07:47 🔗 chronomex as someone who's inherited multiple ancient machines, having that stuff on the net is top-notch exciting.
07:55 🔗 ersi shaqfu: I would of copied all that data without asking
07:55 🔗 ersi But that's perhaps just me
08:33 🔗 godane github got hacked
08:34 🔗 godane should we start to do a panic backup?
08:38 🔗 ersi haha
10:31 🔗 db48x` they didn't get hacked in an interesting way though
10:31 🔗 db48x` git itself prevents hack like that from meaning anything serious
10:34 🔗 db48x` you can't change the underlying data in the git repository without invalidating the Merkle-tree and chaning all of the commit ids
10:34 🔗 db48x` changing
10:35 🔗 kennethre rails was hacked :)
12:03 🔗 Hydriz I need an upload script for fortunecity :(
12:06 🔗 ersi running out of space?
12:06 🔗 Hydriz no
12:06 🔗 Hydriz just in case I actually get banned
12:07 🔗 Hydriz hmm, maybe I can just patch from the mobileme uploading script...
12:09 🔗 Hydriz grr I could be pwning, if not for underscor
12:13 🔗 ersi Hydriz: Don't upload to the mobileme share just because
12:13 🔗 ersi get banned? from fortunecity? or from what?
12:14 🔗 Hydriz no
12:14 🔗 Hydriz Firstly, its copying and configuring a little of the mobileme-grab uploading script
12:15 🔗 Hydriz Secondly, banned is the abuse of resources of the Toolserver
12:15 🔗 Hydriz Toolserver = Wikimedia
12:15 🔗 Hydriz Toolserver != ArchiveTEam
12:15 🔗 emijrp have you been banned?
12:15 🔗 ersi Alrighty, didn't know you were using someone elses machine to grab down stuff to - seems a bit silly
12:15 🔗 ersi 13:05 < Hydriz> I need an upload script for fortunecity :(
12:16 🔗 ersi 13:08 < Hydriz> just in case I actually get banned
12:16 🔗 Schbirid well, dont abuse toolserver then please
12:16 🔗 Hydriz shyt
12:16 🔗 ersi Hydriz: I'd recommend stopping downloading from that machine and wait until you get a hold of SketchCow and upload that stuff
12:16 🔗 Hydriz yeah, I stopped
12:16 🔗 Hydriz I don't intend to do it forever
12:16 🔗 Hydriz its purely retarded
12:17 🔗 Hydriz though it doesn't really mean banning
12:17 🔗 Hydriz but its just conscience
12:20 🔗 emijrp So, you are the guy who makes slow the Toolserver!
12:20 🔗 Hydriz LOL no
12:20 🔗 Hydriz there is a ton of bots that are running on the Toolserver
12:20 🔗 Hydriz mostly having memory leak
12:21 🔗 Hydriz and that brings trouble to the Toolserver
12:21 🔗 Hydriz anyway I just hope I have inspired you guys to grab faster?
12:22 🔗 Hydriz like the mobileme thingy
12:23 🔗 ersi It's hard to accelerate without running 500 workers
12:23 🔗 ersi since fortuneshitty is so slow
12:23 🔗 Hydriz yeah
12:23 🔗 Hydriz its goddamn slow
12:24 🔗 Hydriz one whole night and only 22GB
12:25 🔗 Hydriz seems like you guys are actually grabbing memac faster than before...
12:25 🔗 Hydriz which is good :)
12:27 🔗 Hydriz wait, is fortunecity actually closing?
12:27 🔗 Hydriz oh yes, just saw it
12:30 🔗 eprillios Hmm, wanted to check out my FortuneCity account I had in my childhood, web interface works, FTP doesn't.
12:30 🔗 emijrp link?
12:31 🔗 eprillios I'm getting Bad Password messages when I'm logging in on FTP while my password is correct.
12:31 🔗 eprillios FTP details are here: http://www.fortunecity.com/support/
12:34 🔗 eprillios Trying with other FTP settings...
12:40 🔗 eprillios http://www.ghisler.ch/board/viewtopic.php?t=30283
12:40 🔗 eprillios Seems to be a problem for multiple users.
12:47 🔗 db48x hrm
12:48 🔗 eprillios Sounds like they have abandoned their servers.
16:10 🔗 SketchCow Why hello.
16:10 🔗 swebb_ Yo.
16:10 🔗 SketchCow Schbirid: Yes, please upload the forums to archive.org and let me know where it went, at jason@textfiles.com.
17:48 🔗 SketchCow OK, I've had a chat with Brewster regarding MobileMe.
17:49 🔗 SketchCow MobileMe is a terrible problem we're going to turn into an informative experiment.
17:49 🔗 SketchCow We need a day to think on it, but there's a very rough plan.
17:51 🔗 SketchCow 1. Download MobileMe
17:51 🔗 SketchCow 2. Generate very useful content info about mobileme data
17:51 🔗 SketchCow 3. Put on drives
17:51 🔗 SketchCow 4. Take drives offline
17:52 🔗 SketchCow 5. Store drives
17:52 🔗 SketchCow Then
17:52 🔗 SketchCow 6. Within X amount of time, bring back drives
17:52 🔗 SketchCow 7. Leela and Bender put data onto contemporary items
17:52 🔗 SketchCow 8. Donate old drives to charities and needy
17:53 🔗 SketchCow Comments and brainstorm welcome.
17:54 🔗 Schbirid legal problem or space or other problem?
17:54 🔗 kennethre space i'm sure
18:00 🔗 Nemo_bis Wrt 2., you should contact some researchers to know what they'd be interested in and whether they're willing to help you. So that you don't forget anything before taking them offline.
18:01 🔗 Nemo_bis Do you have some contacts with researchers working on IA's content? There's emijrp, and maybe some mailing lists which could be (ab)used, like those for Wikimedia-related researchers and WikiSym (probably among hte biggest on user-generated content?).
18:05 🔗 Schbirid should i include who did the work on the forumplanet.gamespy.com upload or rather not? i do not care but it might be better to not mention it i guess?
18:07 🔗 SketchCow It's entirely up to you, Schbirid
18:08 🔗 Schbirid Nemo_bis: what do you think?
18:08 🔗 Schbirid ok
18:08 🔗 Schbirid you are the only other person on it :)
18:08 🔗 Nemo_bis Schbirid, don't put me
18:08 🔗 Schbirid ok
18:17 🔗 SketchCow Don't put me bro
18:27 🔗 DoubleJ PUT Nemo_bis HTTP/1.1
18:27 🔗 DoubleJ HTTP/1.1 405 Method Not Allowed
18:28 🔗 Nemo_bis hem
18:32 🔗 Schbirid and 418 for me
18:32 🔗 kennethre haha
18:32 🔗 kennethre http://httpbin.org/status/418
18:34 🔗 DFJustin what about 402
18:34 🔗 tef I was gonna say that but now i have to say 429
18:34 🔗 kennethre 418 is the only one with a content body
18:36 🔗 Schbirid this netlabel rocks http://www.blocsonic.com/releases/show/dig-deep
18:38 🔗 Nemo_bis 402 surely not
18:44 🔗 alard 1. Make the upload to IA faster. (The 50MB/s upload we got to S3 via batcave is not fast enough.)
18:44 🔗 alard 2. Speed up the downloads. (kennethre was going fast, but maybe not going fast enough. That also depends on the upload speed.)
18:44 🔗 alard SketchCow: For your discussion: If you do intend to download all of MobileMe, it would be necessary to:
18:44 🔗 alard 3. Collect more usernames, because we have enough to keep us going for a while, but certainly not everything.
18:44 🔗 alard But of course, not doing any of that will still produce lots of data.
18:53 🔗 kennethre alard: we need that ED guy back
18:54 🔗 alard Yes, that would help. But only if the upload is faster, too.
19:03 🔗 kennethre well he can probally keep loads of stuff locally
19:04 🔗 kennethre upload's only an issue for heroku runners
19:10 🔗 Nemo_bis How long does stuff last in unpowered hard drives?
19:26 🔗 kennethre i'd think longer than powered
19:27 🔗 swebb_ I think that multiple S3 uploads are faster than a single S3 upload. Also, if you can stage uploads on EC2, then uploads to S3 can be parallelized and should be much faster.
19:29 🔗 swebb_ so S3 scales by parallelizing transfers.
19:32 🔗 alard swebb_: We were doing 40-50 uploads to S3. (The IA S3 interface, that is.) Uploading directly was slow, something's wrong there, but through a proxy on batcave it was faster. Yet, we weren't able to push that to more than 50MB/s.
19:32 🔗 swebb_ Aah. I'm not that familiar with the AI S3.
19:33 🔗 alard Well, you just do an HTTP PUT with some extra headers. The problem is that, for some unknown reason, uploads directly to http://s3.us.archive.org/ never go beyond 300-400kB/s.
19:34 🔗 alard Via the batcave-proxy route you can do 10MB/s, with one upload stream. But that wouldn't go faster than 50MB/s. Which is fast, but not fast enough.
19:34 🔗 swebb_ BTW, if anyone's interested, I had to research a solution to a problem that we were having at work streaming tons of data across long-haul internet spans (like TCP traffic over the internet from US to Europe) and found this: http://udt.sourceforge.net/ Might be interesting to stream completed data to a central location at faster speeds if anyone's interested.
19:39 🔗 swebb_ It's possible that batcave is a linux box or something with a more well-tuned TCP stack than the s2.us.archive.org server. TCP latency can be a bitch if not tuned very well.
19:39 🔗 swebb_ if s3.us.archive.org is on the same LAN as batcave, then it won't suffer from internet latency.
19:48 🔗 Schbirid damn, the slides in "Meredith Patterson and Sergey Bratus - The Science of Insecurity" are so complex thjey are actually distracting me
19:48 🔗 SketchCow Meredith!
19:48 🔗 SketchCow Love that girl, hugged her madly last year
19:49 🔗 Schbirid :)
19:53 🔗 Schbirid ok it is way over my head
19:54 🔗 Schbirid any other recommendations from http://www.shmoocon.org/presentations ? i did watch the awesome "Kristin Paget - Credit Card Fraud: The Contactless Generation" so far
20:01 🔗 ersi swebb_: From where I'm standing, it looks like shaping or some kind of bottleneck in the network path
20:03 🔗 swebb_ ersi: It could very well be, but latency looks like traffic shaping also. The solution, move to something other than TCP for the transfer to a machine physically close to the slow server, then use TCP or whatever to move the data from there.
20:05 🔗 ersi Hmm, now that you say that - yes, latency actually does. It's one of those 'bottlenecks' I vaguely mentioned but didn't think about
20:06 🔗 swebb_ You can tune your linux machine to be less picky about a high-latency connection, but you need to change things on both ends.
20:06 🔗 swebb_ cd /proc/sys/net/core
20:06 🔗 swebb_ echo 10000000 > rmem_default
20:06 🔗 swebb_ echo 10000000 > rmem_max
20:06 🔗 swebb_ echo 10000000 > wmem_default
20:06 🔗 swebb_ echo 10000000 > wmem_max
20:06 🔗 ersi How about intermediates?
20:06 🔗 swebb_ cd ../ipv4
20:06 🔗 swebb_ echo '4096 10000000 10000000' > tcp_mem
20:06 🔗 swebb_ echo '4096 10000000 10000000' > tcp_rmem
20:06 🔗 swebb_ cd route
20:06 🔗 swebb_ echo '4096 10000000 10000000' > tcp_wmem
20:06 🔗 swebb_ echo 1 > flush
20:07 🔗 swebb_ Those settings will help latency on all TCP traffic on any linux machine.
20:09 🔗 swebb_ or just use UDP (or that UDT tool) to get the data 'close' to the S3 server, so it can stream quickly from there.
20:10 🔗 Nemo_bis swebb_, we've already mentioned that
20:11 🔗 Coderjoe that "get data close" thing is already being done by relaying through batcave, which is in the same network. I don't think it is the same LAN segment, but within the same building.
20:11 🔗 Nemo_bis I was told it wasn't going to be useful. In fact, I did some experiments from a machine with a 200 ms latency to IA s3 servers and about 100 Mb/s bandwidth.
20:11 🔗 Nemo_bis And there was some small result, but not much; usually it flowed from 50 to 1000 KB/s, averaging on 500 maybe, and reached 2-3000 as top or 1500 as highest average.
20:12 🔗 Nemo_bis (IIRC)
20:12 🔗 Nemo_bis I doubt any such trick can do better than the batcave proxying.
20:12 🔗 closure http://animalnewyork.com/2012/02/the-department-of-homeland-security-is-searching-your-facebook-and-twitter-for-these-words/ cool, my nick is on the list
20:13 🔗 Coderjoe mmm
20:13 🔗 swebb_ Nemo_bis: where is your machine physically located?
20:13 🔗 Nemo_bis swebb_, that was in Italy. GARR network.
20:15 🔗 Coderjoe I now desire a chocolate and vodka concoction
20:15 🔗 swebb_ It looks like the s3.us.archive.org machine is located in San Francisco. So, Italy <--> SF will definitely suffer from latency issues if you're using TCP.
20:15 🔗 Nemo_bis of course
20:16 🔗 Nemo_bis the point is how you can improve through high latency connection
20:16 🔗 swebb_ Don't use a protocol that waits for a return packet before sending a second packet (ie UDP).
20:16 🔗 Nemo_bis http://www.archiveteam.org/index.php?title=User:Nemo_bis ontains some info
20:16 🔗 Nemo_bis and what alternatives are there for s3-like upload?
20:18 🔗 swebb_ S3 requires HTTP POST, so you're limited to TCP there.
20:18 🔗 Coderjoe well, there is the ftp interface
20:19 🔗 swebb_ FTP would work. I didn't know about that one.
20:19 🔗 Nemo_bis FTP is HORRIBLE
20:19 🔗 Coderjoe the new beta web interface is just a pretty front end on the s3 interface
20:20 🔗 Coderjoe the ftp interface is rather annoying, though, since you have to "check out" the item, which puts it on an FTP server, then interact with the FTP server to do whatever, and then finally "check in" the item, which pushes it back into the CMS
20:20 🔗 swebb_ Blech.
20:21 🔗 ersi FTP might be horrible, but that turd on steriods is working somewhat
20:22 🔗 Nemo_bis Nothing works with FTP. Nothing. Nothing.
20:22 🔗 Nemo_bis FTP gave me headaches for weeks. http://www.archive.org/post/267549/stale-pureftpd-upload-file-blocks-checkin
20:23 🔗 Nemo_bis Then an angel IA sysadmin told me to use s3 and I was saved.
20:23 🔗 ersi It's funny that you're getting sort of the same speed
20:24 🔗 Nemo_bis me? no, it's way faster
20:24 🔗 Nemo_bis At least from here.
20:24 🔗 swebb_ I'd do this: Italy -- [UDT] --> batcave -- [HTTP PUT] --> s3.us.archive.org. Automate the S3 part to just dump everything that's a few days old or something so people from around the world can use a better protocol to stream data over the internet.
20:24 🔗 Nemo_bis Sometimes is maxes my bandwidth (only 10 Mb/s full duplex at home).
20:26 🔗 alard I don't believe the latency is the issue; it's not just uploading from Italy, it's just as slow if you do an HTTP PUT from Amazon EC2.
20:27 🔗 swebb_ The third application used a new protocol called that UDX that is a lightweight variant of the UDT protocol. UDX was able to sustain a data transfer rate of 9.2 Gb/s over a 10 Gb/s connection with a 200ms RTT (which corresponds to a 12,000 mile path or long enough to reach half way around the world).
20:27 🔗 swebb_ alard: ec2 isn't known for low-latency either though. :)
20:28 🔗 alard No, but it should be fast enough not to jump through hoops to get a decent upload speed.
20:28 🔗 swebb_ Were you using a us-west EC2 instance?
20:29 🔗 swebb_ since the AI server is in SF, it would be a better connection from the west in EC2.
20:29 🔗 alard What's the default?
20:29 🔗 swebb_ The default is on the east coast.
20:29 🔗 alard Ah, I see. Heroku is in us-east too.
20:29 🔗 kennethre correct
20:30 🔗 alard Still, I think it's silly that you can't get a decent upload to s3.us.archive.org while you can get a decent upload to batcave, which is just around the corner.
20:30 🔗 swebb_ batcave may be tuned to handle latency better.
20:30 🔗 Coderjoe swebb_: from the same EC2 instance, going to batcave (at IA) was fast, going to the IA-s3 endpoint was slow. going to the IA-s3 endpoint via batcave was fast.
20:31 🔗 swebb_ The whole idea is to get rid of latency issues alltogether.
20:31 🔗 swebb_ ok, nevermind. I can tell that nobody cares. :)
20:32 🔗 alard swebb_: Well, I'm sure it can be much more efficient and better and more elegant, but indeed, I just want HTTP to be fast enough. :)
20:32 🔗 Coderjoe the proper fix is to get the IA admins to fix whatever the bottleneck on the S3 endpoint machines is.
20:32 🔗 alard I shouldn't have to care, that's the point. :)
20:33 🔗 Coderjoe the only way your UDP hopscotch method will work is if someone in charge of a machine there OKs it and sets it up
20:34 🔗 swebb_ Yea, you'd need an account somewhere physically close to the S3 server.
20:39 🔗 swebb_ Twitter delivers the firehose over HTTP (TCP) from California at a rate of 70Mbps (peak). People from the east coast and over seas can't consume the stream fast enough based on nothing other than internet latency and as a result, end up putting machines on the West coast to consume the stream, then trickle it via other means to a more local server for use.
20:47 🔗 Schbirid and suddenly. 7MB/s to s3
20:48 🔗 soultcer TCP can't be that bad, they must have crappy connectivity
20:51 🔗 Nemo_bis Schbirid, with what latency?
20:51 🔗 Nemo_bis (that varies as well, in my experience)
20:51 🔗 Schbirid no idea, it just finished my s3cmd stuff in no time
20:51 🔗 Nemo_bis heh
20:51 🔗 Schbirid was the usual 150kilobytes before
20:51 🔗 Nemo_bis lol
20:51 🔗 Nemo_bis where is the item
20:52 🔗 Schbirid currently http://www.archive.org/details/Forumplanet.gamespy.comArchive
20:53 🔗 Schbirid i uploaded to files by accident, just deleted them, still there
20:58 🔗 Schbirid ok, the files are correct now
21:00 🔗 Schbirid no, still one planetfallout too much
21:01 🔗 Schbirid and planetdoom
21:34 🔗 Schbirid heh, SketchCow, i was just going to mail you about it. thanks for moving!
21:54 🔗 alard kennethre: I made a Heroku version of the fortunecity scripts, so if you'd like to join, feel free... :) https://github.com/ArchiveTeam/fortunecity
21:54 🔗 alard (The same holds for other people, by the way. You can run one dyno for free.)
21:55 🔗 kennethre nice :)
21:55 🔗 kennethre one dyno *per app* for free
21:55 🔗 kennethre so everyone could run 100 of those each
21:55 🔗 alard Yes.
21:55 🔗 kennethre :)
21:55 🔗 kennethre but you know, don't exploit
21:56 🔗 alard The rsync in your buildpack doesn't work, for some reason. It crashes, so I created a new one.
21:59 🔗 kennethre alard: it does, i didn't compile it right
22:32 🔗 shaqfu "Stipend: None"
22:32 🔗 shaqfu Fuck you, LoC, and your unpaid digitization too
22:38 🔗 Coderjoe one problem with the "move mobileme to offline drives" idea: it makes it harder for people to retrieve whatever data (of theirs) we saved
22:39 🔗 Coderjoe (the geocities story of the woman with a site dedicated to her deceased infant son comes to mind)
22:40 🔗 alard kennethre: Are your memac clients still running?
22:40 🔗 shaqfu Coderjoe: Is it feasible to keep an index and get the data on request?
22:40 🔗 kennethre alard: nah
22:40 🔗 alard It's still really active on the tracker.
22:41 🔗 Coderjoe shaqfu: only if there is someone willing to pull out one of the offline drives to retrieve the data on request.
22:42 🔗 shaqfu Coderjoe: Point taken, esp. if it's not all held by one person/entity
22:42 🔗 Coderjoe shaqfu: it would be held at IA.
22:43 🔗 shaqfu Coderjoe: Oh, hm
22:43 🔗 Coderjoe shaqfu: see around 2012-03-06 00:50 eastern time for more
22:44 🔗 shaqfu Hunh
22:44 🔗 shaqfu Closed stacks at the IA, who would've thought
22:44 🔗 kennethre alard: where?
22:44 🔗 alard kennethre: The memac-tamer tracker.
22:44 🔗 kennethre alard: i only see 1
22:44 🔗 alard kennethre: Just set the number of clients to 1. http://memac-tamer.heroku.com/
22:45 🔗 alard Well, it's really busy with clients asking 'can I start?'.
22:45 🔗 kennethre ugh
22:45 🔗 kennethre let me see
22:45 🔗 shaqfu Coderjoe: There's good precedent for it - every paper archive works like that - but I'm not fully sure how it'd work with digital data
22:47 🔗 kennethre damnit
22:47 🔗 kennethre found the app
22:47 🔗 kennethre off now
22:47 🔗 kennethre damn
22:47 🔗 shaqfu You could guess at high-interest data and have that on-hand, and closed stack the rest
22:49 🔗 shaqfu That way, you have stuff people usually want readily availible, and anything else (low-traffic data, miners) can put in requests or wait
22:49 🔗 alard kennethre: Good, less wasteful!
22:49 🔗 shaqfu (I don't know what's in MobileMe, so dunno what'd qualify as high-value)
22:50 🔗 alard Well, one thing they could do is separate the web.me.com/homepage.mac.com data from the galleries and the files.
22:51 🔗 alard web+homepage are 1/4th of the data, but possibly interesting to more people.
22:53 🔗 shaqfu alard: How large is the entire collection?
22:53 🔗 alard We once estimated 200TB, but here's the most recent data to to the calculations yourself: http://memac.heroku.com/
22:56 🔗 shaqfu alard: That'd be a good idea, since I assume a lot of what's on there is referenced by homepages
22:56 🔗 shaqfu So people can look through the pages, and if there's a gap, they can request it
22:57 🔗 alard Sites on homepage/web do not, as far as I know, really refer to files on gallery or public. If they do, they probably have a copy of the file on web or homepage.
22:57 🔗 chronomex washington mutual used to have old statement archives on dark stacks - they dumped them to tape after 6mo or whatever; to see something older you requested it and the tape robot would get it within 24 hours
22:58 🔗 shaqfu Tape robot? Fancy; the archives I've worked at used high schoolers
22:58 🔗 chronomex okay, maybe it was people, I dunno.
22:59 🔗 shaqfu They probably used a robot; WaMu has more money than all the places I've worked at combined
22:59 🔗 chronomex *had
22:59 🔗 shaqfu That too
23:00 🔗 alard The solution may be even easier: just get an account with one of the companies like http://www.bitcasa.com/
23:01 🔗 alard Unlimited storage for 10 dollars a month, that's what we need. :)
23:01 🔗 shaqfu Haha
23:01 🔗 chronomex sweet
23:01 🔗 chronomex that'll do
23:01 🔗 alard Hello, can we try your beta?
23:02 🔗 shaqfu Can we borrow your entire line for, oh, a month?
23:03 🔗 shaqfu Anyway, it's an interesting problem

irclogger-viewer