[00:00] I suspect I will eventually. [00:02] I mean, make no mistake, I'm getting stuff done all at the same time. This is all work that needs to be done and this room needs to process this material so I can then put it into permanent storage or donate it. [00:11] hmm, what if you get more drives ? if it takes 90 seconds to rip, and you have nine of them, then you change one CD every ten seconds [00:11] on the theory that if you're going to be babysitting it, it might as well go as fast as possible [00:12] Right, that's the problem. [00:12] I could start to move into custom solutions, but it gets silly. [00:12] how do you stagger them, and isn't 10 seconds pretty close to what you need to open the drive, take the CD out, pop it back into a case, snap the case shut, get the next one, repeat? [00:12] The fact is, these items already waited a year, if they get delayed over time because that process is running as catch-can all the time, it's cool. [00:13] dashcloud: they'd be staggered because you're loading them serially [00:14] load #1, load #2, ... by the time you finish #1 is done, repeat. [00:16] Also, remember, this is with me setting them into a "Ripped" box for later scanning of the labels and CD. [00:16] That's a WHOLE other process. [00:16] If I lived in SF, I could probably get someone to do it. [00:21] See, it's a nice problem to have, but the fundamental issue is rapidly becoming not "do we have the space and bandwidth" for it, but "where do we get the volunteers" [00:23] Also, this drive in this thing is ridiculous [01:06] I changed the captcha on the wiki. [01:17] I'm listening to Tim O'Reilly talk about Digital Preservation and say nothing new for 30 minutes. [01:18] So you don't have to. [01:19] I would probably listen to you talk about preservation and general computer history for six hours, SketchCow. [01:20] but it's good for more people to say things youy've already said- he's a well-known figure, and hopefully can get people thinking and caring about those issues [01:32] Hooray (?) I found another cache of DVDs. [01:36] with flea market season approaching fast, I'm hoping to find many more awesome goodies to get archived [02:30] hi, this item got misnamed somehow: http://archive.org/details/cdrom-riscos-kosovo the correct name is in the item description [02:36] I uploaded that, I put riscos- in front of all the risc os stuff from the piratebay so I could keep track of it [02:38] as for the kosovo part as far as I know that's correct [02:39] somebody apparently thought acorn shovelware would be a great way to raise money for orphans [02:41] take a look at http://archive.org/download/cdrom-riscos-kosovo/KosovoOrphansAppeal.iso/INSTRUCTIONS in your favourite text editor [02:45] I thought that was the wrong name because of this name: Archimedes World Magazine CD1 [02:45] Mostly because it's such an out of place name it's like they were trying to avoid selling any of them [02:49] apparently it did well enough to sell out one pressing http://archive.org/download/cdrom-riscos-kosovo/KosovoOrphansAppeal.iso/2NDEDITION [02:51] no shit [03:15] lol [03:15] how absurd [03:28] SketchCow: since IA seems to have ABBYY for book OCR, can you re-use that for the labels to generate basic CD descriptions from the case and CD scans? [03:34] the abbyy name still shows in a few places but AIUI it's actually luratech under the hood [03:37] interesting- I'm not familiar with them [04:37] Welcokme to dispatchers that steal [04:57] For those with scanners dispatchers want to see that concord is the 1st “police” station to refuse to let a person get a head in life and if it is federal level etc will stop mail and items in transit and make sure that t here is interception. [05:01] santa-ine: pretty sure buying body parts is illegal for a good reason bro [05:02] you should only get a head in life once (your own) [05:17] Why am I getting "rate limited. waiting for 300 seconds" on my Warrior for Yahoo Messages? [05:18] Are they banning IPs? [05:18] It's their special way of telling you you're awesome ;) [05:18] well that's nice of them [05:19] (and 'yes' to both your questions) [05:20] also, #BurnTheMessenger is the channel for questions related to this project [05:22] From that channel: "when you get rate-limited, it waits at least 12 periods of 300 seconds. that's per-thread, and you'll likely get one item done before you get rate-limited" [05:28] gotcha [06:34] is any amount of the MIDI content from AOL Composer's Showcase circa 1997 archived anywhere? [06:37] also did anyone grab any amount of Digg before they turned to version 3 and wiped all the data ~2.5yrs ago? [07:35] hi peeps [10:33] Nooo, my town university is killing off all the student home pages [10:46] noooooooooo [10:46] fuck why do they do that [10:52] mirror that [10:56] Because they're "modernizing" :( [10:56] at least they have a great main index, so no username crawling needed [11:23] so i'm uploading more g4 videos [11:23] wish i could do it at 5MB a second [11:23] only cause i got like over a TB of videos [11:42] also i found more high res edge magazines scans [11:43] its from a different guy this time [11:43] also i will upload the 150dpi rips i go since its more of a complete set from 1995 to 2007 of edge magazine [12:07] so i got a the trailer of the new slient hill movie [12:08] thanks to g4tv.com [12:08] and in hd too [12:57] 3gb left on 4data [13:30] \o/ [14:21] T-minus 1gb and counting. [14:33] o_O [14:43] It is done. 103gb spread over 380,000 images [14:44] Another successful save [14:55] Yeah, come on. Everyone in the channel, get a warrior running. [14:55] It's going to be too close. [14:56] Are we all blocked? The tracker has, like, no scrolling. [14:59] you do? [15:17] SketchCow: yahoo's blocking is a lot more aggressive than that of posterous. [15:19] is it per-IP? [15:19] DrDeke: I believe so [15:19] but I'm not 100% sure [15:23] any idea how many concurrent we should run? [15:23] as in, will somethign < 6 help prevent limiting [15:23] People have been getting banned running 1 thread [15:23] I get limiting with just 2, I don't think it helps at all [15:23] I'm limited with 1 after 10 minutes or so [15:26] hm [15:27] We need to find what the connection limit is before banning occurs [15:27] yeah [15:27] Then we can sit 1 below that with the User Agent of "Fuck your scripts, we're Archive Team" [15:27] ;) [15:28] Anyway [15:28] * GLaDOS pushes everyone into #BurnTheMessenger [15:28] guys, as no one is looking in the other channel, anyone know the ID of the AMI for the warrior on EC2? The old (original) one I have seems to not exist anymore. [15:28] I'll happily fire up a few instances if only I had a working system :D [15:28] i have an AMI that i created myself which doesn't use the warrior, and pretty much rapes posterous (sorry) [15:28] Smiley: remember that dedi that I gave details to? [15:28] i could make it public or add you to its ACL if you want [15:28] but nothing for yahoo yet [15:28] GLaDOS: yes, but I don't know how to setup the seesaw yet ;) [15:29] setting up seesaw is easy, but for this you'll need tons of IPs [15:29] right [15:29] apt-get install python-pip; pip install seesaw [15:29] #burnthemessenger !!!! [15:29] Smiley ^ [15:43] ------------------------------------------ [15:43] #BurnTheMessenger - Yahoo! Messages needs to be archived. Please visit the project channel and/or start the project in your warriors. [15:43] ------------------------------------------ [15:44] 04,01What ersi said [15:45] 1,8◤◢◤◢◤◢ ALART ALART ALART ◤◢◤◢◤◢ [15:47] How come no one gets this excited over a project that has announced it will close but no official date set? [15:47] god damn it, that was clear enough [15:47] omf_: It's Yahoo! and they suck [15:47] That could literally be off tomorrow [15:47] 742MPH, WE DON'T NEED TO SAY ANYMORE. [15:48] 02n03e04e05ds 06t07o 08b09e 10fa11b12u13l07o08u09s [15:48] Please try to keep this channel A) On-topic B) As low-traffic as possible C) Low-noise [15:48] Stop with the damn colour things. Take that to #archiveteam-bs [15:49] It distracts. [15:49] It's meant to distract [15:49] Thanks. [15:49] We're waking up the gang. [15:49] We have 100 people in the channel, many are idle. [15:49] Less idle now! [15:50] They'll surely see it if we make them scroll! [15:50] I am not going to agree with your position on this! [15:53] Just did an interview with CBC about posterous [15:53] And shitty monitors! [15:55] I try not to talk too much, don't want to piss people off :P [15:55] ^^^^ A thing I have never said [15:56] ...well to be more specific, I meant in here [15:56] my normal on-line behavior is carefree of who it disturbs [15:57] so cbc, this is the Canadian Broadcasting Channel? [15:57] Was one of my favorite cable channels growing up. [15:57] CBC is pretty great [15:58] Kids in the hall uncensored vs Commedy central [15:58] can anyone find more g4tv.com xml data? [15:58] i'm trying to see if there is somethng hiding in google but not sure if i can find it there [17:23] This is Spark at CBC [17:23] They've talked with me before [17:23] Posterous got some attention [17:32] Listening to the Q&A of the 2011 Tim O'Reilly speech. [17:32] In it, Stanford bemoans how nobody is saving the source repositories. [17:32] We're doing it, as far as I know. [17:32] I can't overstate how Archive Team is completely in the forefront of this horseshit [17:33] ha ha, some toolbag asking a question about "why do we need to save all this" [17:33] * SketchCow gets archery equipment [17:36] We need to get you a nice pocket sized crossbow [17:36] with poison bolts [17:37] Archery is actually a very relaxing activity [17:38] It'd make my presentations better [17:38] ssssss THOOOOOOOON [18:17] hey, I've got a linux machine in the IA's friends and family rack. I can't run virtualbox on it, tho. how can I help with the yahoo messages? [18:21] you could run this or some variant of it: http://pastebin.com/CarmqNrt [18:22] you might want to remove the screen part (or you might not, depends) [18:22] also you *might* need to get rid of --concurrent 2 if you don't want to get rate limited [18:22] that is not entirely clear at this point [18:26] is there any way to utilize google's caches of some of the yahoo messages? example: https://bitly.com/11qet7N [18:26] i picked a few at random, most weren't cached, some were. [18:27] http://webcache.googleusercontent.com/search?q=cache:http://example.com is apparently format to grab a cached copy via a direct url [18:50] Yeah and they ban cacherippers pretty fast too iirc [19:22] any way to get the warrior to listen on a port other than 8001? i'm already using that one [19:24] Good question [19:26] it's not super important, i can change the port of the other service on my machine that's listening on 8001 instead [19:27] Looking into it - I know the underlaying scripts have parameters to change the bind/listen port [19:30] It can be changed in the network adapter settings, under advanced, port forwarding [19:30] aha, i see that [19:31] brilliant, didn't even have to restart the VM [19:31] thanks [20:15] Say, how often does the warrior upload the pages it has downloaded? [20:17] grawity: Project? Yahoo! Messages? Posterous? [20:19] Yahoo Messages... Hit the rate limit after ~200 URLs, that's very little but I don't want to accidentally discard those anyway. [20:21] grawity: The script will sleep for 300 seconds and try again - you need to complete the whole Item before it'll be uploaded [20:22] Ah, okay [20:22] Feel free to join #BurnTheMessenger by the way, it's the project channel for archiving "Yahoo! Messages" [20:22] and feel free to hang around in general ^_^ [21:03] hello - is there any way to specify a SOCKS proxy to the archive warrior or otherwise route all traffic through Tor, or through a Tor bridge? [21:08] Maybe - not documented/guide written though [21:08] It's probably very doable though [21:09] neurophyr: I think wget listens to the http_proxy environment variable. [21:09] yeah, it does [21:13] SketchCow: Thanks, I've corrected the Punchfork index Name / Date thing now. http://archive.org/download/archiveteam_punchfork_index/ [21:14] ah okay so i can log into the warrior. is there documentation on how? [21:14] Alt+F3 [21:14] User: root Password: archiveteam [21:15] Setting the http_proxy variable might be more difficult. Perhaps you should set it in the /home/warrior/.bashrc ? [21:15] wonderful, thank you. yeah i am just trying to get around bans (assuming tor exits aren't banned) and have a couple relays... [21:15] tor by default presents a SOCKS proxy [21:18] i'll head back w/questions if i can't get it working. thanks for running this project, just heard of it today :) [21:18] np :) [21:18] feel free to stay around anytime [21:47] why does my warrior download new URLs after I've told it to stop? x) [21:48] bowman__: It finishes the task it's currently working on. [21:49] alard: ah kk so I suppose I'd better leave it alone until it's done [21:55] Yes. [21:55] yup [21:55] alard: did you have modifications to warctools btw [21:55] tef: Did I? [21:58] tef: Not that I know of. I checked my only three repositories with a hanzo/warctools directory (warc-proxy, warctozip and warctozip-service), but there don't seem to be any changes there. [21:59] ah, I also mean things like warctozip [21:59] talked to IA today, got a github page [21:59] https://github.com/internetarchive/warctools/ [21:59] going to push stuff there and start merging things too [22:00] Ah. [22:00] i.e abandon hg \o/