[03:45] is ftp the way to go for bulk upload to IA? [03:48] ftp, sftp, rsync or rfc1149 [03:49] are the rsync instructions on the IA wiki? [03:49] i don't know [03:50] considering how slow my internet is rfc1149 might be good for some of the larger downloads [03:51] you could do s3 as well [03:51] I have the data local [03:52] oh you mean the s3 like api [03:52] that's the one [03:55] shit I cannot remember if I even have a wiki account [03:59] well I just signed up for an account and now I get some access but mainly access denied errors. This is all starting to come back to me [04:00] google was faster [04:02] I do not have permission to download ias3upload [04:04] I found a copy, not sure if it is as up to date but you take what you can get [05:16] Getting "No item received. Retrying after 30 seconds... " does this mean that Posterous finally blocked my IP? [05:18] jfranusic: are you @jf [05:18] yes [05:18] it's robbie, i'm the kid always wearing the twilio jacket who is friends with gene [05:18] Maybe I should make that my handle on IRC [05:18] anyway, that it because the tracker is suspended because Posterous was having uptime issues a few hours ago [05:19] we are waiting for the person manging the tracker to come back and un suspend it [05:19] you may want to join #preposterus for more info [05:19] done! [05:20] ... and, it looks like my IP is blocked anyway [05:21] jfranusic: yeah i dont know how you can tell if your IP is blocked or not, haven't had that issue yet since i just keep swapping out IPs [05:22] https://gist.github.com/jpf/5074632 [05:22] ah yes [05:22] how are you swapping IPs? [05:22] just using Elastic IP's? [05:22] try run `curl http://posterous.com/`, if it failes to connect you're blocked, if you get some HTML back you are good to go [05:23] Cameron_D: yeah, that's what's in the gist [05:24] Yeah, decided to hit enter before checking the link [05:24] * jfranusic laughs [06:01] jfranusic, robbiet48: the tracker is not handing out new tasks for the next few hours, it turns out we overloaded posterous [06:01] we're letting them get their shit back in order [06:01] they said its back up a few hours ago [06:01] they tweeted it [06:02] correct [06:02] ah so we are still waiting [06:02] ok [06:02] we are playing it safe for the moment [06:02] sure [06:03] it may actually not be possible to archive all of posterous, which would be a sad [06:03] :( [06:03] might not be possible because of their ip banning policy? [06:06] well there are plenty of ips to go around [06:06] they might not be able to render every page in time [06:06] without going down [06:06] too bad posterous isn't ipv6 enabled [06:06] then we would never run out of addresses! [06:07] ha [06:07] pretty much [06:08] so, there's this: http://webcache.googleusercontent.com/search?q=cache:jf.posterous.com [06:09] we should just convince some twitter employee to look the other way and send us a database dump :) [06:09] I'm writing an email right now [06:09] problem is legality [06:09] explain [06:09] supposedly their ToS means they don't own the content [06:09] so they don't have license to give it to us [06:10] this is what someone said [06:10] Eh they're sorrrrta giving it to us already with the webpages they render [06:10] presumably that issue was figured out once when Twitter bought them [06:10] maybe they could "sell" that part of the company to someone for $1, or something [06:11] iono [06:11] anyway [06:11] looks like the google cache has my posterous [06:12] presumably they have others, would it be worth it to "scrape" from the google cache? [06:14] I talked to a twitter employee friend of mine [06:14] he says that it's not a part of their normal infrastructure, so he couldn't even touch the boxes [06:17] I guess Posterous's TOS was different from Twitter's, given that Twitter has no problem giving their tweet archives to LC [06:20] * chronomex shrugs [06:21] the google cache version of my posterous is nearly identical to the google cache version [06:21] it looks like google adds ~2 lines to their cached pages [06:22] it is, but you don't get the additional data such as headers (which is needed to put it into the wayback) [06:23] And Google's IP banning is even faster than Posterous' (usually happens within a few hundred requests) [06:27] ah! I didn't know that the wayback stores headers too. That's very cool [06:28] since normal users do not have access to many collections on IA to upload to, is the best practice to stick 'archiveteam' in the keywords and then it gets moved later [06:31] put it in community texts and either poke sketchcow or underscor in here if they are active or email jason if not [09:26] Just a reminder. The current projects are #preposterus for posterous.com [09:27] #closedsolaris for opensolaris.org [09:27] and #ispygames for ugo, ign, 1up, gamespy [09:27] #BurnTheMessenger for Yahoo Messages [09:28] yep I was just scrolling back to find the channel name [09:37] Are there any sites with mailing lists we need to fetch at present? I already got all of mail.opensolaris.org [15:21] Is it possible to merge two warc files? [15:22] Someone told me it was [15:22] There might already be code to do it too [15:22] grawity: Yes [15:22] wget only overwrites... `cat` seems to work, but I'm trying to be careful here [15:23] AFAIK just appending it to one should work and be valid. Depends on if you have a warcinfo record I guess [15:49] grawity: Yes, if they're valid warc files cat will also make a valid warc file. [21:01] alard: ...hmm, wouldn't it damage the .cdx file then? [21:02] grawity: Yes, that's true, you'll have to generate a new cdx file. [21:05] And, um, howdoIdothat? [21:06] Why do you want the cdx file? [21:07] There's this, for example: https://github.com/internetarchive/CDX-Writer , but that doesn't work for wget deduplication. [21:08] If you need the cdx for the wget deduplication option you can simple cat the cdx files together: wget doesn't use the file offset fields. [21:11] alard: I don't, I've just read in some-or-other "how to grab stuff for ArchiveTeam using wget" article that it's better to include the .cdx [21:12] It isn't. I'd just throw it away if you don't need it. The Wget-generated .cdx is mostly useful for the Wget deduplication option. It probably won't work for the wayback machine, for instance. [21:12] archive.org generates its own cdx when you upload an item [21:13] Ah, good. [23:31] so there are 124 users left to do [23:31] punchfork users... are they all problematic?