#archiveteam 2013-03-03,Sun

↑back Search

Time Nickname Message
03:45 🔗 omf_ is ftp the way to go for bulk upload to IA?
03:48 🔗 Lord_Nigh ftp, sftp, rsync or rfc1149
03:49 🔗 omf_ are the rsync instructions on the IA wiki?
03:49 🔗 Lord_Nigh i don't know
03:50 🔗 omf_ considering how slow my internet is rfc1149 might be good for some of the larger downloads
03:51 🔗 Famicoman you could do s3 as well
03:51 🔗 omf_ I have the data local
03:52 🔗 omf_ oh you mean the s3 like api
03:52 🔗 Famicoman that's the one
03:55 🔗 omf_ shit I cannot remember if I even have a wiki account
03:59 🔗 omf_ well I just signed up for an account and now I get some access but mainly access denied errors. This is all starting to come back to me
04:00 🔗 omf_ google was faster
04:02 🔗 omf_ I do not have permission to download ias3upload
04:04 🔗 omf_ I found a copy, not sure if it is as up to date but you take what you can get
05:16 🔗 jfranusic Getting "No item received. Retrying after 30 seconds... " does this mean that Posterous finally blocked my IP?
05:18 🔗 robbiet48 jfranusic: are you @jf
05:18 🔗 jfranusic yes
05:18 🔗 robbiet48 it's robbie, i'm the kid always wearing the twilio jacket who is friends with gene
05:18 🔗 jfranusic Maybe I should make that my handle on IRC
05:18 🔗 robbiet48 anyway, that it because the tracker is suspended because Posterous was having uptime issues a few hours ago
05:19 🔗 robbiet48 we are waiting for the person manging the tracker to come back and un suspend it
05:19 🔗 robbiet48 you may want to join #preposterus for more info
05:19 🔗 jfranusic done!
05:20 🔗 jfranusic ... and, it looks like my IP is blocked anyway
05:21 🔗 robbiet48 jfranusic: yeah i dont know how you can tell if your IP is blocked or not, haven't had that issue yet since i just keep swapping out IPs
05:22 🔗 jfranusic https://gist.github.com/jpf/5074632
05:22 🔗 robbiet48 ah yes
05:22 🔗 jfranusic how are you swapping IPs?
05:22 🔗 jfranusic just using Elastic IP's?
05:22 🔗 Cameron_D try run `curl http://posterous.com/`, if it failes to connect you're blocked, if you get some HTML back you are good to go
05:23 🔗 jfranusic Cameron_D: yeah, that's what's in the gist
05:24 🔗 Cameron_D Yeah, decided to hit enter before checking the link
05:24 🔗 * jfranusic laughs
06:01 🔗 chronomex jfranusic, robbiet48: the tracker is not handing out new tasks for the next few hours, it turns out we overloaded posterous
06:01 🔗 chronomex we're letting them get their shit back in order
06:01 🔗 robbiet48 they said its back up a few hours ago
06:01 🔗 robbiet48 they tweeted it
06:02 🔗 chronomex correct
06:02 🔗 robbiet48 ah so we are still waiting
06:02 🔗 robbiet48 ok
06:02 🔗 chronomex we are playing it safe for the moment
06:02 🔗 robbiet48 sure
06:03 🔗 chronomex it may actually not be possible to archive all of posterous, which would be a sad
06:03 🔗 robbiet48 :(
06:03 🔗 jfranusic might not be possible because of their ip banning policy?
06:06 🔗 chronomex well there are plenty of ips to go around
06:06 🔗 chronomex they might not be able to render every page in time
06:06 🔗 chronomex without going down
06:06 🔗 robbiet48 too bad posterous isn't ipv6 enabled
06:06 🔗 robbiet48 then we would never run out of addresses!
06:07 🔗 chronomex ha
06:07 🔗 chronomex pretty much
06:08 🔗 jfranusic so, there's this: http://webcache.googleusercontent.com/search?q=cache:jf.posterous.com
06:09 🔗 robbiet48 we should just convince some twitter employee to look the other way and send us a database dump :)
06:09 🔗 jfranusic I'm writing an email right now
06:09 🔗 robbiet48 problem is legality
06:09 🔗 jfranusic explain
06:09 🔗 robbiet48 supposedly their ToS means they don't own the content
06:09 🔗 robbiet48 so they don't have license to give it to us
06:10 🔗 robbiet48 this is what someone said
06:10 🔗 atg Eh they're sorrrrta giving it to us already with the webpages they render
06:10 🔗 jfranusic presumably that issue was figured out once when Twitter bought them
06:10 🔗 jfranusic maybe they could "sell" that part of the company to someone for $1, or something
06:11 🔗 robbiet48 iono
06:11 🔗 jfranusic anyway
06:11 🔗 jfranusic looks like the google cache has my posterous
06:12 🔗 jfranusic presumably they have others, would it be worth it to "scrape" from the google cache?
06:14 🔗 chronomex I talked to a twitter employee friend of mine
06:14 🔗 chronomex he says that it's not a part of their normal infrastructure, so he couldn't even touch the boxes
06:17 🔗 mistym I guess Posterous's TOS was different from Twitter's, given that Twitter has no problem giving their tweet archives to LC
06:20 🔗 * chronomex shrugs
06:21 🔗 jfranusic the google cache version of my posterous is nearly identical to the google cache version
06:21 🔗 jfranusic it looks like google adds ~2 lines to their cached pages
06:22 🔗 Cameron_D it is, but you don't get the additional data such as headers (which is needed to put it into the wayback)
06:23 🔗 Cameron_D And Google's IP banning is even faster than Posterous' (usually happens within a few hundred requests)
06:27 🔗 jfranusic ah! I didn't know that the wayback stores headers too. That's very cool
06:28 🔗 omf_ since normal users do not have access to many collections on IA to upload to, is the best practice to stick 'archiveteam' in the keywords and then it gets moved later
06:31 🔗 DFJustin put it in community texts and either poke sketchcow or underscor in here if they are active or email jason if not
09:26 🔗 omf_ Just a reminder. The current projects are #preposterus for posterous.com
09:27 🔗 omf_ #closedsolaris for opensolaris.org
09:27 🔗 omf_ and #ispygames for ugo, ign, 1up, gamespy
09:27 🔗 Cameron_D #BurnTheMessenger for Yahoo Messages
09:28 🔗 omf_ yep I was just scrolling back to find the channel name
09:37 🔗 omf_ Are there any sites with mailing lists we need to fetch at present? I already got all of mail.opensolaris.org
15:21 🔗 grawity Is it possible to merge two warc files?
15:22 🔗 omf_ Someone told me it was
15:22 🔗 omf_ There might already be code to do it too
15:22 🔗 ersi grawity: Yes
15:22 🔗 grawity wget only overwrites... `cat` seems to work, but I'm trying to be careful here
15:23 🔗 ersi AFAIK just appending it to one should work and be valid. Depends on if you have a warcinfo record I guess
15:49 🔗 alard grawity: Yes, if they're valid warc files cat will also make a valid warc file.
21:01 🔗 grawity alard: ...hmm, wouldn't it damage the .cdx file then?
21:02 🔗 alard grawity: Yes, that's true, you'll have to generate a new cdx file.
21:05 🔗 grawity And, um, howdoIdothat?
21:06 🔗 alard Why do you want the cdx file?
21:07 🔗 alard There's this, for example: https://github.com/internetarchive/CDX-Writer , but that doesn't work for wget deduplication.
21:08 🔗 alard If you need the cdx for the wget deduplication option you can simple cat the cdx files together: wget doesn't use the file offset fields.
21:11 🔗 grawity alard: I don't, I've just read in some-or-other "how to grab stuff for ArchiveTeam using wget" article that it's better to include the .cdx
21:12 🔗 alard It isn't. I'd just throw it away if you don't need it. The Wget-generated .cdx is mostly useful for the Wget deduplication option. It probably won't work for the wayback machine, for instance.
21:12 🔗 DFJustin archive.org generates its own cdx when you upload an item
21:13 🔗 grawity Ah, good.
23:31 🔗 balrog so there are 124 users left to do
23:31 🔗 balrog punchfork users... are they all problematic?

irclogger-viewer