#archiveteam 2013-03-03,Sun

↑back Search

Time	Nickname	Message
03:45 ^🔗	omf_	is ftp the way to go for bulk upload to IA?
03:48 ^🔗	Lord_Nigh	ftp, sftp, rsync or rfc1149
03:49 ^🔗	omf_	are the rsync instructions on the IA wiki?
03:49 ^🔗	Lord_Nigh	i don't know
03:50 ^🔗	omf_	considering how slow my internet is rfc1149 might be good for some of the larger downloads
03:51 ^🔗	Famicoman	you could do s3 as well
03:51 ^🔗	omf_	I have the data local
03:52 ^🔗	omf_	oh you mean the s3 like api
03:52 ^🔗	Famicoman	that's the one
03:55 ^🔗	omf_	shit I cannot remember if I even have a wiki account
03:59 ^🔗	omf_	well I just signed up for an account and now I get some access but mainly access denied errors. This is all starting to come back to me
04:00 ^🔗	omf_	google was faster
04:02 ^🔗	omf_	I do not have permission to download ias3upload
04:04 ^🔗	omf_	I found a copy, not sure if it is as up to date but you take what you can get
05:16 ^🔗	jfranusic	Getting "No item received. Retrying after 30 seconds... " does this mean that Posterous finally blocked my IP?
05:18 ^🔗	robbiet48	jfranusic: are you @jf
05:18 ^🔗	jfranusic	yes
05:18 ^🔗	robbiet48	it's robbie, i'm the kid always wearing the twilio jacket who is friends with gene
05:18 ^🔗	jfranusic	Maybe I should make that my handle on IRC
05:18 ^🔗	robbiet48	anyway, that it because the tracker is suspended because Posterous was having uptime issues a few hours ago
05:19 ^🔗	robbiet48	we are waiting for the person manging the tracker to come back and un suspend it
05:19 ^🔗	robbiet48	you may want to join #preposterus for more info
05:19 ^🔗	jfranusic	done!
05:20 ^🔗	jfranusic	... and, it looks like my IP is blocked anyway
05:21 ^🔗	robbiet48	jfranusic: yeah i dont know how you can tell if your IP is blocked or not, haven't had that issue yet since i just keep swapping out IPs
05:22 ^🔗	jfranusic	https://gist.github.com/jpf/5074632
05:22 ^🔗	robbiet48	ah yes
05:22 ^🔗	jfranusic	how are you swapping IPs?
05:22 ^🔗	jfranusic	just using Elastic IP's?
05:22 ^🔗	Cameron_D	try run `curl http://posterous.com/`, if it failes to connect you're blocked, if you get some HTML back you are good to go
05:23 ^🔗	jfranusic	Cameron_D: yeah, that's what's in the gist
05:24 ^🔗	Cameron_D	Yeah, decided to hit enter before checking the link
05:24 ^🔗	*	jfranusic laughs
06:01 ^🔗	chronomex	jfranusic, robbiet48: the tracker is not handing out new tasks for the next few hours, it turns out we overloaded posterous
06:01 ^🔗	chronomex	we're letting them get their shit back in order
06:01 ^🔗	robbiet48	they said its back up a few hours ago
06:01 ^🔗	robbiet48	they tweeted it
06:02 ^🔗	chronomex	correct
06:02 ^🔗	robbiet48	ah so we are still waiting
06:02 ^🔗	robbiet48	ok
06:02 ^🔗	chronomex	we are playing it safe for the moment
06:02 ^🔗	robbiet48	sure
06:03 ^🔗	chronomex	it may actually not be possible to archive all of posterous, which would be a sad
06:03 ^🔗	robbiet48	:(
06:03 ^🔗	jfranusic	might not be possible because of their ip banning policy?
06:06 ^🔗	chronomex	well there are plenty of ips to go around
06:06 ^🔗	chronomex	they might not be able to render every page in time
06:06 ^🔗	chronomex	without going down
06:06 ^🔗	robbiet48	too bad posterous isn't ipv6 enabled
06:06 ^🔗	robbiet48	then we would never run out of addresses!
06:07 ^🔗	chronomex	ha
06:07 ^🔗	chronomex	pretty much
06:08 ^🔗	jfranusic	so, there's this: http://webcache.googleusercontent.com/search?q=cache:jf.posterous.com
06:09 ^🔗	robbiet48	we should just convince some twitter employee to look the other way and send us a database dump :)
06:09 ^🔗	jfranusic	I'm writing an email right now
06:09 ^🔗	robbiet48	problem is legality
06:09 ^🔗	jfranusic	explain
06:09 ^🔗	robbiet48	supposedly their ToS means they don't own the content
06:09 ^🔗	robbiet48	so they don't have license to give it to us
06:10 ^🔗	robbiet48	this is what someone said
06:10 ^🔗	atg	Eh they're sorrrrta giving it to us already with the webpages they render
06:10 ^🔗	jfranusic	presumably that issue was figured out once when Twitter bought them
06:10 ^🔗	jfranusic	maybe they could "sell" that part of the company to someone for $1, or something
06:11 ^🔗	robbiet48	iono
06:11 ^🔗	jfranusic	anyway
06:11 ^🔗	jfranusic	looks like the google cache has my posterous
06:12 ^🔗	jfranusic	presumably they have others, would it be worth it to "scrape" from the google cache?
06:14 ^🔗	chronomex	I talked to a twitter employee friend of mine
06:14 ^🔗	chronomex	he says that it's not a part of their normal infrastructure, so he couldn't even touch the boxes
06:17 ^🔗	mistym	I guess Posterous's TOS was different from Twitter's, given that Twitter has no problem giving their tweet archives to LC
06:20 ^🔗	*	chronomex shrugs
06:21 ^🔗	jfranusic	the google cache version of my posterous is nearly identical to the google cache version
06:21 ^🔗	jfranusic	it looks like google adds ~2 lines to their cached pages
06:22 ^🔗	Cameron_D	it is, but you don't get the additional data such as headers (which is needed to put it into the wayback)
06:23 ^🔗	Cameron_D	And Google's IP banning is even faster than Posterous' (usually happens within a few hundred requests)
06:27 ^🔗	jfranusic	ah! I didn't know that the wayback stores headers too. That's very cool
06:28 ^🔗	omf_	since normal users do not have access to many collections on IA to upload to, is the best practice to stick 'archiveteam' in the keywords and then it gets moved later
06:31 ^🔗	DFJustin	put it in community texts and either poke sketchcow or underscor in here if they are active or email jason if not
09:26 ^🔗	omf_	Just a reminder. The current projects are #preposterus for posterous.com
09:27 ^🔗	omf_	#closedsolaris for opensolaris.org
09:27 ^🔗	omf_	and #ispygames for ugo, ign, 1up, gamespy
09:27 ^🔗	Cameron_D	#BurnTheMessenger for Yahoo Messages
09:28 ^🔗	omf_	yep I was just scrolling back to find the channel name
09:37 ^🔗	omf_	Are there any sites with mailing lists we need to fetch at present? I already got all of mail.opensolaris.org
15:21 ^🔗	grawity	Is it possible to merge two warc files?
15:22 ^🔗	omf_	Someone told me it was
15:22 ^🔗	omf_	There might already be code to do it too
15:22 ^🔗	ersi	grawity: Yes
15:22 ^🔗	grawity	wget only overwrites... `cat` seems to work, but I'm trying to be careful here
15:23 ^🔗	ersi	AFAIK just appending it to one should work and be valid. Depends on if you have a warcinfo record I guess
15:49 ^🔗	alard	grawity: Yes, if they're valid warc files cat will also make a valid warc file.
21:01 ^🔗	grawity	alard: ...hmm, wouldn't it damage the .cdx file then?
21:02 ^🔗	alard	grawity: Yes, that's true, you'll have to generate a new cdx file.
21:05 ^🔗	grawity	And, um, howdoIdothat?
21:06 ^🔗	alard	Why do you want the cdx file?
21:07 ^🔗	alard	There's this, for example: https://github.com/internetarchive/CDX-Writer , but that doesn't work for wget deduplication.
21:08 ^🔗	alard	If you need the cdx for the wget deduplication option you can simple cat the cdx files together: wget doesn't use the file offset fields.
21:11 ^🔗	grawity	alard: I don't, I've just read in some-or-other "how to grab stuff for ArchiveTeam using wget" article that it's better to include the .cdx
21:12 ^🔗	alard	It isn't. I'd just throw it away if you don't need it. The Wget-generated .cdx is mostly useful for the Wget deduplication option. It probably won't work for the wayback machine, for instance.
21:12 ^🔗	DFJustin	archive.org generates its own cdx when you upload an item
21:13 ^🔗	grawity	Ah, good.
23:31 ^🔗	balrog	so there are 124 users left to do
23:31 ^🔗	balrog	punchfork users... are they all problematic?

irclogger-viewer