#archiveteam 2011-11-14,Mon

↑back Search

Time	Nickname	Message
00:00 ^🔗	marceloan	I`ll try to delete that folder and run get-wget-warc.sh again. How do we delete things? -_-` (In MS-DOS it`s del, but when I type this it says unknown command)
00:01 ^🔗	db48x	first look at the files in the directory
00:01 ^🔗	db48x	use ls -l
00:01 ^🔗	Wyatt\|Wor	marceloan: rm
00:02 ^🔗	Wyatt\|Wor	marceloan: This may be helpful: http://www.yolinux.com/TUTORIALS/unix_for_dos_users.html
00:02 ^🔗	marceloan	Thanks, it`ll help ?)
00:02 ^🔗	marceloan	:)
00:07 ^🔗	Wyatt\|Wor	No problem! Welcome to the bird side. ;)
00:08 ^🔗	Wyatt\|Wor	"A Linux terminal emulates an emulation of a terminal of an old DEC VT100 and somehow that's still FAR better than the Windows command prompt."
00:09 ^🔗	marceloan	ERROR contacting tracker. Could not mark `4Boa` done.
00:09 ^🔗	Qwerty0	Hey guys
00:10 ^🔗	Wyatt\|Wor	Hallo
00:11 ^🔗	marceloan	Well, it could download the file but couldn`t contact the tracker...
00:12 ^🔗	Qwerty0	Btw since Friendster I haven't been hanging around IRC much but I'm not sure how to get word of new projects
00:12 ^🔗	Qwerty0	Anyone know the best way to "receive the batsignal" when there's something going down?
00:13 ^🔗	Qwerty0	I've tried @archiveteam on twitter but I dunno if I'm missing things
00:14 ^🔗	DoubleJ	IRC's all I know of. @archiveteam seems to be what's been done, not what's going on.
00:15 ^🔗	Qwerty0	Hmm, well I guess I'll just try to pop in every now and then
00:15 ^🔗	Qwerty0	It'd be great if there was a mailing list or something
00:15 ^🔗	db48x	indeed
00:15 ^🔗	db48x	we've got three projects going at the moment
00:16 ^🔗	Wyatt\|Wor	Splinder, Anyhub...and?
00:16 ^🔗	db48x	mobileme
00:16 ^🔗	Wyatt\|Wor	Would the /topic be a good place for these?
00:16 ^🔗	db48x	http://memac.heroku.com/
00:17 ^🔗	DoubleJ	Possibly. Still has the "being on IRC" problem though.
00:17 ^🔗	Qwerty0	haha, oh great, mobileme. thanks apple
00:17 ^🔗	Wyatt\|Wor	DoubleJ: IRC logs help, no?
00:17 ^🔗	Qwerty0	DoubleJ: i think it'd still help
00:18 ^🔗	DoubleJ	Wyatt\|Wor: Apparently Qwerty0 hasn't been reading the logs, either :)
00:19 ^🔗	Qwerty0	nerp
00:19 ^🔗	DoubleJ	Qwerty0: Better than nothing, yes. But if someone's not on IRC for a while they can miss a lot. And depending on client, /topic only shows ~75 characters.
00:19 ^🔗	Qwerty0	I didn't think that'd be a good way to quickly find out the active projects
00:19 ^🔗	Wyatt\|Wor	To be fair, logs are rather hard for humans to parse without grep
00:20 ^🔗	Qwerty0	DoubleJ: oh totally, it's not the best solution
00:21 ^🔗	Wyatt\|Wor	Maybe leverage the wiki top page too?
00:21 ^🔗	Wyatt\|Wor	A "current projects" sidebar or such.
00:21 ^🔗	DoubleJ	I could see that working. I'd be concerned that a mailing list would be someone else's job and never get used. Easy to update the wiki page.
00:21 ^🔗	Qwerty0	but if it's the best there is, I'll probably start checking them
00:21 ^🔗	Qwerty0	yeah, and IRC chatter is usually 95% details without explaining the project itself, let alone the other active projects
00:22 ^🔗	Qwerty0	I'd find that extremely effective
00:22 ^🔗	Wyatt\|Wor	Lesson learned: if you're going to use pv, make sure pv is installed.
00:22 ^🔗	Qwerty0	But like all things it's dependent on interest in updating it
00:23 ^🔗	DoubleJ	Something simple, like a bulleted list: $PROJECTNAME: Ask $IRCHANDLE
00:23 ^🔗	DoubleJ	(or, $PROJECTNAME: See $CHANNEL)
00:23 ^🔗	Wyatt\|Wor	Add a wiki link to the relevant page too, IMO
00:24 ^🔗	DoubleJ	Agree.
00:26 ^🔗	Qwerty0	Of course, all I need is just a single notice that something's going on
00:26 ^🔗	Qwerty0	That'd be 98% of it for me
00:27 ^🔗	Qwerty0	I just don't have the time to always be on IRC, but I like to join the effort whenever there's a crisis
00:28 ^🔗	DoubleJ	Well there's always something happening. Some of it's just really fast and doesn't even get mentioned on-channel. I'd say anything that requires more than a few people would be good to put up.
00:29 ^🔗	Qwerty0	good point
00:50 ^🔗	underscor	A googlecode mailing list for "xxxxx is going down, see this" announcements would be pretty nice
00:54 ^🔗	Wyatt\|Wor	Oh, about the dld-client, is there a graceful way of killing it, or just send SIGHUP?
00:56 ^🔗	Wyatt\|Wor	I wonder how hard it would be to hook signal 10 or 11 and make it exit after the current child finishes?
00:56 ^🔗	Wyatt\|Wor	I haven't actually done any signal handling in bash.
01:07 ^🔗	db48x	touch STOP
01:08 ^🔗	db48x	any running clients will notice and end after they finish their current job
01:08 ^🔗	underscor	- Discovering urls (JSON)... ERROR (6).
01:08 ^🔗	underscor	Downloading thefonzsays - Sun Nov 13 17:08:23 PST 2011
01:08 ^🔗	underscor	Downloading web.me.com/thefonzsays
01:08 ^🔗	underscor	alard: Getting next username from tracker... done.
01:09 ^🔗	Wyatt\|Wor	Oh. Well, okay then!
01:09 ^🔗	Wyatt\|Wor	Thanks
01:09 ^🔗	db48x	you're welcome
01:12 ^🔗	underscor	alard: If I run the curl command, I get a page that says "This account does not exist"
01:12 ^🔗	underscor	http://web.me.com/c32040821/?webdav-method=truthget&feedfmt=json&depth=Infinity
01:15 ^🔗	underscor	Oh, hmm, looks like a curl problem again
01:15 ^🔗	underscor	Nevermind
01:55 ^🔗	Paradoks	Qwerty0/Wyatt/DoubleJ: http://www.archiveteam.org/index.php?title=Projects -- I added a "Projects with BASH scripts that need more people running them" section. Mind you, it's not an automatic alert or anything.
01:56 ^🔗	Paradoks	I'm not sure what'd work better. I'll try to keep the page current, but I can't say I'll reliably be the most informed person.
01:56 ^🔗	Paradoks	Err, keep the section current. I have no idea how to usefully keep the entire page current.
02:00 ^🔗	Wyatt\|Wor	That's an interesting one. How do people normally schedule temporary alerts with expiration?
03:03 ^🔗	db48x	hrm
03:03 ^🔗	db48x	I'm getting lots of errors
03:03 ^🔗	db48x	Downloading 124 media files... done, with HTTP errors
03:06 ^🔗	marceloan	It's also happening to me
03:06 ^🔗	marceloan	- Downloading 4 media files... done, with HTTP errors.
03:07 ^🔗	db48x	looks like it's all 404 errors
03:07 ^🔗	db48x	but I didn't notice any before
03:08 ^🔗	db48x	mostly from files.splinder.com
03:11 ^🔗	db48x	perhaps theu have been deleted
03:12 ^🔗	marceloan	What should we do about it
03:12 ^🔗	marceloan	?
03:14 ^🔗	Cameron_D	I jsut got some too
03:17 ^🔗	closure	hmm.. http://files.us.splinder.com/47e09e7aa78f749b5081204479d6a5c5.png is a 404 to wget, but shows up in a web browser, or with curl
03:18 ^🔗	closure	ok, I guess they serve a 404 followed by a dummy image
03:18 ^🔗	marceloan	Maybe they recognize the user agent?
03:18 ^🔗	closure	no, I tried changing it
03:20 ^🔗	Paradoks	I tried checking downforeveryoneorjustme.com , and it's down for them, too, so my guess is that it's not a response to Archive Team.
03:22 ^🔗	Paradoks	Going to http://www.us.splinder.com/ , not terribly many of the front page things have the thumbnails.
03:23 ^🔗	Paradoks	...though I'm not sure that's saying much of anything.
04:08 ^🔗	closure	alard: There is a problem with dashes for sure - Downloading blog from -------mydi------------.splinder.com... done, with network errors.
04:40 ^🔗	Wyatt\|Wor	Question about the heroku stats: How is the size calculated? Is it sending the size back?
04:40 ^🔗	db48x	yea, when the script reports that you've finished one, it sends along the size of the data
04:40 ^🔗	closure	I don't think it's accurate, cos I have 100 gb here
04:41 ^🔗	db48x	it only counts the size of the warc files
04:41 ^🔗	Wyatt\|Wor	Aaah, I see.
04:41 ^🔗	closure	still there's just a few logs otherwise
04:41 ^🔗	Wyatt\|Wor	I've got something like 3GB across a few machines, from what I can tell
04:42 ^🔗	Wyatt\|Wor	Still, this live stats thing is really cool
04:42 ^🔗	db48x	indeed
04:43 ^🔗	Wyatt\|Wor	Hey closure, you said you had something like a thousand threads running? How were you keeping load down at the 300 level?
04:43 ^🔗	closure	1000 was a few too many.. I dropped it to around 600-800 and got that load
04:43 ^🔗	db48x	$( ./du-helper.sh -bsc "${userdir}/"*"-media.warc.gz"
04:44 ^🔗	Wyatt\|Wor	Aah, so it looks like it does scale rougly linear.
04:49 ^🔗	closure	otoh, I have 207 wgets running now and a load of 4
04:49 ^🔗	Wyatt\|Wor	Weird...
04:49 ^🔗	closure	some of them with large downloads get bogged down on the network and don't use much resouces
04:50 ^🔗	Wyatt\|Wor	Amazon EC2?
04:50 ^🔗	closure	real hw
04:50 ^🔗	Wyatt\|Wor	Oh, nice
05:08 ^🔗	underscor	http://vimeo.com/32001208
05:38 ^🔗	yipdw	wow, I didn't know about anyhub
05:38 ^🔗	yipdw	websites need to stop dying
05:58 ^🔗	db48x	then what would we archive?
06:04 ^🔗	Wyatt\|Wor	All the things that never made it to the internet but are still in digital form.
06:06 ^🔗	yipdw	db48x: I have this grand vision of a future where, given a webapp that accepts user-generated content, you could plug in https://example.com/user.warc and get back 200 OK or 202 Accepted
06:06 ^🔗	yipdw	and either get back a WARC that was current of your request date or a URL to a location that you could check whilst the WARC was built
06:06 ^🔗	yipdw	and it'd be neat to help that out with library code
06:07 ^🔗	yipdw	(or 403 Forbidden, I guess, for private stuff)
06:07 ^🔗	yipdw	well ok it's not that grand, but whatever
06:07 ^🔗	Wyatt\|Wor	It's not?
06:07 ^🔗	db48x	heh
06:08 ^🔗	yipdw	well it would probably be less of a load on websites than what we're doing now :P
06:11 ^🔗	Wyatt\|Wor	A future where people care about their data sounds pretty grand to me. :)
06:12 ^🔗	yipdw	yeah, or -- in this case -- a future where archiving of user data is common enough that there exists code out there to plug in to your app, tell it about archivable things and preferred formats, etc
06:12 ^🔗	yipdw	oh, and I guess you'd need an account-discovery protocol
06:12 ^🔗	yipdw	Accept: application/json; GET https://example.com/accounts or something
06:13 ^🔗	yipdw	nothing really groundbreaking
06:13 ^🔗	yipdw	hm
06:13 ^🔗	yipdw	I wonder how hard that'd be to integrate with a typical e.g. Rail sapp
06:13 ^🔗	yipdw	obviously, an archiver library can't auto-archive your models
06:13 ^🔗	yipdw	too much domain-specific knowledge there
06:14 ^🔗	yipdw	but the gruntwork of building the WARC, yeah, that can and should be standard
06:14 ^🔗	yipdw	hmmmm.
06:14 ^🔗	yipdw	if only we had a Rails app to try this out with
06:14 ^🔗	yipdw	oh hey wait, Diaspora!
06:15 ^🔗	yipdw	shit, that means I need to try to get it running again :(
06:15 ^🔗	yipdw	whoa.
06:15 ^🔗	yipdw	http://techcrunch.com/2011/11/13/diaspora-co-founder-ilya-zhitomirskiy-passes-away-at-21/
06:19 ^🔗	db48x	wow
06:19 ^🔗	yipdw	21, goddamn
06:19 ^🔗	yipdw	that's really terrible
06:19 ^🔗	chronomex	*22
06:19 ^🔗	yipdw	er, yeah
06:22 ^🔗	Wyatt\|Wor	I saw it earlier, but I'm still saddened to hear it again.
08:03 ^🔗	ersi	Holy fuck
08:16 ^🔗	db48x	ersi: what's up
08:16 ^🔗	db48x	?
08:20 ^🔗	ersi	I was "Holy fuck"ing @ Ilya
08:20 ^🔗	ersi	other than that, werk
08:44 ^🔗	db48x	ersi: indeed
08:44 ^🔗	db48x	ersi: surprising
09:01 ^🔗	db48x	hrm
09:01 ^🔗	db48x	the users/hour on splinder has dropped off quite a bit
13:07 ^🔗	Paradoks	Evidently the Splinder "HTTP errors" have spread to the blogs, now, too. Yet we're still getting SOME data.
13:09 ^🔗	alard	Paradoks: HTTP errors is nothing new, it's just that the script now tells you about them.
13:10 ^🔗	Paradoks	Oh. Okay. So, data-wise, we're getting as much stuff as we were a day ago?
13:12 ^🔗	alard	Yes. What happened before was that the script just ignored any HTTP error. Some images are not found, that's to be expected: not everyone has a profile image, for instance, and because the script generates new urls there is a chance that you'll get 404 errors.
13:13 ^🔗	alard	But it turned out that the US version sometimes returns 502 or 504 gateway errors, which isn't good. So the new version checks if wget found HTTP errors, then looks in the log to see if any of those are 502 or 504. If there are only harmless 404 errors it continues.
13:13 ^🔗	alard	If there was a 502 or 504 error, it retries the user.
13:15 ^🔗	Paradoks	Cool. Thanks for the info. There was some worry (during the time you were asleep, I think) that we were getting an increasing/excessive amount of 404s.
13:51 ^🔗	DoubleJ	Wow. Still downloading the blog that was up when I switched to the new scripts yesterday.
13:52 ^🔗	DoubleJ	It's still making new files, so I guess it's working. But jeez. Thing's been going for at least 18 hours now.
13:54 ^🔗	DoubleJ	Random request: Could the dashboard be changed to use • instead of &bullet;? The old Firefox I have at work doesn't understand the latter.
15:54 ^🔗	alard	SketchCow: Are you there?
17:28 ^🔗	yipdw	DoubleJ: I've got an EC2 instance that's been downloading splinder/Redazione for about 24 hours now
17:29 ^🔗	yipdw	it is, somehow, still making progress
17:29 ^🔗	yipdw	I guess it's because (1) the journal dates back to 2002 and (2) Splinder Italy is terribly bogged right now
17:48 ^🔗	PepsiMax	yipdw: Yeah, it seems to be huge!
17:56 ^🔗	ersi	that's what She said! Whooo!
18:06 ^🔗	yipdw	PepsiMax: 119 MB and counting
18:06 ^🔗	yipdw	also, weird, I just had git totally space out on origin/* pointers in a repo at work
18:07 ^🔗	yipdw	never seen that happen before
18:08 ^🔗	PepsiMax	how mature ersi :P
18:08 ^🔗	ersi	Sometimes that just burps right out of me
18:08 ^🔗	ersi	I think it's what's holding me alive, but I'm not sure!
18:09 ^🔗	PepsiMax	sudo apt-get upgrade your-live
18:09 ^🔗	PepsiMax	E: Unable to locate package your-life
18:09 ^🔗	PepsiMax	etc
18:11 ^🔗	ersi	Havn't you heard it's The Small Things In Life?
18:11 ^🔗	ersi	Atleast I'm able to enjoy myself >_>
18:11 ^🔗	PepsiMax	:-(
18:22 ^🔗	yipdw	dpkg: dependency problems prevent configuration of your-life
18:22 ^🔗	PepsiMax	requres the source: money
18:29 ^🔗	yipdw	I was going to make an alcoholism joke, but I guess that also works
18:51 ^🔗	PepsiMax	https://imgur.com/gallery/XspuW
18:58 ^🔗	yipdw	heh
18:58 ^🔗	yipdw	one of the anyhub WARCs I've got is just a bunch of BitTorrent files
18:58 ^🔗	yipdw	interesting way to get around the legal restrictions, I guess
18:58 ^🔗	yipdw	oh
18:58 ^🔗	yipdw	Content-Disposition: inline; filename=black-pro.exe;
19:04 ^🔗	PepsiMax	yipdw: yeah, it seems to be a lot of shady files...
19:04 ^🔗	PepsiMax	even found some DoS tools...
19:04 ^🔗	PepsiMax	I reported them to antivirus vendors, tought.
19:05 ^🔗	PepsiMax	and i shred'd em
19:05 ^🔗	yipdw	er, what's the point of archiving if you're gonna shred them
19:05 ^🔗	PepsiMax	yipdw: I don't want lose exe files around.
19:06 ^🔗	yipdw	they aren't loose, they're in WARCs
19:06 ^🔗	PepsiMax	I found the rar files trought the /stats page
19:06 ^🔗	PepsiMax	no.
19:06 ^🔗	PepsiMax	I don't look insede the gzips/
19:06 ^🔗	PepsiMax	http://www.anyhub.net/stats
19:07 ^🔗	PepsiMax	why would someone upload loose exes
19:07 ^🔗	yipdw	I don't know, but their intention is irrelevant, IMO
19:07 ^🔗	yipdw	I mean, from that page I can't even tell if it's actually a Windows PE file
19:08 ^🔗	yipdw	I think that if the intention is to archive anyhub's public files, then you might as well archive all of it
19:08 ^🔗	ersi	PepsiMax: You suck at archiving if you're deleting stuff
19:08 ^🔗	yipdw	the security experts and lawyers etc can pick apart the archive later
19:09 ^🔗	PepsiMax	Well
19:09 ^🔗	ersi	You won't find any zero day .exe's anyway, and there will be anti virii signatures for those lame RATs
19:09 ^🔗	PepsiMax	somehwere deep I do agree
19:09 ^🔗	PepsiMax	theses viruses would be a shame to lose.
19:09 ^🔗	PepsiMax	http://www.anyhub.net/stats
19:09 ^🔗	PepsiMax	some unknown stuff
19:09 ^🔗	PepsiMax	new stuff to submit :D
19:09 ^🔗	SketchCow	Sorry, I was in lala land, now here. What up.
19:10 ^🔗	ersi	SketchCow: PepsiMax's ranting about viruses in executables, how he's not going to archive them
19:10 ^🔗	ersi	blah blah
19:10 ^🔗	yipdw	PepsiMax: one thing to keep in mind is that, unless you've actually looked at the content of those files, you can't tell if it's even a virus
19:10 ^🔗	SketchCow	So archive them without him.
19:10 ^🔗	PepsiMax	ersi: well, i have 11,7GB of stuff ready. I'm moving them to a secure location.
19:11 ^🔗	PepsiMax	I'm, just ranting about how people exploit a great fileupload
19:11 ^🔗	PepsiMax	hurr
19:11 ^🔗	yipdw	PepsiMax: I mean, sure, there's a high probability that something named "LOIC.exe" is really the LOIC
19:11 ^🔗	yipdw	but who knows
19:11 ^🔗	SketchCow	Calm down, boys.
19:11 ^🔗	yipdw	filename isn't a criteria for deletion from an archive, IMO
19:11 ^🔗	PepsiMax	yipdw: thats why I don't open the gzips. Then I did not knew.
19:11 ^🔗	SketchCow	Is that all the current news?
19:11 ^🔗	PepsiMax	yipdw: PR0DDOS.EXE
19:12 ^🔗	yipdw	PepsiMax: again, I don't know :P
19:12 ^🔗	ersi	Might be a Disk Operating System, you don't know that.
19:12 ^🔗	PepsiMax	ProDoS v1.0 by 0v3rd0z3r aka SatansWrath
19:12 ^🔗	PepsiMax	I am NOT responsible for your actions, and what you do with this program. Education purposes only, thank you.
19:12 ^🔗	PepsiMax	Note: This program is twice as powerful then LOIC or ServerAttack, so be careful with it.
19:12 ^🔗	yipdw	I also think the danger is minimal, even if you gunzip the WARC and pipe it through less
19:12 ^🔗	yipdw	especially if you're on a UNIX system where PEs are pretty hard to execute
19:13 ^🔗	yipdw	it would be a good idea to run all this archiving stuff in a sandbox, though
19:13 ^🔗	PepsiMax	I don't care about executing. I care about owning them
19:13 ^🔗	yipdw	eh, well
19:14 ^🔗	yipdw	PepsiMax: if you could just let alard know the identifiers of the files that you shredded so that he can re-add them to the tracker, that'd be nice
19:14 ^🔗	yipdw	so that someone else can grab them.
19:14 ^🔗	PepsiMax	yipdw: i never shred'd any warc downloads.
19:14 ^🔗	db48x	sheesh, I woke up late
19:14 ^🔗	yipdw	then I must have misread 13:05:31 <PepsiMax> and i shred'd em
19:14 ^🔗	PepsiMax	Because I don't know whats inside them
19:15 ^🔗	PepsiMax	yes.
19:15 ^🔗	PepsiMax	the "pxF-pro-dos.rar"
19:15 ^🔗	yipdw	I'm confused
19:15 ^🔗	yipdw	how did you actually get that file if you never gunzipped the WARC
19:15 ^🔗	PepsiMax	Im not sure either.
19:15 ^🔗	PepsiMax	http://www.anyhub.net/stats
19:15 ^🔗	yipdw	did you download it manually?
19:16 ^🔗	PepsiMax	People use anyhub for shady files. Thats all.
19:16 ^🔗	yipdw	I know
19:16 ^🔗	PepsiMax	I'm just archiving.
19:16 ^🔗	yipdw	I'm just trying to figure out whether or not "and i shred'd them" applies to any of the stuff you downloaded
19:16 ^🔗	PepsiMax	yipdw: it doesn't. I didn't tamper with any warcs.
19:17 ^🔗	PepsiMax	We do not have time for that.
19:17 ^🔗	yipdw	ok cool
19:17 ^🔗	yipdw	cool, that's been sorted out
19:18 ^🔗	PepsiMax	How do get my 18GB to the web?
19:18 ^🔗	PepsiMax	Will we use internet archive?
19:19 ^🔗	ersi	We'll shred them when we've downloaded it all
19:19 ^🔗	ersi	Because it could be used for dangerous things :P
19:23 ^🔗	yipdw	reminds me of the various "shove all the biological and electronic pathogens to the Moon" plot devices in sci-fi novels
19:23 ^🔗	yipdw	e.g. 3001
19:27 ^🔗	PepsiMax	oh hi
19:27 ^🔗	marceloan	hi :)
19:34 ^🔗	alard	SketchCow?
19:35 ^🔗	closure	the moon? surely the sun
19:39 ^🔗	db48x	the moon is better
19:39 ^🔗	db48x	you can get stuff back
19:45 ^🔗	SketchCow	Yes
19:45 ^🔗	closure	in case the last anthrax spores are needed to fight invaders from another dimension, presumably
19:45 ^🔗	SketchCow	See, I live
19:45 ^🔗	alard	SketchCow: Hi!
19:45 ^🔗	alard	Perhaps it would be handy if you could set up something on batcave to rsync anyhub, splinder stuff to.
19:46 ^🔗	alard	Assuming you can handle that, of course.
19:46 ^🔗	alard	Perhaps one module where people can rsync to a subdirectory?
19:46 ^🔗	closure	fwiw, I have been sending some splinder stuff to my rsync on batcave -- and have 100 gb I plan to send there soon, as I'm getting low on disk
19:48 ^🔗	alard	There are multiple people with small to medium-sized chunks, it would be useful if we could point them somewhere.
19:49 ^🔗	SketchCow	How big is it.
19:49 ^🔗	SketchCow	(Just need to know)
19:51 ^🔗	alard	A guess: far less than 300GB from 'little people'? (Not the underscors, Coderjoes who rake in terabytes.)
19:51 ^🔗	SketchCow	We have about 14 terabytes free on batcave at the moment.
19:52 ^🔗	closure	it'll be around a terabyte all told, I suspect
19:52 ^🔗	alard	In total, MobileMe is currently at 3586GB; AnyHub is at 265 GB; Splinder at 120GB.
19:53 ^🔗	closure	and splinder is 15% or so done
19:53 ^🔗	alard	Individual downloaders have tens of GBs each.
19:54 ^🔗	ersi	Holy fuck that's some.. data.
19:54 ^🔗	closure	note these are du --apparent-size numbers, and with lots of small files, I see up to 2x what your tracker sees with regular du
19:54 ^🔗	SketchCow	Not too bad.
19:54 ^🔗	SketchCow	You heard about the disk thing with archive.org, right.
19:55 ^🔗	SketchCow	Slowdown of purchases until the Thailand situation clears up.
19:55 ^🔗	closure	yeah.. I hope this doesn't turn out to be like the 90's with ram
19:55 ^🔗	SketchCow	It's not just a standstill thing, because a number of drives die every day.
19:55 ^🔗	SketchCow	So they're using them just to stay afloat.
19:55 ^🔗	SketchCow	So we'll keep going, but a 200tb block would be significant right now.
19:56 ^🔗	ersi	200TB T_T
19:57 ^🔗	alard	MobileMe is staying until June next year, so that's not urgent.
20:01 ^🔗	yipdw	"Downloading it:volevochiamarmipuckmaeraoccupato profile"
20:01 ^🔗	yipdw	wtf
20:01 ^🔗	SketchCow	So, this Berlios thing.
20:01 ^🔗	SketchCow	What's the opinion.
20:10 ^🔗	closure	finish transfer and see what happens.
20:10 ^🔗	closure	I looked at their mailing list for the takeover and there were only a few posts. But I don't read German
20:14 ^🔗	alard	There's our German. :) But where's the mailing list?
20:16 ^🔗	closure	https://lists.berlios.de/pipermail/berlios-verein/2011-October/thread.html
20:16 ^🔗	closure	also last month
20:18 ^🔗	alard	The German things, as far as I've seen, come down to this: https://lists.berlios.de/pipermail/berlios-verein/2011-October/000006.html
20:18 ^🔗	alard	There will be an non-profit association that will continue Berlios.
20:19 ^🔗	alard	They're looking for volunteers to help. The association will be founded 'in November 2011'.
20:20 ^🔗	SketchCow	OK.
20:20 ^🔗	alard	Der bisher angefallene Hauptkostenblock waren Personalkosten,
20:20 ^🔗	SketchCow	So what do we think is the best way to present this stuff?
20:20 ^🔗	alard	The main cost was in personnel, which a volunteer association doesn't have.
20:20 ^🔗	SketchCow	Do I do a per-site archive?
20:20 ^🔗	alard	They're hoping to find hosting sponsors.
20:22 ^🔗	closure	SketchCow: ym for rsync?
20:22 ^🔗	SketchCow	ym?
20:23 ^🔗	closure	you mean
20:23 ^🔗	SketchCow	I mean when I put these into a collection on archive.org
20:23 ^🔗	SketchCow	Because I'm going to pull them off batcave.
20:23 ^🔗	closure	ah. well, for berlios, we have a nice division into per-repo directories, which could be separate archive.org items. I don't know how hard it would be to create thousands of items though
20:24 ^🔗	chronomex	easy peasy
20:24 ^🔗	SketchCow	Yeah.
20:24 ^🔗	SketchCow	So that's the smart way? I can do that.
20:24 ^🔗	closure	that way if a project needs their git repo they can get it without hunting thru some ginormous tarball
20:25 ^🔗	closure	otoh, I have no personal problem with a ginormous tarball either. really up to you dude
20:25 ^🔗	chronomex	I vote per-project item
20:25 ^🔗	chronomex	or maybe
20:25 ^🔗	chronomex	nah, yeah, collection for berlios, 1 item per project
20:27 ^🔗	closure	there are 12 thousand projects fyi
20:27 ^🔗	chronomex	understood.
20:28 ^🔗	SketchCow	How many mobileme accounts do we think there are?
20:29 ^🔗	chronomex	SketchCow: there are ~340,000 in the queue
20:29 ^🔗	chronomex	ummm, a few million?
20:34 ^🔗	closure	SketchCow: we're rsynced up untarred directories. I can try to write a script you could run that does another rsync to get recent activity and tars them up nicely for archival=
20:35 ^🔗	SketchCow	It'd probably be better to write something that converts the uploaded directories into a nice package for per-package items
20:35 ^🔗	SketchCow	metadata extraction too
20:35 ^🔗	SketchCow	Then I can keep going
20:36 ^🔗	closure	what kind of package and metadata format do you have in mind?
20:37 ^🔗	SketchCow	Like, making each item a .tar.gz or whatever, and a .txt file with the name, author, whatever else comes with the entry.
20:37 ^🔗	SketchCow	So I can blow in that information into the item.
20:38 ^🔗	closure	absolutely. author info is a bit hard, since these are everything from git repositories to mailing list archives. But at least name and original rsync url I can do
20:39 ^🔗	closure	I will develop it and get back to you
20:41 ^🔗	SketchCow	Whatever we can get.
21:06 ^🔗	Nemo_bis	I'm helping with Splinder (42 instances and didn't do it on purpose)
21:06 ^🔗	Nemo_bis	there's a user who keeps failing: http://p.defau.lt/?J4RlPPKettnFG0loB0eX2Q
21:07 ^🔗	PepsiMax	defau.lt?
21:07 ^🔗	Nemo_bis	looks like there's some extra dash in the URL: http://ladyvengeance.splinder.com/
21:07 ^🔗	Nemo_bis	PepsiMax, yep, a pastebin
21:08 ^🔗	PepsiMax	:D
21:08 ^🔗	PepsiMax	hmm
21:08 ^🔗	*	chronomex sees PepsiMax's eyes light up
21:08 ^🔗	PepsiMax	Well, I don't see anything. Ask closure/ alard/ yipdw/ ersi etc
21:09 ^🔗	yipdw	Nemo_bis: what's in the wget logs?
21:09 ^🔗	Nemo_bis	yipdw, where are they?
21:10 ^🔗	yipdw	http://-ladyvengeance-.splinder.com/ exists
21:10 ^🔗	yipdw	Nemo_bis: they'll be in data/-/-l/-la or some such
21:10 ^🔗	closure	yes, I've also seen the problem with directories starting with a dash. It makes the downloader loop forever
21:10 ^🔗	yipdw	hold on
21:10 ^🔗	*	yipdw will try to debug this
21:10 ^🔗	closure	something needs to use ./$dir instead of $dir
21:10 ^🔗	Nemo_bis	i can't open that URl in my browser
21:11 ^🔗	yipdw	works here
21:11 ^🔗	Nemo_bis	"unknown hostname"
21:11 ^🔗	yipdw	on Firefox 8
21:11 ^🔗	Nemo_bis	(on wget)
21:11 ^🔗	Nemo_bis	a DNS problem? :-?
21:11 ^🔗	yipdw	https://gist.github.com/69346f7d072a4cdd4e77
21:12 ^🔗	closure	would not be surprised if some crap dns server doesn't like dashes at start either :)
21:12 ^🔗	yipdw	what DNS server are you using?
21:12 ^🔗	closure	actually, looks like chrome has a bug with it too :)
21:12 ^🔗	closure	my dns is ok, chrome shows a dns error though
21:12 ^🔗	yipdw	before that, though, let me try to download Crystailline's profile
21:13 ^🔗	Nemo_bis	Fastweb DNS
21:13 ^🔗	yipdw	closure: the rational response is clearly that Chrome sucks and that you should abandon it for Opera
21:13 ^🔗	yipdw	assuming slashdot is any indication of logic
21:13 ^🔗	Nemo_bis	do you want IPs to try and reproduce it?
21:13 ^🔗	closure	--2011-11-14 17:13:37-- http://-ladyvengeance-.splinder.com/
21:13 ^🔗	closure	Resolving -ladyvengeance-.splinder.com (-ladyvengeance-.splinder.com)... failed: Name or service not known.
21:13 ^🔗	closure	wget: unable to resolve host address `-la
21:13 ^🔗	closure	dyvengeance-.splinder.com'
21:13 ^🔗	yipdw	closure: quote it
21:13 ^🔗	yipdw	otherwise it'll be interpreted as an option
21:13 ^🔗	closure	it's not a quoting problem, I ran wget http://-lady...
21:14 ^🔗	closure	-ladyvengeance-.splinder.com is an alias for blog.splinder.com.
21:14 ^🔗	closure	and my dns is ok: host -- -ladyvengeance-.splinder.com
21:14 ^🔗	closure	blog.splinder.com has address 195.110.103.13
21:14 ^🔗	yipdw	did you run wget 'http://-lady...' or wget http://-lady...
21:14 ^🔗	closure	they're absolutely equivilant. I ran both thogh :P
21:14 ^🔗	yipdw	got me, then
21:14 ^🔗	Nemo_bis	me too, none worked
21:15 ^🔗	Nemo_bis	and DNS working here too
21:15 ^🔗	yipdw	all I can say is https://gist.github.com/69346f7d072a4cdd4e77
21:15 ^🔗	yipdw	I'm checking if the download scripts choke on dashes
21:17 ^🔗	Nemo_bis	different wget version? :-/
21:17 ^🔗	Nemo_bis	but curl doesn't work either
21:18 ^🔗	Nemo_bis	also, if it closes on the 24th, are we going to complete it or do we need additional downloaders? Splinder servers are already overloaded at least in Italian working hours, though
21:18 ^🔗	Nemo_bis	(according to the dashboard we'd need about 20 more days at this speed)
21:18 ^🔗	yipdw	Nemo_bis: possibly; here's some environment info
21:18 ^🔗	yipdw	https://gist.github.com/69346f7d072a4cdd4e77#file_a
21:19 ^🔗	Nemo_bis	yipdw, http://p.defau.lt/?RiePg540hWv_79Wi6ieKsA
21:20 ^🔗	yipdw	Nemo_bis: what's the IPs of the nameservers you're using?
21:20 ^🔗	Nemo_bis	wait a moment
21:20 ^🔗	Nemo_bis	yipdw, 62.101.93.101 / 83.103.25.250
21:21 ^🔗	Nemo_bis	I don't know if it works from "outside", though; my ISP is a bit nasty
21:21 ^🔗	marceloan	Try Google DNS or OpenDNS
21:22 ^🔗	Nemo_bis	marceloan, to debug or in general?
21:22 ^🔗	marceloan	In general
21:22 ^🔗	yipdw	ok, what
21:22 ^🔗	yipdw	Resolv::ResolvError: no address for www.google.com
21:22 ^🔗	yipdw	ruby-1.9.2-p290 :013 > r = Resolv.new([Resolv::DNS.new(:nameserver => '62.101.93.101')]); r.getaddress('www.google.com')
21:22 ^🔗	chronomex	host: convert UTF-8 textname to IDN encoding: prohibited character found
21:22 ^🔗	chronomex	interesting, on a BSD system I get:
21:22 ^🔗	yipdw	your ISP is run by shitheads
21:23 ^🔗	yipdw	just saying.
21:23 ^🔗	Nemo_bis	marceloan, I used namebenched and mine it's usually faster
21:23 ^🔗	Nemo_bis	yipdw, yes, they're a bit strict...
21:23 ^🔗	yipdw	Nemo_bis: no idea, then. try a different DNS cache
21:24 ^🔗	yipdw	if that fixes it, run your own :P
21:24 ^🔗	Nemo_bis	but are you sure that it is a DNS problem?
21:24 ^🔗	yipdw	I don't know
21:24 ^🔗	yipdw	but
21:24 ^🔗	Nemo_bis	host -- -ladyvengeance-.splinder.com
21:24 ^🔗	Nemo_bis	-ladyvengeance-.splinder.com is an alias for blog.splinder.com.
21:24 ^🔗	Nemo_bis	blog.splinder.com has address 195.110.103.13
21:25 ^🔗	yipdw	Nemo_bis: https://gist.github.com/ec5f5921bc65e7af9ed9
21:25 ^🔗	yipdw	if you can copy-and-paste the wget logs for Crystailline that'd help
21:25 ^🔗	yipdw	from that we can see just what error wget is encountering
21:25 ^🔗	Nemo_bis	ok
21:26 ^🔗	yipdw	they'll be in data/it/-/-l/-la/-ladyvengeance-
21:27 ^🔗	yipdw	that said, the archive I have for that blog is really, really small
21:27 ^🔗	yipdw	I think it's incomplete
21:28 ^🔗	yipdw	oh, yeah, it definitely is
21:28 ^🔗	yipdw	I shouldn't be seeing "La pagina richiesta non Ã¨ stata trovata o non Ã¨ disponibile. Controllare che l'indirizzo della pagina sia corretto."
21:28 ^🔗	Nemo_bis	there's no such directory
21:28 ^🔗	Nemo_bis	definitely
21:28 ^🔗	yipdw	er
21:28 ^🔗	yipdw	fuck.
21:28 ^🔗	yipdw	I'm retarded
21:29 ^🔗	yipdw	Nemo_bis: sorry. check it/C/Cr/Cry/Crystailline
21:29 ^🔗	Nemo_bis	did it already, couldn't find it
21:29 ^🔗	Nemo_bis	ah no
21:29 ^🔗	*	Nemo_bis facepalms
21:30 ^🔗	Nemo_bis	ok, so what do you need?
21:30 ^🔗	yipdw	the wget logs
21:30 ^🔗	Nemo_bis	all of them?
21:30 ^🔗	yipdw	wget-phase-3--ladyvengeance-.splinder.com.log to start with
21:31 ^🔗	Nemo_bis	ehm, just got deleted, I guess I have to stop the loop :-)
21:31 ^🔗	yipdw	oh, yeah
21:33 ^🔗	yipdw	jesus this Redazione blog is huge
21:33 ^🔗	*	Nemo_bis is not fast enough
21:34 ^🔗	yipdw	try ps ax \| grep dld-client \| cut -f 1 -d ' ' \| xargs kill
21:34 ^🔗	yipdw	or touch STOP
21:35 ^🔗	Nemo_bis	let's start with http://p.defau.lt/?35BiHhyT9RU1YXT3B_fH6w
21:36 ^🔗	yipdw	which log is that?
21:36 ^🔗	yipdw	it looks like the log for lilithqueenoftheevil
21:36 ^🔗	Nemo_bis	3-lilithqueenoftheevil.splinder.com
21:36 ^🔗	Nemo_bis	yep
21:37 ^🔗	yipdw	ok, that looks fine
21:37 ^🔗	yipdw	how about the one for -ladyvengeance-
21:37 ^🔗	Nemo_bis	it's just wget-warc: impossibile risolvere l'indirizzo dell'host "-ladyvengeance-.splinder.com"
21:37 ^🔗	Nemo_bis	can't resolve hostname
21:38 ^🔗	yipdw	that sounds like DNS :P
21:38 ^🔗	Nemo_bis	but why does "host" work then? :-/
21:38 ^🔗	yipdw	what resolver is host using?
21:38 ^🔗	Nemo_bis	how can I know?
21:39 ^🔗	yipdw	host -v
21:40 ^🔗	Nemo_bis	http://p.defau.lt/?j_5vOVBVqVUFOhJY8BhCeQ
21:40 ^🔗	yipdw	well, that's awesome
21:44 ^🔗	Nemo_bis	yipdw, did you see that your ./wget-warc --version differs from mine in its -nls (I have +nls)?
21:44 ^🔗	yipdw	that might be it
21:44 ^🔗	yipdw	I'm not sure what NLS could do there, though
21:45 ^🔗	yipdw	unless it's punycode getting in the way
21:45 ^🔗	*	Nemo_bis has no idea what it is :-p
21:45 ^🔗	yipdw	er, wait
21:46 ^🔗	yipdw	hold on
21:46 ^🔗	yipdw	I'll recompile with nls
21:47 ^🔗	yipdw	what libraries do I need for that?
21:48 ^🔗	Nemo_bis	are you asking me? because I have no idea at all, obviously ^_^
21:52 ^🔗	yipdw	huh what the hell
21:52 ^🔗	yipdw	I just built it with NLS, and it blew up
21:52 ^🔗	yipdw	chronomex: I think you're on to something there
21:53 ^🔗	yipdw	where by "blew up" I mean "I can replicate the error"
21:53 ^🔗	Nemo_bis	I feel less lonely now
21:54 ^🔗	underscor	Can you guess where the vm went down?
21:54 ^🔗	underscor	http://tracker.archive.org/tracker.png
22:04 ^🔗	alard	yipdw / Nemo_bis: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=626472
22:05 ^🔗	alard	A dash as the first or last character in the host name is not allowed by the RFC, apparently.
22:05 ^🔗	yipdw	huh
22:05 ^🔗	yipdw	interesting
22:07 ^🔗	Nemo_bis	i knew only about leading dash
22:10 ^🔗	goekesmi	shocklying large quantities of software handle it anyway.
22:11 ^🔗	goekesmi	And by handle, I mean break in interesting and arbitray ways.
22:12 ^🔗	yipdw	I like this part of the report
22:12 ^🔗	yipdw	"tcpdump/wireshark shows that the DNS query for foo-.tumblr.com does go
22:12 ^🔗	yipdw	out, and is answered with the CNAME and A (for
22:12 ^🔗	yipdw	proxy-tumblelogs.d1.tumblr.com), but gethostbyname just returns
22:12 ^🔗	yipdw	failure with errno set to EBADMSG."
22:12 ^🔗	yipdw	gethostbyname: "Fuck you, I know what's right"
22:12 ^🔗	yipdw	sigh
22:32 ^🔗	*	Nemo_bis going to bed, downloading non-dash users
22:47 ^🔗	db48x2	(1 167 730 + 169 544) / (169 544 / (3 days)) = 23.6624239 days
22:47 ^🔗	db48x2	7 days too long
22:48 ^🔗	SketchCow	Hey, so gang
22:48 ^🔗	SketchCow	We've been offered SCRAPERSUNITED.COM and UNITEDSCRAPERS.COM as domains
22:49 ^🔗	SketchCow	Is this interesting, or do we want to stick with archiveteam.org/scrapers or whatever
22:49 ^🔗	closure	alard: re the - problem.. I think that in DNS, the RFC actually requires the first character be alphanumeric. Which is probably why some stuff breaks, it was written to the spec. And other stuff was not :)
22:51 ^🔗	closure	<label> ::= <letter> [ [ <ldh-str> ] <let-dig> ]
22:51 ^🔗	closure	from the RFC :)
22:54 ^🔗	db48x2	SketchCow: they are interesting domains
22:55 ^🔗	db48x2	SketchCow: I think Archive Team is a superior name though
22:59 ^🔗	SketchCow	I think I agree. I will tell him to let them expire.
23:05 ^🔗	db48x2	hrm
23:05 ^🔗	db48x2	I'm signing up for dsl, and I just realized that I don't even know if I have a phone jack
23:06 ^🔗	db48x2	I mean, I would be really suprised if I didn't, but I can't recall ever seeing one...
23:16 ^🔗	underscor	SketchCow: I like scrapersunited, but I suppose it's a bit late
23:16 ^🔗	underscor	db48x: What're those stats for?
23:16 ^🔗	Paradoks	I'm a sucker for additional domain names, but I agree with db48x2. Beyond that, "scraper" seems tangential. Also, I'm not keen on having "rape" in domains. I mean, "SC Rapers United" probably isn't a common thing to see, but if someone is from South Carolina...
23:17 ^🔗	underscor	Paradoks: ...hahahaha
23:18 ^🔗	underscor	I still want archivete.am
23:18 ^🔗	underscor	holy shit, scrape.rs is available!
23:19 ^🔗	underscor	scrape.rs/united
23:20 ^🔗	DFJustin	is rs even a country
23:20 ^🔗	underscor	serbia
23:20 ^🔗	db48x2	underscor: splinder
23:21 ^🔗	underscor	oh, I should be able to pour a bunch more on that
23:21 ^🔗	underscor	once anyhub is done
23:21 ^🔗	underscor	which should be pretty soon
23:22 ^🔗	db48x2	good
23:23 ^🔗	underscor	currently pulling 65mbps from anyhub
23:26 ^🔗	db48x2	astounding
23:26 ^🔗	underscor	I'd get more, but I'm hitting the cpu limit of the host
23:27 ^🔗	underscor	(not even iowait, just the compression and encryption processes)
23:30 ^🔗	db48x2	heh
23:31 ^🔗	underscor	Man, these du's are incredibly painful
23:32 ^🔗	underscor	(the storage is mounted over sshfs, so it's not happy about it)
23:44 ^🔗	bsmith093	is there a way to do a text search inside an gz archive, without exploding it?
23:47 ^🔗	db48x2	gzcat file.gz \| grep foo
23:48 ^🔗	bsmith093	omg really, is that filenames only of inside txt files too?

irclogger-viewer