#archiveteam 2011-11-23,Wed

↑back Search

Time	Nickname	Message
00:06 ^🔗	db48x2	PatC: welcome!
00:06 ^🔗	PatC	Thank you!
00:07 ^🔗	dashcloud	hi guys, any chance someone could get in contact with this guy: http://libregraphicsworld.org/blog/entry/guitar-samples-in-gig-format-from-flame-studio-collection-shared and help him get those items on archive.org?
00:25 ^🔗	underscor	dashcloud: Think uploading them and then just sending him a note is sufficient?
00:26 ^🔗	underscor	Since we can download straight from the torrents
00:26 ^🔗	dashcloud	sure
00:26 ^🔗	dashcloud	thanks!
00:27 ^🔗	bsmith093	the torrents have very few seeds
00:31 ^🔗	underscor	Yeah, you're right, these are pretty underseeded
00:34 ^🔗	PatC	Paradoks, I have to finish archiving some files from good.net, (another week or so) then I can start helping with MobileMe
00:38 ^🔗	Paradoks	PatC: Great. Thankfully, we have a bit of time with MobileMe. It's a ton of data, though.
00:38 ^🔗	PatC	ya, isn't it 200+ TB?
00:39 ^🔗	Paradoks	Yeah. Though Archive Team seems to deal in fractions of petabytes at times. Though less so with the current hard drive shortage.
00:40 ^🔗	PatC	Definately
00:40 ^🔗	PatC	I was thinking of picking up a few 2tb drives, but then that whole thing happened, and I cant get one for less then $150
00:43 ^🔗	Paradoks	As someone mentioned previously, hopefully this isn't like the RAM shortages of yesteryear. I hope Thailand recovers quickly.
00:43 ^🔗	PatC	ya..
00:51 ^🔗	bsmith093	is this a waste of time wget -mcpke robots=off -U="googlebot" www.fanfiction.net
00:52 ^🔗	bsmith093	im trying to grab all of it, and id like to know if this is the right way to go about it
00:53 ^🔗	Coderjoe	well, you should get the googlebot UA right. also, warc support would be nice,
00:53 ^🔗	bsmith093	so how do i do wars, bc i have it and just cant figure it out, and whats the google ua
00:53 ^🔗	Coderjoe	(or kK)
00:54 ^🔗	Coderjoe	well, googlebot 2.1 is: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
00:55 ^🔗	chronomex	"ARCHIVEBOT FUCK YOU ALL" always works great
00:55 ^🔗	Coderjoe	and if you have wget-warc (which you do if you helped with splinder), check the --help, as it has the warc options
00:55 ^🔗	chronomex	and the old standby "EAT DELICIOUS POOP"
00:55 ^🔗	bsmith093	so thats where the warc manual is
00:56 ^🔗	bsmith093	also is the syntax correct for setting the useragent
00:57 ^🔗	Coderjoe	no. it is -U "your preferred UA string here"
00:57 ^🔗	Coderjoe	no =
00:57 ^🔗	bsmith093	thanks
00:58 ^🔗	Coderjoe	you use the = if you use the long format of --user-agent="your ua string here"
00:58 ^🔗	bsmith093	ok thanks that was really confising in the wget man pages
00:58 ^🔗	bsmith093	now whats the wget warc command ia would like
00:59 ^🔗	PatC	May I ask... what is warc?
00:59 ^🔗	chronomex	it's an archive format that archive.org uses, contains a full description of a http connection
00:59 ^🔗	PatC	ah
00:59 ^🔗	chronomex	request/response body, all headers, etc
01:09 ^🔗	bsmith093	should i include cdx files too
01:20 ^🔗	dashcloud	guys, I told the splinder scrips to stop, but they haven't stopped yet- is there a quicker way to force them to quit?
01:21 ^🔗	chronomex	control-c?
01:21 ^🔗	chronomex	how much do you care?
01:22 ^🔗	dashcloud	I don't want to lose anything, but I'm getting complaints about bandwidth from other people on my connection, so I'd like to get them finished up quickly
01:25 ^🔗	chronomex	how many are you running? :)
01:26 ^🔗	dashcloud	not that many I thought (150), but enough to seriously impact the wireless network in the house
01:27 ^🔗	chronomex	ah, wifi. wifi doesn't like archivism.
01:27 ^🔗	bsmith093	so should i include cdx files as well
01:29 ^🔗	PatC	is cdx also an archive.org format?
01:31 ^🔗	chronomex	I have never heard of cdx before.
01:32 ^🔗	bsmith093	neither have i, its a warc option., so thats a no, then?
01:39 ^🔗	bsmith093	hows this for the command wget-warc -mcpkKe robots=off -U "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" --warc-file=www.fanfiction.net/warc/warc www.fanfiction.net
01:42 ^🔗	bsmith093	thw upshot of this method is that the stories themselves can be had from a linklist made up of the fiile paths of nearly every file in the folder, because by some miracle, thats how their being saved
01:59 ^🔗	db48x2	bsmith093: yes, do the cdx file
02:00 ^🔗	db48x2	bsmith093: then, the next time you run this command wget will skip saving things to the warc if they already exist (and have the same content)
02:23 ^🔗	Paradoks	I'll admit I'm only passingly familiar with Google Knol, Wave, and Gears. Are they the sorts of things we can usefully archive?
02:23 ^🔗	chronomex	gears, no user content.
02:23 ^🔗	chronomex	knol, dunno. wave, certainly.
02:24 ^🔗	chronomex	I'm not sure what the utility of google wave data will be sans google wave
02:24 ^🔗	PatC	Doesn't wave have privacy settings?
02:25 ^🔗	chronomex	that too.
02:25 ^🔗	PatC	Are things open by default? because chances are people will leave default security settings
02:28 ^🔗	SketchCow	17:15 -!- Irssi: Starting query in EFNet with anonymous
02:28 ^🔗	SketchCow	17:15 <anonymous> May I have an rsync slot?
02:28 ^🔗	SketchCow	18:02 <anonymous> May I have an rsync slot?
02:28 ^🔗	SketchCow	18:48 -!- anonymous [webchat@216-164-62-141.c3-0.slvr-ubr1.lnh-slvr.md.cable.rcn.com] has quit [Quit: Page closed]
02:28 ^🔗	SketchCow	18:48 <anonymous> May I have an rsync slot?
02:28 ^🔗	SketchCow	18:52 <anonymous> May I have an rsync slot>
02:28 ^🔗	SketchCow	19:18 -!- anonymous [webchat@216-164-62-141.c3-0.slvr-ubr1.lnh-slvr.md.cable.rcn.com] has quit [Ping timeout: 260 seconds]
02:28 ^🔗	SketchCow	That's just not a good way to do it.
02:28 ^🔗	PatC	:/
02:28 ^🔗	SketchCow	It's called, Jason was out shopping
02:29 ^🔗	PatC	Oh, your Jason S. ?
02:29 ^🔗	SketchCow	Yep, my Jason S.
02:30 ^🔗	chronomex	jason with the cat.
02:31 ^🔗	PatC	SketchCow, I was 'Pat' from the Google hangout a few nights ago
02:31 ^🔗	PatC	[20:33:09] <gmnevo> Wierd
02:31 ^🔗	PatC	oops
02:31 ^🔗	SketchCow	Hey
02:31 ^🔗	PatC	(putty right click paste sorry)
02:31 ^🔗	PatC	SketchCow, I got an external hard drive dock, as you suggested. Very nice to have
02:32 ^🔗	SketchCow	Excellent
02:34 ^🔗	closure	pity about Wave. Bet there's some very interesting and historic content in there
02:34 ^🔗	SketchCow	Agree
02:34 ^🔗	SketchCow	Also putty: My weblog has gotten the white screen of death
02:34 ^🔗	SketchCow	No idea why
02:36 ^🔗	PatC	Have you guys seen these little storage boxes? http://usb.brando.com/hdd-paper-storage-box-with-cover-5-bay-_p00962c044d15.html
02:36 ^🔗	PatC	They seem to be a good, cheap way of storing hard drives
02:37 ^🔗	closure	Knol sounds like it has content, dunno if it's publically available
02:56 ^🔗	PatC	SketchCow, what is your blog url?
02:59 ^🔗	closure	http://wayback.archive.org/web/*/http://ascii.textfiles.com
03:06 ^🔗	PatC	Thank you closure
03:08 ^🔗	chronomex	that is one way to get there
03:11 ^🔗	bsmith093	this is the command im now using ben@ben-laptop:~$ wget-warc -mpkKe robots=off -U "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" --warc-cdx --warc-file=www.fanfiction.net/warc/warc www.fanfiction.net
03:12 ^🔗	bsmith093	and this is the current output Opening WARC file `www.fanfiction.net/warc/warc.warc.gz'. Error opening WARC file `www.fanfiction.net/warc/warc.warc.gz'. Could not open WARC file.
03:12 ^🔗	PatC	bsmith093, that is the filepath?
03:13 ^🔗	bsmith093	yes i deleted the stuff wget had akready downloaded for the sake of doing it right, and recreated the www.fanfiction.net folder for the warc file
03:13 ^🔗	bsmith093	oh wait never mind hold on
03:14 ^🔗	bsmith093	yeah it cant find the warc file because its not there because its supposed to creat it?
03:16 ^🔗	PatC	if you 'ls' in the www.fanfiction.com/warc/ folder is there the warc.warc.gz file?
03:18 ^🔗	bsmith093	no its empty
03:18 ^🔗	bsmith093	i havent stared the job yet, and this only happened after i added the --warc-cdx option
03:19 ^🔗	PatC	ah
03:19 ^🔗	bsmith093	w/o it its fine
03:19 ^🔗	bsmith093	do i do the cdx later after the job is done as an update to it
03:19 ^🔗	PatC	I'm not sure what --warc-cdx does so I can't help you with that, sorry
03:21 ^🔗	bsmith093	db48x2: do i run without the cdx option for th e first donload and then everytime after with the cdx to update?
03:25 ^🔗	bsmith093	im an idiot fanfiction.net not .com had the wrong filepath, all fixed now, runs great with cdx option set
03:25 ^🔗	PatC	ah, what would do it
03:36 ^🔗	PatC	Wait, i'm sorry, I was the person who mentioned .com
03:49 ^🔗	SketchCow	Works
03:51 ^🔗	PatC	nice!
03:51 ^🔗	PatC	What was the problem?
03:57 ^🔗	db48x2	bsmith096: use the cdx option every time
04:01 ^🔗	SketchCow	I overlaid a new install of wordpress
04:01 ^🔗	SketchCow	THANKS WORDPRESS
04:01 ^🔗	SketchCow	THE RESET YOUR ROUTER OF REPAIRS
04:08 ^🔗	PatC	SketchCow, congrats on your kickstarter! I can't wait to see the end result :)
04:14 ^🔗	bsmith096	SketchCow: does the wget warc create the warc file at the very end, because im running one of my own download jobs, and its not creating it yet
04:21 ^🔗	SketchCow	alard can tell you
04:21 ^🔗	SketchCow	I don't know actually.
04:27 ^🔗	bsmith096	alard: does the wget warc create the warc file at the very end, because im running one of my own download jobs, and its not creating it yet
04:28 ^🔗	bsmith096	documentary question, how did you decide to focus on the 6502 chip?
04:30 ^🔗	yipdw^	bsmith096: in wget 1.13.4-2574, the WARC is built as each file is downloaded
04:30 ^🔗	yipdw^	if you have a different version, the behavior may be different
04:31 ^🔗	chronomex	bsmith096: imo, 6502 is the most widely used 8-bit processor for hobbyists during a certain time.
04:31 ^🔗	chronomex	it also retains some lasting appeal ... the z80 is still used but is still used industrially and in gameboys, etc, so it's different.
04:33 ^🔗	bsmith096	yipdw^: but the file doesnt exist yet, so is it builing it in ram or something and then wirting at the end, because the path where i told wget warc to put the warc file is currently empty
04:34 ^🔗	yipdw^	bsmith096: dunno. maybe building an index file changes the behavior
04:43 ^🔗	Coderjoe	no
04:43 ^🔗	Coderjoe	i think it may have failed to open the warc file because the directory didn't exist
04:48 ^🔗	yipdw^	oh
04:48 ^🔗	yipdw^	or that
04:51 ^🔗	Coderjoe	also, may I suggest a better base filename than "warc"
04:52 ^🔗	Coderjoe	(since it will have .warc or .warc.gz added to the end already)
04:53 ^🔗	yipdw^	also, I have archived Kohl's terrible Black Friday ad, in the event that they develop a sense of shame
04:53 ^🔗	Coderjoe	may I suggest something like "www.fanfiction.com_20111122" or the like
04:53 ^🔗	Coderjoe	oh?
04:53 ^🔗	Coderjoe	I've not looked at any BF ads.
04:53 ^🔗	yipdw^	http://www.youtube.com/watch?v=vGiQzPi0f_E
04:53 ^🔗	yipdw^	a friend linked me to it
04:53 ^🔗	yipdw^	I feel sad
04:54 ^🔗	Coderjoe	her, SHIT
04:54 ^🔗	*	PatC shreds his screen
04:54 ^🔗	Coderjoe	er...
04:55 ^🔗	Coderjoe	after the related vids loaded, along with the description, the pieces fell together...
04:55 ^🔗	Coderjoe	and THEN the video started playing
04:57 ^🔗	yipdw^	I didn't get the comments or the video info page, though
05:01 ^🔗	Coderjoe	not completely positive, but I think Rebecca is the woman in the red. the one getting into the store after the woman singing, and the one she steals the item from and then flicks off
05:01 ^🔗	Coderjoe	baha. did you catch the very end of the video?
05:01 ^🔗	bsmith096	Coderjoe: apparently the warc file has to be in the folder ithe wget is saving to not under it, also took your advice and renamed it, and now restarted the job, and the war and cdx now exist so whoo!
05:02 ^🔗	Coderjoe	it can be in a different directory, but the directory has to exist at the start of the job
05:02 ^🔗	bsmith096	well its working now so yay
05:04 ^🔗	bsmith096	it's running beautifully
05:37 ^🔗	chronomex	okay guys.
05:38 ^🔗	chronomex	http://colour-recovery.wikispaces.com/Full+gamut+colour+recovery these guys are working on recovering color tv
05:38 ^🔗	chronomex	starting with scans of film that was exposed in a machine that records tv onto film
05:38 ^🔗	chronomex	they're re-rectangularizing the frames and extracting color from the dot crawl you get on b/w tv
05:38 ^🔗	chronomex	it's fucking impressive.
05:43 ^🔗	SketchCow	I downloaded all the the ROFLcon summit videos.
05:43 ^🔗	SketchCow	Including myself.
05:43 ^🔗	SketchCow	http://www.archive.org/details/roflconsummit-cpw
05:52 ^🔗	underscor	Brewster and you?!
05:55 ^🔗	Coderjoe	chronomex: someone was telling me about that, restoring color to old kinescopes of doctor who
05:55 ^🔗	chronomex	yeah
05:55 ^🔗	chronomex	it's cool shit.
05:56 ^🔗	chronomex	modern signal processing is kind of magic
06:02 ^🔗	Coderjoe	i stil mildly dislike the old heads at the beeb. the ones that decided to destroy their archives.
06:02 ^🔗	chronomex	they had reasons
06:11 ^🔗	SketchCow	Oh, don't justify them.
06:27 ^🔗	chronomex	I didn't say they were good reasons
06:27 ^🔗	chronomex	but they didn't just say HURRDURR BALEET
06:37 ^🔗	SketchCow	Someone did
06:37 ^🔗	SketchCow	I believe it was TALLY BALEET
06:37 ^🔗	db48x	lol
06:45 ^🔗	chronomex	this is horrible. I'm moving into a smaller space and have to winnow :(
06:45 ^🔗	SketchCow	Noooooo
06:46 ^🔗	chronomex	is why I have some understanding for what happened at the BBC
06:46 ^🔗	chronomex	mostly I have a bunch of paper detritus
06:55 ^🔗	arima	SketchCow: can I get an rsync slot?
07:28 ^🔗	Nemo_bis	what does it mean if I have over 110 instances of the splinder downloader running but only 70 instances of wget?
07:29 ^🔗	Nemo_bis	there can't be 40 in the process of parsing the profile pages because there's no such python process currently
07:45 ^🔗	Coderjoe	Nemo_bis: how high is your load avg? what does your memory usage look like? there can be a number trying to parse if there is heavy disk IO, or you're short on ram and not able to cache stuff for log, or swapping, etc
07:50 ^🔗	dnova	this is funny; we're down to the bigger profiles (not counting the ones alard has not put back into the queue)
07:54 ^🔗	Coderjoe	hah
07:54 ^🔗	Coderjoe	it:MagicaHermione
07:56 ^🔗	Nemo_bis	Coderjoe, I'm using all memory and disk load is very high, but there's nothing to parse, it's been downloading the same big users for days
07:58 ^🔗	dnova	there seem to be a small amount of extreme splinder users
07:58 ^🔗	Coderjoe	weird
07:59 ^🔗	Coderjoe	just had the same downloader-user pair show up twice on the dashboard
07:59 ^🔗	dnova	yeah I saw that too!
07:59 ^🔗	dnova	Nemo it:Luley 0MB
07:59 ^🔗	dnova	Nemo it:Luley 0MB
07:59 ^🔗	Nemo_bis	hm
08:00 ^🔗	Nemo_bis	I was just looking at that user
08:01 ^🔗	Nemo_bis	uh, there's some user being downloaded at 10 Mb/s, what a joy
08:04 ^🔗	Nemo_bis	machine is too slow to open browser now, sorry if I flood the channel
08:04 ^🔗	Nemo_bis	- Downloading profile HTML pages... done.
08:04 ^🔗	Nemo_bis	- Parsing profile HTML to extract media urls... done.
08:04 ^🔗	Nemo_bis	Deleting incomplete result for it:Luley
08:04 ^🔗	Nemo_bis	Downloading it:Luley profile
08:04 ^🔗	Nemo_bis	it:Luley contains 502 or 504 errors, needs to be fixed.
08:04 ^🔗	Nemo_bis	- Downloading 4 media files... done, with HTTP errors.
08:04 ^🔗	Nemo_bis	- Checking for important 502, 504 errors... none found.
08:04 ^🔗	Nemo_bis	- Result: 134K
08:04 ^🔗	Nemo_bis	rm: impossibile rimuovere "data/it/L/Lu/Lul/Luley/.incomplete": File o directory non esistente
08:04 ^🔗	Nemo_bis	Telling tracker that 'it:Luley' is done.
08:04 ^🔗	Nemo_bis	blogs.txt media-urls.txt splinder.com-Luley-html.warc.gz splinder.com-Luley-media.warc.gz wget-phase-1.log wget-phase-2.log
08:04 ^🔗	Nemo_bis	ls data/it/L/Lu/Lul/Luley
08:05 ^🔗	Coderjoe	you happened to have two threads doing the same user? friggin awesome :-\
08:05 ^🔗	Nemo_bis	no, I don't think so
08:05 ^🔗	Coderjoe	it would explain the double-complete and that rm error
08:05 ^🔗	Nemo_bis	I think that for some reason it marked it complete two times, the second failing to remove the .incomplete (because already deleted) but telling to tracke two times
08:06 ^🔗	Coderjoe	(one thread deleted the .incomplete file before the other)
08:06 ^🔗	Nemo_bis	perhaps due to disk load it queued the "remove and tell to tracker" two times? (I don't know how it works)
08:07 ^🔗	Coderjoe	no, it wouldn't have done that
08:07 ^🔗	Nemo_bis	ah, you're right
08:07 ^🔗	Coderjoe	unless there were two dld-single.sh instances working on the same user at the same time
08:07 ^🔗	Nemo_bis	hm, this shouldn't happen
08:09 ^🔗	Coderjoe	no idea how that would have happened unless you stopped dld-streamer.sh with ^c and didn't actually stop all the children, and then started downloaders through another means
08:09 ^🔗	Nemo_bis	no, I'm running two instances of fix-dld
08:09 ^🔗	Coderjoe	oh
08:09 ^🔗	Coderjoe	don't do that
08:09 ^🔗	Nemo_bis	because there are too many users to fix and it's very slow, and they shouldn't conflict
08:10 ^🔗	Coderjoe	they will conflict
08:10 ^🔗	Nemo_bis	but the first which finds a user adds .cinomplete and the second ignores it
08:10 ^🔗	Nemo_bis	it's always worked :-/
08:11 ^🔗	Coderjoe	only because you were lucky before on the race conditions
08:12 ^🔗	Nemo_bis	but how can they conflict, doesn't the fixer add the .incomplete mark as soon as it starts working on a user?
08:12 ^🔗	Coderjoe	yes, but if the one got past that check just before the other makes the file, they will BOTH think they can work on that user
08:12 ^🔗	Coderjoe	like I said, RACE CONDITIONS
08:13 ^🔗	Coderjoe	checking for the file and creating the file are not atomic operations
08:13 ^🔗	Nemo_bis	yes, I understand, but... I thought that writing an empty file wouldn't be a problem. It happened 4 times though, stopping now...
08:14 ^🔗	Nemo_bis	Before us.splinder.com went down I was doing one for it users and one for us user
08:15 ^🔗	Coderjoe	fix-dld doesn't have a means of specifying which country to work on
08:16 ^🔗	Nemo_bis	I just modify the script
08:17 ^🔗	Coderjoe	if you look at the script, there is a rather large time gap between checking for .incomplete and creating it
08:17 ^🔗	Coderjoe	(the grep)
08:18 ^🔗	Nemo_bis	ah
08:19 ^🔗	Nemo_bis	indeed
08:20 ^🔗	Nemo_bis	although in this was it checks for way more .incomplete than needed and it saves just a few greppings
08:21 ^🔗	Coderjoe	eh? I can't quite parse that. (I should also probably go to bed soon as well)
08:24 ^🔗	Coderjoe	if you wanted to make it more parallel, the way to go about it would be to write out the list of need-redo users to a file and then (carefully) modify dld-streamer.sh to read from that file instead of asking the tracker. (but do not run that dld-streamer until your fix script is done identifying all users to redo)
08:25 ^🔗	Coderjoe	(I think this approach is a bit easier and less error prone than trying to bring in all the job control logic from dld-streamer into fix-dld)
08:29 ^🔗	Nemo_bis	yes, but I don't have that many users to fix
08:30 ^🔗	Coderjoe	right, but you wanted to run the downloads in parallel. fix-dld was not written to support running multiple instances on the same data directory at the same time
08:30 ^🔗	Nemo_bis	what I meant is that you usually put the less expensive and more effective condition first, so it makes sense that fix-dld checks for .incomplete as first thing, but it's not actually needed if the user doesn't need to be fixed
08:31 ^🔗	Nemo_bis	yep, I just gave up, I'll be patient :-p
08:31 ^🔗	Coderjoe	the check for .incomplete was to allow fix to be run while downloaders are also running
08:31 ^🔗	Coderjoe	(as stated in the comment above the check)
08:31 ^🔗	Nemo_bis	so perhaps it could be more efficient to check for the .inomplete after grep
08:31 ^🔗	Nemo_bis	yes, I meant ^
08:32 ^🔗	Schbirid	is there a way to create a new git repo and upload files using the website at github?
08:32 ^🔗	Schbirid	wait nevermind
08:32 ^🔗	Schbirid	i forgot i published it already ( https://github.com/SpiritQuaddicted/sourceforge-file-download ) :D
08:35 ^🔗	Schbirid	hm, moddb might be a worthy target
08:44 ^🔗	SketchCow	http://ascii.textfiles.com/archives/3395
08:47 ^🔗	chronomex	SketchCow: you left his mail address in once, dunno if that was intentional
08:49 ^🔗	SketchCow	Un.
08:49 ^🔗	SketchCow	Fixing.
08:50 ^🔗	SketchCow	Fixed.
08:50 ^🔗	*	Schbirid posts unredacted version to archive.org
08:50 ^🔗	Schbirid	just kidding
08:52 ^🔗	chronomex	hm. google suggests that someone with that email address is into penny stocks
08:52 ^🔗	chronomex	interesting.
08:53 ^🔗	Nemo_bis	this makes me think of some emails in my inbox... :"(
08:53 ^🔗	Nemo_bis	oh well, gotta go
08:54 ^🔗	chronomex	Once I sent an email in 2004 and got a response in 2009.
08:54 ^🔗	chronomex	These things happen.
08:54 ^🔗	chronomex	actually 2008
08:54 ^🔗	Schbirid	those things rock
08:54 ^🔗	chronomex	so I waited to 2011 to reply back.
08:55 ^🔗	Schbirid	sounds like my astonishing talent of conversation with girls i fancy
08:55 ^🔗	chronomex	?
08:56 ^🔗	Schbirid	gets hit on in 2004, realises in 2008
08:56 ^🔗	Schbirid	:D
08:56 ^🔗	chronomex	smrt.
08:56 ^🔗	SketchCow	Sleep with in 1996, realize she was using you in 2007
09:01 ^🔗	Schbirid	)
09:01 ^🔗	Schbirid	:)
09:02 ^🔗	SketchCow	This was just the therapy to be able to reach some sort of thing with this guy.
09:02 ^🔗	SketchCow	Send him his drive back.
09:02 ^🔗	SketchCow	be done with it.
09:02 ^🔗	SketchCow	Now, he percieves a crime worthy of monetary damages.
09:02 ^🔗	SketchCow	That's interesting.
09:02 ^🔗	SketchCow	I'd like to see that court case.
09:11 ^🔗	Schbirid	opera makes it much too easy to collect too many open tabs over the week
11:41 ^🔗	emijrp	google knol is closing
11:43 ^🔗	db48x	indeed
11:43 ^🔗	db48x	can we save it?
11:43 ^🔗	emijrp	DUDE.
11:45 ^🔗	db48x	hmm
11:45 ^🔗	db48x	firefox just crashed
11:45 ^🔗	db48x	it doesn't usually do that
11:45 ^🔗	emijrp	http://googleblog.blogspot.com/2011/11/more-spring-cleaning-out-of-season.html
11:48 ^🔗	emijrp	Google is now on the Archive Team black list.
11:51 ^🔗	db48x	well, we have five months
11:52 ^🔗	chronomex	google joins yahoo in the Bad Decision Club
12:15 ^🔗	ZoeB	hi! does anyone know how to get curl to play nicely with Yahoo!'s login screen?
12:16 ^🔗	chronomex	hm. yahoo's login screen barely works for me in a real web browser :P
12:16 ^🔗	ZoeB	I've got as far as trying the following:
12:16 ^🔗	ZoeB	curl -c cookie.txt -d "login=zoeblade&passwd=foo&submit=Sign In" https://login.yahoo.com/config/login
12:16 ^🔗	ZoeB	curl -b cookie.txt -A "Mozilla/4.0" -O http://launch.groups.yahoo.com/group/foo/message/[1-23217]?source=1
12:16 ^🔗	chronomex	I'd suggest looking at the program 'fetchyahoo' which downloads mail from a yahoo account
12:16 ^🔗	chronomex	that has some rather robust login code
12:17 ^🔗	ZoeB	but although it does save a cookies file, it keeps on trying to redirect me to the login screen, so presumably I haven't logged in correctly
12:17 ^🔗	chronomex	or it did when I used it few years ago :]
12:17 ^🔗	ZoeB	good idea, thanks
12:17 ^🔗	chronomex	hm.
12:17 ^🔗	chronomex	good luck! it's 4am and I should have been in bed hours ago :\|
12:23 ^🔗	ZoeB	sweet dreams ^.^
12:32 ^🔗	Hydriz	Hey people
12:37 ^🔗	emijrp	hi
12:37 ^🔗	db48x	hello Hydriz
12:39 ^🔗	Hydriz	How is the day?
12:39 ^🔗	Hydriz	Just seen the Knol news
12:41 ^🔗	db48x	the day is early
12:42 ^🔗	Hydriz	lol
12:43 ^🔗	Hydriz	just feel like jumping in and start archiving Knol
12:43 ^🔗	Hydriz	just love the archiving feeling
12:46 ^🔗	db48x	great
12:46 ^🔗	db48x	the place to start is by exploring the site and seeing how it's organized
12:46 ^🔗	db48x	can we download things by user, or by some other enumerable index?
12:50 ^🔗	Hydriz	looks like it is sorted by user?
12:50 ^🔗	Hydriz	or maybe...
12:51 ^🔗	Hydriz	Lets take an example: http://knol.google.com/k/scott-jenson/scott-jenson/6b7e08nms1ct/0#knols shows knols the user created
12:52 ^🔗	db48x	excellent
12:59 ^🔗	Hydriz	are you writing the script now?
13:02 ^🔗	db48x	no; I'm far too tired to be of any use there
13:03 ^🔗	Hydriz	LOL
13:44 ^🔗	ersi	Funny how the Knols he wrote is that they've been viewed strangely
13:44 ^🔗	ersi	27k times, 3k, 5k
13:44 ^🔗	db48x	nah, google estimates on that sort of thing all the time
13:45 ^🔗	db48x	number of search results, word counts in books, etc
13:45 ^🔗	ersi	those bastards
13:45 ^🔗	db48x	all estimates posing as real numbers
16:56 ^🔗	SketchCow	HEY IT IS THE FAMOUS ZOE BLADE
18:01 ^🔗	closure	SketchCow: hey, I think you left out your own talk on http://www.archive.org/details/roflcon-summit
18:03 ^🔗	PatC	yay, a download!
18:06 ^🔗	SketchCow	It's there, but you have to click on "all"
18:10 ^🔗	closure	doh! one day this will have a better interface, I'm sure
18:14 ^🔗	closure	on the plus side, that's another successfull transaction, and I didn't even blackmail you
18:40 ^🔗	ersi	hah
18:48 ^🔗	Coderjoe	wow
18:49 ^🔗	Coderjoe	I just finished a 243MB user
18:52 ^🔗	closure	I have users who have been going for 3 days now.
18:54 ^🔗	Coderjoe	hmmm
18:55 ^🔗	Coderjoe	you know what might be handy? a ramfs/tmpfs-disk hybrid filesystem
18:56 ^🔗	Coderjoe	you set a maximum ram usage and it keeps recently used/accessed stuff only in ram. when that ram space is exceeded, stuff that hasn't been touched in awhile gets flushed out to a backing directory on a hard drive
18:58 ^🔗	Coderjoe	this way, you get the benefits of tmpfs/ramfs for the files that get downloaded and grepped, but don't potentially fill it up with additional stuff that is downloaded and then not touched
19:01 ^🔗	Coderjoe	and is better than just the disk cache over a normal filesystem because stuff doesn't even hit rotational media until it gets evicted from ram
19:09 ^🔗	bsmith094	i just got back, google knol is going away?
19:12 ^🔗	db48x2	yea, I had 26 users still going when I left this morning
19:12 ^🔗	db48x2	bsmith094: yep
19:13 ^🔗	bsmith094	so is splinder done yet? fully?
19:13 ^🔗	db48x2	nope
19:13 ^🔗	db48x2	us.splinder.com is still down
19:13 ^🔗	db48x2	so those users are out of circulation
19:13 ^🔗	bsmith094	not done, beause the tracker for the streamer script is down
19:13 ^🔗	bsmith094	??
19:14 ^🔗	db48x2	and we all need to check our data for incomplete or broken users and finish them
19:14 ^🔗	db48x2	bsmith094: the tracker isn't handing out any us users, so the script will see it as done
19:16 ^🔗	bsmith094	ok i just spent 14 hrs doing a wget of fanfiction.net, and why fo i have blah and blah.orig
19:16 ^🔗	db48x2	cool. how big is the result?
19:17 ^🔗	marenostr	Hi dear friends! Can anyone of you with Windows operating system and Internet Explorer web browser help me just for a few minutes? I'm dealing with a bug in a file of Project Gutenberg and I don't use Windows/IE, I'm on a GNU/Linux box and online IE rendering tools is not helpful for my case. I want to learn what you see on page www.gutenberg.org/dirs/etext06/8loc110h.htm under the expression "digestibility will be acquired." (without quotes). An
19:17 ^🔗	bsmith094	du -ach isnt done yet
19:17 ^🔗	marenostr	image? Or some text? What? Thanks in advance!
19:17 ^🔗	db48x2	bsmith094: :)
19:17 ^🔗	bsmith094	so whats the .orig files and can i get rid of them?
19:18 ^🔗	db48x2	marenostr: this isn't really the right channel for that kind of question
19:18 ^🔗	db48x2	bsmith094: dunno. check your wget log and see when it created them
19:18 ^🔗	db48x2	maybe the server just had files with that name on it
19:18 ^🔗	bsmith094	where would the log be
19:18 ^🔗	db48x2	where did you save it?
19:18 ^🔗	marenostr	db48x, OK. Sorry. Googling gave me that impression. Sorry.
19:19 ^🔗	bsmith094	wget wwwfanfiction.net
19:19 ^🔗	db48x2	hrm
19:19 ^🔗	db48x2	marenostr: what specifically did you see that gave you this impression?
19:19 ^🔗	db48x2	bsmith094: did you specify a -o or -a option?
19:23 ^🔗	marenostr	db48x, On page http://archiveteam.org/index.php?title=Project_Gutenberg , right side, at the bottom of the box says. IRC channel: #archiveteam and this is meant for Project Gutenberg. It gave me that impression.
19:26 ^🔗	bsmith094	db48x2: i used this wget-warc -mpkKe robots=off -U "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" --warc-cdx --warc-file=www.fanfiction.net/fanfiction_20111122 www.fanfiction.net
19:26 ^🔗	db48x2	bsmith094: so wget sent it's log to stdout
19:27 ^🔗	db48x2	bsmith094: did you redirect stdout to a file?
19:27 ^🔗	db48x2	for example: wget ... www.fanfiction.net > archive.log
19:27 ^🔗	bsmith094	errr... no, but in hindsight that would have been a good ides
19:27 ^🔗	db48x2	indeed :)
19:27 ^🔗	PatC	Evening folks
19:28 ^🔗	db48x2	marenostr: ah. Archive Team is all about archiving webpages, and one of the webpages we have archived/are archiving is Project Gutenberg :)
19:28 ^🔗	db48x2	PatC: howdy
19:28 ^🔗	bsmith094	to give an idea of how large this site is for being mostly text, the warc file is 850mb
19:31 ^🔗	bsmith094	doesnt PG have a problem with robots bulk downloading
19:33 ^🔗	bsmith094	my disk currently hates me, so im gonna let du -ach run afor a while
19:33 ^🔗	PatC	Pulling 2MB/s off archive.org I didn't know their internet connection was this good / not limited
19:34 ^🔗	ndurner	Good/unlimited internet connections are 2 GB/s nowadays
19:35 ^🔗	PatC	wow!
19:39 ^🔗	Coderjoe	bsmith094: if you used -k and -K the -K means back up the original files before changing the links. (which isn't really needed if you were doing a warc as well, since the warc would have the original data the server sent)
19:40 ^🔗	Coderjoe	also, if you let wget run to completion, a copy of the log should be in the warc file
19:40 ^🔗	db48x2	ah, right, -K
19:44 ^🔗	Coderjoe	haha
19:45 ^🔗	Coderjoe	that's mildly amusing
19:45 ^🔗	Coderjoe	google had already indexed the blog post before it was revised to remove the email address
19:50 ^🔗	db48x2	heh
20:24 ^🔗	bsmith094	how do i rsync my splinder data
20:25 ^🔗	bsmith094	never mind got it
20:29 ^🔗	alard	Hi people; Splinder seems to be back to normal. Is it time to requeue the leftovers?
20:34 ^🔗	closure	leftovers? after thanksgiving
20:34 ^🔗	closure	srsly, if you want to queue some stuff, I'm game
20:39 ^🔗	DFJustin	<db48x2> and we all need to check our data for incomplete or broken users and finish them
20:39 ^🔗	DFJustin	how does one go about this
20:40 ^🔗	bsmith094	i would like to know as well. is there a script for that? :)
20:46 ^🔗	bsmith094	ive noticed a file in the splinder grab, called eror userrnames, how would i pipe that into dld-single?
21:34 ^🔗	emijrp	knol guys, knol
21:35 ^🔗	PatC	What's that?
21:38 ^🔗	emijrp	Google Knol.
21:40 ^🔗	SketchCow	Yeah
21:40 ^🔗	SketchCow	Saw you talking about it.
21:41 ^🔗	SketchCow	Let's requie splinder, that that done.
21:43 ^🔗	alard	SketchCow / others: I put the splinder items back into the queue about 1 hour ago. 254466 to do. (And the site is sluggish again, perhaps there is a correlation?)
21:45 ^🔗	bsmith093	trackers back up ... downloading...
21:47 ^🔗	bsmith093	hey cool im acutally on the board
21:50 ^🔗	db48x2	bsmith093: :)
22:03 ^🔗	SketchCow	Good deal.
22:03 ^🔗	SketchCow	Now, help me here, though.
22:03 ^🔗	SketchCow	That means people have given me users that are, in fact, empty, and that someone will give me different users.
22:07 ^🔗	alard	Yes, so don't mix them up. :)
22:07 ^🔗	alard	(Although at this point I've requeued items that were never marked done, so probably those are still unfinished and you won't get them.)
22:11 ^🔗	SketchCow	We'll have to do some level of checking.
22:12 ^🔗	SketchCow	By the way, dragan is completely redoing/cleaning the Geocities torrent.
22:12 ^🔗	closure	SketchCow: did you see the Berlios cleanup script I wrote you?
22:13 ^🔗	SketchCow	Yes, but you'll need to e-mail me details.
22:14 ^🔗	closure	comments at the top should explain it all
22:21 ^🔗	Coderjoe	I just made some changes to dld-streamer.sh (and tested them)
22:21 ^🔗	Coderjoe	there is now an optional third parameter. if given, it is the name of a file to read usernames from, rather than asking the tracker.
22:22 ^🔗	Coderjoe	this is to allow the use of the job management of dld-streamer.sh to retry previously-failed usernames.
22:23 ^🔗	Coderjoe	but I am going to guess a large number of those have been moved back into the todo list on the tracker
22:25 ^🔗	bsmith093	Coderjoe: thats exactly what i was asking about thanks!
22:25 ^🔗	bsmith093	error-usernames is a list
22:39 ^🔗	emijrp	how can i know which hard disk brand goes inside a verbatim 2TB drive?
22:40 ^🔗	Coderjoe	crack it open, or google the model and see if anyone else has cracked it open?
22:43 ^🔗	amerrykan	semi-ontopic for this channel: Someone found the original workprint to "Manos: The Hands of Fate" and intends to restore it - http://forums.somethingawful.com/showthread.php?threadid=3450845
22:43 ^🔗	amerrykan	http://www.manosinhd.com/
22:48 ^🔗	Coderjoe	...

irclogger-viewer