#archiveteam-bs 2013-11-26,Tue

↑back Search

Time	Nickname	Message
03:11 ^🔗	instence_	I may have asked this before
03:11 ^🔗	instence_	er
03:11 ^🔗	instence	I may have asked this before, but is it possible to prevent wget from getting a website twice?
03:12 ^🔗	instence	when it grabs http://www.domain.com/, and http://domain.com/
03:12 ^🔗	instence	I run into this so often, and it makes an unecessary mess every time
03:13 ^🔗	Coderjoe	http://i.imgur.com/UL6lGX5.jpg
03:15 ^🔗	instence	nice hehe
03:52 ^🔗	instence	hmm is pre-creating a symlink the only option?
03:53 ^🔗	instence	hm but its still going to download the files twice, and could screw up the recursion
03:54 ^🔗	instence	or link repair that is
04:02 ^🔗	SketchCow	godane: Thanks
04:19 ^🔗	DFJustin	instence: look into the --reject paramater for wget
04:27 ^🔗	instence	DFJustin: Hmm, not sure how that would apply. So far I use -R for filename restrictions, but I am not sure how that would allow me to tell wget to go "Hey I noticed you are archiving content that is being linked across both domain.com and www.domain.com, lets treat this as 1 unique folder structure instead of 2, and tidy it up as such."
04:28 ^🔗	instence	I was looking at cut-dirs as well
04:28 ^🔗	instence	but each solution potentially breaks the accuracy of the convert-links process
04:29 ^🔗	DFJustin	ahh
04:29 ^🔗	DFJustin	nm then
04:31 ^🔗	instence	ALl it takes is 1 url to be hard linked as a href="domain.com/file.html" and then you are downloading the entire site all over again :/
04:31 ^🔗	instence	Drives me nuts.
04:33 ^🔗	instence	However I have run into situations where I break out BeyondCompare to look for differences between folder structures, and there are often orphans in each folder.
04:34 ^🔗	instence	Which means if I narrowed the scope to exclude one or the other, I could be missing out on data or pathways to crawl.
04:56 ^🔗	DFJustin	gotcha
04:57 ^🔗	DFJustin	how about we just throw everyone who doesn't redirect to a canonical domain into a volcano
05:10 ^🔗	instence	I like this Volcano proposal
05:10 ^🔗	instence	let's do it
05:20 ^🔗	SketchCow	Seconded
05:35 ^🔗	DFJustin	I hate to bring this up but textfiles.com and www.textfiles.com don't redirect to one or the other
05:57 ^🔗	*	SketchCow off the side of the boat
05:58 ^🔗	*	BlueMax jumps in after SketchCow
05:58 ^🔗	BlueMax	I'LL SAVE YOU
05:59 ^🔗	instence	DFJustin: haha
11:42 ^🔗	GLaDOS	http://scr.terrywri.st/1385465533.png You are so cute, uTorrent.
12:55 ^🔗	dashcloud	here's something funny for the morning: job posting at Penny Arcade: http://www.linkedin.com/jobs2/view/9887522?trk=job_nov and the hashtag to go along with it: #PennyArcadeJobPostings
15:50 ^🔗	godane	SketchCow: i'm starting to upload more NGC magazine issues i got for Kiwi
15:50 ^🔗	SketchCow	Excellent
15:50 ^🔗	godane	there is only one issue left before there is a complete set
15:51 ^🔗	godane	SketchCow: i'm also getting Digital Camera World
15:51 ^🔗	godane	i think there is a collection of it out there form 2002 to 2011
15:51 ^🔗	godane	sadly i can't get that one but i'm getting stuff from 2002 to 2005
16:32 ^🔗	godane	so good news on finding the digital camera world collection
16:33 ^🔗	godane	i found a nzb for it
16:33 ^🔗	godane	i'm using my other usenet account to grab it since my default one is missing stuff like crazy
17:22 ^🔗	soultcer	Anyone seen omf around lately?
18:00 ^🔗	Coderjoe	huh. someone elsewhere dropped this link to a well-written article about the max headroom STL signal intrusion: http://motherboard.vice.com/blog/headroom-hacker
18:00 ^🔗	Coderjoe	and I spy a textfiles.com link in the middle of it
18:02 ^🔗	DFJustin	like seriously, I'm looking at this XOWA thing - wikitaxi would take the ~11gb .xml.bz2 dump file, process it for a while, and write out a ~13gb .taxi file with all the data and an index for random access
18:03 ^🔗	DFJustin	xowa makes you extract the ENTIRE DUMP TO DISK, taking 45gb
18:03 ^🔗	DFJustin	then generates a 25gb sqlite database
18:03 ^🔗	DFJustin	this is like reinventing the wheel and making it square
19:02 ^🔗	xmc	seems reasonable
19:57 ^🔗	yipdw	Coderjoe: as an FYI, it may be worth starting work on a seesaw pipeline to archive the .org
19:58 ^🔗	*	yipdw doesn't think it's going to be around that much longer
22:24 ^🔗	S[h]O[r]T	http://video.bobdylan.com/desktop.html
23:07 ^🔗	nico_32	https://github.com/nsapa/cloudexchange.org/commit/a163610089c026b41968a79b02273071f78eab6c
23:08 ^🔗	*	nico_32 is doing things the sed&cut' way

irclogger-viewer