[03:11] I may have asked this before [03:11] er [03:11] I may have asked this before, but is it possible to prevent wget from getting a website twice? [03:12] when it grabs http://www.domain.com/, and http://domain.com/ [03:12] I run into this so often, and it makes an unecessary mess every time [03:13] http://i.imgur.com/UL6lGX5.jpg [03:15] nice hehe [03:52] hmm is pre-creating a symlink the only option? [03:53] hm but its still going to download the files twice, and could screw up the recursion [03:54] or link repair that is [04:02] godane: Thanks [04:19] instence: look into the --reject paramater for wget [04:27] DFJustin: Hmm, not sure how that would apply. So far I use -R for filename restrictions, but I am not sure how that would allow me to tell wget to go "Hey I noticed you are archiving content that is being linked across both domain.com and www.domain.com, lets treat this as 1 unique folder structure instead of 2, and tidy it up as such." [04:28] I was looking at cut-dirs as well [04:28] but each solution potentially breaks the accuracy of the convert-links process [04:29] ahh [04:29] nm then [04:31] ALl it takes is 1 url to be hard linked as a href="domain.com/file.html" and then you are downloading the entire site all over again :/ [04:31] Drives me nuts. [04:33] However I have run into situations where I break out BeyondCompare to look for differences between folder structures, and there are often orphans in each folder. [04:34] Which means if I narrowed the scope to exclude one or the other, I could be missing out on data or pathways to crawl. [04:56] gotcha [04:57] how about we just throw everyone who doesn't redirect to a canonical domain into a volcano [05:10] I like this Volcano proposal [05:10] let's do it [05:20] Seconded [05:35] I hate to bring this up but textfiles.com and www.textfiles.com don't redirect to one or the other [05:57] * SketchCow off the side of the boat [05:58] * BlueMax jumps in after SketchCow [05:58] I'LL SAVE YOU [05:59] DFJustin: haha [11:42] http://scr.terrywri.st/1385465533.png You are so cute, uTorrent. [12:55] here's something funny for the morning: job posting at Penny Arcade: http://www.linkedin.com/jobs2/view/9887522?trk=job_nov and the hashtag to go along with it: #PennyArcadeJobPostings [15:50] SketchCow: i'm starting to upload more NGC magazine issues i got for Kiwi [15:50] Excellent [15:50] there is only one issue left before there is a complete set [15:51] SketchCow: i'm also getting Digital Camera World [15:51] i think there is a collection of it out there form 2002 to 2011 [15:51] sadly i can't get that one but i'm getting stuff from 2002 to 2005 [16:32] so good news on finding the digital camera world collection [16:33] i found a nzb for it [16:33] i'm using my other usenet account to grab it since my default one is missing stuff like crazy [17:22] Anyone seen omf around lately? [18:00] huh. someone elsewhere dropped this link to a well-written article about the max headroom STL signal intrusion: http://motherboard.vice.com/blog/headroom-hacker [18:00] and I spy a textfiles.com link in the middle of it [18:02] like seriously, I'm looking at this XOWA thing - wikitaxi would take the ~11gb .xml.bz2 dump file, process it for a while, and write out a ~13gb .taxi file with all the data and an index for random access [18:03] xowa makes you extract the ENTIRE DUMP TO DISK, taking 45gb [18:03] then generates a 25gb sqlite database [18:03] this is like reinventing the wheel and making it square [19:02] seems reasonable [19:57] Coderjoe: as an FYI, it may be worth starting work on a seesaw pipeline to archive the .org [19:58] * yipdw doesn't think it's going to be around that much longer [22:24] http://video.bobdylan.com/desktop.html [23:07] https://github.com/nsapa/cloudexchange.org/commit/a163610089c026b41968a79b02273071f78eab6c [23:08] * nico_32 is doing things the sed&cut' way