#archiveteam-bs 2013-11-26,Tue

↑back Search

Time Nickname Message
03:11 🔗 instence_ I may have asked this before
03:11 🔗 instence_ er
03:11 🔗 instence I may have asked this before, but is it possible to prevent wget from getting a website twice?
03:12 🔗 instence when it grabs http://www.domain.com/, and http://domain.com/
03:12 🔗 instence I run into this so often, and it makes an unecessary mess every time
03:13 🔗 Coderjoe http://i.imgur.com/UL6lGX5.jpg
03:15 🔗 instence nice hehe
03:52 🔗 instence hmm is pre-creating a symlink the only option?
03:53 🔗 instence hm but its still going to download the files twice, and could screw up the recursion
03:54 🔗 instence or link repair that is
04:02 🔗 SketchCow godane: Thanks
04:19 🔗 DFJustin instence: look into the --reject paramater for wget
04:27 🔗 instence DFJustin: Hmm, not sure how that would apply. So far I use -R for filename restrictions, but I am not sure how that would allow me to tell wget to go "Hey I noticed you are archiving content that is being linked across both domain.com and www.domain.com, lets treat this as 1 unique folder structure instead of 2, and tidy it up as such."
04:28 🔗 instence I was looking at cut-dirs as well
04:28 🔗 instence but each solution potentially breaks the accuracy of the convert-links process
04:29 🔗 DFJustin ahh
04:29 🔗 DFJustin nm then
04:31 🔗 instence ALl it takes is 1 url to be hard linked as a href="domain.com/file.html" and then you are downloading the entire site all over again :/
04:31 🔗 instence Drives me nuts.
04:33 🔗 instence However I have run into situations where I break out BeyondCompare to look for differences between folder structures, and there are often orphans in each folder.
04:34 🔗 instence Which means if I narrowed the scope to exclude one or the other, I could be missing out on data or pathways to crawl.
04:56 🔗 DFJustin gotcha
04:57 🔗 DFJustin how about we just throw everyone who doesn't redirect to a canonical domain into a volcano
05:10 🔗 instence I like this Volcano proposal
05:10 🔗 instence let's do it
05:20 🔗 SketchCow Seconded
05:35 🔗 DFJustin I hate to bring this up but textfiles.com and www.textfiles.com don't redirect to one or the other
05:57 🔗 * SketchCow off the side of the boat
05:58 🔗 * BlueMax jumps in after SketchCow
05:58 🔗 BlueMax I'LL SAVE YOU
05:59 🔗 instence DFJustin: haha
11:42 🔗 GLaDOS http://scr.terrywri.st/1385465533.png You are so cute, uTorrent.
12:55 🔗 dashcloud here's something funny for the morning: job posting at Penny Arcade: http://www.linkedin.com/jobs2/view/9887522?trk=job_nov and the hashtag to go along with it: #PennyArcadeJobPostings
15:50 🔗 godane SketchCow: i'm starting to upload more NGC magazine issues i got for Kiwi
15:50 🔗 SketchCow Excellent
15:50 🔗 godane there is only one issue left before there is a complete set
15:51 🔗 godane SketchCow: i'm also getting Digital Camera World
15:51 🔗 godane i think there is a collection of it out there form 2002 to 2011
15:51 🔗 godane sadly i can't get that one but i'm getting stuff from 2002 to 2005
16:32 🔗 godane so good news on finding the digital camera world collection
16:33 🔗 godane i found a nzb for it
16:33 🔗 godane i'm using my other usenet account to grab it since my default one is missing stuff like crazy
17:22 🔗 soultcer Anyone seen omf around lately?
18:00 🔗 Coderjoe huh. someone elsewhere dropped this link to a well-written article about the max headroom STL signal intrusion: http://motherboard.vice.com/blog/headroom-hacker
18:00 🔗 Coderjoe and I spy a textfiles.com link in the middle of it
18:02 🔗 DFJustin like seriously, I'm looking at this XOWA thing - wikitaxi would take the ~11gb .xml.bz2 dump file, process it for a while, and write out a ~13gb .taxi file with all the data and an index for random access
18:03 🔗 DFJustin xowa makes you extract the ENTIRE DUMP TO DISK, taking 45gb
18:03 🔗 DFJustin then generates a 25gb sqlite database
18:03 🔗 DFJustin this is like reinventing the wheel and making it square
19:02 🔗 xmc seems reasonable
19:57 🔗 yipdw Coderjoe: as an FYI, it may be worth starting work on a seesaw pipeline to archive the .org
19:58 🔗 * yipdw doesn't think it's going to be around that much longer
22:24 🔗 S[h]O[r]T http://video.bobdylan.com/desktop.html
23:07 🔗 nico_32 https://github.com/nsapa/cloudexchange.org/commit/a163610089c026b41968a79b02273071f78eab6c
23:08 🔗 * nico_32 is doing things the sed&cut' way

irclogger-viewer