Time |
Nickname |
Message |
03:11
🔗
|
instence_ |
I may have asked this before |
03:11
🔗
|
instence_ |
er |
03:11
🔗
|
instence |
I may have asked this before, but is it possible to prevent wget from getting a website twice? |
03:12
🔗
|
instence |
when it grabs http://www.domain.com/, and http://domain.com/ |
03:12
🔗
|
instence |
I run into this so often, and it makes an unecessary mess every time |
03:13
🔗
|
Coderjoe |
http://i.imgur.com/UL6lGX5.jpg |
03:15
🔗
|
instence |
nice hehe |
03:52
🔗
|
instence |
hmm is pre-creating a symlink the only option? |
03:53
🔗
|
instence |
hm but its still going to download the files twice, and could screw up the recursion |
03:54
🔗
|
instence |
or link repair that is |
04:02
🔗
|
SketchCow |
godane: Thanks |
04:19
🔗
|
DFJustin |
instence: look into the --reject paramater for wget |
04:27
🔗
|
instence |
DFJustin: Hmm, not sure how that would apply. So far I use -R for filename restrictions, but I am not sure how that would allow me to tell wget to go "Hey I noticed you are archiving content that is being linked across both domain.com and www.domain.com, lets treat this as 1 unique folder structure instead of 2, and tidy it up as such." |
04:28
🔗
|
instence |
I was looking at cut-dirs as well |
04:28
🔗
|
instence |
but each solution potentially breaks the accuracy of the convert-links process |
04:29
🔗
|
DFJustin |
ahh |
04:29
🔗
|
DFJustin |
nm then |
04:31
🔗
|
instence |
ALl it takes is 1 url to be hard linked as a href="domain.com/file.html" and then you are downloading the entire site all over again :/ |
04:31
🔗
|
instence |
Drives me nuts. |
04:33
🔗
|
instence |
However I have run into situations where I break out BeyondCompare to look for differences between folder structures, and there are often orphans in each folder. |
04:34
🔗
|
instence |
Which means if I narrowed the scope to exclude one or the other, I could be missing out on data or pathways to crawl. |
04:56
🔗
|
DFJustin |
gotcha |
04:57
🔗
|
DFJustin |
how about we just throw everyone who doesn't redirect to a canonical domain into a volcano |
05:10
🔗
|
instence |
I like this Volcano proposal |
05:10
🔗
|
instence |
let's do it |
05:20
🔗
|
SketchCow |
Seconded |
05:35
🔗
|
DFJustin |
I hate to bring this up but textfiles.com and www.textfiles.com don't redirect to one or the other |
05:57
🔗
|
* |
SketchCow off the side of the boat |
05:58
🔗
|
* |
BlueMax jumps in after SketchCow |
05:58
🔗
|
BlueMax |
I'LL SAVE YOU |
05:59
🔗
|
instence |
DFJustin: haha |
11:42
🔗
|
GLaDOS |
http://scr.terrywri.st/1385465533.png You are so cute, uTorrent. |
12:55
🔗
|
dashcloud |
here's something funny for the morning: job posting at Penny Arcade: http://www.linkedin.com/jobs2/view/9887522?trk=job_nov and the hashtag to go along with it: #PennyArcadeJobPostings |
15:50
🔗
|
godane |
SketchCow: i'm starting to upload more NGC magazine issues i got for Kiwi |
15:50
🔗
|
SketchCow |
Excellent |
15:50
🔗
|
godane |
there is only one issue left before there is a complete set |
15:51
🔗
|
godane |
SketchCow: i'm also getting Digital Camera World |
15:51
🔗
|
godane |
i think there is a collection of it out there form 2002 to 2011 |
15:51
🔗
|
godane |
sadly i can't get that one but i'm getting stuff from 2002 to 2005 |
16:32
🔗
|
godane |
so good news on finding the digital camera world collection |
16:33
🔗
|
godane |
i found a nzb for it |
16:33
🔗
|
godane |
i'm using my other usenet account to grab it since my default one is missing stuff like crazy |
17:22
🔗
|
soultcer |
Anyone seen omf around lately? |
18:00
🔗
|
Coderjoe |
huh. someone elsewhere dropped this link to a well-written article about the max headroom STL signal intrusion: http://motherboard.vice.com/blog/headroom-hacker |
18:00
🔗
|
Coderjoe |
and I spy a textfiles.com link in the middle of it |
18:02
🔗
|
DFJustin |
like seriously, I'm looking at this XOWA thing - wikitaxi would take the ~11gb .xml.bz2 dump file, process it for a while, and write out a ~13gb .taxi file with all the data and an index for random access |
18:03
🔗
|
DFJustin |
xowa makes you extract the ENTIRE DUMP TO DISK, taking 45gb |
18:03
🔗
|
DFJustin |
then generates a 25gb sqlite database |
18:03
🔗
|
DFJustin |
this is like reinventing the wheel and making it square |
19:02
🔗
|
xmc |
seems reasonable |
19:57
🔗
|
yipdw |
Coderjoe: as an FYI, it may be worth starting work on a seesaw pipeline to archive the .org |
19:58
🔗
|
* |
yipdw doesn't think it's going to be around that much longer |
22:24
🔗
|
S[h]O[r]T |
http://video.bobdylan.com/desktop.html |
23:07
🔗
|
nico_32 |
https://github.com/nsapa/cloudexchange.org/commit/a163610089c026b41968a79b02273071f78eab6c |
23:08
🔗
|
* |
nico_32 is doing things the sed&cut' way |