[06:44] so some good news is some weird way with glenn beck stuff [06:44] turns out i can start getting the hd version of the show now [06:45] the show is a single show with just glenn beck on it now [06:47] this maybe only sort of a storage problem in that this 1 hour shows will 1G byte [07:15] the other good news is there is real metadata now to go with the shows [07:20] the other fun is that i can't down grade the hd to what it use to be [07:26] godane: you were the one who I grabbed the NHK podcasts for, right? [07:36] yes [07:36] godane: what do you want me to do with them? is it okay if I just give you an rsync endpoint to grab them from? [07:36] ok [07:37] (there might be duplicates, but the file modification date indicates whether they are unique... it's set to the actual recording date) [07:37] alright [07:37] godane: will do so later today and PM you the details [07:37] you can do it now if you want [13:51] ok warrior fired up for these grabs [13:51] I'll run as long as I can at home [13:53] any way to grab asp files when you get errors like this: http://computerpoweruser.com/articles/archive/create/cre6197/cre6197.asp [13:53] i want to do full back of all the old computerpoweruser articles if its possable [13:55] they've screwed up their server side includes godane so I don't know of anyway.. :/ [14:09] Sigh, all my threads have "yahooed!" :( [14:13] someone is telling me you can use GETS + grabbing a directory listing and grabbing indiviual files godane - no idea if it'll work [14:16] how do i do that? [14:22] Smiley: ? [14:23] I don't know godane it's just something someone said,and they won't give me more details. [14:23] so it's likely a deadend [14:29] Smiley: is the guy your talking to on a irc channel? [14:30] i need to know what the hell it is [14:38] godane: does that asp file link to a working article or is it just broken? [14:38] its just broking [14:39] but i can get to a working article that is newer [14:39] about 2004ish [14:39] example: http://www.computerpoweruser.com/articles/archive/c0504/50c04/50c04.asp [15:14] So I downloaded a 10GB tar.bz2 with wget-warc, as well as some smaller stuff, and megawarced it all together. But, the resulting .warc.gz file is only 7.6GB. What's up with that? [15:22] perhaps the warc rearranged things - say, alphabetically - so it compressed better? [17:44] Smiley: anything about the 'GETS' yet? [18:12] non./ [20:24] kyan: the tar.bz2 likely compressed a bit more when run through gzip [20:26] a warc.gz file is not one single gzip stream. each record is a separate stream. [20:26] Coderjoe, Actually I decompressed it, the raw WARC was only 7 or 8 gb so I think something got left out [20:26] hmm [20:27] was the tar.bz2 file just in the same directory as the warc files? what did it contain? [20:29] megawarc creates three files: [20:29] FILE.warc.gz is the concatenated .warc.gz FILE.tar contains any non-warc files from the .tar FILE.json.gz contains metadata [20:31] Coderjoe, I downloaded the tar.bz2 with wget-warc, with the automatic-delete option enabled (so only the warc was kept). Then that warc (and all the others) got fed to megawarc, and then deleted. I'll see if I can track down where they ended up (the files have all been sent out to the Internet Archive now). [20:32] Coderjoe, [20:32] https://archive.org/details/AMJ_BarrelData_6561_88c1c7eb-067b-4547-949c-e3111a189bab.2013-12-16-19-44-26-244140-_E [20:32] The log output went to the xz file… [20:33] it's from an automated system I've been working on; I was keeping an eye on this item especially because it's the first really big thing the program has worked with [20:39] Coderjoe, looking at the json.gz it looks like the file megawarc got was only 7gb. So, I conclude that Wget munged it :P [20:48] or megawarc did [20:51] if megawarc munged it, that's worrying [20:51] if wget munged it, that's also worrying [20:53] Coderjoe, I don't think megawarc did since it doesn't (I don't think?) touch the original warc [20:56] I don't know though, I 'm not too familiar with the software involved [21:01] it's pretty common for the connection to drop at some point in a 10gb download, that would be my guess [21:01] DFJustin, usually wget resumes automatically I think? The download log seemed to indicate it got all the way to 100% complete [21:02] (although I did some downloads of big MPEG transport stream files with wget-warc (10 to 20 gb) and they seemmed to come through fine, so Im' not sure what's different) [21:03] depends on the parameters [21:03] and that was with a lot of dropped connections since it was to a server in greece (I think) on a flaky connection [21:03] Ok I guess I'll look at those items then, and compare the settings to what my script is doing [21:04] and on how the server on the other end behaves [21:04] see what I changed [21:04] I see… I'll do some investigating :) [21:04] thanks! [21:11] hmm [21:11] has jason scott been online? [21:11] haven't seen him for some time... [21:13] he was on this morning [21:16] ah well [21:16] I sended SketchCow some emails a while ago and I didn't get a reply yet, so I thought he maybe was on holiday or something [21:17] he's been flying back to new york I think [21:18] https://twitter.com/textfiles/status/413037877791326208/photo/1 [21:22] ah, so that's why [21:22] thank you, guess I just have to wait some time... :) [21:45] flying or sliding, it's all the same [23:26] His terrible crack addition has overtaken his responsibilities. [23:41] this might be interesting to some folks here, on the use of PhantomJS for a massive site: http://sorcery.smugmug.com/2013/12/17/using-phantomjs-at-scale/ [23:43] i wish I undstood more about the massive node.js implementation we have at work [23:45] SketchCow: ah, see, there's your mistake; try being addicted to /incredible/ crack instead.