#archiveteam-bs 2013-12-17,Tue

↑back Search

Time Nickname Message
06:44 πŸ”— godane so some good news is some weird way with glenn beck stuff
06:44 πŸ”— godane turns out i can start getting the hd version of the show now
06:45 πŸ”— godane the show is a single show with just glenn beck on it now
06:47 πŸ”— godane this maybe only sort of a storage problem in that this 1 hour shows will 1G byte
07:15 πŸ”— godane the other good news is there is real metadata now to go with the shows
07:20 πŸ”— godane the other fun is that i can't down grade the hd to what it use to be
07:26 πŸ”— joepie91 godane: you were the one who I grabbed the NHK podcasts for, right?
07:36 πŸ”— godane yes
07:36 πŸ”— joepie91 godane: what do you want me to do with them? is it okay if I just give you an rsync endpoint to grab them from?
07:36 πŸ”— godane ok
07:37 πŸ”— joepie91 (there might be duplicates, but the file modification date indicates whether they are unique... it's set to the actual recording date)
07:37 πŸ”— joepie91 alright
07:37 πŸ”— joepie91 godane: will do so later today and PM you the details
07:37 πŸ”— godane you can do it now if you want
13:51 πŸ”— Smiley ok warrior fired up for these grabs
13:51 πŸ”— Smiley I'll run as long as I can at home
13:53 πŸ”— godane any way to grab asp files when you get errors like this: http://computerpoweruser.com/articles/archive/create/cre6197/cre6197.asp
13:53 πŸ”— godane i want to do full back of all the old computerpoweruser articles if its possable
13:55 πŸ”— Smiley they've screwed up their server side includes godane so I don't know of anyway.. :/
14:09 πŸ”— Smiley Sigh, all my threads have "yahooed!" :(
14:13 πŸ”— Smiley someone is telling me you can use GETS + grabbing a directory listing and grabbing indiviual files godane - no idea if it'll work
14:16 πŸ”— godane how do i do that?
14:22 πŸ”— godane Smiley: ?
14:23 πŸ”— Smiley I don't know godane it's just something someone said,and they won't give me more details.
14:23 πŸ”— Smiley so it's likely a deadend
14:29 πŸ”— godane Smiley: is the guy your talking to on a irc channel?
14:30 πŸ”— godane i need to know what the hell it is
14:38 πŸ”— m1das godane: does that asp file link to a working article or is it just broken?
14:38 πŸ”— godane its just broking
14:39 πŸ”— godane but i can get to a working article that is newer
14:39 πŸ”— godane about 2004ish
14:39 πŸ”— godane example: http://www.computerpoweruser.com/articles/archive/c0504/50c04/50c04.asp
15:14 πŸ”— kyan So I downloaded a 10GB tar.bz2 with wget-warc, as well as some smaller stuff, and megawarced it all together. But, the resulting .warc.gz file is only 7.6GB. What's up with that?
15:22 πŸ”— sep332 perhaps the warc rearranged things - say, alphabetically - so it compressed better?
17:44 πŸ”— godane Smiley: anything about the 'GETS' yet?
18:12 πŸ”— Smiley non./
20:24 πŸ”— Coderjoe kyan: the tar.bz2 likely compressed a bit more when run through gzip
20:26 πŸ”— Coderjoe a warc.gz file is not one single gzip stream. each record is a separate stream.
20:26 πŸ”— kyan Coderjoe, Actually I decompressed it, the raw WARC was only 7 or 8 gb so I think something got left out
20:26 πŸ”— Coderjoe hmm
20:27 πŸ”— Coderjoe was the tar.bz2 file just in the same directory as the warc files? what did it contain?
20:29 πŸ”— Coderjoe megawarc creates three files:
20:29 πŸ”— Coderjoe FILE.warc.gz is the concatenated .warc.gz FILE.tar contains any non-warc files from the .tar FILE.json.gz contains metadata
20:31 πŸ”— kyan Coderjoe, I downloaded the tar.bz2 with wget-warc, with the automatic-delete option enabled (so only the warc was kept). Then that warc (and all the others) got fed to megawarc, and then deleted. I'll see if I can track down where they ended up (the files have all been sent out to the Internet Archive now).
20:32 πŸ”— kyan Coderjoe,
20:32 πŸ”— kyan https://archive.org/details/AMJ_BarrelData_6561_88c1c7eb-067b-4547-949c-e3111a189bab.2013-12-16-19-44-26-244140-_E
20:32 πŸ”— kyan The log output went to the xz fileҀ¦
20:33 πŸ”— kyan it's from an automated system I've been working on; I was keeping an eye on this item especially because it's the first really big thing the program has worked with
20:39 πŸ”— kyan Coderjoe, looking at the json.gz it looks like the file megawarc got was only 7gb. So, I conclude that Wget munged it :P
20:48 πŸ”— Coderjoe or megawarc did
20:51 πŸ”— xmc if megawarc munged it, that's worrying
20:51 πŸ”— xmc if wget munged it, that's also worrying
20:53 πŸ”— kyan Coderjoe, I don't think megawarc did since it doesn't (I don't think?) touch the original warc
20:56 πŸ”— kyan I don't know though, I 'm not too familiar with the software involved
21:01 πŸ”— DFJustin it's pretty common for the connection to drop at some point in a 10gb download, that would be my guess
21:01 πŸ”— kyan DFJustin, usually wget resumes automatically I think? The download log seemed to indicate it got all the way to 100% complete
21:02 πŸ”— kyan (although I did some downloads of big MPEG transport stream files with wget-warc (10 to 20 gb) and they seemmed to come through fine, so Im' not sure what's different)
21:03 πŸ”— DFJustin depends on the parameters
21:03 πŸ”— kyan and that was with a lot of dropped connections since it was to a server in greece (I think) on a flaky connection
21:03 πŸ”— kyan Ok I guess I'll look at those items then, and compare the settings to what my script is doing
21:04 πŸ”— DFJustin and on how the server on the other end behaves
21:04 πŸ”— kyan see what I changed
21:04 πŸ”— kyan I seeҀ¦ I'll do some investigating :)
21:04 πŸ”— kyan thanks!
21:11 πŸ”— arkiver hmm
21:11 πŸ”— arkiver has jason scott been online?
21:11 πŸ”— arkiver haven't seen him for some time...
21:13 πŸ”— DFJustin he was on this morning
21:16 πŸ”— arkiver ah well
21:16 πŸ”— arkiver I sended SketchCow some emails a while ago and I didn't get a reply yet, so I thought he maybe was on holiday or something
21:17 πŸ”— DFJustin he's been flying back to new york I think
21:18 πŸ”— DFJustin https://twitter.com/textfiles/status/413037877791326208/photo/1
21:22 πŸ”— arkiver ah, so that's why
21:22 πŸ”— arkiver thank you, guess I just have to wait some time... :)
21:45 πŸ”— m1das flying or sliding, it's all the same
23:26 πŸ”— SketchCow His terrible crack addition has overtaken his responsibilities.
23:41 πŸ”— dashcloud this might be interesting to some folks here, on the use of PhantomJS for a massive site: http://sorcery.smugmug.com/2013/12/17/using-phantomjs-at-scale/
23:43 πŸ”— BiggieJ i wish I undstood more about the massive node.js implementation we have at work
23:45 πŸ”— Baljem SketchCow: ah, see, there's your mistake; try being addicted to /incredible/ crack instead.

irclogger-viewer