[00:00] *** Arcorann has joined #archiveteam-ot [00:01] *** Arcorann has quit IRC (Remote host closed the connection) [00:01] *** Arcorann has joined #archiveteam-ot [00:45] *** Arcorann has quit IRC (Read error: Connection reset by peer) [00:45] *** Arcorann has joined #archiveteam-ot [00:50] *** Arcorann has quit IRC (Read error: Connection reset by peer) [00:51] *** Arcorann has joined #archiveteam-ot [01:06] *** godane has quit IRC (Ping timeout: 265 seconds) [02:03] *** benjins has quit IRC (Ping timeout: 610 seconds) [02:21] *** wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES) [02:23] *** yawkat has quit IRC (Ping timeout: 260 seconds) [02:26] *** yawkat has joined #archiveteam-ot [02:33] *** godane has joined #archiveteam-ot [02:46] *** wp494 has joined #archiveteam-ot [03:49] *** qw3rty_ has joined #archiveteam-ot [03:57] *** qw3rty has quit IRC (Read error: Operation timed out) [04:21] *** OrIdow6 has quit IRC (Read error: Connection reset by peer) [04:22] *** OrIdow6 has joined #archiveteam-ot [05:19] *** jrwr has quit IRC (Ping timeout: 260 seconds) [05:21] *** namespace has quit IRC (Read error: Operation timed out) [05:21] *** nyany has quit IRC (Write error: Broken pipe) [05:21] *** prq has quit IRC (Write error: Broken pipe) [05:21] *** mtntmnky has quit IRC (Read error: Operation timed out) [05:21] *** revi has quit IRC (Ping timeout: 260 seconds) [05:23] *** DigiDigi has quit IRC (Read error: Operation timed out) [05:24] *** Igloo has quit IRC (Read error: Operation timed out) [05:39] *** nyany has joined #archiveteam-ot [05:39] *** namespace has joined #archiveteam-ot [05:40] *** jrwr has joined #archiveteam-ot [05:40] *** DigiDigi has joined #archiveteam-ot [05:40] *** revi has joined #archiveteam-ot [05:41] *** prq has joined #archiveteam-ot [05:44] *** Igloo has joined #archiveteam-ot [05:48] *** mtntmnky has joined #archiveteam-ot [11:15] *** MrRadar has quit IRC (Ping timeout: 610 seconds) [11:20] *** benjins has joined #archiveteam-ot [11:57] *** benjinsmi has joined #archiveteam-ot [11:57] *** benjins has quit IRC (Ping timeout: 265 seconds) [12:02] *** benjinss has joined #archiveteam-ot [12:07] *** benjinsmi has quit IRC (Read error: Operation timed out) [12:29] *** Arcorann_ has joined #archiveteam-ot [12:36] *** Arcorann has quit IRC (Read error: Operation timed out) [12:43] *** Ravenloft has quit IRC (Read error: Operation timed out) [13:02] *** Ravenloft has joined #archiveteam-ot [13:20] *** Stiletto has quit IRC (Ping timeout: 260 seconds) [13:43] *** BlueMax has quit IRC (Quit: Leaving) [13:55] So, how about those Nintendo leaks? [15:12] *** exoire has joined #archiveteam-ot [15:13] *** exoire has quit IRC (Client Quit) [16:26] *** Arcorann_ has quit IRC (Read error: Connection reset by peer) [17:29] *** yawkat has quit IRC (Ping timeout: 272 seconds) [17:29] *** Laverne has quit IRC (Ping timeout: 272 seconds) [17:29] *** NatarajBt has quit IRC (Ping timeout: 272 seconds) [17:29] *** sHATNER has quit IRC (Ping timeout: 272 seconds) [17:32] *** yawkat has joined #archiveteam-ot [18:30] *** sHATNER has joined #archiveteam-ot [18:31] *** NatarajBt has joined #archiveteam-ot [18:31] *** Laverne has joined #archiveteam-ot [19:34] *** oorw has joined #archiveteam-ot [19:39] Hello. Anyone online who could explain how to work with megawarc.warc.zst files? I didn't manage to find a tutorial anywhere... [19:44] oorw: I don't think there is any documentation on it yet. [19:44] It's a very new thing still. [19:45] Where do you find the dictionary, given an IA item? [19:46] It's inside the file as a skippable frame with some magic ID. [19:46] Oh [19:46] So, I found https://git.kiska.pw/ArchiveTeam/megawarc which seems to be the latest branch of megawarc [19:46] but I have the same question as above: where do I get the correct dictionary? [19:47] https://github.com/internetarchive/CDX-Writer/blob/zstd/cdx_writer/zstdstream.py should be useful also. [19:47] Might be good to separate it out into a separate file [19:47] But the idea is to write a spec for this and have it added to the WARC specification as an appendix. [19:48] You can do that if you want. It's easier to handle this way on the larger scales, I think. [19:48] Also, the file is self-contained like this. [19:49] Otherwise, there will eventually be a situation where someone only kept the .warc.zst but not the dictionary file and can't read it back again. [19:49] which file should be the input to that script? [19:50] So long as it isn't standardized, having it in the file is a threat to accessibility [19:50] In the file and not anywhere else, I should say [19:52] my problem is as follows: I find a set of interesting warc.zst's and I also find a set of zst dictionaries in an odd format (in a different collection!) [19:52] 1st problem: how do I find the matching dictionary to initial .zst since there are multiple [19:52] 2nd: how to extract the pure zst dictionary - i suppose the script from @JAA helps with that... is that right? [19:58] oorw: Yes, specifically the get_zstd_dictionary function. [19:58] right, that much I figured :) [19:58] how about question 1? [19:59] You don't need a matching dictionary since it's inside the file. That's exactly what I meant above. There's no need to search for or download a second file as it's self-contained. [19:59] zstd tooling in general still sucks. There are no equivalents of zless, zgrep, etc. (that I'm aware of). [19:59] But hopefully that will get better over time. [20:01] So in that sense, .warc.zst isn't accessible anyway, and having the dictionary hidden inside it doesn't really make a difference, OrIdow6. [20:01] aaah, I see. So *.megawarc.warc.zst is all I need? If so, then what are those _dictionary_ releases good for? [20:01] Once a specification exists, it'll probably/hopefully be implemented by warcio and other libraries to seamlessly add support for these files to all kinds of tools. [20:02] oorw: That's for the retrieval phase. Basically, the dictionary is recalculated every now and then to reduce the WARC size, and that dictionary needs to be kept somewhere for the crawlers. Someone decided that'd best be inside an IA item. [20:03] @JAA: Ok, thanks for the clarification. I know it's a stretch, but is there any chance that a complete, working example is available somewhere? [20:04] *of unpacking a .zst-packed release [20:04] I don't think so. CDX-Writer is probably the closest. [20:05] that's a shame - the whole point of archiving would be to be able to recover the information with reasonable effort... [20:05] JAA: Knowledge of zstd is fairly widespread; the existance or nonexistance of the tools doesn't change that [20:05] I know, I know, new tech and all, but I hope this is improved sooner rather than later [20:05] Yes, I've been meaning to write up the spec for it, but other stuff keeps interfering. [20:06] There have been 2 (3, if you count me) people who have come in here recently who know what a zstd dictionary is and are unable to get anything out of these files [20:06] OrIdow6: It's widespread as a compression algorithm itself, but reading zstd-compressed files with a custom dictionary is *much* less common. [20:06] ironically, I did manage to unpack the .zst IA for the dictionary, but didn't know what to do with it then... [20:07] JAA: But it is well-documented, in the manpages, on the website, etc. (IIRC) [20:08] oorw: Yeah, those are just simple zstd-compressed files. I guess you could use those to decompress the .warc.zst, but you'll need to do the matching yourself. [20:08] OrIdow6: .warc.zst isn't though. Anyway, we're saying the same thing I think: the current situation sucks, but it's still very new, and hopefully it'll improve once a formal spec for it exists. [20:08] right, and that is exactly what I failed to due to lack of any explanation anywhere... [20:09] Yup, because you don't really have to do any matching. That likely won't be documented ever. [20:09] If someone in 50 years with decent technical knowledge and enough records of the state of things now came across a file compressed with a zstd dictionary, they would presumably be able to open it; the same is not true of these zstd-compressed warcs, which depend on esoteric knowledge which seems to be confined to a few people inside of AT and IA as well as some obscure GitHub repos [20:09] Treat the dictionary items on IA as a retrieval artefact and ignore it. :-) [20:09] @JAA : fair enough as long as the 'correct' way is documented properly [20:10] It will. :-) [20:10] But yes, standardization is preferable [20:10] OrIdow6: It will be part of the WARC spec, so yes, anyone will be able to read it if they can get their hands on the spec, which is needed anyway to properly handle WARC files. [20:10] @Orldow6: just to clarify, have you managed to extract any data from warc.zst's or not? [20:11] The thing is that the whole WARC specification process is very much "implementation first please". So you can't get anything into the spec unless it is used and has proven fairly stable. We're somewhere between those steps now. [20:12] oorw: I haven't had a practical need to yet, just wondered, so no [20:12] I.e. it has been used for a while and is working fine, so it can be standardised now. [20:12] Which probably means it'll be in the released WARC spec sometime in the next 5 years. [20:12] at least the archiving part seems to be working just fine, not so sure about unarchiving :P [20:12] Well, the WBM is able to read it back just fine. [20:13] So in that sense, yes, that's working as well. [20:13] right, but the idea is that anyone can access the data, isn'ti t? [20:13] But the WBM is closed-source, so it sucks. [20:16] @JAA : even the the CDX-Writer source code contains such fun comments like '# assumes our specific way of calling. does not support general usage.' [20:17] not really user-friendly... [20:18] Yup, that's IA's typical source code. Working, kind of, but not exactly great code. [20:19] the good thing is, that there are at least a few tests, so it's possible to see the usage (although fairly limited code snippets) [20:31] oorw: https://transfer.notkiska.pw/TXlRo/xtract.py for an ugly standalone thing [20:37] *** britmob has quit IRC (Read error: Connection reset by peer) [20:40] *** britmob has joined #archiveteam-ot [20:41] @Orldow6 : thanks a lot for that! Looks like i'm now hitting https://github.com/dask/fastparquet/issues/381 though :( I have never liked Python... [20:44] @Orldow6 : after uninstall zstd & zstandard and installing just zstandard, the script worked. Now I got the dictionary file with the size of 1 MB, but now I'm basically back at square one: this looks equivalent to the separate IA which contain only the dictionary... How do I feed this dictionary to the particular .zst file to get to the actual data? [20:45] -D [20:45] E.g. zstdcat -D in a pipe [20:45] -D [the dictionary file] [20:46] oorw [20:47] Or just the "zstd" tool, for that matter, but I usually find myself using zstdcat more [20:47] And a warning that if this is a megawarc, this may create a huge file if you're just extracting it to disk [20:48] As would be the case with or without zstd [20:48] More so with zstd though since the custom dictionary massively decreases the WARC size. [20:49] Oh, you're right [20:50] uhm, now zstd is indeed extracting something, but I actually get a smaller-sized file than the original (roughly 90% of the size)... what am I missing? [20:51] Is it a valid warc? [20:51] Is it a warc, I should say; a validator isn't necessary [20:52] file ending is *.megawarc.warc.zst [20:52] From AT? Should be valid then. [20:52] (btw, sorry for the silly questions, i'm new to this...) [20:52] File ending of what? [20:52] The original, or the output of the zstd tool? [20:55] haha, don't laugh (too hard), but I unintentionally compressed it more instead of decompressing, doh (btw, gained 10% gain in space with one more round of compression...) [20:56] now that I actually did decompression, the results look like they should - I got ~12x the original file size [20:56] file ending is now .megawarc.warc as expected [20:56] what is the recommended tool for this filetype? (I am aware of the wiki page with a looong list of tools) [20:57] Huh, -10 %, interesting. [20:57] Depends on what you're trying to do. [20:57] I'd like to get the original data back - if they are pictures, I want to see pictures; if it is text, I want to see the text etc. [20:57] Local WBM-like playback: pywb (but note that it will create a full copy of your file on importing) [20:58] Extract into a file tree, although that doesn't always make sense: warcat has something for that. [20:58] WARCs basically store HTTP requests and responses, so mapping that to files isn't really possible. [20:59] warcat - great name... I think this is indeed what I'm looking for. I'll give it a go, thanks [21:01] to summarize everything until now: I think the crucial information, that I was missing, is that the .zst file has a self-container dictionary file & I also didn't find anywhere, how to extract it... if these two pieces of information would be made generally available in an easy-to-access medium, the ease-of-use for standard users would increase greatly [21:01] *self-contained [21:07] I'd say that only the first needs documentation; the second can be found in the manpages and command help [21:09] I may see about writing a nicer tool to get the dictionary out (maybe in C, since apparently the Python API is inadequate for this and needs some ffi hacks to make it work) [21:10] @Orldow6 : fair enough. But tbh, xtract.py that you provided was indeed very useful - didn't have to worry about finding the various code snippets and gluing them together [21:11] oorw: Well, that's how I made it, haha [21:12] And I'm glad it was useful - I think (public) accessibility of these things is useful [21:12] Couldn't agree more [21:13] Anyways, @Orldow6 @JAA - thanks a lot for all the help! [21:13] oorw: Np [21:13] I don't think I would have managed with only the publicly available information... maybe only by a stroke of luck [21:28] Sure thing. [22:04] *** oorw has left [22:31] *** Zerote has quit IRC (Read error: Connection reset by peer) [23:45] *** BlueMax has joined #archiveteam-ot