[01:03] *** JAA sets mode: +bb Ya_ALLAH!*@* *!*@185.143.4*
[01:45] *** CoolCanuk has joined #archiveteam-bs
[02:01] *** DopefishJ is now known as DFJustin
[02:10] *** odemg has quit IRC (Ping timeout: 250 seconds)
[02:13] *** odemg has joined #archiveteam-bs
[02:21] *** pizzaiolo has quit IRC (pizzaiolo)
[03:05] *** ZexaronS has quit IRC (Read error: Operation timed out)
[03:07] *** Stilett0 has quit IRC (Read error: Operation timed out)
[03:16] *** ndiddy has joined #archiveteam-bs
[03:21] *** Stilett0 has joined #archiveteam-bs
[03:21] *** Stilett0 is now known as Stiletto
[03:38] <robogoat> Somebody2: Ok, can you explain something, regarding darking and crawl history?
[03:38] <robogoat> It is possible for someone to own domain A, have content on domain A, be fine with IA archiving it,
[03:39] <robogoat> and then sell/let the domain lapse, and the subsequent owner/squatter puts up a prohibitive robots.txt
[03:39] <robogoat> It's my understanding that the material is then non-accessible,
[03:39] <robogoat> is it "darked"?
[04:05] *** qw3rty111 has joined #archiveteam-bs
[04:09] *** qw3rty119 has quit IRC (Read error: Operation timed out)
[04:14] *** CoolCanuk has quit IRC (Quit: Connection closed for inactivity)
[04:21] <vantec> Been out of the loop for a bit, but think this mostly still stands: https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/
[05:28] *** bithippo has quit IRC (Read error: Connection reset by peer)
[05:44] *** wacky has quit IRC (Read error: Operation timed out)
[05:44] *** wacky_ has joined #archiveteam-bs
[06:50] <Somebody2> robogoat: First of all, note that I'm not employed by IA, and in fact have only visited once. This is just outside curious onlooker.
[06:51] <Somebody2> With that out of the way -- dark'ing applies to *items* (a jargon term) on archive.org, not web pages in the Wayback Machine.
[06:52] <Somebody2> The Wayback Machine is an *interface* to a whole bunch of WARC files, stored in various items on IA.
[06:53] <Somebody2> The WARC files are what actually contain the HTML (and URLs, and dates) that are displayed through the Wayback Machine.
[06:55] <Somebody2> If an item is darked, it can't be included in the Wayback Machine (or any other interface, like the TV News viewer, or the Emularity).
[06:56] *** Stiletto has quit IRC (Ping timeout: 250 seconds)
[06:56] <Somebody2> So none of the items containing WARCs that contain web pages visible through the Wayback Machine are darked (unless I'm missing something).
[06:57] <Somebody2> However -- that doesn't mean you can download the WARC files yourself, directly (with some exceptions).
[06:58] <Somebody2> Most of the items containing WARCs used in the Wayback Machine are "private", which means while you can see the file names, and sizes, and hashes, ...
[06:59] <Somebody2> ... you can't actually download the actual files without special permission (which the software that runs the Wayback Machine has).
[07:00] <Somebody2> The WARCs produced by ArchiveTeam generally are *NOT* private -- although we prefer not to talk about this too loudly, to avoid people complaining.
[07:01] <Somebody2> A recently added feature of the Wayback Machine provides links from a particular web page to the item containing the WARC it came from.
[07:04] <Somebody2> Actually, it looks like it only links to the *collection* containing the item containing the WARC, sorry.
[07:06] <Somebody2> So, now, to get back to robots.txt -- the Wayback Machine does (currently) include a feature to disable access to URLs whose most recent robots.txt file Disallows them.
[07:07] <Somebody2> The details of exactly how this operates (i.e. which Agent names does it recognize, how does it parse different Allow and Disallow lines, ...
[07:07] <Somebody2> ... what does it do if there is no robots.txt file) are subtle, changing, and undocumented.
[07:08] <Somebody2> And robots.txt files do *NOT* apply to themselves, so you can always see the contents of all the robots.txt files IA has captured for a domain.
[07:09] <Somebody2> (unless there was a specific complaint sent to IA asking for the domain to be excluded, which they also honor)
[07:10] <Somebody2> But the robots.txt logic doesn't apply at ALL to the underlying items -- so *if* you can download them, you can still access the data that way.
[07:10] <Somebody2> Hopefully that answers the question.
[07:10] <Somebody2> (and sorry everyone else for the literal wall of text)
[07:11] <Somebody2> There is also a robots.txt feature included in the Save Page Now feature, but that's a separate thing.
[07:22] *** bwn has quit IRC (Read error: Connection reset by peer)
[07:37] *** DFJustin has quit IRC (Remote host closed the connection)
[07:44] *** DFJustin has joined #archiveteam-bs
[07:44] *** swebb sets mode: +o DFJustin
[07:44] *** bwn has joined #archiveteam-bs
[08:01] *** ZexaronS has joined #archiveteam-bs
[08:03] *** mr_archiv has quit IRC (Quit: WeeChat 1.6)
[08:03] *** mr_archiv has joined #archiveteam-bs
[08:03] *** mr_archiv has quit IRC (Client Quit)
[08:05] *** mr_archiv has joined #archiveteam-bs
[08:54] *** Mateon1 has quit IRC (Read error: Connection reset by peer)
[08:55] *** Mateon1 has joined #archiveteam-bs
[09:11] *** MrDignity has quit IRC (Remote host closed the connection)
[09:11] *** MrDignity has joined #archiveteam-bs
[10:12] *** schbirid has joined #archiveteam-bs
[10:17] *** nyany has quit IRC (Leaving)
[10:40] *** JAA sets mode: +bb BestPrize!*@* *!pointspri@*
[10:53] *** MrDignity has quit IRC (Remote host closed the connection)
[10:53] *** MrDignity has joined #archiveteam-bs
[12:19] *** kimmer12 has joined #archiveteam-bs
[12:22] *** BlueMaxim has quit IRC (Quit: Leaving)
[12:25] *** schbirid has quit IRC (Quit: Leaving)
[12:26] *** kimmer1 has quit IRC (Ping timeout: 633 seconds)
[12:54] *** dashcloud has quit IRC (No Ping reply in 180 seconds.)
[12:54] *** dashcloud has joined #archiveteam-bs
[13:00] *** Stilett0 has joined #archiveteam-bs
[13:46] *** dashcloud has quit IRC (Read error: Operation timed out)
[13:48] *** dashcloud has joined #archiveteam-bs
[14:33] *** Specular has joined #archiveteam-bs
[14:34] <Specular> is there any known way of converting Web Archive files saved from Safari to the MHT format?
[14:52] *** godane has quit IRC (Quit: Leaving.)
[15:09] *** kimmer1 has joined #archiveteam-bs
[15:12] *** kimmer12 has quit IRC (Ping timeout: 633 seconds)
[15:35] *** kimmer12 has joined #archiveteam-bs
[15:36] *** kimmer12 has quit IRC (Remote host closed the connection)
[15:42] *** kimmer1 has quit IRC (Ping timeout: 633 seconds)
[16:06] <Specular> somehow my search queries were too specific prior and just found this. Mac only but will test later. https://langui.net/webarchive-to-mht/
[16:07] <Specular> oh it's commercial. Typical Mac apps, ahaha.
[16:58] *** Specular has quit IRC (Quit: Leaving)
[17:09] *** pizzaiolo has joined #archiveteam-bs
[17:52] *** ola_norsk has joined #archiveteam-bs
[17:54] <ola_norsk> how might one go about to 'archive a person' on the internet archive? I'm thinking of the youtuber Charles Green a.k.a 'Angry Granpa'
[18:04] *** kimmer1 has joined #archiveteam-bs
[18:12] <Somebody2> ola_norsk: You can't archive a person. But you could archive the work they have posted online. I'd use Archivebot and youtube-dl
[18:14] <ola_norsk> Somebody2: could those various items, from e.g the fandom wikia, twitter, to youtube videos etc; later be made into e.g a 'Collection' ?
[18:14] <ola_norsk> Somebody2: without having to be one item, i mean
[18:15] <Somebody2> Yes, once you upload them, send an email to info@archive suggesting they be made into a collection, and someone will likely do it eventually.
[18:15] <ola_norsk> ty
[18:16] *** godane has joined #archiveteam-bs
[18:17] <ola_norsk> btw, would the items need a certain meta-tag?
[18:18] <ola_norsk> other than topics, i means
[18:20] <JAA> For WARCs, you need to set the mediatype to web. Anything else is optional and can be changed post-upload (mediatype can only be set at item creation). But the more metadata, the better! :-)
[18:20] <ola_norsk> okidoki
[18:21] <JAA> (If you forget to set the mediatype correctly on upload, send them an email instead of trying to work around it by creating a new item or whatever; they can change it, I believe.)
[18:22] *** pizzaiolo has quit IRC (Read error: Operation timed out)
[18:22] <ola_norsk> speaking of which, i messed that up on this item https://archive.org/details/vidme_AfterPrisonJoe :/
[18:24] <ola_norsk> and it seems to have messed up the media format detection. I did re-download the videos locally though.
[18:26] *** pizzaiolo has joined #archiveteam-bs
[18:27] <ola_norsk> JAA: would it be a good idea to simply tar.gz the vidme videos, add that to the item; and then send an email asking the all item's content to be replaced by the content of the tar.gz
[18:27] <JAA> I doubt it.
[18:28] <JAA> Why don't you just use the 'ia' tool (Python package internetarchive)?
[18:28] <ola_norsk> i do use that
[18:29] <ola_norsk> but, there seems to be a bug that's preventing changing metadata.
[18:30] <JAA> Hm?
[18:30] <ola_norsk> JAA: https://github.com/jjjake/internetarchive/issues/228
[18:31] <ola_norsk> but, maybe it's resolved already. I've not tested yet.
[18:41] <ola_norsk> ouch, might i have accidentally closed the issue? Or do they timeout on github after some time :/
[18:46] <ola_norsk> JAA: what version of 'ia' are you on?
[18:46] * ola_norsk is on 1.7.4
[19:18] <ola_norsk>  how do i submit 'angrygrandpa.wikia.com' to get grabbed by archivebot?
[19:19] <ola_norsk> nevermind
[19:38] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
[19:38] *** dashcloud has joined #archiveteam-bs
[19:56] *** dashcloud has quit IRC (Read error: Connection reset by peer)
[20:00] *** dashcloud has joined #archiveteam-bs
[20:20] *** BlueMaxim has joined #archiveteam-bs
[20:35] <JAA> ola_norsk: Sorry, had to leave. I've been using 1.7.1 and 1.7.3.
[20:38] <JAA> ola_norsk, Somebody2: If you archive a site through ArchiveBot, you can't easily add just that site's archives to a collection afterwards though, because the archives are generally spread over multiple items and each of those items also contains loads of other jobs' archives.
[20:39] <JAA> It might be better to use grab-site/wpull and upload the WARCs yourself. Then you can create clean items for each site you archive or whatever, and these can easily be added to a collection (or multiple) afterwards.
[20:42] <Somebody2> That is a good point, thank you.
[20:47] <PurpleSym> What’s the #archivebot IRC logs password?
[20:56] <JAA> Query
[20:57] <PurpleSym> And username, JAA?
[20:59] <ola_norsk> JAA: ah. i will see if i have the space for that wikia. (though, it's already submitted to archivebot as a job) :/ I will use !abort if my harddrive runs out
[21:00] <ola_norsk> JAA: so there's no real way to recall a specific archivebot job/task?
[21:00] <JAA> ola_norsk: You could upload the WARCs already while you're still grabbing it. That's what ArchiveBot pipelines do as well.
[21:00] <JAA> What do you mean by "recall"?
[21:01] <ola_norsk> JAA: to make that specific archivebot task into an item on ia
[21:02] <ola_norsk> JAA: a warc item, i mean, with topics etc.
[21:02] <JAA> In theory, you could download the files and reupload them to a new item. Or use 'ia copy', which does a server-side copy I believe. Whether that's a good idea though is another question entirely...
[21:03] <ola_norsk> im n00b at using archivebot i'm afraid :/
[21:03] <JAA> I guess someone from IA could move the files to a separate item. But again I'm not sure whether they do that.
[21:03] <JAA> But yeah, that's all manual.
[21:04] <JAA> Some pipelines upload files directly, and there you sort-of have one item per job (though it doesn't contain all relevant files and may sometimes contain other jobs as well).
[21:04] <JAA> But other than that...
[21:05] <ola_norsk> a warc item that's manually uploaded as item at a later time though, could that use data already grabbed trough archivebot? Or would it be causing duplicate-hell ?
[21:06] <JAA> I don't know how IA handles duplicates.
[21:06] <ola_norsk> aye, me neither
[21:07] <JAA> That's why I wonder if it's a good idea.
[21:07] <JAA> If they deduplicate the files, then it would probably be fine.
[21:07] <JAA> Maybe someone else knows more about this.
[21:08] <ola_norsk> "somewhere, in the deep cellars of internet archive; There's a single gnome set to the task of checksum'ing all files and writing symlinks" lol
[21:08] <ola_norsk> :D
[21:12] *** Mateon1 has quit IRC (Read error: Operation timed out)
[21:13] *** Mateon1 has joined #archiveteam-bs
[21:15] <Somebody2> IA does not (*YET*) dedeuplicate.
[21:15] <Somebody2> (AFAIK)
[21:20] *** jschwart has joined #archiveteam-bs
[21:49] *** dashcloud has quit IRC (Read error: Operation timed out)
[21:50] *** dashcloud has joined #archiveteam-bs
[21:51] <ola_norsk> i have no idea. if there's no pressing need it's ok i guess. And thinking they would if need be, at least on items that's not been altered for quite a while.
[21:52] <godane> dedeuplicate i can see being done on video and pdf items
[21:53] <godane> i don't think it would work with warc files
[21:53] <ola_norsk> aye
[21:53] <ola_norsk> not with derived/recompressed files either i think. not unless the original was checked beforehand
[21:57] <ola_norsk> godane: i guess with warc it would need checking content; and patching the stuff and/or link list in the warcs
[21:58] <godane> sort of my thought
[21:58] <ola_norsk> aye
[21:59] <ez> unless ia unpacks all wars on their side
[21:59] <godane> i was think it would be check important web urls that have the same checksum and making derive warc archive to only store it once
[22:00] <ez> warc is generally rather unfortunate thing to do for bulk file formats
[22:00] <ez> (not sure about the wisdom of reinventing zip files, either)
[22:00] <godane> so it would be derive warc either way
[22:01] <godane> something like this for warc makes more sense if we are doing my librarybox project idea
[22:01] <ola_norsk> got link?
[22:01] <godane> cause then people can host full archives of cbsnews.com for example without it take 100gb
[22:02] <ez> the thing is that on mass scale, dupes dont happen that often in general
[22:02] <ez> so its often not worth the time bothering with it, especially for small items
[22:04] <ola_norsk> but e.g for twitter, users might upload memes that are just copied from other sites. i don't know if twitter alters all images; but that could cause duplicates i think
[22:04] <ola_norsk> if, twitter gives each uploaded image it's own file and filename, i mean
[22:05] <ez> somewhat
[22:06] <ez> theres been study for this for 4chan, which granted, isnt representative sample of twitter
[22:06] <ez> but as far as actual md5 stored _live_, only 10-15% were dupes
[22:06] <ola_norsk> ok
[22:06] <ez> however over time, there were indeed >50% dupes in certain time periods
[22:06] <ez> exactly as you say, some image got really popular and got reposted over and over
[22:08] *** pizzaiolo has quit IRC (Read error: Operation timed out)
[22:08] <ola_norsk> in any case, it's something that's fixable in the future though i would guess. E.g picking trough all the shit and finding e,g the most reoccuring, or highquality image; either by md5 or image regocnitions
[22:09] <ola_norsk> image recognition*
[22:09] *** pizzaiolo has joined #archiveteam-bs
[22:10] <ez> well, this isnt that exact study i've seen, but mentions the distribution of dupes per post https://i.imgur.com/S9pxJqV.png
[22:11] <ez> ola_norsk: it could be done of course. note that if you were to do what you say, you'd also build index for reverse image search
[22:11] <ez> which would be a really handy thing for IA to have
[22:11] <ez> needless to say, it depends if IA wants to diversify as a search engine
[22:12] <ola_norsk> well, the meta.xml is there, containing the md5's i think :D
[22:12] <ez> md5s useless for search
[22:12] <ola_norsk> aye, but to find duplicates i mean
[22:12] <ez> on md5 level, the dupes dont happen often enough given a random kitchen sink of files
[22:13] <ez> it definitely makes sense for certain datasets
[22:13] *** pizzaiolo has quit IRC (Client Quit)
[22:13] <ez> like those ad laden pirate rapidshare upload servers, they sure do md5. they have this huge sample of highly specific cotent, and the files often are same copies.
[22:13] <ez> megaupload actually got nailed for this legally
[22:13] *** pizzaiolo has joined #archiveteam-bs
[22:13] <ez> they dmcad the link, but not the md5
[22:14] <ez> deduping 10k 1MB image files and deduping 10k 1GB files is what makes the difference
[22:15] <ez> as you get random mix of both, its obvious which part of the set to focus on
[22:15] <ola_norsk> it would be slow work i guess
[22:15] <ez> depends on how the system works, really
[22:16] <ez> most data stores working on per-file basis often compute hash on streaming upload, and make a symlink when hash found at that time, too
[22:16] <ez> but more general setups often dont have the luxury of having a hook fire per each new file
[22:16] <ola_norsk> what if there was a distributed tool to Warrior, that picked trough the IA items xml and looked for duplicate md5's ?
[22:17] <ola_norsk> (of only original files that is)
[22:18] <ez> its possible to fetch the hashes from ia directly via api
[22:18] <JAA> joepie91: FYI, I'm porting your anti-CloudFlare code to Python.
[22:19] <ez> not everything is available tho. as long it has xml sha1, you can api fetch it
[22:19] <ez> build an offline database too, etc
[22:20] <ola_norsk> ez: it's a quite a number of items on ia though. But if it was a delegated slow and steady task
[22:21] <ola_norsk> ez: like a background task or something
[22:21] <ez> no i mean you can have the hashes offline
[22:21] <ez> as comparably small structure
[22:21] <ez> unfortunately its really awkward to get it at this moment
[22:22] <ola_norsk> ez: i mean making a full list of duplicate files. where e.g the 'parent item' is by first date
[22:22] <ez> ola_norsk: a bloom filter with reasonable collision rate is like 10 bits per item
[22:23] <ez> regardless of number of items
[22:24] <ez> not sure if ia supports search by hash
[22:25] <ez> last time i checked the api (some years ago) it didnt
[22:25] <ola_norsk> the md5 hashes are in the xml of (each?) item
[22:25] <ez> if it still doesnt, you'd need to store hash, as well as xml id to locate its context as you say
[22:25] <ez> which would bload the database a great deal
[22:26] <ez> ola_norsk: yea
[22:26] <ez> the idea is that i run scan over my filesystem and compare every file to filter i scrapped from ia api
[22:26] <ola_norsk> it's not going anywhere though. So it could basically be done slow as shit don't you think?
[22:26] <ez> and upload only files which dont match. this is because querying IA with 500M+ files is not realistic
[22:27] <ez> so bloom filter would work fine for uploads of random crap and making more or less sure its not a dupe
[22:28] <ez> but it wont tell you *where* your files are on ia, unless you abuse the api with fulltext search and what not
[22:28] <ola_norsk> i have no idea man :)
[22:29] <ola_norsk> does the logs say?
[22:29] <ola_norsk> the history of items?
[22:30] * ola_norsk 's brain is broken and beered :/
[22:31] <ola_norsk> i'm guessing IA would put some gnome to work the day their harddrive is full :D
[22:32] <ola_norsk> (which i'm guessing is not tomorrow lol) :D
[22:32] <ez> just restrict some classes of uploads when space starts running short
[22:32] <ez> but yea, space can be done on cheap if you have the scale
[22:33] <ola_norsk> doh, restriction is bad :/
[22:33] <ola_norsk> might they as well check if that file already exist?
[22:34] <ez> as i said, at those scales, the content is so diverse it happens rather infrequently, especially if your files are comparably small (ie a lot of small items of diverse content)
[22:34] <ola_norsk> ...maybe they already do.. :0
[22:34] *** odemg has quit IRC (Read error: Connection reset by peer)
[22:35] <ez> its easy to do for single files, but not quite sure about warc
[22:35] <ola_norsk> aye, it would need unpacking and stuff
[22:36] <ola_norsk> and e.g youtube videos that are mkv combined would need split into audio and video i guess, then compared
[22:37] <ez> the thing is i've seen deduping rather infrequently in large setups like this - the restrictions on flexibility of what you can do (you now need some sort of fast hash index to check against, you need some symlink mechanism now)
[22:37] <ez> youtube doesnt dedupe im pretty sure
[22:37] <ez> not by the source video anyway
[22:38] <ez> since 99% of content they get is original uploads. most dupes they'd otherwise get usually gets struck by contentid
[22:38] <ez> 1% is the long trail of short meme videos reposted over and over and what not, but its just a tiny part of long trail
[22:39] <ola_norsk> i sometimes upload by getting videoes with youtube-dl, and it seems that often combines 2 files, audio and video, into a single file..would that make different md5 sum?
[22:40] <ola_norsk> (without repacking/recoding, i mean)
[22:40] <ez> (its easy to test - each reupload yields new reencode on yt, and the encode is even slightly different as it contains timestamp in mp4 header)
[22:41] <ez> ola_norsk: yes, highest quality is available only via HLS
[22:42] <ez> curiously, its an artificial restriction, as other parts of google infra which uses yt infra, together mp4 1080/4k just fine
[22:42] <ez> the restriction is specific to yt, i suppose in a bid to frustrate trivial ripping attempts via browser extensions
[22:44] <ola_norsk> but, i think what i mean is; If i run 'youtube-dl' on the same youtube video twice..Then e.g the audio and video (often webm and mp4), before they are merged into MKV file, would be the very same files each time? or?
[22:44] <ez> ola_norsk: on and off its possible to abuse youtube apis to get the actual original mp4, but it changes 2 times now (works for google+, only for your own videos when logged in)
[22:44] <ez> *has changed
[22:44] <ez> so definitely not something to rely on
[22:45] <ez> but if i were as crazy to archive yt, i'd definitely try to rip the original files, not the re-encodes
[22:45] *** odemg has joined #archiveteam-bs
[22:45] <ola_norsk> no hehe :D i'm just talking example as to how to detect duplicate videos :D
[22:45] <ola_norsk> or duplicate uploads in general
[22:46] <ez> ola_norsk: depends what you command ytdl to do
[22:46] <ez> generally if you ask it same file, same format, you overwhelmingly get the exact same file
[22:46] <ez> but google re-encodes those from time to time
[22:48] <ola_norsk> aye, most often it just combines audio and video, 2 different files, into a KVM file. And, i'm thinking if the kvm file were split again, into those two a/v files, the md5 would be the same in two instances of where youtube-dl were used to download the same video.
[22:49] <ola_norsk> and, by that, duplicate video uploads could be detected
[22:50] <ola_norsk> the generated merged file (kvm etc) would be different, but the two merged files would be same, since there's no re-encoding occurring
[22:51] <ez> hum, that sounds elaborate?
[22:51] <ola_norsk> aye, i have a headacke :D
[22:51] <ez> why not just tell ytdl to rip everything to mkv from the get-go?
[22:54] <ola_norsk> i just meant in relation to detecting duplicates in IA items (where e.g 2 youtube vidoes are uploaded twice)
[22:55] <ola_norsk> where md5sum of the two kvm files are not an option
[22:55] <ola_norsk> since each download would cause two diffrent kvm to be made locally by the downloaders
[22:57] <ola_norsk> if, however, it's possible to split a kvm, into the audio and video files contained..i'm thinking those two would yield identical md5
[23:02] * ola_norsk ran out of duplication-detection smartness :/
[23:05] <ez> i have no idea what kvm file is
[23:10] *** nyany has joined #archiveteam-bs
[23:16] <ola_norsk> ez: usually when i use youtube-dl it downloads audio and video seperate, then combine the two into a kvm file
[23:16] <ez> whats an kvm file?
[23:16] <ez> oh
[23:16] <ez> mkv
[23:17] <ola_norsk> ah, sorry, yes
[23:17] <ola_norsk> mkv
[23:18] <ez> yea its a bit annoying, mostly because the ffmpeg transcoder is non-deterministic
[23:18] <ez> i think it puts timestamp in the mkv header or something silly like that
[23:18] <ez> s/transcoder/muxer/
[23:19] <ola_norsk> does it alter the two audio and video files though? or can it be split from the mkv?
[23:21] <ola_norsk> either way, if that is possible; then detecting duplicate mkv files uploaded to ia is possible
[23:21] <ola_norsk> even if the md5 sum differs between to mkv files containing the same content
[23:22] <ez> the raw bitstream is kept as-is
[23:23] <ola_norsk> ok
[23:23] <ez> meaning the mux is as "original" as sent in the google HLS track
[23:23] <ez> but the container metadata are often unstable on account of different versions of software muxing slightly differently (think seek metadata and such)
[23:27] <ola_norsk> if the framecount doesn't differ though, and neither does the frames, that could be a further step?
[23:28] <ola_norsk> ez: the extent of my knowledge is rather spent when it comes to codecs :D
[23:28] <ez> tl;dr is that you cant rely on what ytdl gives you as muxed output
[23:29] <ez> perhaps if you just ask for the 720p mp4, as that one isnt remuxed (yet?)
[23:30] <ola_norsk> i'm thinking some kind of image/frame simularity detection then i guess
[23:30] <ola_norsk> rather*
[23:32] *** jschwart has quit IRC (Quit: Konversation terminated!)
[23:32] <ola_norsk> the md5 is in the item xml though, so i guess that is where one would have to start to find duplicates on ia
[23:39] <ola_norsk> ez: it is as you say elaborate. So i'm glad i don't have to do it :D
[23:40] <ola_norsk> ez: (and so should everyone else be, of me not doing it lol) ;)
[23:41] *** BlueMaxim has quit IRC (Quit: Leaving)