#archiveteam-bs 2017-12-16,Sat

↑back Search

Time Nickname Message
01:03 πŸ”— JAA sets mode: +bb Ya_ALLAH!*@* *!*@185.143.4*
01:45 πŸ”— CoolCanuk has joined #archiveteam-bs
02:01 πŸ”— DopefishJ is now known as DFJustin
02:10 πŸ”— odemg has quit IRC (Ping timeout: 250 seconds)
02:13 πŸ”— odemg has joined #archiveteam-bs
02:21 πŸ”— pizzaiolo has quit IRC (pizzaiolo)
03:05 πŸ”— ZexaronS has quit IRC (Read error: Operation timed out)
03:07 πŸ”— Stilett0 has quit IRC (Read error: Operation timed out)
03:16 πŸ”— ndiddy has joined #archiveteam-bs
03:21 πŸ”— Stilett0 has joined #archiveteam-bs
03:21 πŸ”— Stilett0 is now known as Stiletto
03:38 πŸ”— robogoat Somebody2: Ok, can you explain something, regarding darking and crawl history?
03:38 πŸ”— robogoat It is possible for someone to own domain A, have content on domain A, be fine with IA archiving it,
03:39 πŸ”— robogoat and then sell/let the domain lapse, and the subsequent owner/squatter puts up a prohibitive robots.txt
03:39 πŸ”— robogoat It's my understanding that the material is then non-accessible,
03:39 πŸ”— robogoat is it "darked"?
04:05 πŸ”— qw3rty111 has joined #archiveteam-bs
04:09 πŸ”— qw3rty119 has quit IRC (Read error: Operation timed out)
04:14 πŸ”— CoolCanuk has quit IRC (Quit: Connection closed for inactivity)
04:21 πŸ”— vantec Been out of the loop for a bit, but think this mostly still stands: https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/
05:28 πŸ”— bithippo has quit IRC (Read error: Connection reset by peer)
05:44 πŸ”— wacky has quit IRC (Read error: Operation timed out)
05:44 πŸ”— wacky_ has joined #archiveteam-bs
06:50 πŸ”— Somebody2 robogoat: First of all, note that I'm not employed by IA, and in fact have only visited once. This is just outside curious onlooker.
06:51 πŸ”— Somebody2 With that out of the way -- dark'ing applies to *items* (a jargon term) on archive.org, not web pages in the Wayback Machine.
06:52 πŸ”— Somebody2 The Wayback Machine is an *interface* to a whole bunch of WARC files, stored in various items on IA.
06:53 πŸ”— Somebody2 The WARC files are what actually contain the HTML (and URLs, and dates) that are displayed through the Wayback Machine.
06:55 πŸ”— Somebody2 If an item is darked, it can't be included in the Wayback Machine (or any other interface, like the TV News viewer, or the Emularity).
06:56 πŸ”— Stiletto has quit IRC (Ping timeout: 250 seconds)
06:56 πŸ”— Somebody2 So none of the items containing WARCs that contain web pages visible through the Wayback Machine are darked (unless I'm missing something).
06:57 πŸ”— Somebody2 However -- that doesn't mean you can download the WARC files yourself, directly (with some exceptions).
06:58 πŸ”— Somebody2 Most of the items containing WARCs used in the Wayback Machine are "private", which means while you can see the file names, and sizes, and hashes, ...
06:59 πŸ”— Somebody2 ... you can't actually download the actual files without special permission (which the software that runs the Wayback Machine has).
07:00 πŸ”— Somebody2 The WARCs produced by ArchiveTeam generally are *NOT* private -- although we prefer not to talk about this too loudly, to avoid people complaining.
07:01 πŸ”— Somebody2 A recently added feature of the Wayback Machine provides links from a particular web page to the item containing the WARC it came from.
07:04 πŸ”— Somebody2 Actually, it looks like it only links to the *collection* containing the item containing the WARC, sorry.
07:06 πŸ”— Somebody2 So, now, to get back to robots.txt -- the Wayback Machine does (currently) include a feature to disable access to URLs whose most recent robots.txt file Disallows them.
07:07 πŸ”— Somebody2 The details of exactly how this operates (i.e. which Agent names does it recognize, how does it parse different Allow and Disallow lines, ...
07:07 πŸ”— Somebody2 ... what does it do if there is no robots.txt file) are subtle, changing, and undocumented.
07:08 πŸ”— Somebody2 And robots.txt files do *NOT* apply to themselves, so you can always see the contents of all the robots.txt files IA has captured for a domain.
07:09 πŸ”— Somebody2 (unless there was a specific complaint sent to IA asking for the domain to be excluded, which they also honor)
07:10 πŸ”— Somebody2 But the robots.txt logic doesn't apply at ALL to the underlying items -- so *if* you can download them, you can still access the data that way.
07:10 πŸ”— Somebody2 Hopefully that answers the question.
07:10 πŸ”— Somebody2 (and sorry everyone else for the literal wall of text)
07:11 πŸ”— Somebody2 There is also a robots.txt feature included in the Save Page Now feature, but that's a separate thing.
07:22 πŸ”— bwn has quit IRC (Read error: Connection reset by peer)
07:37 πŸ”— DFJustin has quit IRC (Remote host closed the connection)
07:44 πŸ”— DFJustin has joined #archiveteam-bs
07:44 πŸ”— swebb sets mode: +o DFJustin
07:44 πŸ”— bwn has joined #archiveteam-bs
08:01 πŸ”— ZexaronS has joined #archiveteam-bs
08:03 πŸ”— mr_archiv has quit IRC (Quit: WeeChat 1.6)
08:03 πŸ”— mr_archiv has joined #archiveteam-bs
08:03 πŸ”— mr_archiv has quit IRC (Client Quit)
08:05 πŸ”— mr_archiv has joined #archiveteam-bs
08:54 πŸ”— Mateon1 has quit IRC (Read error: Connection reset by peer)
08:55 πŸ”— Mateon1 has joined #archiveteam-bs
09:11 πŸ”— MrDignity has quit IRC (Remote host closed the connection)
09:11 πŸ”— MrDignity has joined #archiveteam-bs
10:12 πŸ”— schbirid has joined #archiveteam-bs
10:17 πŸ”— nyany has quit IRC (Leaving)
10:40 πŸ”— JAA sets mode: +bb BestPrize!*@* *!pointspri@*
10:53 πŸ”— MrDignity has quit IRC (Remote host closed the connection)
10:53 πŸ”— MrDignity has joined #archiveteam-bs
12:19 πŸ”— kimmer12 has joined #archiveteam-bs
12:22 πŸ”— BlueMaxim has quit IRC (Quit: Leaving)
12:25 πŸ”— schbirid has quit IRC (Quit: Leaving)
12:26 πŸ”— kimmer1 has quit IRC (Ping timeout: 633 seconds)
12:54 πŸ”— dashcloud has quit IRC (No Ping reply in 180 seconds.)
12:54 πŸ”— dashcloud has joined #archiveteam-bs
13:00 πŸ”— Stilett0 has joined #archiveteam-bs
13:46 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
13:48 πŸ”— dashcloud has joined #archiveteam-bs
14:33 πŸ”— Specular has joined #archiveteam-bs
14:34 πŸ”— Specular is there any known way of converting Web Archive files saved from Safari to the MHT format?
14:52 πŸ”— godane has quit IRC (Quit: Leaving.)
15:09 πŸ”— kimmer1 has joined #archiveteam-bs
15:12 πŸ”— kimmer12 has quit IRC (Ping timeout: 633 seconds)
15:35 πŸ”— kimmer12 has joined #archiveteam-bs
15:36 πŸ”— kimmer12 has quit IRC (Remote host closed the connection)
15:42 πŸ”— kimmer1 has quit IRC (Ping timeout: 633 seconds)
16:06 πŸ”— Specular somehow my search queries were too specific prior and just found this. Mac only but will test later. https://langui.net/webarchive-to-mht/
16:07 πŸ”— Specular oh it's commercial. Typical Mac apps, ahaha.
16:58 πŸ”— Specular has quit IRC (Quit: Leaving)
17:09 πŸ”— pizzaiolo has joined #archiveteam-bs
17:52 πŸ”— ola_norsk has joined #archiveteam-bs
17:54 πŸ”— ola_norsk how might one go about to 'archive a person' on the internet archive? I'm thinking of the youtuber Charles Green a.k.a 'Angry Granpa'
18:04 πŸ”— kimmer1 has joined #archiveteam-bs
18:12 πŸ”— Somebody2 ola_norsk: You can't archive a person. But you could archive the work they have posted online. I'd use Archivebot and youtube-dl
18:14 πŸ”— ola_norsk Somebody2: could those various items, from e.g the fandom wikia, twitter, to youtube videos etc; later be made into e.g a 'Collection' ?
18:14 πŸ”— ola_norsk Somebody2: without having to be one item, i mean
18:15 πŸ”— Somebody2 Yes, once you upload them, send an email to info@archive suggesting they be made into a collection, and someone will likely do it eventually.
18:15 πŸ”— ola_norsk ty
18:16 πŸ”— godane has joined #archiveteam-bs
18:17 πŸ”— ola_norsk btw, would the items need a certain meta-tag?
18:18 πŸ”— ola_norsk other than topics, i means
18:20 πŸ”— JAA For WARCs, you need to set the mediatype to web. Anything else is optional and can be changed post-upload (mediatype can only be set at item creation). But the more metadata, the better! :-)
18:20 πŸ”— ola_norsk okidoki
18:21 πŸ”— JAA (If you forget to set the mediatype correctly on upload, send them an email instead of trying to work around it by creating a new item or whatever; they can change it, I believe.)
18:22 πŸ”— pizzaiolo has quit IRC (Read error: Operation timed out)
18:22 πŸ”— ola_norsk speaking of which, i messed that up on this item https://archive.org/details/vidme_AfterPrisonJoe :/
18:24 πŸ”— ola_norsk and it seems to have messed up the media format detection. I did re-download the videos locally though.
18:26 πŸ”— pizzaiolo has joined #archiveteam-bs
18:27 πŸ”— ola_norsk JAA: would it be a good idea to simply tar.gz the vidme videos, add that to the item; and then send an email asking the all item's content to be replaced by the content of the tar.gz
18:27 πŸ”— JAA I doubt it.
18:28 πŸ”— JAA Why don't you just use the 'ia' tool (Python package internetarchive)?
18:28 πŸ”— ola_norsk i do use that
18:29 πŸ”— ola_norsk but, there seems to be a bug that's preventing changing metadata.
18:30 πŸ”— JAA Hm?
18:30 πŸ”— ola_norsk JAA: https://github.com/jjjake/internetarchive/issues/228
18:31 πŸ”— ola_norsk but, maybe it's resolved already. I've not tested yet.
18:41 πŸ”— ola_norsk ouch, might i have accidentally closed the issue? Or do they timeout on github after some time :/
18:46 πŸ”— ola_norsk JAA: what version of 'ia' are you on?
18:46 πŸ”— * ola_norsk is on 1.7.4
19:18 πŸ”— ola_norsk how do i submit 'angrygrandpa.wikia.com' to get grabbed by archivebot?
19:19 πŸ”— ola_norsk nevermind
19:38 πŸ”— dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
19:38 πŸ”— dashcloud has joined #archiveteam-bs
19:56 πŸ”— dashcloud has quit IRC (Read error: Connection reset by peer)
20:00 πŸ”— dashcloud has joined #archiveteam-bs
20:20 πŸ”— BlueMaxim has joined #archiveteam-bs
20:35 πŸ”— JAA ola_norsk: Sorry, had to leave. I've been using 1.7.1 and 1.7.3.
20:38 πŸ”— JAA ola_norsk, Somebody2: If you archive a site through ArchiveBot, you can't easily add just that site's archives to a collection afterwards though, because the archives are generally spread over multiple items and each of those items also contains loads of other jobs' archives.
20:39 πŸ”— JAA It might be better to use grab-site/wpull and upload the WARCs yourself. Then you can create clean items for each site you archive or whatever, and these can easily be added to a collection (or multiple) afterwards.
20:42 πŸ”— Somebody2 That is a good point, thank you.
20:47 πŸ”— PurpleSym What’s the #archivebot IRC logs password?
20:56 πŸ”— JAA Query
20:57 πŸ”— PurpleSym And username, JAA?
20:59 πŸ”— ola_norsk JAA: ah. i will see if i have the space for that wikia. (though, it's already submitted to archivebot as a job) :/ I will use !abort if my harddrive runs out
21:00 πŸ”— ola_norsk JAA: so there's no real way to recall a specific archivebot job/task?
21:00 πŸ”— JAA ola_norsk: You could upload the WARCs already while you're still grabbing it. That's what ArchiveBot pipelines do as well.
21:00 πŸ”— JAA What do you mean by "recall"?
21:01 πŸ”— ola_norsk JAA: to make that specific archivebot task into an item on ia
21:02 πŸ”— ola_norsk JAA: a warc item, i mean, with topics etc.
21:02 πŸ”— JAA In theory, you could download the files and reupload them to a new item. Or use 'ia copy', which does a server-side copy I believe. Whether that's a good idea though is another question entirely...
21:03 πŸ”— ola_norsk im n00b at using archivebot i'm afraid :/
21:03 πŸ”— JAA I guess someone from IA could move the files to a separate item. But again I'm not sure whether they do that.
21:03 πŸ”— JAA But yeah, that's all manual.
21:04 πŸ”— JAA Some pipelines upload files directly, and there you sort-of have one item per job (though it doesn't contain all relevant files and may sometimes contain other jobs as well).
21:04 πŸ”— JAA But other than that...
21:05 πŸ”— ola_norsk a warc item that's manually uploaded as item at a later time though, could that use data already grabbed trough archivebot? Or would it be causing duplicate-hell ?
21:06 πŸ”— JAA I don't know how IA handles duplicates.
21:06 πŸ”— ola_norsk aye, me neither
21:07 πŸ”— JAA That's why I wonder if it's a good idea.
21:07 πŸ”— JAA If they deduplicate the files, then it would probably be fine.
21:07 πŸ”— JAA Maybe someone else knows more about this.
21:08 πŸ”— ola_norsk "somewhere, in the deep cellars of internet archive; There's a single gnome set to the task of checksum'ing all files and writing symlinks" lol
21:08 πŸ”— ola_norsk :D
21:12 πŸ”— Mateon1 has quit IRC (Read error: Operation timed out)
21:13 πŸ”— Mateon1 has joined #archiveteam-bs
21:15 πŸ”— Somebody2 IA does not (*YET*) dedeuplicate.
21:15 πŸ”— Somebody2 (AFAIK)
21:20 πŸ”— jschwart has joined #archiveteam-bs
21:49 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
21:50 πŸ”— dashcloud has joined #archiveteam-bs
21:51 πŸ”— ola_norsk i have no idea. if there's no pressing need it's ok i guess. And thinking they would if need be, at least on items that's not been altered for quite a while.
21:52 πŸ”— godane dedeuplicate i can see being done on video and pdf items
21:53 πŸ”— godane i don't think it would work with warc files
21:53 πŸ”— ola_norsk aye
21:53 πŸ”— ola_norsk not with derived/recompressed files either i think. not unless the original was checked beforehand
21:57 πŸ”— ola_norsk godane: i guess with warc it would need checking content; and patching the stuff and/or link list in the warcs
21:58 πŸ”— godane sort of my thought
21:58 πŸ”— ola_norsk aye
21:59 πŸ”— ez unless ia unpacks all wars on their side
21:59 πŸ”— godane i was think it would be check important web urls that have the same checksum and making derive warc archive to only store it once
22:00 πŸ”— ez warc is generally rather unfortunate thing to do for bulk file formats
22:00 πŸ”— ez (not sure about the wisdom of reinventing zip files, either)
22:00 πŸ”— godane so it would be derive warc either way
22:01 πŸ”— godane something like this for warc makes more sense if we are doing my librarybox project idea
22:01 πŸ”— ola_norsk got link?
22:01 πŸ”— godane cause then people can host full archives of cbsnews.com for example without it take 100gb
22:02 πŸ”— ez the thing is that on mass scale, dupes dont happen that often in general
22:02 πŸ”— ez so its often not worth the time bothering with it, especially for small items
22:04 πŸ”— ola_norsk but e.g for twitter, users might upload memes that are just copied from other sites. i don't know if twitter alters all images; but that could cause duplicates i think
22:04 πŸ”— ola_norsk if, twitter gives each uploaded image it's own file and filename, i mean
22:05 πŸ”— ez somewhat
22:06 πŸ”— ez theres been study for this for 4chan, which granted, isnt representative sample of twitter
22:06 πŸ”— ez but as far as actual md5 stored _live_, only 10-15% were dupes
22:06 πŸ”— ola_norsk ok
22:06 πŸ”— ez however over time, there were indeed >50% dupes in certain time periods
22:06 πŸ”— ez exactly as you say, some image got really popular and got reposted over and over
22:08 πŸ”— pizzaiolo has quit IRC (Read error: Operation timed out)
22:08 πŸ”— ola_norsk in any case, it's something that's fixable in the future though i would guess. E.g picking trough all the shit and finding e,g the most reoccuring, or highquality image; either by md5 or image regocnitions
22:09 πŸ”— ola_norsk image recognition*
22:09 πŸ”— pizzaiolo has joined #archiveteam-bs
22:10 πŸ”— ez well, this isnt that exact study i've seen, but mentions the distribution of dupes per post https://i.imgur.com/S9pxJqV.png
22:11 πŸ”— ez ola_norsk: it could be done of course. note that if you were to do what you say, you'd also build index for reverse image search
22:11 πŸ”— ez which would be a really handy thing for IA to have
22:11 πŸ”— ez needless to say, it depends if IA wants to diversify as a search engine
22:12 πŸ”— ola_norsk well, the meta.xml is there, containing the md5's i think :D
22:12 πŸ”— ez md5s useless for search
22:12 πŸ”— ola_norsk aye, but to find duplicates i mean
22:12 πŸ”— ez on md5 level, the dupes dont happen often enough given a random kitchen sink of files
22:13 πŸ”— ez it definitely makes sense for certain datasets
22:13 πŸ”— pizzaiolo has quit IRC (Client Quit)
22:13 πŸ”— ez like those ad laden pirate rapidshare upload servers, they sure do md5. they have this huge sample of highly specific cotent, and the files often are same copies.
22:13 πŸ”— ez megaupload actually got nailed for this legally
22:13 πŸ”— pizzaiolo has joined #archiveteam-bs
22:13 πŸ”— ez they dmcad the link, but not the md5
22:14 πŸ”— ez deduping 10k 1MB image files and deduping 10k 1GB files is what makes the difference
22:15 πŸ”— ez as you get random mix of both, its obvious which part of the set to focus on
22:15 πŸ”— ola_norsk it would be slow work i guess
22:15 πŸ”— ez depends on how the system works, really
22:16 πŸ”— ez most data stores working on per-file basis often compute hash on streaming upload, and make a symlink when hash found at that time, too
22:16 πŸ”— ez but more general setups often dont have the luxury of having a hook fire per each new file
22:16 πŸ”— ola_norsk what if there was a distributed tool to Warrior, that picked trough the IA items xml and looked for duplicate md5's ?
22:17 πŸ”— ola_norsk (of only original files that is)
22:18 πŸ”— ez its possible to fetch the hashes from ia directly via api
22:18 πŸ”— JAA joepie91: FYI, I'm porting your anti-CloudFlare code to Python.
22:19 πŸ”— ez not everything is available tho. as long it has xml sha1, you can api fetch it
22:19 πŸ”— ez build an offline database too, etc
22:20 πŸ”— ola_norsk ez: it's a quite a number of items on ia though. But if it was a delegated slow and steady task
22:21 πŸ”— ola_norsk ez: like a background task or something
22:21 πŸ”— ez no i mean you can have the hashes offline
22:21 πŸ”— ez as comparably small structure
22:21 πŸ”— ez unfortunately its really awkward to get it at this moment
22:22 πŸ”— ola_norsk ez: i mean making a full list of duplicate files. where e.g the 'parent item' is by first date
22:22 πŸ”— ez ola_norsk: a bloom filter with reasonable collision rate is like 10 bits per item
22:23 πŸ”— ez regardless of number of items
22:24 πŸ”— ez not sure if ia supports search by hash
22:25 πŸ”— ez last time i checked the api (some years ago) it didnt
22:25 πŸ”— ola_norsk the md5 hashes are in the xml of (each?) item
22:25 πŸ”— ez if it still doesnt, you'd need to store hash, as well as xml id to locate its context as you say
22:25 πŸ”— ez which would bload the database a great deal
22:26 πŸ”— ez ola_norsk: yea
22:26 πŸ”— ez the idea is that i run scan over my filesystem and compare every file to filter i scrapped from ia api
22:26 πŸ”— ola_norsk it's not going anywhere though. So it could basically be done slow as shit don't you think?
22:26 πŸ”— ez and upload only files which dont match. this is because querying IA with 500M+ files is not realistic
22:27 πŸ”— ez so bloom filter would work fine for uploads of random crap and making more or less sure its not a dupe
22:28 πŸ”— ez but it wont tell you *where* your files are on ia, unless you abuse the api with fulltext search and what not
22:28 πŸ”— ola_norsk i have no idea man :)
22:29 πŸ”— ola_norsk does the logs say?
22:29 πŸ”— ola_norsk the history of items?
22:30 πŸ”— * ola_norsk 's brain is broken and beered :/
22:31 πŸ”— ola_norsk i'm guessing IA would put some gnome to work the day their harddrive is full :D
22:32 πŸ”— ola_norsk (which i'm guessing is not tomorrow lol) :D
22:32 πŸ”— ez just restrict some classes of uploads when space starts running short
22:32 πŸ”— ez but yea, space can be done on cheap if you have the scale
22:33 πŸ”— ola_norsk doh, restriction is bad :/
22:33 πŸ”— ola_norsk might they as well check if that file already exist?
22:34 πŸ”— ez as i said, at those scales, the content is so diverse it happens rather infrequently, especially if your files are comparably small (ie a lot of small items of diverse content)
22:34 πŸ”— ola_norsk ...maybe they already do.. :0
22:34 πŸ”— odemg has quit IRC (Read error: Connection reset by peer)
22:35 πŸ”— ez its easy to do for single files, but not quite sure about warc
22:35 πŸ”— ola_norsk aye, it would need unpacking and stuff
22:36 πŸ”— ola_norsk and e.g youtube videos that are mkv combined would need split into audio and video i guess, then compared
22:37 πŸ”— ez the thing is i've seen deduping rather infrequently in large setups like this - the restrictions on flexibility of what you can do (you now need some sort of fast hash index to check against, you need some symlink mechanism now)
22:37 πŸ”— ez youtube doesnt dedupe im pretty sure
22:37 πŸ”— ez not by the source video anyway
22:38 πŸ”— ez since 99% of content they get is original uploads. most dupes they'd otherwise get usually gets struck by contentid
22:38 πŸ”— ez 1% is the long trail of short meme videos reposted over and over and what not, but its just a tiny part of long trail
22:39 πŸ”— ola_norsk i sometimes upload by getting videoes with youtube-dl, and it seems that often combines 2 files, audio and video, into a single file..would that make different md5 sum?
22:40 πŸ”— ola_norsk (without repacking/recoding, i mean)
22:40 πŸ”— ez (its easy to test - each reupload yields new reencode on yt, and the encode is even slightly different as it contains timestamp in mp4 header)
22:41 πŸ”— ez ola_norsk: yes, highest quality is available only via HLS
22:42 πŸ”— ez curiously, its an artificial restriction, as other parts of google infra which uses yt infra, together mp4 1080/4k just fine
22:42 πŸ”— ez the restriction is specific to yt, i suppose in a bid to frustrate trivial ripping attempts via browser extensions
22:44 πŸ”— ola_norsk but, i think what i mean is; If i run 'youtube-dl' on the same youtube video twice..Then e.g the audio and video (often webm and mp4), before they are merged into MKV file, would be the very same files each time? or?
22:44 πŸ”— ez ola_norsk: on and off its possible to abuse youtube apis to get the actual original mp4, but it changes 2 times now (works for google+, only for your own videos when logged in)
22:44 πŸ”— ez *has changed
22:44 πŸ”— ez so definitely not something to rely on
22:45 πŸ”— ez but if i were as crazy to archive yt, i'd definitely try to rip the original files, not the re-encodes
22:45 πŸ”— odemg has joined #archiveteam-bs
22:45 πŸ”— ola_norsk no hehe :D i'm just talking example as to how to detect duplicate videos :D
22:45 πŸ”— ola_norsk or duplicate uploads in general
22:46 πŸ”— ez ola_norsk: depends what you command ytdl to do
22:46 πŸ”— ez generally if you ask it same file, same format, you overwhelmingly get the exact same file
22:46 πŸ”— ez but google re-encodes those from time to time
22:48 πŸ”— ola_norsk aye, most often it just combines audio and video, 2 different files, into a KVM file. And, i'm thinking if the kvm file were split again, into those two a/v files, the md5 would be the same in two instances of where youtube-dl were used to download the same video.
22:49 πŸ”— ola_norsk and, by that, duplicate video uploads could be detected
22:50 πŸ”— ola_norsk the generated merged file (kvm etc) would be different, but the two merged files would be same, since there's no re-encoding occurring
22:51 πŸ”— ez hum, that sounds elaborate?
22:51 πŸ”— ola_norsk aye, i have a headacke :D
22:51 πŸ”— ez why not just tell ytdl to rip everything to mkv from the get-go?
22:54 πŸ”— ola_norsk i just meant in relation to detecting duplicates in IA items (where e.g 2 youtube vidoes are uploaded twice)
22:55 πŸ”— ola_norsk where md5sum of the two kvm files are not an option
22:55 πŸ”— ola_norsk since each download would cause two diffrent kvm to be made locally by the downloaders
22:57 πŸ”— ola_norsk if, however, it's possible to split a kvm, into the audio and video files contained..i'm thinking those two would yield identical md5
23:02 πŸ”— * ola_norsk ran out of duplication-detection smartness :/
23:05 πŸ”— ez i have no idea what kvm file is
23:10 πŸ”— nyany has joined #archiveteam-bs
23:16 πŸ”— ola_norsk ez: usually when i use youtube-dl it downloads audio and video seperate, then combine the two into a kvm file
23:16 πŸ”— ez whats an kvm file?
23:16 πŸ”— ez oh
23:16 πŸ”— ez mkv
23:17 πŸ”— ola_norsk ah, sorry, yes
23:17 πŸ”— ola_norsk mkv
23:18 πŸ”— ez yea its a bit annoying, mostly because the ffmpeg transcoder is non-deterministic
23:18 πŸ”— ez i think it puts timestamp in the mkv header or something silly like that
23:18 πŸ”— ez s/transcoder/muxer/
23:19 πŸ”— ola_norsk does it alter the two audio and video files though? or can it be split from the mkv?
23:21 πŸ”— ola_norsk either way, if that is possible; then detecting duplicate mkv files uploaded to ia is possible
23:21 πŸ”— ola_norsk even if the md5 sum differs between to mkv files containing the same content
23:22 πŸ”— ez the raw bitstream is kept as-is
23:23 πŸ”— ola_norsk ok
23:23 πŸ”— ez meaning the mux is as "original" as sent in the google HLS track
23:23 πŸ”— ez but the container metadata are often unstable on account of different versions of software muxing slightly differently (think seek metadata and such)
23:27 πŸ”— ola_norsk if the framecount doesn't differ though, and neither does the frames, that could be a further step?
23:28 πŸ”— ola_norsk ez: the extent of my knowledge is rather spent when it comes to codecs :D
23:28 πŸ”— ez tl;dr is that you cant rely on what ytdl gives you as muxed output
23:29 πŸ”— ez perhaps if you just ask for the 720p mp4, as that one isnt remuxed (yet?)
23:30 πŸ”— ola_norsk i'm thinking some kind of image/frame simularity detection then i guess
23:30 πŸ”— ola_norsk rather*
23:32 πŸ”— jschwart has quit IRC (Quit: Konversation terminated!)
23:32 πŸ”— ola_norsk the md5 is in the item xml though, so i guess that is where one would have to start to find duplicates on ia
23:39 πŸ”— ola_norsk ez: it is as you say elaborate. So i'm glad i don't have to do it :D
23:40 πŸ”— ola_norsk ez: (and so should everyone else be, of me not doing it lol) ;)
23:41 πŸ”— BlueMaxim has quit IRC (Quit: Leaving)

irclogger-viewer