#archiveteam-bs 2017-12-16,Sat

↑back Search

Time	Nickname	Message
01:03 ^🔗		JAA sets mode: +bb Ya_ALLAH!@ !@185.143.4*
01:45 ^🔗		CoolCanuk has joined #archiveteam-bs
02:01 ^🔗		DopefishJ is now known as DFJustin
02:10 ^🔗		odemg has quit IRC (Ping timeout: 250 seconds)
02:13 ^🔗		odemg has joined #archiveteam-bs
02:21 ^🔗		pizzaiolo has quit IRC (pizzaiolo)
03:05 ^🔗		ZexaronS has quit IRC (Read error: Operation timed out)
03:07 ^🔗		Stilett0 has quit IRC (Read error: Operation timed out)
03:16 ^🔗		ndiddy has joined #archiveteam-bs
03:21 ^🔗		Stilett0 has joined #archiveteam-bs
03:21 ^🔗		Stilett0 is now known as Stiletto
03:38 ^🔗	robogoat	Somebody2: Ok, can you explain something, regarding darking and crawl history?
03:38 ^🔗	robogoat	It is possible for someone to own domain A, have content on domain A, be fine with IA archiving it,
03:39 ^🔗	robogoat	and then sell/let the domain lapse, and the subsequent owner/squatter puts up a prohibitive robots.txt
03:39 ^🔗	robogoat	It's my understanding that the material is then non-accessible,
03:39 ^🔗	robogoat	is it "darked"?
04:05 ^🔗		qw3rty111 has joined #archiveteam-bs
04:09 ^🔗		qw3rty119 has quit IRC (Read error: Operation timed out)
04:14 ^🔗		CoolCanuk has quit IRC (Quit: Connection closed for inactivity)
04:21 ^🔗	vantec	Been out of the loop for a bit, but think this mostly still stands: https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/
05:28 ^🔗		bithippo has quit IRC (Read error: Connection reset by peer)
05:44 ^🔗		wacky has quit IRC (Read error: Operation timed out)
05:44 ^🔗		wacky_ has joined #archiveteam-bs
06:50 ^🔗	Somebody2	robogoat: First of all, note that I'm not employed by IA, and in fact have only visited once. This is just outside curious onlooker.
06:51 ^🔗	Somebody2	With that out of the way -- dark'ing applies to items (a jargon term) on archive.org, not web pages in the Wayback Machine.
06:52 ^🔗	Somebody2	The Wayback Machine is an interface to a whole bunch of WARC files, stored in various items on IA.
06:53 ^🔗	Somebody2	The WARC files are what actually contain the HTML (and URLs, and dates) that are displayed through the Wayback Machine.
06:55 ^🔗	Somebody2	If an item is darked, it can't be included in the Wayback Machine (or any other interface, like the TV News viewer, or the Emularity).
06:56 ^🔗		Stiletto has quit IRC (Ping timeout: 250 seconds)
06:56 ^🔗	Somebody2	So none of the items containing WARCs that contain web pages visible through the Wayback Machine are darked (unless I'm missing something).
06:57 ^🔗	Somebody2	However -- that doesn't mean you can download the WARC files yourself, directly (with some exceptions).
06:58 ^🔗	Somebody2	Most of the items containing WARCs used in the Wayback Machine are "private", which means while you can see the file names, and sizes, and hashes, ...
06:59 ^🔗	Somebody2	... you can't actually download the actual files without special permission (which the software that runs the Wayback Machine has).
07:00 ^🔗	Somebody2	The WARCs produced by ArchiveTeam generally are NOT private -- although we prefer not to talk about this too loudly, to avoid people complaining.
07:01 ^🔗	Somebody2	A recently added feature of the Wayback Machine provides links from a particular web page to the item containing the WARC it came from.
07:04 ^🔗	Somebody2	Actually, it looks like it only links to the collection containing the item containing the WARC, sorry.
07:06 ^🔗	Somebody2	So, now, to get back to robots.txt -- the Wayback Machine does (currently) include a feature to disable access to URLs whose most recent robots.txt file Disallows them.
07:07 ^🔗	Somebody2	The details of exactly how this operates (i.e. which Agent names does it recognize, how does it parse different Allow and Disallow lines, ...
07:07 ^🔗	Somebody2	... what does it do if there is no robots.txt file) are subtle, changing, and undocumented.
07:08 ^🔗	Somebody2	And robots.txt files do NOT apply to themselves, so you can always see the contents of all the robots.txt files IA has captured for a domain.
07:09 ^🔗	Somebody2	(unless there was a specific complaint sent to IA asking for the domain to be excluded, which they also honor)
07:10 ^🔗	Somebody2	But the robots.txt logic doesn't apply at ALL to the underlying items -- so if you can download them, you can still access the data that way.
07:10 ^🔗	Somebody2	Hopefully that answers the question.
07:10 ^🔗	Somebody2	(and sorry everyone else for the literal wall of text)
07:11 ^🔗	Somebody2	There is also a robots.txt feature included in the Save Page Now feature, but that's a separate thing.
07:22 ^🔗		bwn has quit IRC (Read error: Connection reset by peer)
07:37 ^🔗		DFJustin has quit IRC (Remote host closed the connection)
07:44 ^🔗		DFJustin has joined #archiveteam-bs
07:44 ^🔗		swebb sets mode: +o DFJustin
07:44 ^🔗		bwn has joined #archiveteam-bs
08:01 ^🔗		ZexaronS has joined #archiveteam-bs
08:03 ^🔗		mr_archiv has quit IRC (Quit: WeeChat 1.6)
08:03 ^🔗		mr_archiv has joined #archiveteam-bs
08:03 ^🔗		mr_archiv has quit IRC (Client Quit)
08:05 ^🔗		mr_archiv has joined #archiveteam-bs
08:54 ^🔗		Mateon1 has quit IRC (Read error: Connection reset by peer)
08:55 ^🔗		Mateon1 has joined #archiveteam-bs
09:11 ^🔗		MrDignity has quit IRC (Remote host closed the connection)
09:11 ^🔗		MrDignity has joined #archiveteam-bs
10:12 ^🔗		schbirid has joined #archiveteam-bs
10:17 ^🔗		nyany has quit IRC (Leaving)
10:40 ^🔗		JAA sets mode: +bb BestPrize!@ !pointspri@
10:53 ^🔗		MrDignity has quit IRC (Remote host closed the connection)
10:53 ^🔗		MrDignity has joined #archiveteam-bs
12:19 ^🔗		kimmer12 has joined #archiveteam-bs
12:22 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
12:25 ^🔗		schbirid has quit IRC (Quit: Leaving)
12:26 ^🔗		kimmer1 has quit IRC (Ping timeout: 633 seconds)
12:54 ^🔗		dashcloud has quit IRC (No Ping reply in 180 seconds.)
12:54 ^🔗		dashcloud has joined #archiveteam-bs
13:00 ^🔗		Stilett0 has joined #archiveteam-bs
13:46 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
13:48 ^🔗		dashcloud has joined #archiveteam-bs
14:33 ^🔗		Specular has joined #archiveteam-bs
14:34 ^🔗	Specular	is there any known way of converting Web Archive files saved from Safari to the MHT format?
14:52 ^🔗		godane has quit IRC (Quit: Leaving.)
15:09 ^🔗		kimmer1 has joined #archiveteam-bs
15:12 ^🔗		kimmer12 has quit IRC (Ping timeout: 633 seconds)
15:35 ^🔗		kimmer12 has joined #archiveteam-bs
15:36 ^🔗		kimmer12 has quit IRC (Remote host closed the connection)
15:42 ^🔗		kimmer1 has quit IRC (Ping timeout: 633 seconds)
16:06 ^🔗	Specular	somehow my search queries were too specific prior and just found this. Mac only but will test later. https://langui.net/webarchive-to-mht/
16:07 ^🔗	Specular	oh it's commercial. Typical Mac apps, ahaha.
16:58 ^🔗		Specular has quit IRC (Quit: Leaving)
17:09 ^🔗		pizzaiolo has joined #archiveteam-bs
17:52 ^🔗		ola_norsk has joined #archiveteam-bs
17:54 ^🔗	ola_norsk	how might one go about to 'archive a person' on the internet archive? I'm thinking of the youtuber Charles Green a.k.a 'Angry Granpa'
18:04 ^🔗		kimmer1 has joined #archiveteam-bs
18:12 ^🔗	Somebody2	ola_norsk: You can't archive a person. But you could archive the work they have posted online. I'd use Archivebot and youtube-dl
18:14 ^🔗	ola_norsk	Somebody2: could those various items, from e.g the fandom wikia, twitter, to youtube videos etc; later be made into e.g a 'Collection' ?
18:14 ^🔗	ola_norsk	Somebody2: without having to be one item, i mean
18:15 ^🔗	Somebody2	Yes, once you upload them, send an email to info@archive suggesting they be made into a collection, and someone will likely do it eventually.
18:15 ^🔗	ola_norsk	ty
18:16 ^🔗		godane has joined #archiveteam-bs
18:17 ^🔗	ola_norsk	btw, would the items need a certain meta-tag?
18:18 ^🔗	ola_norsk	other than topics, i means
18:20 ^🔗	JAA	For WARCs, you need to set the mediatype to web. Anything else is optional and can be changed post-upload (mediatype can only be set at item creation). But the more metadata, the better! :-)
18:20 ^🔗	ola_norsk	okidoki
18:21 ^🔗	JAA	(If you forget to set the mediatype correctly on upload, send them an email instead of trying to work around it by creating a new item or whatever; they can change it, I believe.)
18:22 ^🔗		pizzaiolo has quit IRC (Read error: Operation timed out)
18:22 ^🔗	ola_norsk	speaking of which, i messed that up on this item https://archive.org/details/vidme_AfterPrisonJoe :/
18:24 ^🔗	ola_norsk	and it seems to have messed up the media format detection. I did re-download the videos locally though.
18:26 ^🔗		pizzaiolo has joined #archiveteam-bs
18:27 ^🔗	ola_norsk	JAA: would it be a good idea to simply tar.gz the vidme videos, add that to the item; and then send an email asking the all item's content to be replaced by the content of the tar.gz
18:27 ^🔗	JAA	I doubt it.
18:28 ^🔗	JAA	Why don't you just use the 'ia' tool (Python package internetarchive)?
18:28 ^🔗	ola_norsk	i do use that
18:29 ^🔗	ola_norsk	but, there seems to be a bug that's preventing changing metadata.
18:30 ^🔗	JAA	Hm?
18:30 ^🔗	ola_norsk	JAA: https://github.com/jjjake/internetarchive/issues/228
18:31 ^🔗	ola_norsk	but, maybe it's resolved already. I've not tested yet.
18:41 ^🔗	ola_norsk	ouch, might i have accidentally closed the issue? Or do they timeout on github after some time :/
18:46 ^🔗	ola_norsk	JAA: what version of 'ia' are you on?
18:46 ^🔗	*	ola_norsk is on 1.7.4
19:18 ^🔗	ola_norsk	how do i submit 'angrygrandpa.wikia.com' to get grabbed by archivebot?
19:19 ^🔗	ola_norsk	nevermind
19:38 ^🔗		dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
19:38 ^🔗		dashcloud has joined #archiveteam-bs
19:56 ^🔗		dashcloud has quit IRC (Read error: Connection reset by peer)
20:00 ^🔗		dashcloud has joined #archiveteam-bs
20:20 ^🔗		BlueMaxim has joined #archiveteam-bs
20:35 ^🔗	JAA	ola_norsk: Sorry, had to leave. I've been using 1.7.1 and 1.7.3.
20:38 ^🔗	JAA	ola_norsk, Somebody2: If you archive a site through ArchiveBot, you can't easily add just that site's archives to a collection afterwards though, because the archives are generally spread over multiple items and each of those items also contains loads of other jobs' archives.
20:39 ^🔗	JAA	It might be better to use grab-site/wpull and upload the WARCs yourself. Then you can create clean items for each site you archive or whatever, and these can easily be added to a collection (or multiple) afterwards.
20:42 ^🔗	Somebody2	That is a good point, thank you.
20:47 ^🔗	PurpleSym	What’s the #archivebot IRC logs password?
20:56 ^🔗	JAA	Query
20:57 ^🔗	PurpleSym	And username, JAA?
20:59 ^🔗	ola_norsk	JAA: ah. i will see if i have the space for that wikia. (though, it's already submitted to archivebot as a job) :/ I will use !abort if my harddrive runs out
21:00 ^🔗	ola_norsk	JAA: so there's no real way to recall a specific archivebot job/task?
21:00 ^🔗	JAA	ola_norsk: You could upload the WARCs already while you're still grabbing it. That's what ArchiveBot pipelines do as well.
21:00 ^🔗	JAA	What do you mean by "recall"?
21:01 ^🔗	ola_norsk	JAA: to make that specific archivebot task into an item on ia
21:02 ^🔗	ola_norsk	JAA: a warc item, i mean, with topics etc.
21:02 ^🔗	JAA	In theory, you could download the files and reupload them to a new item. Or use 'ia copy', which does a server-side copy I believe. Whether that's a good idea though is another question entirely...
21:03 ^🔗	ola_norsk	im n00b at using archivebot i'm afraid :/
21:03 ^🔗	JAA	I guess someone from IA could move the files to a separate item. But again I'm not sure whether they do that.
21:03 ^🔗	JAA	But yeah, that's all manual.
21:04 ^🔗	JAA	Some pipelines upload files directly, and there you sort-of have one item per job (though it doesn't contain all relevant files and may sometimes contain other jobs as well).
21:04 ^🔗	JAA	But other than that...
21:05 ^🔗	ola_norsk	a warc item that's manually uploaded as item at a later time though, could that use data already grabbed trough archivebot? Or would it be causing duplicate-hell ?
21:06 ^🔗	JAA	I don't know how IA handles duplicates.
21:06 ^🔗	ola_norsk	aye, me neither
21:07 ^🔗	JAA	That's why I wonder if it's a good idea.
21:07 ^🔗	JAA	If they deduplicate the files, then it would probably be fine.
21:07 ^🔗	JAA	Maybe someone else knows more about this.
21:08 ^🔗	ola_norsk	"somewhere, in the deep cellars of internet archive; There's a single gnome set to the task of checksum'ing all files and writing symlinks" lol
21:08 ^🔗	ola_norsk	:D
21:12 ^🔗		Mateon1 has quit IRC (Read error: Operation timed out)
21:13 ^🔗		Mateon1 has joined #archiveteam-bs
21:15 ^🔗	Somebody2	IA does not (YET) dedeuplicate.
21:15 ^🔗	Somebody2	(AFAIK)
21:20 ^🔗		jschwart has joined #archiveteam-bs
21:49 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
21:50 ^🔗		dashcloud has joined #archiveteam-bs
21:51 ^🔗	ola_norsk	i have no idea. if there's no pressing need it's ok i guess. And thinking they would if need be, at least on items that's not been altered for quite a while.
21:52 ^🔗	godane	dedeuplicate i can see being done on video and pdf items
21:53 ^🔗	godane	i don't think it would work with warc files
21:53 ^🔗	ola_norsk	aye
21:53 ^🔗	ola_norsk	not with derived/recompressed files either i think. not unless the original was checked beforehand
21:57 ^🔗	ola_norsk	godane: i guess with warc it would need checking content; and patching the stuff and/or link list in the warcs
21:58 ^🔗	godane	sort of my thought
21:58 ^🔗	ola_norsk	aye
21:59 ^🔗	ez	unless ia unpacks all wars on their side
21:59 ^🔗	godane	i was think it would be check important web urls that have the same checksum and making derive warc archive to only store it once
22:00 ^🔗	ez	warc is generally rather unfortunate thing to do for bulk file formats
22:00 ^🔗	ez	(not sure about the wisdom of reinventing zip files, either)
22:00 ^🔗	godane	so it would be derive warc either way
22:01 ^🔗	godane	something like this for warc makes more sense if we are doing my librarybox project idea
22:01 ^🔗	ola_norsk	got link?
22:01 ^🔗	godane	cause then people can host full archives of cbsnews.com for example without it take 100gb
22:02 ^🔗	ez	the thing is that on mass scale, dupes dont happen that often in general
22:02 ^🔗	ez	so its often not worth the time bothering with it, especially for small items
22:04 ^🔗	ola_norsk	but e.g for twitter, users might upload memes that are just copied from other sites. i don't know if twitter alters all images; but that could cause duplicates i think
22:04 ^🔗	ola_norsk	if, twitter gives each uploaded image it's own file and filename, i mean
22:05 ^🔗	ez	somewhat
22:06 ^🔗	ez	theres been study for this for 4chan, which granted, isnt representative sample of twitter
22:06 ^🔗	ez	but as far as actual md5 stored _live_, only 10-15% were dupes
22:06 ^🔗	ola_norsk	ok
22:06 ^🔗	ez	however over time, there were indeed >50% dupes in certain time periods
22:06 ^🔗	ez	exactly as you say, some image got really popular and got reposted over and over
22:08 ^🔗		pizzaiolo has quit IRC (Read error: Operation timed out)
22:08 ^🔗	ola_norsk	in any case, it's something that's fixable in the future though i would guess. E.g picking trough all the shit and finding e,g the most reoccuring, or highquality image; either by md5 or image regocnitions
22:09 ^🔗	ola_norsk	image recognition*
22:09 ^🔗		pizzaiolo has joined #archiveteam-bs
22:10 ^🔗	ez	well, this isnt that exact study i've seen, but mentions the distribution of dupes per post https://i.imgur.com/S9pxJqV.png
22:11 ^🔗	ez	ola_norsk: it could be done of course. note that if you were to do what you say, you'd also build index for reverse image search
22:11 ^🔗	ez	which would be a really handy thing for IA to have
22:11 ^🔗	ez	needless to say, it depends if IA wants to diversify as a search engine
22:12 ^🔗	ola_norsk	well, the meta.xml is there, containing the md5's i think :D
22:12 ^🔗	ez	md5s useless for search
22:12 ^🔗	ola_norsk	aye, but to find duplicates i mean
22:12 ^🔗	ez	on md5 level, the dupes dont happen often enough given a random kitchen sink of files
22:13 ^🔗	ez	it definitely makes sense for certain datasets
22:13 ^🔗		pizzaiolo has quit IRC (Client Quit)
22:13 ^🔗	ez	like those ad laden pirate rapidshare upload servers, they sure do md5. they have this huge sample of highly specific cotent, and the files often are same copies.
22:13 ^🔗	ez	megaupload actually got nailed for this legally
22:13 ^🔗		pizzaiolo has joined #archiveteam-bs
22:13 ^🔗	ez	they dmcad the link, but not the md5
22:14 ^🔗	ez	deduping 10k 1MB image files and deduping 10k 1GB files is what makes the difference
22:15 ^🔗	ez	as you get random mix of both, its obvious which part of the set to focus on
22:15 ^🔗	ola_norsk	it would be slow work i guess
22:15 ^🔗	ez	depends on how the system works, really
22:16 ^🔗	ez	most data stores working on per-file basis often compute hash on streaming upload, and make a symlink when hash found at that time, too
22:16 ^🔗	ez	but more general setups often dont have the luxury of having a hook fire per each new file
22:16 ^🔗	ola_norsk	what if there was a distributed tool to Warrior, that picked trough the IA items xml and looked for duplicate md5's ?
22:17 ^🔗	ola_norsk	(of only original files that is)
22:18 ^🔗	ez	its possible to fetch the hashes from ia directly via api
22:18 ^🔗	JAA	joepie91: FYI, I'm porting your anti-CloudFlare code to Python.
22:19 ^🔗	ez	not everything is available tho. as long it has xml sha1, you can api fetch it
22:19 ^🔗	ez	build an offline database too, etc
22:20 ^🔗	ola_norsk	ez: it's a quite a number of items on ia though. But if it was a delegated slow and steady task
22:21 ^🔗	ola_norsk	ez: like a background task or something
22:21 ^🔗	ez	no i mean you can have the hashes offline
22:21 ^🔗	ez	as comparably small structure
22:21 ^🔗	ez	unfortunately its really awkward to get it at this moment
22:22 ^🔗	ola_norsk	ez: i mean making a full list of duplicate files. where e.g the 'parent item' is by first date
22:22 ^🔗	ez	ola_norsk: a bloom filter with reasonable collision rate is like 10 bits per item
22:23 ^🔗	ez	regardless of number of items
22:24 ^🔗	ez	not sure if ia supports search by hash
22:25 ^🔗	ez	last time i checked the api (some years ago) it didnt
22:25 ^🔗	ola_norsk	the md5 hashes are in the xml of (each?) item
22:25 ^🔗	ez	if it still doesnt, you'd need to store hash, as well as xml id to locate its context as you say
22:25 ^🔗	ez	which would bload the database a great deal
22:26 ^🔗	ez	ola_norsk: yea
22:26 ^🔗	ez	the idea is that i run scan over my filesystem and compare every file to filter i scrapped from ia api
22:26 ^🔗	ola_norsk	it's not going anywhere though. So it could basically be done slow as shit don't you think?
22:26 ^🔗	ez	and upload only files which dont match. this is because querying IA with 500M+ files is not realistic
22:27 ^🔗	ez	so bloom filter would work fine for uploads of random crap and making more or less sure its not a dupe
22:28 ^🔗	ez	but it wont tell you where your files are on ia, unless you abuse the api with fulltext search and what not
22:28 ^🔗	ola_norsk	i have no idea man :)
22:29 ^🔗	ola_norsk	does the logs say?
22:29 ^🔗	ola_norsk	the history of items?
22:30 ^🔗	*	ola_norsk 's brain is broken and beered :/
22:31 ^🔗	ola_norsk	i'm guessing IA would put some gnome to work the day their harddrive is full :D
22:32 ^🔗	ola_norsk	(which i'm guessing is not tomorrow lol) :D
22:32 ^🔗	ez	just restrict some classes of uploads when space starts running short
22:32 ^🔗	ez	but yea, space can be done on cheap if you have the scale
22:33 ^🔗	ola_norsk	doh, restriction is bad :/
22:33 ^🔗	ola_norsk	might they as well check if that file already exist?
22:34 ^🔗	ez	as i said, at those scales, the content is so diverse it happens rather infrequently, especially if your files are comparably small (ie a lot of small items of diverse content)
22:34 ^🔗	ola_norsk	...maybe they already do.. :0
22:34 ^🔗		odemg has quit IRC (Read error: Connection reset by peer)
22:35 ^🔗	ez	its easy to do for single files, but not quite sure about warc
22:35 ^🔗	ola_norsk	aye, it would need unpacking and stuff
22:36 ^🔗	ola_norsk	and e.g youtube videos that are mkv combined would need split into audio and video i guess, then compared
22:37 ^🔗	ez	the thing is i've seen deduping rather infrequently in large setups like this - the restrictions on flexibility of what you can do (you now need some sort of fast hash index to check against, you need some symlink mechanism now)
22:37 ^🔗	ez	youtube doesnt dedupe im pretty sure
22:37 ^🔗	ez	not by the source video anyway
22:38 ^🔗	ez	since 99% of content they get is original uploads. most dupes they'd otherwise get usually gets struck by contentid
22:38 ^🔗	ez	1% is the long trail of short meme videos reposted over and over and what not, but its just a tiny part of long trail
22:39 ^🔗	ola_norsk	i sometimes upload by getting videoes with youtube-dl, and it seems that often combines 2 files, audio and video, into a single file..would that make different md5 sum?
22:40 ^🔗	ola_norsk	(without repacking/recoding, i mean)
22:40 ^🔗	ez	(its easy to test - each reupload yields new reencode on yt, and the encode is even slightly different as it contains timestamp in mp4 header)
22:41 ^🔗	ez	ola_norsk: yes, highest quality is available only via HLS
22:42 ^🔗	ez	curiously, its an artificial restriction, as other parts of google infra which uses yt infra, together mp4 1080/4k just fine
22:42 ^🔗	ez	the restriction is specific to yt, i suppose in a bid to frustrate trivial ripping attempts via browser extensions
22:44 ^🔗	ola_norsk	but, i think what i mean is; If i run 'youtube-dl' on the same youtube video twice..Then e.g the audio and video (often webm and mp4), before they are merged into MKV file, would be the very same files each time? or?
22:44 ^🔗	ez	ola_norsk: on and off its possible to abuse youtube apis to get the actual original mp4, but it changes 2 times now (works for google+, only for your own videos when logged in)
22:44 ^🔗	ez	*has changed
22:44 ^🔗	ez	so definitely not something to rely on
22:45 ^🔗	ez	but if i were as crazy to archive yt, i'd definitely try to rip the original files, not the re-encodes
22:45 ^🔗		odemg has joined #archiveteam-bs
22:45 ^🔗	ola_norsk	no hehe :D i'm just talking example as to how to detect duplicate videos :D
22:45 ^🔗	ola_norsk	or duplicate uploads in general
22:46 ^🔗	ez	ola_norsk: depends what you command ytdl to do
22:46 ^🔗	ez	generally if you ask it same file, same format, you overwhelmingly get the exact same file
22:46 ^🔗	ez	but google re-encodes those from time to time
22:48 ^🔗	ola_norsk	aye, most often it just combines audio and video, 2 different files, into a KVM file. And, i'm thinking if the kvm file were split again, into those two a/v files, the md5 would be the same in two instances of where youtube-dl were used to download the same video.
22:49 ^🔗	ola_norsk	and, by that, duplicate video uploads could be detected
22:50 ^🔗	ola_norsk	the generated merged file (kvm etc) would be different, but the two merged files would be same, since there's no re-encoding occurring
22:51 ^🔗	ez	hum, that sounds elaborate?
22:51 ^🔗	ola_norsk	aye, i have a headacke :D
22:51 ^🔗	ez	why not just tell ytdl to rip everything to mkv from the get-go?
22:54 ^🔗	ola_norsk	i just meant in relation to detecting duplicates in IA items (where e.g 2 youtube vidoes are uploaded twice)
22:55 ^🔗	ola_norsk	where md5sum of the two kvm files are not an option
22:55 ^🔗	ola_norsk	since each download would cause two diffrent kvm to be made locally by the downloaders
22:57 ^🔗	ola_norsk	if, however, it's possible to split a kvm, into the audio and video files contained..i'm thinking those two would yield identical md5
23:02 ^🔗	*	ola_norsk ran out of duplication-detection smartness :/
23:05 ^🔗	ez	i have no idea what kvm file is
23:10 ^🔗		nyany has joined #archiveteam-bs
23:16 ^🔗	ola_norsk	ez: usually when i use youtube-dl it downloads audio and video seperate, then combine the two into a kvm file
23:16 ^🔗	ez	whats an kvm file?
23:16 ^🔗	ez	oh
23:16 ^🔗	ez	mkv
23:17 ^🔗	ola_norsk	ah, sorry, yes
23:17 ^🔗	ola_norsk	mkv
23:18 ^🔗	ez	yea its a bit annoying, mostly because the ffmpeg transcoder is non-deterministic
23:18 ^🔗	ez	i think it puts timestamp in the mkv header or something silly like that
23:18 ^🔗	ez	s/transcoder/muxer/
23:19 ^🔗	ola_norsk	does it alter the two audio and video files though? or can it be split from the mkv?
23:21 ^🔗	ola_norsk	either way, if that is possible; then detecting duplicate mkv files uploaded to ia is possible
23:21 ^🔗	ola_norsk	even if the md5 sum differs between to mkv files containing the same content
23:22 ^🔗	ez	the raw bitstream is kept as-is
23:23 ^🔗	ola_norsk	ok
23:23 ^🔗	ez	meaning the mux is as "original" as sent in the google HLS track
23:23 ^🔗	ez	but the container metadata are often unstable on account of different versions of software muxing slightly differently (think seek metadata and such)
23:27 ^🔗	ola_norsk	if the framecount doesn't differ though, and neither does the frames, that could be a further step?
23:28 ^🔗	ola_norsk	ez: the extent of my knowledge is rather spent when it comes to codecs :D
23:28 ^🔗	ez	tl;dr is that you cant rely on what ytdl gives you as muxed output
23:29 ^🔗	ez	perhaps if you just ask for the 720p mp4, as that one isnt remuxed (yet?)
23:30 ^🔗	ola_norsk	i'm thinking some kind of image/frame simularity detection then i guess
23:30 ^🔗	ola_norsk	rather*
23:32 ^🔗		jschwart has quit IRC (Quit: Konversation terminated!)
23:32 ^🔗	ola_norsk	the md5 is in the item xml though, so i guess that is where one would have to start to find duplicates on ia
23:39 ^🔗	ola_norsk	ez: it is as you say elaborate. So i'm glad i don't have to do it :D
23:40 ^🔗	ola_norsk	ez: (and so should everyone else be, of me not doing it lol) ;)
23:41 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)

irclogger-viewer