Time |
Nickname |
Message |
01:03
π
|
|
JAA sets mode: +bb Ya_ALLAH!*@* *!*@185.143.4* |
01:45
π
|
|
CoolCanuk has joined #archiveteam-bs |
02:01
π
|
|
DopefishJ is now known as DFJustin |
02:10
π
|
|
odemg has quit IRC (Ping timeout: 250 seconds) |
02:13
π
|
|
odemg has joined #archiveteam-bs |
02:21
π
|
|
pizzaiolo has quit IRC (pizzaiolo) |
03:05
π
|
|
ZexaronS has quit IRC (Read error: Operation timed out) |
03:07
π
|
|
Stilett0 has quit IRC (Read error: Operation timed out) |
03:16
π
|
|
ndiddy has joined #archiveteam-bs |
03:21
π
|
|
Stilett0 has joined #archiveteam-bs |
03:21
π
|
|
Stilett0 is now known as Stiletto |
03:38
π
|
robogoat |
Somebody2: Ok, can you explain something, regarding darking and crawl history? |
03:38
π
|
robogoat |
It is possible for someone to own domain A, have content on domain A, be fine with IA archiving it, |
03:39
π
|
robogoat |
and then sell/let the domain lapse, and the subsequent owner/squatter puts up a prohibitive robots.txt |
03:39
π
|
robogoat |
It's my understanding that the material is then non-accessible, |
03:39
π
|
robogoat |
is it "darked"? |
04:05
π
|
|
qw3rty111 has joined #archiveteam-bs |
04:09
π
|
|
qw3rty119 has quit IRC (Read error: Operation timed out) |
04:14
π
|
|
CoolCanuk has quit IRC (Quit: Connection closed for inactivity) |
04:21
π
|
vantec |
Been out of the loop for a bit, but think this mostly still stands: https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ |
05:28
π
|
|
bithippo has quit IRC (Read error: Connection reset by peer) |
05:44
π
|
|
wacky has quit IRC (Read error: Operation timed out) |
05:44
π
|
|
wacky_ has joined #archiveteam-bs |
06:50
π
|
Somebody2 |
robogoat: First of all, note that I'm not employed by IA, and in fact have only visited once. This is just outside curious onlooker. |
06:51
π
|
Somebody2 |
With that out of the way -- dark'ing applies to *items* (a jargon term) on archive.org, not web pages in the Wayback Machine. |
06:52
π
|
Somebody2 |
The Wayback Machine is an *interface* to a whole bunch of WARC files, stored in various items on IA. |
06:53
π
|
Somebody2 |
The WARC files are what actually contain the HTML (and URLs, and dates) that are displayed through the Wayback Machine. |
06:55
π
|
Somebody2 |
If an item is darked, it can't be included in the Wayback Machine (or any other interface, like the TV News viewer, or the Emularity). |
06:56
π
|
|
Stiletto has quit IRC (Ping timeout: 250 seconds) |
06:56
π
|
Somebody2 |
So none of the items containing WARCs that contain web pages visible through the Wayback Machine are darked (unless I'm missing something). |
06:57
π
|
Somebody2 |
However -- that doesn't mean you can download the WARC files yourself, directly (with some exceptions). |
06:58
π
|
Somebody2 |
Most of the items containing WARCs used in the Wayback Machine are "private", which means while you can see the file names, and sizes, and hashes, ... |
06:59
π
|
Somebody2 |
... you can't actually download the actual files without special permission (which the software that runs the Wayback Machine has). |
07:00
π
|
Somebody2 |
The WARCs produced by ArchiveTeam generally are *NOT* private -- although we prefer not to talk about this too loudly, to avoid people complaining. |
07:01
π
|
Somebody2 |
A recently added feature of the Wayback Machine provides links from a particular web page to the item containing the WARC it came from. |
07:04
π
|
Somebody2 |
Actually, it looks like it only links to the *collection* containing the item containing the WARC, sorry. |
07:06
π
|
Somebody2 |
So, now, to get back to robots.txt -- the Wayback Machine does (currently) include a feature to disable access to URLs whose most recent robots.txt file Disallows them. |
07:07
π
|
Somebody2 |
The details of exactly how this operates (i.e. which Agent names does it recognize, how does it parse different Allow and Disallow lines, ... |
07:07
π
|
Somebody2 |
... what does it do if there is no robots.txt file) are subtle, changing, and undocumented. |
07:08
π
|
Somebody2 |
And robots.txt files do *NOT* apply to themselves, so you can always see the contents of all the robots.txt files IA has captured for a domain. |
07:09
π
|
Somebody2 |
(unless there was a specific complaint sent to IA asking for the domain to be excluded, which they also honor) |
07:10
π
|
Somebody2 |
But the robots.txt logic doesn't apply at ALL to the underlying items -- so *if* you can download them, you can still access the data that way. |
07:10
π
|
Somebody2 |
Hopefully that answers the question. |
07:10
π
|
Somebody2 |
(and sorry everyone else for the literal wall of text) |
07:11
π
|
Somebody2 |
There is also a robots.txt feature included in the Save Page Now feature, but that's a separate thing. |
07:22
π
|
|
bwn has quit IRC (Read error: Connection reset by peer) |
07:37
π
|
|
DFJustin has quit IRC (Remote host closed the connection) |
07:44
π
|
|
DFJustin has joined #archiveteam-bs |
07:44
π
|
|
swebb sets mode: +o DFJustin |
07:44
π
|
|
bwn has joined #archiveteam-bs |
08:01
π
|
|
ZexaronS has joined #archiveteam-bs |
08:03
π
|
|
mr_archiv has quit IRC (Quit: WeeChat 1.6) |
08:03
π
|
|
mr_archiv has joined #archiveteam-bs |
08:03
π
|
|
mr_archiv has quit IRC (Client Quit) |
08:05
π
|
|
mr_archiv has joined #archiveteam-bs |
08:54
π
|
|
Mateon1 has quit IRC (Read error: Connection reset by peer) |
08:55
π
|
|
Mateon1 has joined #archiveteam-bs |
09:11
π
|
|
MrDignity has quit IRC (Remote host closed the connection) |
09:11
π
|
|
MrDignity has joined #archiveteam-bs |
10:12
π
|
|
schbirid has joined #archiveteam-bs |
10:17
π
|
|
nyany has quit IRC (Leaving) |
10:40
π
|
|
JAA sets mode: +bb BestPrize!*@* *!pointspri@* |
10:53
π
|
|
MrDignity has quit IRC (Remote host closed the connection) |
10:53
π
|
|
MrDignity has joined #archiveteam-bs |
12:19
π
|
|
kimmer12 has joined #archiveteam-bs |
12:22
π
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
12:25
π
|
|
schbirid has quit IRC (Quit: Leaving) |
12:26
π
|
|
kimmer1 has quit IRC (Ping timeout: 633 seconds) |
12:54
π
|
|
dashcloud has quit IRC (No Ping reply in 180 seconds.) |
12:54
π
|
|
dashcloud has joined #archiveteam-bs |
13:00
π
|
|
Stilett0 has joined #archiveteam-bs |
13:46
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
13:48
π
|
|
dashcloud has joined #archiveteam-bs |
14:33
π
|
|
Specular has joined #archiveteam-bs |
14:34
π
|
Specular |
is there any known way of converting Web Archive files saved from Safari to the MHT format? |
14:52
π
|
|
godane has quit IRC (Quit: Leaving.) |
15:09
π
|
|
kimmer1 has joined #archiveteam-bs |
15:12
π
|
|
kimmer12 has quit IRC (Ping timeout: 633 seconds) |
15:35
π
|
|
kimmer12 has joined #archiveteam-bs |
15:36
π
|
|
kimmer12 has quit IRC (Remote host closed the connection) |
15:42
π
|
|
kimmer1 has quit IRC (Ping timeout: 633 seconds) |
16:06
π
|
Specular |
somehow my search queries were too specific prior and just found this. Mac only but will test later. https://langui.net/webarchive-to-mht/ |
16:07
π
|
Specular |
oh it's commercial. Typical Mac apps, ahaha. |
16:58
π
|
|
Specular has quit IRC (Quit: Leaving) |
17:09
π
|
|
pizzaiolo has joined #archiveteam-bs |
17:52
π
|
|
ola_norsk has joined #archiveteam-bs |
17:54
π
|
ola_norsk |
how might one go about to 'archive a person' on the internet archive? I'm thinking of the youtuber Charles Green a.k.a 'Angry Granpa' |
18:04
π
|
|
kimmer1 has joined #archiveteam-bs |
18:12
π
|
Somebody2 |
ola_norsk: You can't archive a person. But you could archive the work they have posted online. I'd use Archivebot and youtube-dl |
18:14
π
|
ola_norsk |
Somebody2: could those various items, from e.g the fandom wikia, twitter, to youtube videos etc; later be made into e.g a 'Collection' ? |
18:14
π
|
ola_norsk |
Somebody2: without having to be one item, i mean |
18:15
π
|
Somebody2 |
Yes, once you upload them, send an email to info@archive suggesting they be made into a collection, and someone will likely do it eventually. |
18:15
π
|
ola_norsk |
ty |
18:16
π
|
|
godane has joined #archiveteam-bs |
18:17
π
|
ola_norsk |
btw, would the items need a certain meta-tag? |
18:18
π
|
ola_norsk |
other than topics, i means |
18:20
π
|
JAA |
For WARCs, you need to set the mediatype to web. Anything else is optional and can be changed post-upload (mediatype can only be set at item creation). But the more metadata, the better! :-) |
18:20
π
|
ola_norsk |
okidoki |
18:21
π
|
JAA |
(If you forget to set the mediatype correctly on upload, send them an email instead of trying to work around it by creating a new item or whatever; they can change it, I believe.) |
18:22
π
|
|
pizzaiolo has quit IRC (Read error: Operation timed out) |
18:22
π
|
ola_norsk |
speaking of which, i messed that up on this item https://archive.org/details/vidme_AfterPrisonJoe :/ |
18:24
π
|
ola_norsk |
and it seems to have messed up the media format detection. I did re-download the videos locally though. |
18:26
π
|
|
pizzaiolo has joined #archiveteam-bs |
18:27
π
|
ola_norsk |
JAA: would it be a good idea to simply tar.gz the vidme videos, add that to the item; and then send an email asking the all item's content to be replaced by the content of the tar.gz |
18:27
π
|
JAA |
I doubt it. |
18:28
π
|
JAA |
Why don't you just use the 'ia' tool (Python package internetarchive)? |
18:28
π
|
ola_norsk |
i do use that |
18:29
π
|
ola_norsk |
but, there seems to be a bug that's preventing changing metadata. |
18:30
π
|
JAA |
Hm? |
18:30
π
|
ola_norsk |
JAA: https://github.com/jjjake/internetarchive/issues/228 |
18:31
π
|
ola_norsk |
but, maybe it's resolved already. I've not tested yet. |
18:41
π
|
ola_norsk |
ouch, might i have accidentally closed the issue? Or do they timeout on github after some time :/ |
18:46
π
|
ola_norsk |
JAA: what version of 'ia' are you on? |
18:46
π
|
* |
ola_norsk is on 1.7.4 |
19:18
π
|
ola_norsk |
how do i submit 'angrygrandpa.wikia.com' to get grabbed by archivebot? |
19:19
π
|
ola_norsk |
nevermind |
19:38
π
|
|
dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) |
19:38
π
|
|
dashcloud has joined #archiveteam-bs |
19:56
π
|
|
dashcloud has quit IRC (Read error: Connection reset by peer) |
20:00
π
|
|
dashcloud has joined #archiveteam-bs |
20:20
π
|
|
BlueMaxim has joined #archiveteam-bs |
20:35
π
|
JAA |
ola_norsk: Sorry, had to leave. I've been using 1.7.1 and 1.7.3. |
20:38
π
|
JAA |
ola_norsk, Somebody2: If you archive a site through ArchiveBot, you can't easily add just that site's archives to a collection afterwards though, because the archives are generally spread over multiple items and each of those items also contains loads of other jobs' archives. |
20:39
π
|
JAA |
It might be better to use grab-site/wpull and upload the WARCs yourself. Then you can create clean items for each site you archive or whatever, and these can easily be added to a collection (or multiple) afterwards. |
20:42
π
|
Somebody2 |
That is a good point, thank you. |
20:47
π
|
PurpleSym |
Whatβs the #archivebot IRC logs password? |
20:56
π
|
JAA |
Query |
20:57
π
|
PurpleSym |
And username, JAA? |
20:59
π
|
ola_norsk |
JAA: ah. i will see if i have the space for that wikia. (though, it's already submitted to archivebot as a job) :/ I will use !abort if my harddrive runs out |
21:00
π
|
ola_norsk |
JAA: so there's no real way to recall a specific archivebot job/task? |
21:00
π
|
JAA |
ola_norsk: You could upload the WARCs already while you're still grabbing it. That's what ArchiveBot pipelines do as well. |
21:00
π
|
JAA |
What do you mean by "recall"? |
21:01
π
|
ola_norsk |
JAA: to make that specific archivebot task into an item on ia |
21:02
π
|
ola_norsk |
JAA: a warc item, i mean, with topics etc. |
21:02
π
|
JAA |
In theory, you could download the files and reupload them to a new item. Or use 'ia copy', which does a server-side copy I believe. Whether that's a good idea though is another question entirely... |
21:03
π
|
ola_norsk |
im n00b at using archivebot i'm afraid :/ |
21:03
π
|
JAA |
I guess someone from IA could move the files to a separate item. But again I'm not sure whether they do that. |
21:03
π
|
JAA |
But yeah, that's all manual. |
21:04
π
|
JAA |
Some pipelines upload files directly, and there you sort-of have one item per job (though it doesn't contain all relevant files and may sometimes contain other jobs as well). |
21:04
π
|
JAA |
But other than that... |
21:05
π
|
ola_norsk |
a warc item that's manually uploaded as item at a later time though, could that use data already grabbed trough archivebot? Or would it be causing duplicate-hell ? |
21:06
π
|
JAA |
I don't know how IA handles duplicates. |
21:06
π
|
ola_norsk |
aye, me neither |
21:07
π
|
JAA |
That's why I wonder if it's a good idea. |
21:07
π
|
JAA |
If they deduplicate the files, then it would probably be fine. |
21:07
π
|
JAA |
Maybe someone else knows more about this. |
21:08
π
|
ola_norsk |
"somewhere, in the deep cellars of internet archive; There's a single gnome set to the task of checksum'ing all files and writing symlinks" lol |
21:08
π
|
ola_norsk |
:D |
21:12
π
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
21:13
π
|
|
Mateon1 has joined #archiveteam-bs |
21:15
π
|
Somebody2 |
IA does not (*YET*) dedeuplicate. |
21:15
π
|
Somebody2 |
(AFAIK) |
21:20
π
|
|
jschwart has joined #archiveteam-bs |
21:49
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
21:50
π
|
|
dashcloud has joined #archiveteam-bs |
21:51
π
|
ola_norsk |
i have no idea. if there's no pressing need it's ok i guess. And thinking they would if need be, at least on items that's not been altered for quite a while. |
21:52
π
|
godane |
dedeuplicate i can see being done on video and pdf items |
21:53
π
|
godane |
i don't think it would work with warc files |
21:53
π
|
ola_norsk |
aye |
21:53
π
|
ola_norsk |
not with derived/recompressed files either i think. not unless the original was checked beforehand |
21:57
π
|
ola_norsk |
godane: i guess with warc it would need checking content; and patching the stuff and/or link list in the warcs |
21:58
π
|
godane |
sort of my thought |
21:58
π
|
ola_norsk |
aye |
21:59
π
|
ez |
unless ia unpacks all wars on their side |
21:59
π
|
godane |
i was think it would be check important web urls that have the same checksum and making derive warc archive to only store it once |
22:00
π
|
ez |
warc is generally rather unfortunate thing to do for bulk file formats |
22:00
π
|
ez |
(not sure about the wisdom of reinventing zip files, either) |
22:00
π
|
godane |
so it would be derive warc either way |
22:01
π
|
godane |
something like this for warc makes more sense if we are doing my librarybox project idea |
22:01
π
|
ola_norsk |
got link? |
22:01
π
|
godane |
cause then people can host full archives of cbsnews.com for example without it take 100gb |
22:02
π
|
ez |
the thing is that on mass scale, dupes dont happen that often in general |
22:02
π
|
ez |
so its often not worth the time bothering with it, especially for small items |
22:04
π
|
ola_norsk |
but e.g for twitter, users might upload memes that are just copied from other sites. i don't know if twitter alters all images; but that could cause duplicates i think |
22:04
π
|
ola_norsk |
if, twitter gives each uploaded image it's own file and filename, i mean |
22:05
π
|
ez |
somewhat |
22:06
π
|
ez |
theres been study for this for 4chan, which granted, isnt representative sample of twitter |
22:06
π
|
ez |
but as far as actual md5 stored _live_, only 10-15% were dupes |
22:06
π
|
ola_norsk |
ok |
22:06
π
|
ez |
however over time, there were indeed >50% dupes in certain time periods |
22:06
π
|
ez |
exactly as you say, some image got really popular and got reposted over and over |
22:08
π
|
|
pizzaiolo has quit IRC (Read error: Operation timed out) |
22:08
π
|
ola_norsk |
in any case, it's something that's fixable in the future though i would guess. E.g picking trough all the shit and finding e,g the most reoccuring, or highquality image; either by md5 or image regocnitions |
22:09
π
|
ola_norsk |
image recognition* |
22:09
π
|
|
pizzaiolo has joined #archiveteam-bs |
22:10
π
|
ez |
well, this isnt that exact study i've seen, but mentions the distribution of dupes per post https://i.imgur.com/S9pxJqV.png |
22:11
π
|
ez |
ola_norsk: it could be done of course. note that if you were to do what you say, you'd also build index for reverse image search |
22:11
π
|
ez |
which would be a really handy thing for IA to have |
22:11
π
|
ez |
needless to say, it depends if IA wants to diversify as a search engine |
22:12
π
|
ola_norsk |
well, the meta.xml is there, containing the md5's i think :D |
22:12
π
|
ez |
md5s useless for search |
22:12
π
|
ola_norsk |
aye, but to find duplicates i mean |
22:12
π
|
ez |
on md5 level, the dupes dont happen often enough given a random kitchen sink of files |
22:13
π
|
ez |
it definitely makes sense for certain datasets |
22:13
π
|
|
pizzaiolo has quit IRC (Client Quit) |
22:13
π
|
ez |
like those ad laden pirate rapidshare upload servers, they sure do md5. they have this huge sample of highly specific cotent, and the files often are same copies. |
22:13
π
|
ez |
megaupload actually got nailed for this legally |
22:13
π
|
|
pizzaiolo has joined #archiveteam-bs |
22:13
π
|
ez |
they dmcad the link, but not the md5 |
22:14
π
|
ez |
deduping 10k 1MB image files and deduping 10k 1GB files is what makes the difference |
22:15
π
|
ez |
as you get random mix of both, its obvious which part of the set to focus on |
22:15
π
|
ola_norsk |
it would be slow work i guess |
22:15
π
|
ez |
depends on how the system works, really |
22:16
π
|
ez |
most data stores working on per-file basis often compute hash on streaming upload, and make a symlink when hash found at that time, too |
22:16
π
|
ez |
but more general setups often dont have the luxury of having a hook fire per each new file |
22:16
π
|
ola_norsk |
what if there was a distributed tool to Warrior, that picked trough the IA items xml and looked for duplicate md5's ? |
22:17
π
|
ola_norsk |
(of only original files that is) |
22:18
π
|
ez |
its possible to fetch the hashes from ia directly via api |
22:18
π
|
JAA |
joepie91: FYI, I'm porting your anti-CloudFlare code to Python. |
22:19
π
|
ez |
not everything is available tho. as long it has xml sha1, you can api fetch it |
22:19
π
|
ez |
build an offline database too, etc |
22:20
π
|
ola_norsk |
ez: it's a quite a number of items on ia though. But if it was a delegated slow and steady task |
22:21
π
|
ola_norsk |
ez: like a background task or something |
22:21
π
|
ez |
no i mean you can have the hashes offline |
22:21
π
|
ez |
as comparably small structure |
22:21
π
|
ez |
unfortunately its really awkward to get it at this moment |
22:22
π
|
ola_norsk |
ez: i mean making a full list of duplicate files. where e.g the 'parent item' is by first date |
22:22
π
|
ez |
ola_norsk: a bloom filter with reasonable collision rate is like 10 bits per item |
22:23
π
|
ez |
regardless of number of items |
22:24
π
|
ez |
not sure if ia supports search by hash |
22:25
π
|
ez |
last time i checked the api (some years ago) it didnt |
22:25
π
|
ola_norsk |
the md5 hashes are in the xml of (each?) item |
22:25
π
|
ez |
if it still doesnt, you'd need to store hash, as well as xml id to locate its context as you say |
22:25
π
|
ez |
which would bload the database a great deal |
22:26
π
|
ez |
ola_norsk: yea |
22:26
π
|
ez |
the idea is that i run scan over my filesystem and compare every file to filter i scrapped from ia api |
22:26
π
|
ola_norsk |
it's not going anywhere though. So it could basically be done slow as shit don't you think? |
22:26
π
|
ez |
and upload only files which dont match. this is because querying IA with 500M+ files is not realistic |
22:27
π
|
ez |
so bloom filter would work fine for uploads of random crap and making more or less sure its not a dupe |
22:28
π
|
ez |
but it wont tell you *where* your files are on ia, unless you abuse the api with fulltext search and what not |
22:28
π
|
ola_norsk |
i have no idea man :) |
22:29
π
|
ola_norsk |
does the logs say? |
22:29
π
|
ola_norsk |
the history of items? |
22:30
π
|
* |
ola_norsk 's brain is broken and beered :/ |
22:31
π
|
ola_norsk |
i'm guessing IA would put some gnome to work the day their harddrive is full :D |
22:32
π
|
ola_norsk |
(which i'm guessing is not tomorrow lol) :D |
22:32
π
|
ez |
just restrict some classes of uploads when space starts running short |
22:32
π
|
ez |
but yea, space can be done on cheap if you have the scale |
22:33
π
|
ola_norsk |
doh, restriction is bad :/ |
22:33
π
|
ola_norsk |
might they as well check if that file already exist? |
22:34
π
|
ez |
as i said, at those scales, the content is so diverse it happens rather infrequently, especially if your files are comparably small (ie a lot of small items of diverse content) |
22:34
π
|
ola_norsk |
...maybe they already do.. :0 |
22:34
π
|
|
odemg has quit IRC (Read error: Connection reset by peer) |
22:35
π
|
ez |
its easy to do for single files, but not quite sure about warc |
22:35
π
|
ola_norsk |
aye, it would need unpacking and stuff |
22:36
π
|
ola_norsk |
and e.g youtube videos that are mkv combined would need split into audio and video i guess, then compared |
22:37
π
|
ez |
the thing is i've seen deduping rather infrequently in large setups like this - the restrictions on flexibility of what you can do (you now need some sort of fast hash index to check against, you need some symlink mechanism now) |
22:37
π
|
ez |
youtube doesnt dedupe im pretty sure |
22:37
π
|
ez |
not by the source video anyway |
22:38
π
|
ez |
since 99% of content they get is original uploads. most dupes they'd otherwise get usually gets struck by contentid |
22:38
π
|
ez |
1% is the long trail of short meme videos reposted over and over and what not, but its just a tiny part of long trail |
22:39
π
|
ola_norsk |
i sometimes upload by getting videoes with youtube-dl, and it seems that often combines 2 files, audio and video, into a single file..would that make different md5 sum? |
22:40
π
|
ola_norsk |
(without repacking/recoding, i mean) |
22:40
π
|
ez |
(its easy to test - each reupload yields new reencode on yt, and the encode is even slightly different as it contains timestamp in mp4 header) |
22:41
π
|
ez |
ola_norsk: yes, highest quality is available only via HLS |
22:42
π
|
ez |
curiously, its an artificial restriction, as other parts of google infra which uses yt infra, together mp4 1080/4k just fine |
22:42
π
|
ez |
the restriction is specific to yt, i suppose in a bid to frustrate trivial ripping attempts via browser extensions |
22:44
π
|
ola_norsk |
but, i think what i mean is; If i run 'youtube-dl' on the same youtube video twice..Then e.g the audio and video (often webm and mp4), before they are merged into MKV file, would be the very same files each time? or? |
22:44
π
|
ez |
ola_norsk: on and off its possible to abuse youtube apis to get the actual original mp4, but it changes 2 times now (works for google+, only for your own videos when logged in) |
22:44
π
|
ez |
*has changed |
22:44
π
|
ez |
so definitely not something to rely on |
22:45
π
|
ez |
but if i were as crazy to archive yt, i'd definitely try to rip the original files, not the re-encodes |
22:45
π
|
|
odemg has joined #archiveteam-bs |
22:45
π
|
ola_norsk |
no hehe :D i'm just talking example as to how to detect duplicate videos :D |
22:45
π
|
ola_norsk |
or duplicate uploads in general |
22:46
π
|
ez |
ola_norsk: depends what you command ytdl to do |
22:46
π
|
ez |
generally if you ask it same file, same format, you overwhelmingly get the exact same file |
22:46
π
|
ez |
but google re-encodes those from time to time |
22:48
π
|
ola_norsk |
aye, most often it just combines audio and video, 2 different files, into a KVM file. And, i'm thinking if the kvm file were split again, into those two a/v files, the md5 would be the same in two instances of where youtube-dl were used to download the same video. |
22:49
π
|
ola_norsk |
and, by that, duplicate video uploads could be detected |
22:50
π
|
ola_norsk |
the generated merged file (kvm etc) would be different, but the two merged files would be same, since there's no re-encoding occurring |
22:51
π
|
ez |
hum, that sounds elaborate? |
22:51
π
|
ola_norsk |
aye, i have a headacke :D |
22:51
π
|
ez |
why not just tell ytdl to rip everything to mkv from the get-go? |
22:54
π
|
ola_norsk |
i just meant in relation to detecting duplicates in IA items (where e.g 2 youtube vidoes are uploaded twice) |
22:55
π
|
ola_norsk |
where md5sum of the two kvm files are not an option |
22:55
π
|
ola_norsk |
since each download would cause two diffrent kvm to be made locally by the downloaders |
22:57
π
|
ola_norsk |
if, however, it's possible to split a kvm, into the audio and video files contained..i'm thinking those two would yield identical md5 |
23:02
π
|
* |
ola_norsk ran out of duplication-detection smartness :/ |
23:05
π
|
ez |
i have no idea what kvm file is |
23:10
π
|
|
nyany has joined #archiveteam-bs |
23:16
π
|
ola_norsk |
ez: usually when i use youtube-dl it downloads audio and video seperate, then combine the two into a kvm file |
23:16
π
|
ez |
whats an kvm file? |
23:16
π
|
ez |
oh |
23:16
π
|
ez |
mkv |
23:17
π
|
ola_norsk |
ah, sorry, yes |
23:17
π
|
ola_norsk |
mkv |
23:18
π
|
ez |
yea its a bit annoying, mostly because the ffmpeg transcoder is non-deterministic |
23:18
π
|
ez |
i think it puts timestamp in the mkv header or something silly like that |
23:18
π
|
ez |
s/transcoder/muxer/ |
23:19
π
|
ola_norsk |
does it alter the two audio and video files though? or can it be split from the mkv? |
23:21
π
|
ola_norsk |
either way, if that is possible; then detecting duplicate mkv files uploaded to ia is possible |
23:21
π
|
ola_norsk |
even if the md5 sum differs between to mkv files containing the same content |
23:22
π
|
ez |
the raw bitstream is kept as-is |
23:23
π
|
ola_norsk |
ok |
23:23
π
|
ez |
meaning the mux is as "original" as sent in the google HLS track |
23:23
π
|
ez |
but the container metadata are often unstable on account of different versions of software muxing slightly differently (think seek metadata and such) |
23:27
π
|
ola_norsk |
if the framecount doesn't differ though, and neither does the frames, that could be a further step? |
23:28
π
|
ola_norsk |
ez: the extent of my knowledge is rather spent when it comes to codecs :D |
23:28
π
|
ez |
tl;dr is that you cant rely on what ytdl gives you as muxed output |
23:29
π
|
ez |
perhaps if you just ask for the 720p mp4, as that one isnt remuxed (yet?) |
23:30
π
|
ola_norsk |
i'm thinking some kind of image/frame simularity detection then i guess |
23:30
π
|
ola_norsk |
rather* |
23:32
π
|
|
jschwart has quit IRC (Quit: Konversation terminated!) |
23:32
π
|
ola_norsk |
the md5 is in the item xml though, so i guess that is where one would have to start to find duplicates on ia |
23:39
π
|
ola_norsk |
ez: it is as you say elaborate. So i'm glad i don't have to do it :D |
23:40
π
|
ola_norsk |
ez: (and so should everyone else be, of me not doing it lol) ;) |
23:41
π
|
|
BlueMaxim has quit IRC (Quit: Leaving) |