[00:01] *** godane has joined #archiveteam-ot [02:49] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [03:15] *** qw3rty has joined #archiveteam-ot [03:24] *** qw3rty2 has quit IRC (Ping timeout: 745 seconds) [03:39] *** ntntn has joined #archiveteam-ot [06:14] *** dhyan_nat has joined #archiveteam-ot [06:23] *** Mateon1 has quit IRC (Remote host closed the connection) [06:23] *** Mateon1 has joined #archiveteam-ot [07:18] *** dxrt has joined #archiveteam-ot [07:18] *** Fusl____ sets mode: +o dxrt [07:18] *** Fusl sets mode: +o dxrt [07:18] *** Fusl_ sets mode: +o dxrt [07:24] *** systwi_ is now known as systwi [08:30] JAA: "File size collisions are definitely very unlikely. If you can live with a very small potential error, there's no point in doing anything else." [08:31] Heh, well, I'd prefer to use a more bulletproof (err, resistant :P ) solution, but I guess the file size will have to do [08:32] Taking into account how infrequent YT videos are edited by their creators, it'll do [08:33] You guys don't have to spend time researching it, but if you do find a more reliable approach please let me know [09:10] how many videos do you have where you have more than one variant? [10:02] *** voker57_ is now known as voker57 [10:34] *** killsushi has quit IRC (Quit: Leaving) [10:36] *** ShellyRol has quit IRC (Read error: Connection reset by peer) [10:40] *** qw3rty2 has joined #archiveteam-ot [10:44] *** ShellyRol has joined #archiveteam-ot [10:46] *** qw3rty has quit IRC (Ping timeout: 745 seconds) [11:03] i have tons of file size collisions for videos on my hdd. believe it's the nature of mpeg encoding [11:05] are they from youtube-dl ? [11:07] i only have a few dozen videos from youtube-dl [11:07] no dupe sizes [11:11] if someone has a copy of a youtube video pre-censor and post-censor, try running both videos through ffprobe to see if there are diffs [11:11] looks like there's a CRC32 of the video and audio streams? [11:13] if so, then just ffprobe for those values and tack them onto the filename / your database [11:28] Where do you get the checksums from? If that's just ffprobe, then you'd still need to redownload to check whether something has changed. [11:29] And if you redownload, you could just as well run a hash over the file or whatever. [11:29] oh nvm, those aren't crc32 values. but maybe some other values can be sussed out that vary [11:30] yeah, you can hash your files, but i recognize that's more time consuming than metadata readers like ffprobe [11:30] If the checksum's in the headers, yes. If ffprobe calculates it from the streams, almost certainly no. [11:31] censor? [11:31] And if it's in the headers (and reliable), you can also just download the headers (i.e. first few kB or so?), then do the comparison, and close the connection if it's still the same. [11:36] ivan: i think the discussion is a revisit of youtube's ability to mute copyright audio segments or enable channel owners to edit videos [11:42] offtopic offcolor observation. ffprobe -colors lists "NavajoWhite == #ffdead" [11:44] *** coderobe has quit IRC (Remote host closed the connection) [11:45] I'm a little surprised someone who works on youtube-dl wouldn't know already, but I also feel it barely matters. In the face on uncertainty build the simplest / quickest thing that has a chance of working, and running that will either work or point you to what's needed to work. minimum viable product [11:47] video editing is fairly cutting edge and rarely used [11:57] *** BlueMax has quit IRC (Read error: Connection reset by peer) [11:57] *** BlueMax has joined #archiveteam-ot [12:08] *** ntntn has quit IRC (Ping timeout: 260 seconds) [12:10] *** schbirid has quit IRC (Remote host closed the connection) [12:17] I am going to Melbourne tomorrow. That means I get to experiment with znc [12:18] *** ntntn has joined #archiveteam-ot [12:19] *** coderobe has joined #archiveteam-ot [12:25] re: 'yes anysoftkeyboard is nice', it's ok. it's no Hacker's Keyboard. having the workman thing is a very good bonus, but i'm really using it cos i needed a split layout [12:29] *** ntntn has quit IRC () [12:38] *** bluefoo_ has quit IRC (Ping timeout: 745 seconds) [13:16] ivan: I haven't saved any videos yet, as I'm still building my script, so I can't tell. [13:18] Yes, Raccoon, this is because of YT muting audio and ability to edit the video after its been uploaded. Thanks YT, NOT what archivists need [13:20] I also agree, markedL, I would think/hope one of the ytdl devs would know. Also, the way I do my programming is I try to get it perfect the first time. I don't mean to sound rude, it's just that I try to cover any and all bases from the start, [13:23] what are you programming in [13:25] you still need development testing, revisions, and re-re-revisions. might as well do some interactive coding / practice gets [13:26] If you want to be absolutely sure, redownload and compare hashes. That's the only way to be really certain. [13:27] the only way to archive things perfectly is to replicate the remote database exactly and run your own youtube-equivalent viewer [13:28] if you run youtube-dl long enough you will run into things like youtube-dl extracting incorrect metadata [13:28] depends on whether you want the new encoding formats to flagged as new content or same content [13:28] Yeah, if you want to detect video edits as opposed to reencoding, well, good luck. [13:29] if you want to waste less of your time do the thing that yields the highest ROI by capturing the most important stuff in a format probably usable in the future [13:29] *** bluefoo has joined #archiveteam-ot [13:31] there might be some metadata page scraping to find mentions of copyright mutings and author editing [13:31] you'll also see things like youtube videos changing channel or usernames (e.g. whitehouse) becoming a different channel [13:32] i think youtube is transparent about this so google doesn't get accused and abused for source citation tampering etc [13:33] wouldn't want a video with 20 million views to start editing in obscene material, or selling videos with 20 million views to the highest bidder to replace with their own content [13:51] how about inserting tangential video clips at regular intervals, from the highest bidder [13:52] AKA ads [14:18] *** DogsRNice has joined #archiveteam-ot [14:22] *** DogsRNice has quit IRC (Remote host closed the connection) [14:22] *** DogsRNice has joined #archiveteam-ot [14:24] *** DogsRNice has quit IRC (Remote host closed the connection) [14:24] *** DogsRNice has joined #archiveteam-ot [15:39] *** deevious has quit IRC (Quit: deevious) [16:21] *** Raccoon has quit IRC (Ping timeout: 258 seconds) [16:40] *** icedice has joined #archiveteam-ot [16:47] *** ats has quit IRC (leaving) [17:03] *** ats has joined #archiveteam-ot [17:14] *** VADemon has quit IRC (Quit: left4dead) [17:18] *** BlueMax has quit IRC (Read error: Connection reset by peer) [18:02] Windows 10 has now defragmented my SSD so that it's back at 0% fragmentation, so that's nice [18:03] It would be nice if Storage Optimizer could be run manually, but I suppose once a month is enough for most users [18:41] Don't see the issue? https://usercontent.irccloud-cdn.com/file/OVaEyaG1/image.png [19:01] *** bluefoo has quit IRC (Ping timeout: 496 seconds) [19:51] *** bluefoo has joined #archiveteam-ot [20:14] *** Meroje has quit IRC (Quit: bye!) [20:22] You should not be literally defragging an SSD anyways [20:23] *** Meroje has joined #archiveteam-ot [20:26] *** bluefoo has quit IRC (Ping timeout: 252 seconds) [20:30] lets not stir that pot again, we had that the other day [20:30] oh sorry, didn't know [20:32] *** bluefoo has joined #archiveteam-ot [21:28] Ivy: It's an automatic intelligent defrag thing that Windows 10 runs once a month on all Windows 10 computers that have Storage Sense enabled [21:29] https://www.hanselman.com/blog/TheRealAndCompleteStoryDoesWindowsDefragmentYourSSD.aspx [21:30] "Storage Optimizer will defrag an SSD once a month if volume snapshots are enabled. This is by design and necessary due to slow volsnap copy on write performance on fragmented SSD volumes." [21:30] -A developer on the Windows storage team [21:30] Edit: Ok, it was volume snapshots and not Storage Sense, I misremembered that detail [21:52] *** dhyan_nat has quit IRC (Read error: Operation timed out) [21:57] *** Meroje has quit IRC (Quit: bye!) [21:58] *** Meroje has joined #archiveteam-ot [22:02] *** Meroje has quit IRC (Client Quit) [22:02] *** Meroje has joined #archiveteam-ot [22:06] *** Meroje has quit IRC (Client Quit) [22:06] *** Meroje has joined #archiveteam-ot [22:10] *** Meroje has quit IRC (Client Quit) [22:10] *** Meroje has joined #archiveteam-ot [22:11] *** NickN00b has joined #archiveteam-ot [22:14] *** Meroje has quit IRC (Client Quit) [22:14] *** Meroje has joined #archiveteam-ot [22:28] *** ats_ has joined #archiveteam-ot [22:29] *** ats has quit IRC (Read error: Operation timed out) [22:32] *** ats_ has quit IRC (Read error: Operation timed out) [22:59] *** phillipsj has joined #archiveteam-ot [23:06] *** ScruffyB has quit IRC (Read error: Operation timed out) [23:12] *** BlueMax has joined #archiveteam-ot [23:27] *** ats has joined #archiveteam-ot [23:27] ivan: It' [23:27] ivan: It's written in bash [23:30] And the formats I use are .json (metadata), .txt (description), .xml (annotations), .jpg/.png (thumbnail), .mkv (video), .vtt (subtitles) [23:31] And yes, I've also taken into account changing usernames, hence why the folders consist of the channel id and display name in brackets, which will also change as the user modifies it. [23:32] UCeR0n8d3ShTn_yrMhpwyE1Q [TheReportOfTheWeek] [23:33] My script only cares about the beginning 24 characters. If the name changes, so does the folder [23:33] UCeR0n8d3ShTn_yrMhpwyE1Q [ROTW]