#archiveteam-ot 2019-08-28,Wed

↑back Search

Time Nickname Message
01:19 🔗 killsushi has quit IRC (Quit: Leaving)
01:27 🔗 kjhota123 has joined #archiveteam-ot
01:28 🔗 kjhota123 has quit IRC (Client Quit)
02:42 🔗 m007a83 has quit IRC (Read error: Connection reset by peer)
02:58 🔗 m007a83 has joined #archiveteam-ot
03:01 🔗 Mateon1 has quit IRC (Remote host closed the connection)
03:01 🔗 Mateon1 has joined #archiveteam-ot
03:20 🔗 Mateon1 has quit IRC (Remote host closed the connection)
03:24 🔗 Mateon1 has joined #archiveteam-ot
03:35 🔗 systwiAL_ has joined #archiveteam-ot
03:51 🔗 systwiAL_ is now known as systwiALT
03:54 🔗 systwiALT Thanks for the info ivan_ and JAA. I hope I can piece this together okay; I tried following a guide on storing hierarchical info in postgres but the tutorial had some confusing typos.
03:54 🔗 systwiALT Hey, really quick, would ¬ be an allowed substitute for - in a table or column name?
04:01 🔗 godane has quit IRC (Ping timeout: 745 seconds)
04:13 🔗 godane has joined #archiveteam-ot
04:30 🔗 godane has quit IRC (Read error: Operation timed out)
05:43 🔗 godane has joined #archiveteam-ot
06:01 🔗 Atom-- has joined #archiveteam-ot
06:04 🔗 Atom has quit IRC (Read error: Operation timed out)
06:26 🔗 dhyan_nat has joined #archiveteam-ot
07:02 🔗 Terbium has quit IRC (Quit: https://quassel-irc.org - Chat comfortably. Anywhere.)
08:47 🔗 Dragnog2 has joined #archiveteam-ot
09:39 🔗 JAA systwiALT: I don't know what restrictions PostgreSQL imposes, but I've always tried to keep my column names to [a-z0-9_]. It might be possible to use other characters, but that's really asking for trouble one way or another, e.g. due to issues in certain client libraries. Best to just keep it to a minimal subset that's definitely supported everywhere.
09:51 🔗 VerifiedJ has joined #archiveteam-ot
09:51 🔗 BlueMax has quit IRC (Quit: Leaving)
10:04 🔗 hook54321 how bad of an idea is it to buy a refurbished or open box hard drive?
10:11 🔗 ivan_ Bad
10:13 🔗 hook54321 k, I won't consider this one then.
10:32 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
11:07 🔗 bluefoo has quit IRC (Remote host closed the connection)
12:35 🔗 h3ndr1k hook54321: What is an open box harddrive?? Never heard of it
12:41 🔗 h3ndr1k Oh wait, its box is just already opened. I blame english not being my native language.
12:42 🔗 h3ndr1k For some reason I thought of white box harddrive, which would be interesting.
12:44 🔗 h3ndr1k In the sense of white box pc
12:58 🔗 JAA Drives that were refurbished by the manufacturer (and not some shady third party) can be fine, but it always depends on what you want to use them for.
13:13 🔗 hook54321 I'm just looking for one to use in my laptop, I found some others though
14:23 🔗 bluefoo has joined #archiveteam-ot
15:29 🔗 Fusl ah
15:29 🔗 Fusl whoops
15:29 🔗 Fusl wrong tab
15:41 🔗 Fusl http://xor.meo.ws/4817f264/b18a/47cb/839a/2f6a9b4d3efa.png why do i not believe you?
15:46 🔗 systwiAL_ has joined #archiveteam-ot
15:56 🔗 systwiALT has quit IRC (Read error: Operation timed out)
15:57 🔗 bluefoo has quit IRC (Read error: Operation timed out)
16:28 🔗 kiska I've just got my Radeon VII, $1.1k Australian spent... :(
16:28 🔗 kiska Wish it was cheaper
16:40 🔗 systwiAL_ is now known as systwiALT
16:44 🔗 systwiALT JAA: I would normally just use _ and leave it, however some of the table names will (or might, idk yet) have the name of the youtube channel id. As we know, this can have both - and _, therefore I had wanted to use something like ¬ to avoid any collisions or misinformation
16:45 🔗 systwiALT I'm still working at trying to piece even a bare-bones template together, it's not too easy :-/
16:46 🔗 JAA Wait, why would you create a column with the name of a channel?
16:47 🔗 JAA That's almost certainly a terrible idea.
16:53 🔗 ivan_ systwiALT: did the three tables I gave you make sense
16:54 🔗 systwiALT JAA: Well, not exactly. A table.
16:54 🔗 ivan_ you don't need per-channel tables
16:55 🔗 systwiALT ivan_: They kinda did, but I wasn't sure how each channel ID would link to a column in a different table
16:55 🔗 JAA It wouldn't.
16:55 🔗 JAA You have tables with columns and rows.
16:55 🔗 systwiALT Additionally this part: files (video_id, file) PK (video_id, file) didn't make sense. Is the primary key video_id or file?
16:55 🔗 JAA The columns specify the structure of the data, the rows are the actual data.
16:56 🔗 ivan_ systwiALT: it's a composite of both columns
16:56 🔗 systwiALT Ok one sec let me try this again
16:57 🔗 ivan_ because you have channel ids in your videos table, if you wanted videos for one channel, you can just do SELECT * FROM videos WHERE channel_id = 'whatever'
16:57 🔗 JAA So if you had a simple table called "channels" which has just two fields (id INTEGER, name VARCHAR), then you'd have one row in this table for each channel you want to store.
16:58 🔗 ivan_ the channels table is really the least interesting part of the whole thing
16:58 🔗 ivan_ useful if you need to correlate channel ids and users
17:00 🔗 systwiALT What about the parents like "availability, channelname, and grabinfo? (excluding the video IDs)
17:01 🔗 systwiALT Like what the .json had
17:01 🔗 ivan_ parents?
17:01 🔗 systwiALT Parent-child relationships in .json
17:01 🔗 systwiALT Or object I guess is more specific
17:02 🔗 ivan_ is availability per-video?
17:02 🔗 systwiALT Per video and per channel
17:02 🔗 ivan_ ah so a column on both tables
17:03 🔗 ivan_ what's grabinfo?
17:07 🔗 systwiALT I'll take a screenshot in a second here
17:26 🔗 Dragnog94 has quit IRC (Read error: Operation timed out)
17:33 🔗 Raccoon has joined #archiveteam-ot
17:40 🔗 phillipsj has joined #archiveteam-ot
17:40 🔗 bluefoo has joined #archiveteam-ot
17:52 🔗 systwiALT Sorry for the delay ivan_. I can still take a screenshot but it will be difficult since it all doesn't fit on the screen. Did you try viewing the .json I linked to in a hierarchical JSON viewer?
17:58 🔗 ivan_ you mean https://gist.githubusercontent.com/systwi/413add02946e3a9cb087f6b4a8922687/raw/cc1feadc56a29237e9df2ed6ec786f8d5d81a164/youtube_database_sample.json
17:59 🔗 ivan_ I guess you have that on a channel right now
17:59 🔗 ivan_ I'm not sure what the point of it is :-)
18:00 🔗 ivan_ do you care about tracking the grabs you did on a channel, even though grabbing videos individually would have functionally equivalent results?
18:03 🔗 systwiALT Yes that file. It is just an example of one channel. The purpose is to track changes between every grab of an individual channel or that entire channel.
18:04 🔗 systwiALT I have already thought everything through, and the .json works well (in the sense of organization).
18:05 🔗 systwiALT I just need to convert this mess to SQL
18:05 🔗 ivan_ are you aware of the bugs in how YouTube returns paginated upload playlists
18:06 🔗 ivan_ https://ya.borg.xyz/logs/dl/UCN79wVFfg3yCeq0lEy0OzRg/2019-08-28T15_52_08.log
18:06 🔗 ivan_ for every "Ignoring duplicate" there's actually a video in that channel that YouTube is failing to list
18:06 🔗 systwiALT I would normally just send the channel url itself, not with /videos at the end
18:06 🔗 ivan_ so tracking the video IDs you get on each grab will have noisy variations that don't reflect real additions or removals in the channel
18:09 🔗 systwiALT Any grab to a channel, whether it be just a single video or the entire channel, would count as one grab. That number always increments by 1. The most up-to-date information will always be in the "curr" object. If that video is removed the "availability" boolean is set to false. If a video is added its video ID appears in the list under its respective channel id.
18:10 🔗 systwiALT So if the first time I grab a single video from a new channel, the grabnum is 1. If later, I grab a different video from that channel that counts as grab 2. If I then grab the entire channel the grabnum is 3.
18:10 🔗 ivan_ but what do you do with this information
18:11 🔗 systwiALT I will use it to keep track of changes to YT content.
18:11 🔗 systwiALT It's for personal use
18:12 🔗 systwiALT If I wanted to revert a video back to grabnum 2 (whether it be for a removed comment in the current one, different description, etc.) I could do that in my script
18:12 🔗 ivan_ you know the video itself can change, yes?
18:12 🔗 systwiALT Yep
18:12 🔗 systwiALT In the example I provided it didn't change
18:13 🔗 systwiALT But I have already planned out for if/when a video changes
18:13 🔗 ivan_ can you just keep both versions?
18:13 🔗 systwiALT The video file is treated the same as any other file (e.g. description)
18:13 🔗 systwiALT It will keep both versions
18:15 🔗 ivan_ it seems like your mostrecentchg would get overwritten pretty quickly and you'd lose deltas for previous grabs
18:15 🔗 systwiALT On the filesystem I will also have a curr and old folder. Inside of curr, as you guessed, will always have the most up-to-date data. Inside of the old folder, you will see grab1, grab2, basically any older grabs. Inside of those folders will be your older versions of the files.
18:15 🔗 systwiALT The database will keep track of which files were changed at whichever grab number.
18:15 🔗 ivan_ ah, so this thing reflects real changes in your storage, not changes in youtube
18:16 🔗 systwiALT Well, kinda. This will not be monitoring YouTube 24/7, what will happen is in my script I will download a video or channel and it will edit the database accordingly
18:17 🔗 ivan_ does "affectedvideos" : [ "z3aEv3EzMyQ" ] list new videos that you stored
18:17 🔗 ivan_ does it sometimes list something else?
18:17 🔗 ivan_ ah, you're redownloading info for videos that you already have?
18:18 🔗 systwiALT affectedvideos lists the video ID of whichever videos had information changed in them during the most recent grab
18:19 🔗 systwiALT Typically it will list several videos (comments are added, thumbnail might change, etc.)
18:19 🔗 ivan_ if you were storing this in SQL you could just have (video_id, retrieval_time) as the PK and SELECT * FROM videos WHERE video_id = 'whatever'; and compare the rows
18:20 🔗 ivan_ (you would get every version of the thing)
18:20 🔗 systwiALT My plan is to redownload and store information that is different from the file I currently have. If the comments.json file is the same in grabnum 1 as it is in grabnum 5 then grabnum 1's comments.json will be considered the most current and will not be redownloaded
18:21 🔗 systwiALT Once comments.json changes (let's say grabnum 6), it will move curr>grabnum 1>comments.json to "old" and curr>grabnum 6>comments.json will be in "curr"
18:22 🔗 ivan_ I would recommend rethinking this from scratch instead of porting your JSON ideas :-)
18:22 🔗 ivan_ are grabs an entity you really want to track? you _could_ but it doesn't seem necessary
18:23 🔗 ivan_ the changes happen on videos, who cares about the grab that was responsible for detecting the change
18:24 🔗 systwiALT Trust me, I have spent months on this. I really don't want to rethink this from scratch :-/ Also, "(video_id, retrieval_time) as the PK and SELECT * FROM videos WHERE video_id = 'whatever'; and compare the rows" doesn't make sense to me
18:24 🔗 systwiALT Yes, I want to track all of this information.
18:24 🔗 ivan_ you would get multiple rows out if you have multiple versions of the video
18:25 🔗 systwiALT I probably am not explaining this very thoroughly but to me this makes sense.
18:25 🔗 ivan_ do you also want to track when a video disappears (and possible reemerges?)
18:26 🔗 systwiALT ^ That I don't plan on tracking. For now I just have it as "availability", which if it
18:26 🔗 ivan_ you would have a row for each (video_id, retrieval_time) with the same video_id and different retrieval time
18:26 🔗 systwiALT ... which if it's offline then availableonline is false, if it's back up again it's true.
18:27 🔗 ivan_ is it making sense or is it not solving something
18:27 🔗 systwiALT I'm sorry this isn't making sense to me :( SQL is new to me
18:28 🔗 systwiALT I have only this so far:
18:28 🔗 systwiALT CREATE TABLE channels (channel_id TEXT NOT NULL, PRIMARY KEY (channel_id));
18:28 🔗 systwiALT And yes I have read through psql's docs and watched tutorials
18:29 🔗 ivan_ if you have a videos table with a PK on video_id you can only store one row with a certain video_id
18:29 🔗 ivan_ if you have a videos table with a PK on (video_id, retrieval_time) you can store multiple rows with the same video_id if they have a different retrieval_time
18:30 🔗 ivan_ only the entire key must be unique
18:32 🔗 * systwiALT sobs
18:33 🔗 ivan_ what's the confusing part
18:33 🔗 systwiALT "you can only store one row with a certain video_id"
18:34 🔗 systwiALT That sentence would be equivalent to: CREATE TABLE videos (video_id VARCHAR(11) NOT NULL PRIMARY KEY);
18:34 🔗 ivan_ a filesystem is keyed on a filenames, you can't have multiple files with the same filename
18:34 🔗 systwiALT Right?
18:34 🔗 ivan_ a SQL table is keyed on the PK, you can't have multiple rows with the same PK
18:34 🔗 ivan_ systwiALT: sure
18:35 🔗 ivan_ if you try to INSERT another row with the same PK it will refuse
18:35 🔗 systwiALT Ok that makes sense, so meaning: z3aEv3EzMyQ can be used only once under the video_id column if video_id is PK?
18:35 🔗 ivan_ sure
18:36 🔗 systwiALT But... that's kinda how I would like it. I mean, the video_id is unique anyway. No two videos share the same ID
18:36 🔗 ivan_ but you said you were grabbing videos multiple times and keeping different metadata
18:37 🔗 ivan_ or did I get that wrong
18:37 🔗 systwiALT No that's correct
18:37 🔗 ivan_ so you actually have multiple versions of a video and need to have multiple rows with the same video_id
18:37 🔗 systwiALT by retrieval_time do you mean grabnum?
18:37 🔗 ivan_ you could use the UTC timestamp of the time you started the grab for that video
18:38 🔗 systwiALT That's stored in grab_date
18:38 🔗 systwiALT I use grabnum because that way if one file takes 3 minutes to download but another takes a second in the same grab, it's easier to consider them both as grabnum 1 and store that grab time in the database
18:39 🔗 systwiALT than to have folders inside of "old" with each date/time
18:40 🔗 ivan_ you folder structure is relatively independent of the database because you have another files table mapping (video_id) or (video_id, retrieval_time) to the filenames for that grab
18:41 🔗 ivan_ are you re-grabbing video data?
18:41 🔗 systwiALT My script will check for changes to video.mkv (or video.mp4) and keep the file if its hash is different
18:42 🔗 systwiALT than the current one
18:42 🔗 ivan_ that's going to keep videos that have merely been re-encoded by youtube
18:44 🔗 systwiALT Heheh well then that's a bit of a problem. It's not a HUGE deal it just means there's a higher chance for me to have unnecessary duplicates
18:44 🔗 systwiALT I'll take the unnecessary dupes
18:45 🔗 systwiALT There's no way (afaik) to tell the differences other than hashes
18:45 🔗 ivan_ well, it's up to you how to represent things in your database, if you want grabnums and stuff because you think grabs are a meaningful entity then go ahead
18:46 🔗 ivan_ you need a table per entity
18:46 🔗 ivan_ per entity type
18:47 🔗 kiskabak has quit IRC (Remote host closed the connection)
18:47 🔗 kiskabak has joined #archiveteam-ot
18:47 🔗 Fusl sets mode: +o kiskabak
18:47 🔗 Fusl__ sets mode: +o kiskabak
18:47 🔗 Fusl_ sets mode: +o kiskabak
18:47 🔗 systwiALT Yes, I feel that they are. Example (to clarify): Grab #1, video.mkv is added, comments.json is added. Grab #2, video.mkv is the same (don't change), comments.json changes. This means at this point, video.mkv is the same in grab #1 as it is now. comments.json is different from what it was the first time it was grabbed.
18:48 🔗 systwiALT Therefore, curr = video.mkv (grabnum=1), comments.json (grabnum=2)
18:48 🔗 systwiALT old = comments.json (grabnum=1)
18:54 🔗 systwiALT One thing I trip up on is REFERENCES. I'm trying to follow this guide http://patshaughnessy.net/2017/12/11/trying-to-represent-a-tree-structure-using-postgres and it still doesn't click
18:57 🔗 ivan_ systwiALT: if you have a REFERENCES on something else, that other thing must exist, or it will refuse to let you INSERT/UPDATE to that value
18:58 🔗 ivan_ systwiALT: you don't need a tree structure because don't have entities that reference the same entity as a parent
18:58 🔗 ivan_ same entity type
18:59 🔗 kiskabak has quit IRC (Remote host closed the connection)
18:59 🔗 kiskabak has joined #archiveteam-ot
19:00 🔗 Fusl sets mode: +o kiskabak
19:00 🔗 Fusl__ sets mode: +o kiskabak
19:00 🔗 Fusl_ sets mode: +o kiskabak
19:00 🔗 ivan_ if you have a table of grabs, you could for example in another table have: grab integer REFERENCES grabs (id)
19:00 🔗 ivan_ then you would not be able to insert a bogus grab id
19:02 🔗 systwiALT I thought this was a tree structure: https://transfer.notkiska.pw/gTx3G/tree.tiff
19:02 🔗 ivan_ no, as far as I know you have normal relational database structure
19:03 🔗 systwiALT So REFERENCES isn't necessary anywhere in my application?
19:03 🔗 ivan_ entities with attributes with some one-to-many relationships
19:04 🔗 ivan_ sure but I highly recommend using REFERENCES to avoid keeping around bogus data pointing to things that don't exist
19:04 🔗 ivan_ REFERENCES / foreign key constraints will also prevent you from e.g. DROPing a grab that videos are pointing to
19:06 🔗 systwiALT Ahh I see
19:08 🔗 systwiALT Would I need to use an (id SERIAL PRIMARY KEY) and (parent_id INT REFERENCES channels (id)) in any of my tables like that tutorial did?
19:08 🔗 ivan_ (I meant DELETE not DROP)
19:08 🔗 ivan_ systwiALT: maybe not
19:09 🔗 systwiALT Ok
19:09 🔗 systwiALT Thank you so very much for your help. I'll take a lunch break and hopefully I can get somewhere with this. I'll fill you in with any updates
19:28 🔗 ivan_ systwiALT: the way to think of it is: you really do put every entity of the same type into one table, then use a PK or other indexed column to get the rows you need
20:15 🔗 DogsRNice has joined #archiveteam-ot
20:20 🔗 bluefoo has quit IRC (Ping timeout: 255 seconds)
20:25 🔗 superkuh has quit IRC (Remote host closed the connection)
20:26 🔗 bluefoo has joined #archiveteam-ot
20:28 🔗 superkuh has joined #archiveteam-ot
21:14 🔗 Fusl b 171
21:17 🔗 Dimtree has quit IRC ()
21:24 🔗 Joseph_ has joined #archiveteam-ot
21:28 🔗 VerifiedJ has quit IRC (Read error: Operation timed out)
21:31 🔗 Dimtree has joined #archiveteam-ot
22:27 🔗 icedice has joined #archiveteam-ot
22:32 🔗 BlueMax has joined #archiveteam-ot
22:36 🔗 Joseph_ has quit IRC (Read error: Connection reset by peer)
22:43 🔗 kiskabak has quit IRC (Read error: Operation timed out)
22:50 🔗 jeekl has quit IRC (Ping timeout: 745 seconds)
23:06 🔗 bluefoo has quit IRC (Read error: Operation timed out)
23:26 🔗 Maylay has quit IRC (Quit: Pipe Terminated)
23:30 🔗 jeekl has joined #archiveteam-ot
23:33 🔗 JH881 has quit IRC (Ping timeout: 252 seconds)
23:37 🔗 JH881 has joined #archiveteam-ot
23:42 🔗 kiskabak has joined #archiveteam-ot
23:42 🔗 Fusl sets mode: +o kiskabak
23:42 🔗 Fusl__ sets mode: +o kiskabak
23:42 🔗 Fusl_ sets mode: +o kiskabak
23:46 🔗 JH8813 has joined #archiveteam-ot
23:46 🔗 JH881 has quit IRC (Ping timeout: 252 seconds)

irclogger-viewer