#archiveteam-ot 2019-08-28,Wed

↑back Search

Time	Nickname	Message
01:19 ^🔗		killsushi has quit IRC (Quit: Leaving)
01:27 ^🔗		kjhota123 has joined #archiveteam-ot
01:28 ^🔗		kjhota123 has quit IRC (Client Quit)
02:42 ^🔗		m007a83 has quit IRC (Read error: Connection reset by peer)
02:58 ^🔗		m007a83 has joined #archiveteam-ot
03:01 ^🔗		Mateon1 has quit IRC (Remote host closed the connection)
03:01 ^🔗		Mateon1 has joined #archiveteam-ot
03:20 ^🔗		Mateon1 has quit IRC (Remote host closed the connection)
03:24 ^🔗		Mateon1 has joined #archiveteam-ot
03:35 ^🔗		systwiAL_ has joined #archiveteam-ot
03:51 ^🔗		systwiAL_ is now known as systwiALT
03:54 ^🔗	systwiALT	Thanks for the info ivan_ and JAA. I hope I can piece this together okay; I tried following a guide on storing hierarchical info in postgres but the tutorial had some confusing typos.
03:54 ^🔗	systwiALT	Hey, really quick, would ¬ be an allowed substitute for - in a table or column name?
04:01 ^🔗		godane has quit IRC (Ping timeout: 745 seconds)
04:13 ^🔗		godane has joined #archiveteam-ot
04:30 ^🔗		godane has quit IRC (Read error: Operation timed out)
05:43 ^🔗		godane has joined #archiveteam-ot
06:01 ^🔗		Atom-- has joined #archiveteam-ot
06:04 ^🔗		Atom has quit IRC (Read error: Operation timed out)
06:26 ^🔗		dhyan_nat has joined #archiveteam-ot
07:02 ^🔗		Terbium has quit IRC (Quit: https://quassel-irc.org - Chat comfortably. Anywhere.)
08:47 ^🔗		Dragnog2 has joined #archiveteam-ot
09:39 ^🔗	JAA	systwiALT: I don't know what restrictions PostgreSQL imposes, but I've always tried to keep my column names to [a-z0-9_]. It might be possible to use other characters, but that's really asking for trouble one way or another, e.g. due to issues in certain client libraries. Best to just keep it to a minimal subset that's definitely supported everywhere.
09:51 ^🔗		VerifiedJ has joined #archiveteam-ot
09:51 ^🔗		BlueMax has quit IRC (Quit: Leaving)
10:04 ^🔗	hook54321	how bad of an idea is it to buy a refurbished or open box hard drive?
10:11 ^🔗	ivan_	Bad
10:13 ^🔗	hook54321	k, I won't consider this one then.
10:32 ^🔗		dhyan_nat has quit IRC (Read error: Operation timed out)
11:07 ^🔗		bluefoo has quit IRC (Remote host closed the connection)
12:35 ^🔗	h3ndr1k	hook54321: What is an open box harddrive?? Never heard of it
12:41 ^🔗	h3ndr1k	Oh wait, its box is just already opened. I blame english not being my native language.
12:42 ^🔗	h3ndr1k	For some reason I thought of white box harddrive, which would be interesting.
12:44 ^🔗	h3ndr1k	In the sense of white box pc
12:58 ^🔗	JAA	Drives that were refurbished by the manufacturer (and not some shady third party) can be fine, but it always depends on what you want to use them for.
13:13 ^🔗	hook54321	I'm just looking for one to use in my laptop, I found some others though
14:23 ^🔗		bluefoo has joined #archiveteam-ot
15:29 ^🔗	Fusl	ah
15:29 ^🔗	Fusl	whoops
15:29 ^🔗	Fusl	wrong tab
15:41 ^🔗	Fusl	http://xor.meo.ws/4817f264/b18a/47cb/839a/2f6a9b4d3efa.png why do i not believe you?
15:46 ^🔗		systwiAL_ has joined #archiveteam-ot
15:56 ^🔗		systwiALT has quit IRC (Read error: Operation timed out)
15:57 ^🔗		bluefoo has quit IRC (Read error: Operation timed out)
16:28 ^🔗	kiska	I've just got my Radeon VII, $1.1k Australian spent... :(
16:28 ^🔗	kiska	Wish it was cheaper
16:40 ^🔗		systwiAL_ is now known as systwiALT
16:44 ^🔗	systwiALT	JAA: I would normally just use _ and leave it, however some of the table names will (or might, idk yet) have the name of the youtube channel id. As we know, this can have both - and _, therefore I had wanted to use something like ¬ to avoid any collisions or misinformation
16:45 ^🔗	systwiALT	I'm still working at trying to piece even a bare-bones template together, it's not too easy :-/
16:46 ^🔗	JAA	Wait, why would you create a column with the name of a channel?
16:47 ^🔗	JAA	That's almost certainly a terrible idea.
16:53 ^🔗	ivan_	systwiALT: did the three tables I gave you make sense
16:54 ^🔗	systwiALT	JAA: Well, not exactly. A table.
16:54 ^🔗	ivan_	you don't need per-channel tables
16:55 ^🔗	systwiALT	ivan_: They kinda did, but I wasn't sure how each channel ID would link to a column in a different table
16:55 ^🔗	JAA	It wouldn't.
16:55 ^🔗	JAA	You have tables with columns and rows.
16:55 ^🔗	systwiALT	Additionally this part: files (video_id, file) PK (video_id, file) didn't make sense. Is the primary key video_id or file?
16:55 ^🔗	JAA	The columns specify the structure of the data, the rows are the actual data.
16:56 ^🔗	ivan_	systwiALT: it's a composite of both columns
16:56 ^🔗	systwiALT	Ok one sec let me try this again
16:57 ^🔗	ivan_	because you have channel ids in your videos table, if you wanted videos for one channel, you can just do SELECT * FROM videos WHERE channel_id = 'whatever'
16:57 ^🔗	JAA	So if you had a simple table called "channels" which has just two fields (id INTEGER, name VARCHAR), then you'd have one row in this table for each channel you want to store.
16:58 ^🔗	ivan_	the channels table is really the least interesting part of the whole thing
16:58 ^🔗	ivan_	useful if you need to correlate channel ids and users
17:00 ^🔗	systwiALT	What about the parents like "availability, channelname, and grabinfo? (excluding the video IDs)
17:01 ^🔗	systwiALT	Like what the .json had
17:01 ^🔗	ivan_	parents?
17:01 ^🔗	systwiALT	Parent-child relationships in .json
17:01 ^🔗	systwiALT	Or object I guess is more specific
17:02 ^🔗	ivan_	is availability per-video?
17:02 ^🔗	systwiALT	Per video and per channel
17:02 ^🔗	ivan_	ah so a column on both tables
17:03 ^🔗	ivan_	what's grabinfo?
17:07 ^🔗	systwiALT	I'll take a screenshot in a second here
17:26 ^🔗		Dragnog94 has quit IRC (Read error: Operation timed out)
17:33 ^🔗		Raccoon has joined #archiveteam-ot
17:40 ^🔗		phillipsj has joined #archiveteam-ot
17:40 ^🔗		bluefoo has joined #archiveteam-ot
17:52 ^🔗	systwiALT	Sorry for the delay ivan_. I can still take a screenshot but it will be difficult since it all doesn't fit on the screen. Did you try viewing the .json I linked to in a hierarchical JSON viewer?
17:58 ^🔗	ivan_	you mean https://gist.githubusercontent.com/systwi/413add02946e3a9cb087f6b4a8922687/raw/cc1feadc56a29237e9df2ed6ec786f8d5d81a164/youtube_database_sample.json
17:59 ^🔗	ivan_	I guess you have that on a channel right now
17:59 ^🔗	ivan_	I'm not sure what the point of it is :-)
18:00 ^🔗	ivan_	do you care about tracking the grabs you did on a channel, even though grabbing videos individually would have functionally equivalent results?
18:03 ^🔗	systwiALT	Yes that file. It is just an example of one channel. The purpose is to track changes between every grab of an individual channel or that entire channel.
18:04 ^🔗	systwiALT	I have already thought everything through, and the .json works well (in the sense of organization).
18:05 ^🔗	systwiALT	I just need to convert this mess to SQL
18:05 ^🔗	ivan_	are you aware of the bugs in how YouTube returns paginated upload playlists
18:06 ^🔗	ivan_	https://ya.borg.xyz/logs/dl/UCN79wVFfg3yCeq0lEy0OzRg/2019-08-28T15_52_08.log
18:06 ^🔗	ivan_	for every "Ignoring duplicate" there's actually a video in that channel that YouTube is failing to list
18:06 ^🔗	systwiALT	I would normally just send the channel url itself, not with /videos at the end
18:06 ^🔗	ivan_	so tracking the video IDs you get on each grab will have noisy variations that don't reflect real additions or removals in the channel
18:09 ^🔗	systwiALT	Any grab to a channel, whether it be just a single video or the entire channel, would count as one grab. That number always increments by 1. The most up-to-date information will always be in the "curr" object. If that video is removed the "availability" boolean is set to false. If a video is added its video ID appears in the list under its respective channel id.
18:10 ^🔗	systwiALT	So if the first time I grab a single video from a new channel, the grabnum is 1. If later, I grab a different video from that channel that counts as grab 2. If I then grab the entire channel the grabnum is 3.
18:10 ^🔗	ivan_	but what do you do with this information
18:11 ^🔗	systwiALT	I will use it to keep track of changes to YT content.
18:11 ^🔗	systwiALT	It's for personal use
18:12 ^🔗	systwiALT	If I wanted to revert a video back to grabnum 2 (whether it be for a removed comment in the current one, different description, etc.) I could do that in my script
18:12 ^🔗	ivan_	you know the video itself can change, yes?
18:12 ^🔗	systwiALT	Yep
18:12 ^🔗	systwiALT	In the example I provided it didn't change
18:13 ^🔗	systwiALT	But I have already planned out for if/when a video changes
18:13 ^🔗	ivan_	can you just keep both versions?
18:13 ^🔗	systwiALT	The video file is treated the same as any other file (e.g. description)
18:13 ^🔗	systwiALT	It will keep both versions
18:15 ^🔗	ivan_	it seems like your mostrecentchg would get overwritten pretty quickly and you'd lose deltas for previous grabs
18:15 ^🔗	systwiALT	On the filesystem I will also have a curr and old folder. Inside of curr, as you guessed, will always have the most up-to-date data. Inside of the old folder, you will see grab1, grab2, basically any older grabs. Inside of those folders will be your older versions of the files.
18:15 ^🔗	systwiALT	The database will keep track of which files were changed at whichever grab number.
18:15 ^🔗	ivan_	ah, so this thing reflects real changes in your storage, not changes in youtube
18:16 ^🔗	systwiALT	Well, kinda. This will not be monitoring YouTube 24/7, what will happen is in my script I will download a video or channel and it will edit the database accordingly
18:17 ^🔗	ivan_	does "affectedvideos" : [ "z3aEv3EzMyQ" ] list new videos that you stored
18:17 ^🔗	ivan_	does it sometimes list something else?
18:17 ^🔗	ivan_	ah, you're redownloading info for videos that you already have?
18:18 ^🔗	systwiALT	affectedvideos lists the video ID of whichever videos had information changed in them during the most recent grab
18:19 ^🔗	systwiALT	Typically it will list several videos (comments are added, thumbnail might change, etc.)
18:19 ^🔗	ivan_	if you were storing this in SQL you could just have (video_id, retrieval_time) as the PK and SELECT * FROM videos WHERE video_id = 'whatever'; and compare the rows
18:20 ^🔗	ivan_	(you would get every version of the thing)
18:20 ^🔗	systwiALT	My plan is to redownload and store information that is different from the file I currently have. If the comments.json file is the same in grabnum 1 as it is in grabnum 5 then grabnum 1's comments.json will be considered the most current and will not be redownloaded
18:21 ^🔗	systwiALT	Once comments.json changes (let's say grabnum 6), it will move curr>grabnum 1>comments.json to "old" and curr>grabnum 6>comments.json will be in "curr"
18:22 ^🔗	ivan_	I would recommend rethinking this from scratch instead of porting your JSON ideas :-)
18:22 ^🔗	ivan_	are grabs an entity you really want to track? you _could_ but it doesn't seem necessary
18:23 ^🔗	ivan_	the changes happen on videos, who cares about the grab that was responsible for detecting the change
18:24 ^🔗	systwiALT	Trust me, I have spent months on this. I really don't want to rethink this from scratch :-/ Also, "(video_id, retrieval_time) as the PK and SELECT * FROM videos WHERE video_id = 'whatever'; and compare the rows" doesn't make sense to me
18:24 ^🔗	systwiALT	Yes, I want to track all of this information.
18:24 ^🔗	ivan_	you would get multiple rows out if you have multiple versions of the video
18:25 ^🔗	systwiALT	I probably am not explaining this very thoroughly but to me this makes sense.
18:25 ^🔗	ivan_	do you also want to track when a video disappears (and possible reemerges?)
18:26 ^🔗	systwiALT	^ That I don't plan on tracking. For now I just have it as "availability", which if it
18:26 ^🔗	ivan_	you would have a row for each (video_id, retrieval_time) with the same video_id and different retrieval time
18:26 ^🔗	systwiALT	... which if it's offline then availableonline is false, if it's back up again it's true.
18:27 ^🔗	ivan_	is it making sense or is it not solving something
18:27 ^🔗	systwiALT	I'm sorry this isn't making sense to me :( SQL is new to me
18:28 ^🔗	systwiALT	I have only this so far:
18:28 ^🔗	systwiALT	CREATE TABLE channels (channel_id TEXT NOT NULL, PRIMARY KEY (channel_id));
18:28 ^🔗	systwiALT	And yes I have read through psql's docs and watched tutorials
18:29 ^🔗	ivan_	if you have a videos table with a PK on video_id you can only store one row with a certain video_id
18:29 ^🔗	ivan_	if you have a videos table with a PK on (video_id, retrieval_time) you can store multiple rows with the same video_id if they have a different retrieval_time
18:30 ^🔗	ivan_	only the entire key must be unique
18:32 ^🔗	*	systwiALT sobs
18:33 ^🔗	ivan_	what's the confusing part
18:33 ^🔗	systwiALT	"you can only store one row with a certain video_id"
18:34 ^🔗	systwiALT	That sentence would be equivalent to: CREATE TABLE videos (video_id VARCHAR(11) NOT NULL PRIMARY KEY);
18:34 ^🔗	ivan_	a filesystem is keyed on a filenames, you can't have multiple files with the same filename
18:34 ^🔗	systwiALT	Right?
18:34 ^🔗	ivan_	a SQL table is keyed on the PK, you can't have multiple rows with the same PK
18:34 ^🔗	ivan_	systwiALT: sure
18:35 ^🔗	ivan_	if you try to INSERT another row with the same PK it will refuse
18:35 ^🔗	systwiALT	Ok that makes sense, so meaning: z3aEv3EzMyQ can be used only once under the video_id column if video_id is PK?
18:35 ^🔗	ivan_	sure
18:36 ^🔗	systwiALT	But... that's kinda how I would like it. I mean, the video_id is unique anyway. No two videos share the same ID
18:36 ^🔗	ivan_	but you said you were grabbing videos multiple times and keeping different metadata
18:37 ^🔗	ivan_	or did I get that wrong
18:37 ^🔗	systwiALT	No that's correct
18:37 ^🔗	ivan_	so you actually have multiple versions of a video and need to have multiple rows with the same video_id
18:37 ^🔗	systwiALT	by retrieval_time do you mean grabnum?
18:37 ^🔗	ivan_	you could use the UTC timestamp of the time you started the grab for that video
18:38 ^🔗	systwiALT	That's stored in grab_date
18:38 ^🔗	systwiALT	I use grabnum because that way if one file takes 3 minutes to download but another takes a second in the same grab, it's easier to consider them both as grabnum 1 and store that grab time in the database
18:39 ^🔗	systwiALT	than to have folders inside of "old" with each date/time
18:40 ^🔗	ivan_	you folder structure is relatively independent of the database because you have another files table mapping (video_id) or (video_id, retrieval_time) to the filenames for that grab
18:41 ^🔗	ivan_	are you re-grabbing video data?
18:41 ^🔗	systwiALT	My script will check for changes to video.mkv (or video.mp4) and keep the file if its hash is different
18:42 ^🔗	systwiALT	than the current one
18:42 ^🔗	ivan_	that's going to keep videos that have merely been re-encoded by youtube
18:44 ^🔗	systwiALT	Heheh well then that's a bit of a problem. It's not a HUGE deal it just means there's a higher chance for me to have unnecessary duplicates
18:44 ^🔗	systwiALT	I'll take the unnecessary dupes
18:45 ^🔗	systwiALT	There's no way (afaik) to tell the differences other than hashes
18:45 ^🔗	ivan_	well, it's up to you how to represent things in your database, if you want grabnums and stuff because you think grabs are a meaningful entity then go ahead
18:46 ^🔗	ivan_	you need a table per entity
18:46 ^🔗	ivan_	per entity type
18:47 ^🔗		kiskabak has quit IRC (Remote host closed the connection)
18:47 ^🔗		kiskabak has joined #archiveteam-ot
18:47 ^🔗		Fusl sets mode: +o kiskabak
18:47 ^🔗		Fusl__ sets mode: +o kiskabak
18:47 ^🔗		Fusl_ sets mode: +o kiskabak
18:47 ^🔗	systwiALT	Yes, I feel that they are. Example (to clarify): Grab #1, video.mkv is added, comments.json is added. Grab #2, video.mkv is the same (don't change), comments.json changes. This means at this point, video.mkv is the same in grab #1 as it is now. comments.json is different from what it was the first time it was grabbed.
18:48 ^🔗	systwiALT	Therefore, curr = video.mkv (grabnum=1), comments.json (grabnum=2)
18:48 ^🔗	systwiALT	old = comments.json (grabnum=1)
18:54 ^🔗	systwiALT	One thing I trip up on is REFERENCES. I'm trying to follow this guide http://patshaughnessy.net/2017/12/11/trying-to-represent-a-tree-structure-using-postgres and it still doesn't click
18:57 ^🔗	ivan_	systwiALT: if you have a REFERENCES on something else, that other thing must exist, or it will refuse to let you INSERT/UPDATE to that value
18:58 ^🔗	ivan_	systwiALT: you don't need a tree structure because don't have entities that reference the same entity as a parent
18:58 ^🔗	ivan_	same entity type
18:59 ^🔗		kiskabak has quit IRC (Remote host closed the connection)
18:59 ^🔗		kiskabak has joined #archiveteam-ot
19:00 ^🔗		Fusl sets mode: +o kiskabak
19:00 ^🔗		Fusl__ sets mode: +o kiskabak
19:00 ^🔗		Fusl_ sets mode: +o kiskabak
19:00 ^🔗	ivan_	if you have a table of grabs, you could for example in another table have: grab integer REFERENCES grabs (id)
19:00 ^🔗	ivan_	then you would not be able to insert a bogus grab id
19:02 ^🔗	systwiALT	I thought this was a tree structure: https://transfer.notkiska.pw/gTx3G/tree.tiff
19:02 ^🔗	ivan_	no, as far as I know you have normal relational database structure
19:03 ^🔗	systwiALT	So REFERENCES isn't necessary anywhere in my application?
19:03 ^🔗	ivan_	entities with attributes with some one-to-many relationships
19:04 ^🔗	ivan_	sure but I highly recommend using REFERENCES to avoid keeping around bogus data pointing to things that don't exist
19:04 ^🔗	ivan_	REFERENCES / foreign key constraints will also prevent you from e.g. DROPing a grab that videos are pointing to
19:06 ^🔗	systwiALT	Ahh I see
19:08 ^🔗	systwiALT	Would I need to use an (id SERIAL PRIMARY KEY) and (parent_id INT REFERENCES channels (id)) in any of my tables like that tutorial did?
19:08 ^🔗	ivan_	(I meant DELETE not DROP)
19:08 ^🔗	ivan_	systwiALT: maybe not
19:09 ^🔗	systwiALT	Ok
19:09 ^🔗	systwiALT	Thank you so very much for your help. I'll take a lunch break and hopefully I can get somewhere with this. I'll fill you in with any updates
19:28 ^🔗	ivan_	systwiALT: the way to think of it is: you really do put every entity of the same type into one table, then use a PK or other indexed column to get the rows you need
20:15 ^🔗		DogsRNice has joined #archiveteam-ot
20:20 ^🔗		bluefoo has quit IRC (Ping timeout: 255 seconds)
20:25 ^🔗		superkuh has quit IRC (Remote host closed the connection)
20:26 ^🔗		bluefoo has joined #archiveteam-ot
20:28 ^🔗		superkuh has joined #archiveteam-ot
21:14 ^🔗	Fusl	b 171
21:17 ^🔗		Dimtree has quit IRC ()
21:24 ^🔗		Joseph_ has joined #archiveteam-ot
21:28 ^🔗		VerifiedJ has quit IRC (Read error: Operation timed out)
21:31 ^🔗		Dimtree has joined #archiveteam-ot
22:27 ^🔗		icedice has joined #archiveteam-ot
22:32 ^🔗		BlueMax has joined #archiveteam-ot
22:36 ^🔗		Joseph_ has quit IRC (Read error: Connection reset by peer)
22:43 ^🔗		kiskabak has quit IRC (Read error: Operation timed out)
22:50 ^🔗		jeekl has quit IRC (Ping timeout: 745 seconds)
23:06 ^🔗		bluefoo has quit IRC (Read error: Operation timed out)
23:26 ^🔗		Maylay has quit IRC (Quit: Pipe Terminated)
23:30 ^🔗		jeekl has joined #archiveteam-ot
23:33 ^🔗		JH881 has quit IRC (Ping timeout: 252 seconds)
23:37 ^🔗		JH881 has joined #archiveteam-ot
23:42 ^🔗		kiskabak has joined #archiveteam-ot
23:42 ^🔗		Fusl sets mode: +o kiskabak
23:42 ^🔗		Fusl__ sets mode: +o kiskabak
23:42 ^🔗		Fusl_ sets mode: +o kiskabak
23:46 ^🔗		JH8813 has joined #archiveteam-ot
23:46 ^🔗		JH881 has quit IRC (Ping timeout: 252 seconds)

irclogger-viewer