Time |
Nickname |
Message |
01:19
🔗
|
|
killsushi has quit IRC (Quit: Leaving) |
01:27
🔗
|
|
kjhota123 has joined #archiveteam-ot |
01:28
🔗
|
|
kjhota123 has quit IRC (Client Quit) |
02:42
🔗
|
|
m007a83 has quit IRC (Read error: Connection reset by peer) |
02:58
🔗
|
|
m007a83 has joined #archiveteam-ot |
03:01
🔗
|
|
Mateon1 has quit IRC (Remote host closed the connection) |
03:01
🔗
|
|
Mateon1 has joined #archiveteam-ot |
03:20
🔗
|
|
Mateon1 has quit IRC (Remote host closed the connection) |
03:24
🔗
|
|
Mateon1 has joined #archiveteam-ot |
03:35
🔗
|
|
systwiAL_ has joined #archiveteam-ot |
03:51
🔗
|
|
systwiAL_ is now known as systwiALT |
03:54
🔗
|
systwiALT |
Thanks for the info ivan_ and JAA. I hope I can piece this together okay; I tried following a guide on storing hierarchical info in postgres but the tutorial had some confusing typos. |
03:54
🔗
|
systwiALT |
Hey, really quick, would ¬ be an allowed substitute for - in a table or column name? |
04:01
🔗
|
|
godane has quit IRC (Ping timeout: 745 seconds) |
04:13
🔗
|
|
godane has joined #archiveteam-ot |
04:30
🔗
|
|
godane has quit IRC (Read error: Operation timed out) |
05:43
🔗
|
|
godane has joined #archiveteam-ot |
06:01
🔗
|
|
Atom-- has joined #archiveteam-ot |
06:04
🔗
|
|
Atom has quit IRC (Read error: Operation timed out) |
06:26
🔗
|
|
dhyan_nat has joined #archiveteam-ot |
07:02
🔗
|
|
Terbium has quit IRC (Quit: https://quassel-irc.org - Chat comfortably. Anywhere.) |
08:47
🔗
|
|
Dragnog2 has joined #archiveteam-ot |
09:39
🔗
|
JAA |
systwiALT: I don't know what restrictions PostgreSQL imposes, but I've always tried to keep my column names to [a-z0-9_]. It might be possible to use other characters, but that's really asking for trouble one way or another, e.g. due to issues in certain client libraries. Best to just keep it to a minimal subset that's definitely supported everywhere. |
09:51
🔗
|
|
VerifiedJ has joined #archiveteam-ot |
09:51
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
10:04
🔗
|
hook54321 |
how bad of an idea is it to buy a refurbished or open box hard drive? |
10:11
🔗
|
ivan_ |
Bad |
10:13
🔗
|
hook54321 |
k, I won't consider this one then. |
10:32
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
11:07
🔗
|
|
bluefoo has quit IRC (Remote host closed the connection) |
12:35
🔗
|
h3ndr1k |
hook54321: What is an open box harddrive?? Never heard of it |
12:41
🔗
|
h3ndr1k |
Oh wait, its box is just already opened. I blame english not being my native language. |
12:42
🔗
|
h3ndr1k |
For some reason I thought of white box harddrive, which would be interesting. |
12:44
🔗
|
h3ndr1k |
In the sense of white box pc |
12:58
🔗
|
JAA |
Drives that were refurbished by the manufacturer (and not some shady third party) can be fine, but it always depends on what you want to use them for. |
13:13
🔗
|
hook54321 |
I'm just looking for one to use in my laptop, I found some others though |
14:23
🔗
|
|
bluefoo has joined #archiveteam-ot |
15:29
🔗
|
Fusl |
ah |
15:29
🔗
|
Fusl |
whoops |
15:29
🔗
|
Fusl |
wrong tab |
15:41
🔗
|
Fusl |
http://xor.meo.ws/4817f264/b18a/47cb/839a/2f6a9b4d3efa.png why do i not believe you? |
15:46
🔗
|
|
systwiAL_ has joined #archiveteam-ot |
15:56
🔗
|
|
systwiALT has quit IRC (Read error: Operation timed out) |
15:57
🔗
|
|
bluefoo has quit IRC (Read error: Operation timed out) |
16:28
🔗
|
kiska |
I've just got my Radeon VII, $1.1k Australian spent... :( |
16:28
🔗
|
kiska |
Wish it was cheaper |
16:40
🔗
|
|
systwiAL_ is now known as systwiALT |
16:44
🔗
|
systwiALT |
JAA: I would normally just use _ and leave it, however some of the table names will (or might, idk yet) have the name of the youtube channel id. As we know, this can have both - and _, therefore I had wanted to use something like ¬ to avoid any collisions or misinformation |
16:45
🔗
|
systwiALT |
I'm still working at trying to piece even a bare-bones template together, it's not too easy :-/ |
16:46
🔗
|
JAA |
Wait, why would you create a column with the name of a channel? |
16:47
🔗
|
JAA |
That's almost certainly a terrible idea. |
16:53
🔗
|
ivan_ |
systwiALT: did the three tables I gave you make sense |
16:54
🔗
|
systwiALT |
JAA: Well, not exactly. A table. |
16:54
🔗
|
ivan_ |
you don't need per-channel tables |
16:55
🔗
|
systwiALT |
ivan_: They kinda did, but I wasn't sure how each channel ID would link to a column in a different table |
16:55
🔗
|
JAA |
It wouldn't. |
16:55
🔗
|
JAA |
You have tables with columns and rows. |
16:55
🔗
|
systwiALT |
Additionally this part: files (video_id, file) PK (video_id, file) didn't make sense. Is the primary key video_id or file? |
16:55
🔗
|
JAA |
The columns specify the structure of the data, the rows are the actual data. |
16:56
🔗
|
ivan_ |
systwiALT: it's a composite of both columns |
16:56
🔗
|
systwiALT |
Ok one sec let me try this again |
16:57
🔗
|
ivan_ |
because you have channel ids in your videos table, if you wanted videos for one channel, you can just do SELECT * FROM videos WHERE channel_id = 'whatever' |
16:57
🔗
|
JAA |
So if you had a simple table called "channels" which has just two fields (id INTEGER, name VARCHAR), then you'd have one row in this table for each channel you want to store. |
16:58
🔗
|
ivan_ |
the channels table is really the least interesting part of the whole thing |
16:58
🔗
|
ivan_ |
useful if you need to correlate channel ids and users |
17:00
🔗
|
systwiALT |
What about the parents like "availability, channelname, and grabinfo? (excluding the video IDs) |
17:01
🔗
|
systwiALT |
Like what the .json had |
17:01
🔗
|
ivan_ |
parents? |
17:01
🔗
|
systwiALT |
Parent-child relationships in .json |
17:01
🔗
|
systwiALT |
Or object I guess is more specific |
17:02
🔗
|
ivan_ |
is availability per-video? |
17:02
🔗
|
systwiALT |
Per video and per channel |
17:02
🔗
|
ivan_ |
ah so a column on both tables |
17:03
🔗
|
ivan_ |
what's grabinfo? |
17:07
🔗
|
systwiALT |
I'll take a screenshot in a second here |
17:26
🔗
|
|
Dragnog94 has quit IRC (Read error: Operation timed out) |
17:33
🔗
|
|
Raccoon has joined #archiveteam-ot |
17:40
🔗
|
|
phillipsj has joined #archiveteam-ot |
17:40
🔗
|
|
bluefoo has joined #archiveteam-ot |
17:52
🔗
|
systwiALT |
Sorry for the delay ivan_. I can still take a screenshot but it will be difficult since it all doesn't fit on the screen. Did you try viewing the .json I linked to in a hierarchical JSON viewer? |
17:58
🔗
|
ivan_ |
you mean https://gist.githubusercontent.com/systwi/413add02946e3a9cb087f6b4a8922687/raw/cc1feadc56a29237e9df2ed6ec786f8d5d81a164/youtube_database_sample.json |
17:59
🔗
|
ivan_ |
I guess you have that on a channel right now |
17:59
🔗
|
ivan_ |
I'm not sure what the point of it is :-) |
18:00
🔗
|
ivan_ |
do you care about tracking the grabs you did on a channel, even though grabbing videos individually would have functionally equivalent results? |
18:03
🔗
|
systwiALT |
Yes that file. It is just an example of one channel. The purpose is to track changes between every grab of an individual channel or that entire channel. |
18:04
🔗
|
systwiALT |
I have already thought everything through, and the .json works well (in the sense of organization). |
18:05
🔗
|
systwiALT |
I just need to convert this mess to SQL |
18:05
🔗
|
ivan_ |
are you aware of the bugs in how YouTube returns paginated upload playlists |
18:06
🔗
|
ivan_ |
https://ya.borg.xyz/logs/dl/UCN79wVFfg3yCeq0lEy0OzRg/2019-08-28T15_52_08.log |
18:06
🔗
|
ivan_ |
for every "Ignoring duplicate" there's actually a video in that channel that YouTube is failing to list |
18:06
🔗
|
systwiALT |
I would normally just send the channel url itself, not with /videos at the end |
18:06
🔗
|
ivan_ |
so tracking the video IDs you get on each grab will have noisy variations that don't reflect real additions or removals in the channel |
18:09
🔗
|
systwiALT |
Any grab to a channel, whether it be just a single video or the entire channel, would count as one grab. That number always increments by 1. The most up-to-date information will always be in the "curr" object. If that video is removed the "availability" boolean is set to false. If a video is added its video ID appears in the list under its respective channel id. |
18:10
🔗
|
systwiALT |
So if the first time I grab a single video from a new channel, the grabnum is 1. If later, I grab a different video from that channel that counts as grab 2. If I then grab the entire channel the grabnum is 3. |
18:10
🔗
|
ivan_ |
but what do you do with this information |
18:11
🔗
|
systwiALT |
I will use it to keep track of changes to YT content. |
18:11
🔗
|
systwiALT |
It's for personal use |
18:12
🔗
|
systwiALT |
If I wanted to revert a video back to grabnum 2 (whether it be for a removed comment in the current one, different description, etc.) I could do that in my script |
18:12
🔗
|
ivan_ |
you know the video itself can change, yes? |
18:12
🔗
|
systwiALT |
Yep |
18:12
🔗
|
systwiALT |
In the example I provided it didn't change |
18:13
🔗
|
systwiALT |
But I have already planned out for if/when a video changes |
18:13
🔗
|
ivan_ |
can you just keep both versions? |
18:13
🔗
|
systwiALT |
The video file is treated the same as any other file (e.g. description) |
18:13
🔗
|
systwiALT |
It will keep both versions |
18:15
🔗
|
ivan_ |
it seems like your mostrecentchg would get overwritten pretty quickly and you'd lose deltas for previous grabs |
18:15
🔗
|
systwiALT |
On the filesystem I will also have a curr and old folder. Inside of curr, as you guessed, will always have the most up-to-date data. Inside of the old folder, you will see grab1, grab2, basically any older grabs. Inside of those folders will be your older versions of the files. |
18:15
🔗
|
systwiALT |
The database will keep track of which files were changed at whichever grab number. |
18:15
🔗
|
ivan_ |
ah, so this thing reflects real changes in your storage, not changes in youtube |
18:16
🔗
|
systwiALT |
Well, kinda. This will not be monitoring YouTube 24/7, what will happen is in my script I will download a video or channel and it will edit the database accordingly |
18:17
🔗
|
ivan_ |
does "affectedvideos" : [ "z3aEv3EzMyQ" ] list new videos that you stored |
18:17
🔗
|
ivan_ |
does it sometimes list something else? |
18:17
🔗
|
ivan_ |
ah, you're redownloading info for videos that you already have? |
18:18
🔗
|
systwiALT |
affectedvideos lists the video ID of whichever videos had information changed in them during the most recent grab |
18:19
🔗
|
systwiALT |
Typically it will list several videos (comments are added, thumbnail might change, etc.) |
18:19
🔗
|
ivan_ |
if you were storing this in SQL you could just have (video_id, retrieval_time) as the PK and SELECT * FROM videos WHERE video_id = 'whatever'; and compare the rows |
18:20
🔗
|
ivan_ |
(you would get every version of the thing) |
18:20
🔗
|
systwiALT |
My plan is to redownload and store information that is different from the file I currently have. If the comments.json file is the same in grabnum 1 as it is in grabnum 5 then grabnum 1's comments.json will be considered the most current and will not be redownloaded |
18:21
🔗
|
systwiALT |
Once comments.json changes (let's say grabnum 6), it will move curr>grabnum 1>comments.json to "old" and curr>grabnum 6>comments.json will be in "curr" |
18:22
🔗
|
ivan_ |
I would recommend rethinking this from scratch instead of porting your JSON ideas :-) |
18:22
🔗
|
ivan_ |
are grabs an entity you really want to track? you _could_ but it doesn't seem necessary |
18:23
🔗
|
ivan_ |
the changes happen on videos, who cares about the grab that was responsible for detecting the change |
18:24
🔗
|
systwiALT |
Trust me, I have spent months on this. I really don't want to rethink this from scratch :-/ Also, "(video_id, retrieval_time) as the PK and SELECT * FROM videos WHERE video_id = 'whatever'; and compare the rows" doesn't make sense to me |
18:24
🔗
|
systwiALT |
Yes, I want to track all of this information. |
18:24
🔗
|
ivan_ |
you would get multiple rows out if you have multiple versions of the video |
18:25
🔗
|
systwiALT |
I probably am not explaining this very thoroughly but to me this makes sense. |
18:25
🔗
|
ivan_ |
do you also want to track when a video disappears (and possible reemerges?) |
18:26
🔗
|
systwiALT |
^ That I don't plan on tracking. For now I just have it as "availability", which if it |
18:26
🔗
|
ivan_ |
you would have a row for each (video_id, retrieval_time) with the same video_id and different retrieval time |
18:26
🔗
|
systwiALT |
... which if it's offline then availableonline is false, if it's back up again it's true. |
18:27
🔗
|
ivan_ |
is it making sense or is it not solving something |
18:27
🔗
|
systwiALT |
I'm sorry this isn't making sense to me :( SQL is new to me |
18:28
🔗
|
systwiALT |
I have only this so far: |
18:28
🔗
|
systwiALT |
CREATE TABLE channels (channel_id TEXT NOT NULL, PRIMARY KEY (channel_id)); |
18:28
🔗
|
systwiALT |
And yes I have read through psql's docs and watched tutorials |
18:29
🔗
|
ivan_ |
if you have a videos table with a PK on video_id you can only store one row with a certain video_id |
18:29
🔗
|
ivan_ |
if you have a videos table with a PK on (video_id, retrieval_time) you can store multiple rows with the same video_id if they have a different retrieval_time |
18:30
🔗
|
ivan_ |
only the entire key must be unique |
18:32
🔗
|
* |
systwiALT sobs |
18:33
🔗
|
ivan_ |
what's the confusing part |
18:33
🔗
|
systwiALT |
"you can only store one row with a certain video_id" |
18:34
🔗
|
systwiALT |
That sentence would be equivalent to: CREATE TABLE videos (video_id VARCHAR(11) NOT NULL PRIMARY KEY); |
18:34
🔗
|
ivan_ |
a filesystem is keyed on a filenames, you can't have multiple files with the same filename |
18:34
🔗
|
systwiALT |
Right? |
18:34
🔗
|
ivan_ |
a SQL table is keyed on the PK, you can't have multiple rows with the same PK |
18:34
🔗
|
ivan_ |
systwiALT: sure |
18:35
🔗
|
ivan_ |
if you try to INSERT another row with the same PK it will refuse |
18:35
🔗
|
systwiALT |
Ok that makes sense, so meaning: z3aEv3EzMyQ can be used only once under the video_id column if video_id is PK? |
18:35
🔗
|
ivan_ |
sure |
18:36
🔗
|
systwiALT |
But... that's kinda how I would like it. I mean, the video_id is unique anyway. No two videos share the same ID |
18:36
🔗
|
ivan_ |
but you said you were grabbing videos multiple times and keeping different metadata |
18:37
🔗
|
ivan_ |
or did I get that wrong |
18:37
🔗
|
systwiALT |
No that's correct |
18:37
🔗
|
ivan_ |
so you actually have multiple versions of a video and need to have multiple rows with the same video_id |
18:37
🔗
|
systwiALT |
by retrieval_time do you mean grabnum? |
18:37
🔗
|
ivan_ |
you could use the UTC timestamp of the time you started the grab for that video |
18:38
🔗
|
systwiALT |
That's stored in grab_date |
18:38
🔗
|
systwiALT |
I use grabnum because that way if one file takes 3 minutes to download but another takes a second in the same grab, it's easier to consider them both as grabnum 1 and store that grab time in the database |
18:39
🔗
|
systwiALT |
than to have folders inside of "old" with each date/time |
18:40
🔗
|
ivan_ |
you folder structure is relatively independent of the database because you have another files table mapping (video_id) or (video_id, retrieval_time) to the filenames for that grab |
18:41
🔗
|
ivan_ |
are you re-grabbing video data? |
18:41
🔗
|
systwiALT |
My script will check for changes to video.mkv (or video.mp4) and keep the file if its hash is different |
18:42
🔗
|
systwiALT |
than the current one |
18:42
🔗
|
ivan_ |
that's going to keep videos that have merely been re-encoded by youtube |
18:44
🔗
|
systwiALT |
Heheh well then that's a bit of a problem. It's not a HUGE deal it just means there's a higher chance for me to have unnecessary duplicates |
18:44
🔗
|
systwiALT |
I'll take the unnecessary dupes |
18:45
🔗
|
systwiALT |
There's no way (afaik) to tell the differences other than hashes |
18:45
🔗
|
ivan_ |
well, it's up to you how to represent things in your database, if you want grabnums and stuff because you think grabs are a meaningful entity then go ahead |
18:46
🔗
|
ivan_ |
you need a table per entity |
18:46
🔗
|
ivan_ |
per entity type |
18:47
🔗
|
|
kiskabak has quit IRC (Remote host closed the connection) |
18:47
🔗
|
|
kiskabak has joined #archiveteam-ot |
18:47
🔗
|
|
Fusl sets mode: +o kiskabak |
18:47
🔗
|
|
Fusl__ sets mode: +o kiskabak |
18:47
🔗
|
|
Fusl_ sets mode: +o kiskabak |
18:47
🔗
|
systwiALT |
Yes, I feel that they are. Example (to clarify): Grab #1, video.mkv is added, comments.json is added. Grab #2, video.mkv is the same (don't change), comments.json changes. This means at this point, video.mkv is the same in grab #1 as it is now. comments.json is different from what it was the first time it was grabbed. |
18:48
🔗
|
systwiALT |
Therefore, curr = video.mkv (grabnum=1), comments.json (grabnum=2) |
18:48
🔗
|
systwiALT |
old = comments.json (grabnum=1) |
18:54
🔗
|
systwiALT |
One thing I trip up on is REFERENCES. I'm trying to follow this guide http://patshaughnessy.net/2017/12/11/trying-to-represent-a-tree-structure-using-postgres and it still doesn't click |
18:57
🔗
|
ivan_ |
systwiALT: if you have a REFERENCES on something else, that other thing must exist, or it will refuse to let you INSERT/UPDATE to that value |
18:58
🔗
|
ivan_ |
systwiALT: you don't need a tree structure because don't have entities that reference the same entity as a parent |
18:58
🔗
|
ivan_ |
same entity type |
18:59
🔗
|
|
kiskabak has quit IRC (Remote host closed the connection) |
18:59
🔗
|
|
kiskabak has joined #archiveteam-ot |
19:00
🔗
|
|
Fusl sets mode: +o kiskabak |
19:00
🔗
|
|
Fusl__ sets mode: +o kiskabak |
19:00
🔗
|
|
Fusl_ sets mode: +o kiskabak |
19:00
🔗
|
ivan_ |
if you have a table of grabs, you could for example in another table have: grab integer REFERENCES grabs (id) |
19:00
🔗
|
ivan_ |
then you would not be able to insert a bogus grab id |
19:02
🔗
|
systwiALT |
I thought this was a tree structure: https://transfer.notkiska.pw/gTx3G/tree.tiff |
19:02
🔗
|
ivan_ |
no, as far as I know you have normal relational database structure |
19:03
🔗
|
systwiALT |
So REFERENCES isn't necessary anywhere in my application? |
19:03
🔗
|
ivan_ |
entities with attributes with some one-to-many relationships |
19:04
🔗
|
ivan_ |
sure but I highly recommend using REFERENCES to avoid keeping around bogus data pointing to things that don't exist |
19:04
🔗
|
ivan_ |
REFERENCES / foreign key constraints will also prevent you from e.g. DROPing a grab that videos are pointing to |
19:06
🔗
|
systwiALT |
Ahh I see |
19:08
🔗
|
systwiALT |
Would I need to use an (id SERIAL PRIMARY KEY) and (parent_id INT REFERENCES channels (id)) in any of my tables like that tutorial did? |
19:08
🔗
|
ivan_ |
(I meant DELETE not DROP) |
19:08
🔗
|
ivan_ |
systwiALT: maybe not |
19:09
🔗
|
systwiALT |
Ok |
19:09
🔗
|
systwiALT |
Thank you so very much for your help. I'll take a lunch break and hopefully I can get somewhere with this. I'll fill you in with any updates |
19:28
🔗
|
ivan_ |
systwiALT: the way to think of it is: you really do put every entity of the same type into one table, then use a PK or other indexed column to get the rows you need |
20:15
🔗
|
|
DogsRNice has joined #archiveteam-ot |
20:20
🔗
|
|
bluefoo has quit IRC (Ping timeout: 255 seconds) |
20:25
🔗
|
|
superkuh has quit IRC (Remote host closed the connection) |
20:26
🔗
|
|
bluefoo has joined #archiveteam-ot |
20:28
🔗
|
|
superkuh has joined #archiveteam-ot |
21:14
🔗
|
Fusl |
b 171 |
21:17
🔗
|
|
Dimtree has quit IRC () |
21:24
🔗
|
|
Joseph_ has joined #archiveteam-ot |
21:28
🔗
|
|
VerifiedJ has quit IRC (Read error: Operation timed out) |
21:31
🔗
|
|
Dimtree has joined #archiveteam-ot |
22:27
🔗
|
|
icedice has joined #archiveteam-ot |
22:32
🔗
|
|
BlueMax has joined #archiveteam-ot |
22:36
🔗
|
|
Joseph_ has quit IRC (Read error: Connection reset by peer) |
22:43
🔗
|
|
kiskabak has quit IRC (Read error: Operation timed out) |
22:50
🔗
|
|
jeekl has quit IRC (Ping timeout: 745 seconds) |
23:06
🔗
|
|
bluefoo has quit IRC (Read error: Operation timed out) |
23:26
🔗
|
|
Maylay has quit IRC (Quit: Pipe Terminated) |
23:30
🔗
|
|
jeekl has joined #archiveteam-ot |
23:33
🔗
|
|
JH881 has quit IRC (Ping timeout: 252 seconds) |
23:37
🔗
|
|
JH881 has joined #archiveteam-ot |
23:42
🔗
|
|
kiskabak has joined #archiveteam-ot |
23:42
🔗
|
|
Fusl sets mode: +o kiskabak |
23:42
🔗
|
|
Fusl__ sets mode: +o kiskabak |
23:42
🔗
|
|
Fusl_ sets mode: +o kiskabak |
23:46
🔗
|
|
JH8813 has joined #archiveteam-ot |
23:46
🔗
|
|
JH881 has quit IRC (Ping timeout: 252 seconds) |