Time |
Nickname |
Message |
00:07
π
|
|
dashcloud has quit IRC (Read error: Connection reset by peer) |
00:08
π
|
|
dashcloud has joined #archiveteam-ot |
00:56
π
|
|
BlueMax has joined #archiveteam-ot |
01:36
π
|
|
dashcloud has quit IRC (Read error: Connection reset by peer) |
01:38
π
|
|
dashcloud has joined #archiveteam-ot |
02:00
π
|
|
Sanqui has quit IRC (Ping timeout: 260 seconds) |
02:06
π
|
|
Sanqui has joined #archiveteam-ot |
02:06
π
|
|
svchfoo1 sets mode: +o Sanqui |
02:11
π
|
|
Sanqui has quit IRC (Read error: Operation timed out) |
02:23
π
|
|
Sanqui has joined #archiveteam-ot |
02:24
π
|
|
svchfoo1 sets mode: +o Sanqui |
03:22
π
|
|
Sanqui has quit IRC (Ping timeout: 260 seconds) |
03:23
π
|
|
Sanqui has joined #archiveteam-ot |
03:24
π
|
|
svchfoo1 sets mode: +o Sanqui |
03:43
π
|
|
wp494 has quit IRC (Ping timeout: 260 seconds) |
03:44
π
|
|
wp494 has joined #archiveteam-ot |
03:44
π
|
|
svchfoo1 sets mode: +o wp494 |
04:22
π
|
|
odemg has quit IRC (Ping timeout: 265 seconds) |
04:34
π
|
|
odemg has joined #archiveteam-ot |
05:11
π
|
|
adinbied has joined #archiveteam-ot |
05:25
π
|
|
adinbied has quit IRC (Left Channel.) |
06:24
π
|
|
adinbied has joined #archiveteam-ot |
06:59
π
|
|
icedice has quit IRC (Quit: Leaving) |
07:04
π
|
|
Mateon1 has quit IRC (Remote host closed the connection) |
07:04
π
|
|
Mateon1 has joined #archiveteam-ot |
07:08
π
|
|
jspiros has quit IRC (Remote host closed the connection) |
07:08
π
|
|
swebb has quit IRC (Ping timeout: 240 seconds) |
07:08
π
|
|
svchfoo1 has quit IRC (Ping timeout: 240 seconds) |
07:09
π
|
|
swebb has joined #archiveteam-ot |
07:10
π
|
|
nightpoo- has quit IRC (Ping timeout: 246 seconds) |
07:11
π
|
|
JAA has quit IRC (Ping timeout: 246 seconds) |
07:16
π
|
|
godane has quit IRC (Ping timeout: 492 seconds) |
07:17
π
|
|
nightpool has joined #archiveteam-ot |
07:25
π
|
|
godane has joined #archiveteam-ot |
07:55
π
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
08:03
π
|
|
BlueMax has joined #archiveteam-ot |
08:10
π
|
|
JAA has joined #archiveteam-ot |
08:10
π
|
|
svchfoo1 has joined #archiveteam-ot |
08:11
π
|
|
jspiros has joined #archiveteam-ot |
08:11
π
|
|
bakJAA sets mode: +o JAA |
08:11
π
|
|
svchfoo3 sets mode: +o JAA |
08:12
π
|
|
svchfoo3 sets mode: +o svchfoo1 |
08:29
π
|
|
jspiros has quit IRC (hub.efnet.us irc.colosolutions.net) |
08:29
π
|
|
svchfoo1 has quit IRC (hub.efnet.us irc.colosolutions.net) |
08:29
π
|
|
JAA has quit IRC (hub.efnet.us irc.colosolutions.net) |
08:40
π
|
|
jspiros has joined #archiveteam-ot |
08:40
π
|
|
svchfoo1 has joined #archiveteam-ot |
08:40
π
|
|
JAA has joined #archiveteam-ot |
08:40
π
|
|
irc.colosolutions.net sets mode: +oo svchfoo1 JAA |
08:41
π
|
|
JAA sets mode: +o bakJAA |
08:41
π
|
|
schbirid has joined #archiveteam-ot |
08:41
π
|
|
bakJAA sets mode: +o JAA |
09:36
π
|
|
alex___ has joined #archiveteam-ot |
09:40
π
|
|
LFlare43 has quit IRC (Quit: The Lounge - https://thelounge.chat) |
10:35
π
|
|
BlueMax has quit IRC (Remote host closed the connection) |
10:37
π
|
|
BlueMax has joined #archiveteam-ot |
11:54
π
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
12:51
π
|
|
wp494 has quit IRC (Read error: Operation timed out) |
12:51
π
|
|
wp494 has joined #archiveteam-ot |
12:51
π
|
|
svchfoo3 sets mode: +o wp494 |
13:17
π
|
|
VerifiedJ has joined #archiveteam-ot |
13:18
π
|
|
hook54321 has quit IRC (Quit: Connection closed for inactivity) |
13:36
π
|
|
wmvhater has quit IRC (Read error: Operation timed out) |
13:38
π
|
|
kiska1 has quit IRC (Ping timeout (120 seconds)) |
13:38
π
|
|
wmvhater has joined #archiveteam-ot |
13:39
π
|
|
kiska1 has joined #archiveteam-ot |
13:41
π
|
|
wmvhater has quit IRC (Client Quit) |
13:42
π
|
|
wmvhater has joined #archiveteam-ot |
13:43
π
|
|
kiska1 has quit IRC (Client Quit) |
13:43
π
|
|
kiska1 has joined #archiveteam-ot |
13:46
π
|
|
wmvhater has quit IRC (Read error: Operation timed out) |
13:49
π
|
|
wmvhater has joined #archiveteam-ot |
13:51
π
|
|
godane has quit IRC (Ping timeout: 265 seconds) |
15:15
π
|
|
adinbied has quit IRC (Quit: Left Channel.) |
15:36
π
|
|
adinbied has joined #archiveteam-ot |
15:36
π
|
|
schbirid has quit IRC (Remote host closed the connection) |
15:47
π
|
|
schbirid has joined #archiveteam-ot |
16:33
π
|
|
alex___ has quit IRC (alex___) |
16:36
π
|
|
alex___ has joined #archiveteam-ot |
19:37
π
|
|
icedice has joined #archiveteam-ot |
19:39
π
|
|
ola_norsk has joined #archiveteam-ot |
19:40
π
|
ola_norsk |
Happy New Year! |
19:45
π
|
ola_norsk |
Without harping on too much on the YouTube Annotations issue; Would anyone happen to have a good idea to get all video id's by January 15th 2019..that doesn't involve scrapy? |
19:46
π
|
ola_norsk |
I bet i can pull the annotations, the hard part is figuring out all the .ID's |
19:49
π
|
ola_norsk |
youtube data API seems to have some sort of points costs to it, and i'm not paying to unfuck youtubes fuckups |
19:51
π
|
Kaz |
just iterate through them |
19:51
π
|
ola_norsk |
'them' who ? , i have 4Mbit connection :/ |
19:51
π
|
|
alex___ has quit IRC (Read error: Operation timed out) |
19:52
π
|
ola_norsk |
If i had a table of all the links of every video (or video id) i could iterate |
19:53
π
|
ola_norsk |
there's no way i'd be able to scrape every ID off of youtube singlehandedly in 1.25 month |
19:53
π
|
Kaz |
correct |
19:54
π
|
ola_norsk |
eientei95: you see that, "correct" .. it's why i'm here |
19:54
π
|
* |
ola_norsk is no 1337 haxor |
19:55
π
|
Kaz |
you're going to have to brute-force your way through if you want everything |
19:55
π
|
Kaz |
otherwise just search google/facebook/twitter/reddit/whatever |
19:55
π
|
ola_norsk |
Kaz: "...stop being brainless" |
19:57
π
|
|
JAA has quit IRC (Read error: Operation timed out) |
19:57
π
|
|
alex___ has joined #archiveteam-ot |
19:57
π
|
Kaz |
happy to hear any smart ideas you've come up with for this |
19:57
π
|
Kaz |
Google's not going to give you a list |
19:58
π
|
Kaz |
Lots of videos are just going to be unlisted anyway, so won't show in searches etc |
19:58
π
|
ola_norsk |
Kaz: i'm looking at Scrapy, that's the extent of my smart |
19:58
π
|
Kaz |
As, as I said, your options are a) searching whatever you can to scrape youtube.com and youtu.be links |
19:58
π
|
Kaz |
or b) bruteforcing the list |
19:59
π
|
|
jspiros has quit IRC (Read error: Operation timed out) |
20:00
π
|
|
adinbied has quit IRC (Read error: Operation timed out) |
20:00
π
|
|
svchfoo1 has quit IRC (Ping timeout: 246 seconds) |
20:01
π
|
|
adinbied has joined #archiveteam-ot |
20:11
π
|
ola_norsk |
b seems to be the quicker option then |
20:14
π
|
cf |
ola_norsk: I've already started scraping and grabbing annotation xmls |
20:15
π
|
cf |
started with all submissions to reddit since that data is easily available |
20:15
π
|
cf |
got 20M ids or so |
20:15
π
|
cf |
going to do that and also grab from the top hundred thousand channels or so |
20:15
π
|
cf |
maybe see if I can get a primitive spider crawling for more ids but not super invested |
20:23
π
|
ola_norsk |
cf: awesome. Though instead of the top hundred thousand channels, maybe it would be better to proiritize by category? e.g News, History, Technology, first, then e.g Humour last? |
20:24
π
|
ola_norsk |
cf: but yeah, it's hard to care, really..when it's about time YouTube took a dive from their monopoly pedestal |
20:25
π
|
cf |
could prioritize by category, but not sure how easy it is to filter by something like that. will take a look tho |
20:26
π
|
ola_norsk |
what format do have the ids save as? |
20:27
π
|
cf |
just the 11 chars after ?v= in the url. [0-9A-Za-z_-]{11} |
20:27
π
|
ola_norsk |
if you can share them, feel free to do so |
20:27
π
|
cf |
http://files.ulim.it/all_ids.txt |
20:28
π
|
cf |
i'm sure there's things that just matched the regex despite not being a real id (the first line is an example) |
20:28
π
|
cf |
but spot checking seems to show it being pretty good |
20:28
π
|
cf |
and again, that's all of the ids I extracted from reddit submission data |
20:30
π
|
ola_norsk |
still more awesomer than me _trying to_ reinventing the wheel, so thanks! |
20:33
π
|
ola_norsk |
could git be used to commit patches to the list in the future you think? |
20:34
π
|
ola_norsk |
it's quite a fuckload of textfile there :D |
20:36
π
|
ola_norsk |
or mysql, to insert new id's into perhaps? |
20:39
π
|
cf |
rdms's tend to choke when you're just using them to store a list of distinct items. probably still a bit better than a text file but not by much. not sure how git would fare |
20:39
π
|
cf |
*rdbms's |
20:40
π
|
ola_norsk |
it took me over 2 minutes just to 'cat' trought the list |
20:41
π
|
ola_norsk |
do you happen to have a record of the ones you've already pulled the annotations from? |
20:41
π
|
|
godane has joined #archiveteam-ot |
20:41
π
|
ola_norsk |
that way i could start on the ones you've not yet done |
20:42
π
|
cf |
just have a script working its way down the list |
20:42
π
|
cf |
probably the first 100k or so |
20:42
π
|
ola_norsk |
ok |
20:42
π
|
cf |
but its parallelized so its going to be out of order |
20:42
π
|
ola_norsk |
if i reverse the list and work upward? |
20:44
π
|
ola_norsk |
a bit of duplicate is impossible to prevent i guess, with 'tubeup' already having done a lot |
20:45
π
|
ola_norsk |
MirrorTube, i mean |
20:45
π
|
cf |
sure yeah if you want to grab stuff yourself, we should meet somewhere in the middle |
20:47
π
|
ola_norsk |
you expect you'll be able to pull it off that would be great. |
20:47
π
|
ola_norsk |
if* |
20:48
π
|
ola_norsk |
if not, i'd be happy to give it a go |
20:51
π
|
ola_norsk |
how do you name the annotations files btw? by just id, or? |
20:51
π
|
ola_norsk |
<id>.xml ? |
20:52
π
|
cf |
yeah id.xml |
20:52
π
|
ola_norsk |
ok |
20:52
π
|
Kaz |
(get warcs too please) |
20:53
π
|
Kaz |
actually scratch that, I'll chuck the list into archivebot |
20:53
π
|
ola_norsk |
wpull is JAA's territory, i'm just a grabsite n00b :D |
20:53
π
|
|
Mateon1 has quit IRC (Ping timeout: 252 seconds) |
20:54
π
|
|
Mateon1 has joined #archiveteam-ot |
20:54
π
|
Kaz |
if you're chucking it into grab-site it'll probably generate a warc, no? |
20:54
π
|
ola_norsk |
that, and then some |
20:56
π
|
ola_norsk |
maybe 'youtube' ignoreset would help |
20:56
π
|
ola_norsk |
idk, i had a forum generating 10+GB... |
20:58
π
|
ola_norsk |
(https://archive.org/details/WARC_www_subsim_com-radioroom-2018-09-07-89abc154_01) ..and to ~7 (i think) |
20:59
π
|
ola_norsk |
i used login cookie, so it might contain some software and mods..but had to manually cancel it since it never stopped |
20:59
π
|
|
svchfoo1 has joined #archiveteam-ot |
20:59
π
|
|
JAA has joined #archiveteam-ot |
20:59
π
|
|
svchfoo3 sets mode: +o JAA |
21:00
π
|
|
bakJAA sets mode: +o JAA |
21:00
π
|
|
svchfoo3 sets mode: +o svchfoo1 |
21:03
π
|
|
jspiros has joined #archiveteam-ot |
21:07
π
|
ola_norsk |
Kaz: youtube-dl can pull comments i think |
21:07
π
|
ola_norsk |
as info json |
21:08
π
|
Kaz |
eh |
21:08
π
|
Kaz |
isn't this about annotations? |
21:09
π
|
ola_norsk |
<@Kaz> (get warcs too please) |
21:09
π
|
Kaz |
..continue |
21:09
π
|
ola_norsk |
you said that, i didnt't :D |
21:09
π
|
Kaz |
yes |
21:09
π
|
Kaz |
how did comments come into this |
21:10
π
|
Kaz |
if we're going about quoting things.. |
21:10
π
|
Kaz |
<@Kaz> isn't this about annotations? |
21:10
π
|
ola_norsk |
because grabbing comments is doable, while warcing every */v/<id> i don't think is :/ |
21:11
π
|
Kaz |
wha |
21:11
π
|
Kaz |
drop making jumps |
21:11
π
|
Kaz |
you're working from https://www.youtube.com/annotations_invideo?features=1&legacy=1&video_id=<video_id> |
21:11
π
|
Kaz |
correct? |
21:11
π
|
ola_norsk |
warc would mean the video is included, correct? |
21:11
π
|
Kaz |
no |
21:12
π
|
ola_norsk |
ok |
21:12
π
|
Kaz |
if you're grabbing the annotations xml.. get a warc of it |
21:12
π
|
Kaz |
comments and videos themselves don't come into this |
21:12
π
|
Kaz |
we're sure as shit not grabbing the whole of youtube today |
21:12
π
|
ola_norsk |
aha! see, you're smarter than i look |
21:13
π
|
Kaz |
i think that's a compliment |
21:13
π
|
ola_norsk |
i look pretty sh*t, but yeah, it was meant to be |
21:14
π
|
ola_norsk |
anywho, i'll go with your idea |
21:14
π
|
|
ola_norsk has quit IRC (leaving) |
21:16
π
|
ivan |
I've never seen youtube-dl include comments |
21:22
π
|
|
BlueMax has joined #archiveteam-ot |
21:37
π
|
|
hook54321 has joined #archiveteam-ot |
21:43
π
|
|
wp494 has quit IRC (Ping timeout: 268 seconds) |
21:43
π
|
|
wp494 has joined #archiveteam-ot |
21:43
π
|
|
svchfoo3 sets mode: +o wp494 |
22:07
π
|
|
ola_norsk has joined #archiveteam-ot |
22:07
π
|
ola_norsk |
cf: did you you check your all_ids.txt for duplicates? |
22:08
π
|
ola_norsk |
i mean, are all 200k of them unique? |
22:10
π
|
ola_norsk |
200k+ |
22:17
π
|
* |
ola_norsk sucks at grep :/ and came out with 40k+ lines when doing adding/transfering unique from-and-to a file from all_ids :/ http://paste.ubuntu.com/p/6hfm34VSDM/ |
22:19
π
|
ola_norsk |
"grep -Fxq" to find |
22:20
π
|
ola_norsk |
used* |
22:21
π
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
22:21
π
|
|
JAA has quit IRC (Read error: Operation timed out) |
22:22
π
|
ola_norsk |
it's more than likely i messed up |
22:23
π
|
|
BlueMax has joined #archiveteam-ot |
22:24
π
|
|
svchfoo1 has quit IRC (Ping timeout: 246 seconds) |
22:26
π
|
|
jspiros has quit IRC (Read error: Operation timed out) |
22:38
π
|
|
BlueMax has quit IRC (Quit: Leaving) |
22:39
π
|
|
BlueMax has joined #archiveteam-ot |
22:44
π
|
|
godane has quit IRC (Read error: Operation timed out) |
22:51
π
|
ola_norsk |
i put the stuff here https://archive.org/details/Youtube_Video_IDs_CF_2018-12-01 |
22:51
π
|
|
ola_norsk has quit IRC (SkΓ₯l!) |
23:24
π
|
|
jspiros has joined #archiveteam-ot |
23:25
π
|
|
svchfoo1 has joined #archiveteam-ot |
23:26
π
|
|
svchfoo3 sets mode: +o svchfoo1 |
23:26
π
|
|
JAA has joined #archiveteam-ot |
23:26
π
|
|
svchfoo1 sets mode: +o JAA |
23:27
π
|
|
bakJAA sets mode: +o JAA |
23:27
π
|
|
godane has joined #archiveteam-ot |
23:41
π
|
JAA |
(For log completeness, posted in -bs by mistake...) ola_norsk: Extracting duplicates from a file: awk 'seen[$0]++' <file This will print each duplicate (but not the first appearance). You could also do sort <file | uniq -d but I prefer the awk solution since it doesn't have to sort the file and is therefore *much* faster. |
23:43
π
|
JAA |
I have no interest in archiving all YouTube annotations myself. This is a pretty big task, and Google isn't exactly known for liking automated access, i.e. bans come flying very quickly in my experience. |
23:43
π
|
JAA |
In other words, warrior. |
23:44
π
|
JAA |
Also, odemg may have a good list of video IDs from the metadata archival project. |
23:44
π
|
Flashfire |
Ivan pushes a TB of videos to a google drive a day but google produces EB so itβs not truly viable as such |
23:45
π
|
Flashfire |
Sorry YouTube in the form of google pushes EB |
23:45
π
|
JAA |
Not an EB per day though, more like an EB in total. |
23:46
π
|
ivan |
it's more than an EB in total |
23:46
π
|
Flashfire |
Thatβs still not viable to grab all of. Something like 1000 |
23:46
π
|
JAA |
Around that order of magnitude at least. |
23:46
π
|
Flashfire |
hours in a minute is iploaded |
23:47
π
|
Flashfire |
Or some obscure number |
23:47
π
|
JAA |
Nobody suggested grabbing all of YouTube anyway. That won't ever happen. |
23:48
π
|
JAA |
Unless it fades away into obscurity yet somehow survives for a few more decades until an EB of storage capacity is a laughable amount. |
23:49
π
|
* |
Flashfire chuckles in futuristic |