[00:07] *** dashcloud has quit IRC (Read error: Connection reset by peer)
[00:08] *** dashcloud has joined #archiveteam-ot
[00:56] *** BlueMax has joined #archiveteam-ot
[01:36] *** dashcloud has quit IRC (Read error: Connection reset by peer)
[01:38] *** dashcloud has joined #archiveteam-ot
[02:00] *** Sanqui has quit IRC (Ping timeout: 260 seconds)
[02:06] *** Sanqui has joined #archiveteam-ot
[02:06] *** svchfoo1 sets mode: +o Sanqui
[02:11] *** Sanqui has quit IRC (Read error: Operation timed out)
[02:23] *** Sanqui has joined #archiveteam-ot
[02:24] *** svchfoo1 sets mode: +o Sanqui
[03:22] *** Sanqui has quit IRC (Ping timeout: 260 seconds)
[03:23] *** Sanqui has joined #archiveteam-ot
[03:24] *** svchfoo1 sets mode: +o Sanqui
[03:43] *** wp494 has quit IRC (Ping timeout: 260 seconds)
[03:44] *** wp494 has joined #archiveteam-ot
[03:44] *** svchfoo1 sets mode: +o wp494
[04:22] *** odemg has quit IRC (Ping timeout: 265 seconds)
[04:34] *** odemg has joined #archiveteam-ot
[05:11] *** adinbied has joined #archiveteam-ot
[05:25] *** adinbied has quit IRC (Left Channel.)
[06:24] *** adinbied has joined #archiveteam-ot
[06:59] *** icedice has quit IRC (Quit: Leaving)
[07:04] *** Mateon1 has quit IRC (Remote host closed the connection)
[07:04] *** Mateon1 has joined #archiveteam-ot
[07:08] *** jspiros has quit IRC (Remote host closed the connection)
[07:08] *** swebb has quit IRC (Ping timeout: 240 seconds)
[07:08] *** svchfoo1 has quit IRC (Ping timeout: 240 seconds)
[07:09] *** swebb has joined #archiveteam-ot
[07:10] *** nightpoo- has quit IRC (Ping timeout: 246 seconds)
[07:11] *** JAA has quit IRC (Ping timeout: 246 seconds)
[07:16] *** godane has quit IRC (Ping timeout: 492 seconds)
[07:17] *** nightpool has joined #archiveteam-ot
[07:25] *** godane has joined #archiveteam-ot
[07:55] *** BlueMax has quit IRC (Read error: Connection reset by peer)
[08:03] *** BlueMax has joined #archiveteam-ot
[08:10] *** JAA has joined #archiveteam-ot
[08:10] *** svchfoo1 has joined #archiveteam-ot
[08:11] *** jspiros has joined #archiveteam-ot
[08:11] *** bakJAA sets mode: +o JAA
[08:11] *** svchfoo3 sets mode: +o JAA
[08:12] *** svchfoo3 sets mode: +o svchfoo1
[08:29] *** jspiros has quit IRC (hub.efnet.us irc.colosolutions.net)
[08:29] *** svchfoo1 has quit IRC (hub.efnet.us irc.colosolutions.net)
[08:29] *** JAA has quit IRC (hub.efnet.us irc.colosolutions.net)
[08:40] *** jspiros has joined #archiveteam-ot
[08:40] *** svchfoo1 has joined #archiveteam-ot
[08:40] *** JAA has joined #archiveteam-ot
[08:40] *** irc.colosolutions.net sets mode: +oo svchfoo1 JAA
[08:41] *** JAA sets mode: +o bakJAA
[08:41] *** schbirid has joined #archiveteam-ot
[08:41] *** bakJAA sets mode: +o JAA
[09:36] *** alex___ has joined #archiveteam-ot
[09:40] *** LFlare43 has quit IRC (Quit: The Lounge - https://thelounge.chat)
[10:35] *** BlueMax has quit IRC (Remote host closed the connection)
[10:37] *** BlueMax has joined #archiveteam-ot
[11:54] *** BlueMax has quit IRC (Read error: Connection reset by peer)
[12:51] *** wp494 has quit IRC (Read error: Operation timed out)
[12:51] *** wp494 has joined #archiveteam-ot
[12:51] *** svchfoo3 sets mode: +o wp494
[13:17] *** VerifiedJ has joined #archiveteam-ot
[13:18] *** hook54321 has quit IRC (Quit: Connection closed for inactivity)
[13:36] *** wmvhater has quit IRC (Read error: Operation timed out)
[13:38] *** kiska1 has quit IRC (Ping timeout (120 seconds))
[13:38] *** wmvhater has joined #archiveteam-ot
[13:39] *** kiska1 has joined #archiveteam-ot
[13:41] *** wmvhater has quit IRC (Client Quit)
[13:42] *** wmvhater has joined #archiveteam-ot
[13:43] *** kiska1 has quit IRC (Client Quit)
[13:43] *** kiska1 has joined #archiveteam-ot
[13:46] *** wmvhater has quit IRC (Read error: Operation timed out)
[13:49] *** wmvhater has joined #archiveteam-ot
[13:51] *** godane has quit IRC (Ping timeout: 265 seconds)
[15:15] *** adinbied has quit IRC (Quit: Left Channel.)
[15:36] *** adinbied has joined #archiveteam-ot
[15:36] *** schbirid has quit IRC (Remote host closed the connection)
[15:47] *** schbirid has joined #archiveteam-ot
[16:33] *** alex___ has quit IRC (alex___)
[16:36] *** alex___ has joined #archiveteam-ot
[19:37] *** icedice has joined #archiveteam-ot
[19:39] *** ola_norsk has joined #archiveteam-ot
[19:40] <ola_norsk> Happy New Year!
[19:45] <ola_norsk> Without harping on too much on the YouTube Annotations issue; Would anyone happen to have a good idea to get all video id's by January 15th 2019..that doesn't involve scrapy?
[19:46] <ola_norsk> I bet i can pull the annotations, the hard part is figuring out all the .ID's
[19:49] <ola_norsk> youtube data API seems to have some sort of points costs to it, and i'm not paying to unfuck youtubes fuckups
[19:51] <Kaz> just iterate through them
[19:51] <ola_norsk> 'them' who ? , i have 4Mbit connection :/
[19:51] *** alex___ has quit IRC (Read error: Operation timed out)
[19:52] <ola_norsk> If i had a table of all the links of every video (or video id) i could iterate
[19:53] <ola_norsk> there's no way i'd be able to scrape every ID off of youtube singlehandedly in 1.25 month
[19:53] <Kaz> correct
[19:54] <ola_norsk> eientei95: you see that, "correct" .. it's why i'm here
[19:54] * ola_norsk is no 1337 haxor
[19:55] <Kaz> you're going to have to brute-force your way through if you want everything
[19:55] <Kaz> otherwise just search google/facebook/twitter/reddit/whatever
[19:55] <ola_norsk> Kaz: "...stop being brainless"
[19:57] *** JAA has quit IRC (Read error: Operation timed out)
[19:57] *** alex___ has joined #archiveteam-ot
[19:57] <Kaz> happy to hear any smart ideas you've come up with for this
[19:57] <Kaz> Google's not going to give you a list
[19:58] <Kaz> Lots of videos are just going to be unlisted anyway, so won't show in searches etc
[19:58] <ola_norsk> Kaz: i'm looking at Scrapy, that's the extent of my smart
[19:58] <Kaz> As, as I said, your options are a) searching whatever you can to scrape youtube.com and youtu.be links
[19:58] <Kaz> or b) bruteforcing the list
[19:59] *** jspiros has quit IRC (Read error: Operation timed out)
[20:00] *** adinbied has quit IRC (Read error: Operation timed out)
[20:00] *** svchfoo1 has quit IRC (Ping timeout: 246 seconds)
[20:01] *** adinbied has joined #archiveteam-ot
[20:11] <ola_norsk> b seems to be the quicker option then
[20:14] <cf> ola_norsk: I've already started scraping and grabbing annotation xmls
[20:15] <cf> started with all submissions to reddit since that data is easily available
[20:15] <cf> got 20M ids or so
[20:15] <cf> going to do that and also grab from the top hundred thousand channels or so
[20:15] <cf> maybe see if I can get a primitive spider crawling for more ids but not super invested
[20:23] <ola_norsk> cf: awesome. Though instead of the top hundred thousand channels, maybe it would be better to proiritize by category? e.g News, History, Technology, first, then e.g Humour last?
[20:24] <ola_norsk> cf: but yeah, it's hard to care, really..when it's about time YouTube took a dive from their monopoly pedestal
[20:25] <cf> could prioritize by category, but not sure how easy it is to filter by something like that. will take a look tho
[20:26] <ola_norsk> what format do have the ids save as?
[20:27] <cf> just the 11 chars after ?v= in the url. [0-9A-Za-z_-]{11}
[20:27] <ola_norsk> if you can share them, feel free to do so
[20:27] <cf> http://files.ulim.it/all_ids.txt
[20:28] <cf> i'm sure there's things that just matched the regex despite not being a real id (the first line is an example)
[20:28] <cf> but spot checking seems to show it being pretty good
[20:28] <cf> and again, that's all of the ids I extracted from reddit submission data
[20:30] <ola_norsk> still more awesomer than me _trying to_  reinventing the wheel, so thanks!
[20:33] <ola_norsk> could git be used to commit patches to the list in the future you think?
[20:34] <ola_norsk> it's quite a fuckload of textfile there :D
[20:36] <ola_norsk> or mysql, to insert new id's into perhaps?
[20:39] <cf> rdms's tend to choke when you're just using them to store a list of distinct items. probably still a bit better than a text file but not by much. not sure how git would fare
[20:39] <cf> *rdbms's
[20:40] <ola_norsk> it took me over 2 minutes just to 'cat' trought the list
[20:41] <ola_norsk> do you happen to have a record of the ones you've already pulled the annotations from?
[20:41] *** godane has joined #archiveteam-ot
[20:41] <ola_norsk> that way i could start on the ones you've not yet done
[20:42] <cf> just have a script working its way down the list
[20:42] <cf> probably the first 100k or so
[20:42] <ola_norsk> ok
[20:42] <cf> but its parallelized so its going to be out of order
[20:42] <ola_norsk> if i reverse the list and work upward?
[20:44] <ola_norsk> a bit of duplicate is impossible to prevent i guess, with 'tubeup' already having done a lot
[20:45] <ola_norsk> MirrorTube, i mean
[20:45] <cf> sure yeah if you want to grab stuff yourself, we should meet somewhere in the middle
[20:47] <ola_norsk> you expect you'll be able to pull it off that would be great.
[20:47] <ola_norsk> if*
[20:48] <ola_norsk> if not, i'd be happy to give it a go
[20:51] <ola_norsk> how do you name the annotations files btw? by just id, or?
[20:51] <ola_norsk> <id>.xml ?
[20:52] <cf> yeah id.xml
[20:52] <ola_norsk> ok
[20:52] <Kaz> (get warcs too please)
[20:53] <Kaz> actually scratch that, I'll chuck the list into archivebot
[20:53] <ola_norsk> wpull is JAA's territory, i'm just a grabsite n00b :D
[20:53] *** Mateon1 has quit IRC (Ping timeout: 252 seconds)
[20:54] *** Mateon1 has joined #archiveteam-ot
[20:54] <Kaz> if you're chucking it into grab-site it'll probably generate a warc, no?
[20:54] <ola_norsk> that, and then some
[20:56] <ola_norsk> maybe 'youtube' ignoreset would help
[20:56] <ola_norsk> idk, i had a forum generating 10+GB...
[20:58] <ola_norsk> (https://archive.org/details/WARC_www_subsim_com-radioroom-2018-09-07-89abc154_01) ..and to ~7 (i think)
[20:59] <ola_norsk> i used login cookie, so it might contain some software and mods..but had to manually cancel it since it never stopped
[20:59] *** svchfoo1 has joined #archiveteam-ot
[20:59] *** JAA has joined #archiveteam-ot
[20:59] *** svchfoo3 sets mode: +o JAA
[21:00] *** bakJAA sets mode: +o JAA
[21:00] *** svchfoo3 sets mode: +o svchfoo1
[21:03] *** jspiros has joined #archiveteam-ot
[21:07] <ola_norsk> Kaz: youtube-dl can pull comments i think
[21:07] <ola_norsk> as info json
[21:08] <Kaz> eh
[21:08] <Kaz> isn't this about annotations?
[21:09] <ola_norsk> <@Kaz> (get warcs too please)
[21:09] <Kaz> ..continue
[21:09] <ola_norsk> you said that, i didnt't :D
[21:09] <Kaz> yes
[21:09] <Kaz> how did comments come into this
[21:10] <Kaz> if we're going about quoting things..
[21:10] <Kaz> <@Kaz> isn't this about annotations?
[21:10] <ola_norsk> because grabbing comments is doable, while warcing every */v/<id> i don't think is :/
[21:11] <Kaz> wha
[21:11] <Kaz> drop making jumps
[21:11] <Kaz> you're working from https://www.youtube.com/annotations_invideo?features=1&legacy=1&video_id=<video_id>
[21:11] <Kaz> correct?
[21:11] <ola_norsk> warc would mean the video is included, correct?
[21:11] <Kaz> no
[21:12] <ola_norsk> ok
[21:12] <Kaz> if you're grabbing the annotations xml.. get a warc of it
[21:12] <Kaz> comments and videos themselves don't come into this
[21:12] <Kaz> we're sure as shit not grabbing the whole of youtube today
[21:12] <ola_norsk> aha! see, you're smarter than i look
[21:13] <Kaz> i think that's a compliment
[21:13] <ola_norsk> i look pretty sh*t, but yeah, it was meant to be
[21:14] <ola_norsk> anywho, i'll go with your idea
[21:14] *** ola_norsk has quit IRC (leaving)
[21:16] <ivan> I've never seen youtube-dl include comments
[21:22] *** BlueMax has joined #archiveteam-ot
[21:37] *** hook54321 has joined #archiveteam-ot
[21:43] *** wp494 has quit IRC (Ping timeout: 268 seconds)
[21:43] *** wp494 has joined #archiveteam-ot
[21:43] *** svchfoo3 sets mode: +o wp494
[22:07] *** ola_norsk has joined #archiveteam-ot
[22:07] <ola_norsk> cf: did you you check your all_ids.txt for duplicates?
[22:08] <ola_norsk> i mean, are all 200k of them unique?
[22:10] <ola_norsk> 200k+
[22:17] * ola_norsk sucks at grep :/ and came out with 40k+ lines when doing adding/transfering unique from-and-to a file from all_ids :/ http://paste.ubuntu.com/p/6hfm34VSDM/
[22:19] <ola_norsk> "grep -Fxq" to find
[22:20] <ola_norsk> used*
[22:21] *** BlueMax has quit IRC (Read error: Connection reset by peer)
[22:21] *** JAA has quit IRC (Read error: Operation timed out)
[22:22] <ola_norsk> it's more than likely i messed up
[22:23] *** BlueMax has joined #archiveteam-ot
[22:24] *** svchfoo1 has quit IRC (Ping timeout: 246 seconds)
[22:26] *** jspiros has quit IRC (Read error: Operation timed out)
[22:38] *** BlueMax has quit IRC (Quit: Leaving)
[22:39] *** BlueMax has joined #archiveteam-ot
[22:44] *** godane has quit IRC (Read error: Operation timed out)
[22:51] <ola_norsk> i put the stuff here https://archive.org/details/Youtube_Video_IDs_CF_2018-12-01
[22:51] *** ola_norsk has quit IRC (Skål!)
[23:24] *** jspiros has joined #archiveteam-ot
[23:25] *** svchfoo1 has joined #archiveteam-ot
[23:26] *** svchfoo3 sets mode: +o svchfoo1
[23:26] *** JAA has joined #archiveteam-ot
[23:26] *** svchfoo1 sets mode: +o JAA
[23:27] *** bakJAA sets mode: +o JAA
[23:27] *** godane has joined #archiveteam-ot
[23:41] <JAA> (For log completeness, posted in -bs by mistake...)  ola_norsk: Extracting duplicates from a file: awk 'seen[$0]++' <file   This will print each duplicate (but not the first appearance). You could also do  sort <file | uniq -d  but I prefer the awk solution since it doesn't have to sort the file and is therefore *much* faster.
[23:43] <JAA> I have no interest in archiving all YouTube annotations myself. This is a pretty big task, and Google isn't exactly known for liking automated access, i.e. bans come flying very quickly in my experience.
[23:43] <JAA> In other words, warrior.
[23:44] <JAA> Also, odemg may have a good list of video IDs from the metadata archival project.
[23:44] <Flashfire> Ivan pushes a TB of videos to a google drive a day but google produces EB so it’s not truly viable as such
[23:45] <Flashfire> Sorry YouTube in the form of google pushes EB
[23:45] <JAA> Not an EB per day though, more like an EB in total.
[23:46] <ivan> it's more than an EB in total
[23:46] <Flashfire> That’s still not viable to grab all of. Something like 1000
[23:46] <JAA> Around that order of magnitude at least.
[23:46] <Flashfire> hours in a minute is iploaded
[23:47] <Flashfire> Or some obscure number 
[23:47] <JAA> Nobody suggested grabbing all of YouTube anyway. That won't ever happen.
[23:48] <JAA> Unless it fades away into obscurity yet somehow survives for a few more decades until an EB of storage capacity is a laughable amount.
[23:49] * Flashfire chuckles in futuristic