[00:07] *** dashcloud has quit IRC (Read error: Connection reset by peer) [00:08] *** dashcloud has joined #archiveteam-ot [00:56] *** BlueMax has joined #archiveteam-ot [01:36] *** dashcloud has quit IRC (Read error: Connection reset by peer) [01:38] *** dashcloud has joined #archiveteam-ot [02:00] *** Sanqui has quit IRC (Ping timeout: 260 seconds) [02:06] *** Sanqui has joined #archiveteam-ot [02:06] *** svchfoo1 sets mode: +o Sanqui [02:11] *** Sanqui has quit IRC (Read error: Operation timed out) [02:23] *** Sanqui has joined #archiveteam-ot [02:24] *** svchfoo1 sets mode: +o Sanqui [03:22] *** Sanqui has quit IRC (Ping timeout: 260 seconds) [03:23] *** Sanqui has joined #archiveteam-ot [03:24] *** svchfoo1 sets mode: +o Sanqui [03:43] *** wp494 has quit IRC (Ping timeout: 260 seconds) [03:44] *** wp494 has joined #archiveteam-ot [03:44] *** svchfoo1 sets mode: +o wp494 [04:22] *** odemg has quit IRC (Ping timeout: 265 seconds) [04:34] *** odemg has joined #archiveteam-ot [05:11] *** adinbied has joined #archiveteam-ot [05:25] *** adinbied has quit IRC (Left Channel.) [06:24] *** adinbied has joined #archiveteam-ot [06:59] *** icedice has quit IRC (Quit: Leaving) [07:04] *** Mateon1 has quit IRC (Remote host closed the connection) [07:04] *** Mateon1 has joined #archiveteam-ot [07:08] *** jspiros has quit IRC (Remote host closed the connection) [07:08] *** swebb has quit IRC (Ping timeout: 240 seconds) [07:08] *** svchfoo1 has quit IRC (Ping timeout: 240 seconds) [07:09] *** swebb has joined #archiveteam-ot [07:10] *** nightpoo- has quit IRC (Ping timeout: 246 seconds) [07:11] *** JAA has quit IRC (Ping timeout: 246 seconds) [07:16] *** godane has quit IRC (Ping timeout: 492 seconds) [07:17] *** nightpool has joined #archiveteam-ot [07:25] *** godane has joined #archiveteam-ot [07:55] *** BlueMax has quit IRC (Read error: Connection reset by peer) [08:03] *** BlueMax has joined #archiveteam-ot [08:10] *** JAA has joined #archiveteam-ot [08:10] *** svchfoo1 has joined #archiveteam-ot [08:11] *** jspiros has joined #archiveteam-ot [08:11] *** bakJAA sets mode: +o JAA [08:11] *** svchfoo3 sets mode: +o JAA [08:12] *** svchfoo3 sets mode: +o svchfoo1 [08:29] *** jspiros has quit IRC (hub.efnet.us irc.colosolutions.net) [08:29] *** svchfoo1 has quit IRC (hub.efnet.us irc.colosolutions.net) [08:29] *** JAA has quit IRC (hub.efnet.us irc.colosolutions.net) [08:40] *** jspiros has joined #archiveteam-ot [08:40] *** svchfoo1 has joined #archiveteam-ot [08:40] *** JAA has joined #archiveteam-ot [08:40] *** irc.colosolutions.net sets mode: +oo svchfoo1 JAA [08:41] *** JAA sets mode: +o bakJAA [08:41] *** schbirid has joined #archiveteam-ot [08:41] *** bakJAA sets mode: +o JAA [09:36] *** alex___ has joined #archiveteam-ot [09:40] *** LFlare43 has quit IRC (Quit: The Lounge - https://thelounge.chat) [10:35] *** BlueMax has quit IRC (Remote host closed the connection) [10:37] *** BlueMax has joined #archiveteam-ot [11:54] *** BlueMax has quit IRC (Read error: Connection reset by peer) [12:51] *** wp494 has quit IRC (Read error: Operation timed out) [12:51] *** wp494 has joined #archiveteam-ot [12:51] *** svchfoo3 sets mode: +o wp494 [13:17] *** VerifiedJ has joined #archiveteam-ot [13:18] *** hook54321 has quit IRC (Quit: Connection closed for inactivity) [13:36] *** wmvhater has quit IRC (Read error: Operation timed out) [13:38] *** kiska1 has quit IRC (Ping timeout (120 seconds)) [13:38] *** wmvhater has joined #archiveteam-ot [13:39] *** kiska1 has joined #archiveteam-ot [13:41] *** wmvhater has quit IRC (Client Quit) [13:42] *** wmvhater has joined #archiveteam-ot [13:43] *** kiska1 has quit IRC (Client Quit) [13:43] *** kiska1 has joined #archiveteam-ot [13:46] *** wmvhater has quit IRC (Read error: Operation timed out) [13:49] *** wmvhater has joined #archiveteam-ot [13:51] *** godane has quit IRC (Ping timeout: 265 seconds) [15:15] *** adinbied has quit IRC (Quit: Left Channel.) [15:36] *** adinbied has joined #archiveteam-ot [15:36] *** schbirid has quit IRC (Remote host closed the connection) [15:47] *** schbirid has joined #archiveteam-ot [16:33] *** alex___ has quit IRC (alex___) [16:36] *** alex___ has joined #archiveteam-ot [19:37] *** icedice has joined #archiveteam-ot [19:39] *** ola_norsk has joined #archiveteam-ot [19:40] Happy New Year! [19:45] Without harping on too much on the YouTube Annotations issue; Would anyone happen to have a good idea to get all video id's by January 15th 2019..that doesn't involve scrapy? [19:46] I bet i can pull the annotations, the hard part is figuring out all the .ID's [19:49] youtube data API seems to have some sort of points costs to it, and i'm not paying to unfuck youtubes fuckups [19:51] just iterate through them [19:51] 'them' who ? , i have 4Mbit connection :/ [19:51] *** alex___ has quit IRC (Read error: Operation timed out) [19:52] If i had a table of all the links of every video (or video id) i could iterate [19:53] there's no way i'd be able to scrape every ID off of youtube singlehandedly in 1.25 month [19:53] correct [19:54] eientei95: you see that, "correct" .. it's why i'm here [19:54] * ola_norsk is no 1337 haxor [19:55] you're going to have to brute-force your way through if you want everything [19:55] otherwise just search google/facebook/twitter/reddit/whatever [19:55] Kaz: "...stop being brainless" [19:57] *** JAA has quit IRC (Read error: Operation timed out) [19:57] *** alex___ has joined #archiveteam-ot [19:57] happy to hear any smart ideas you've come up with for this [19:57] Google's not going to give you a list [19:58] Lots of videos are just going to be unlisted anyway, so won't show in searches etc [19:58] Kaz: i'm looking at Scrapy, that's the extent of my smart [19:58] As, as I said, your options are a) searching whatever you can to scrape youtube.com and youtu.be links [19:58] or b) bruteforcing the list [19:59] *** jspiros has quit IRC (Read error: Operation timed out) [20:00] *** adinbied has quit IRC (Read error: Operation timed out) [20:00] *** svchfoo1 has quit IRC (Ping timeout: 246 seconds) [20:01] *** adinbied has joined #archiveteam-ot [20:11] b seems to be the quicker option then [20:14] ola_norsk: I've already started scraping and grabbing annotation xmls [20:15] started with all submissions to reddit since that data is easily available [20:15] got 20M ids or so [20:15] going to do that and also grab from the top hundred thousand channels or so [20:15] maybe see if I can get a primitive spider crawling for more ids but not super invested [20:23] cf: awesome. Though instead of the top hundred thousand channels, maybe it would be better to proiritize by category? e.g News, History, Technology, first, then e.g Humour last? [20:24] cf: but yeah, it's hard to care, really..when it's about time YouTube took a dive from their monopoly pedestal [20:25] could prioritize by category, but not sure how easy it is to filter by something like that. will take a look tho [20:26] what format do have the ids save as? [20:27] just the 11 chars after ?v= in the url. [0-9A-Za-z_-]{11} [20:27] if you can share them, feel free to do so [20:27] http://files.ulim.it/all_ids.txt [20:28] i'm sure there's things that just matched the regex despite not being a real id (the first line is an example) [20:28] but spot checking seems to show it being pretty good [20:28] and again, that's all of the ids I extracted from reddit submission data [20:30] still more awesomer than me _trying to_ reinventing the wheel, so thanks! [20:33] could git be used to commit patches to the list in the future you think? [20:34] it's quite a fuckload of textfile there :D [20:36] or mysql, to insert new id's into perhaps? [20:39] rdms's tend to choke when you're just using them to store a list of distinct items. probably still a bit better than a text file but not by much. not sure how git would fare [20:39] *rdbms's [20:40] it took me over 2 minutes just to 'cat' trought the list [20:41] do you happen to have a record of the ones you've already pulled the annotations from? [20:41] *** godane has joined #archiveteam-ot [20:41] that way i could start on the ones you've not yet done [20:42] just have a script working its way down the list [20:42] probably the first 100k or so [20:42] ok [20:42] but its parallelized so its going to be out of order [20:42] if i reverse the list and work upward? [20:44] a bit of duplicate is impossible to prevent i guess, with 'tubeup' already having done a lot [20:45] MirrorTube, i mean [20:45] sure yeah if you want to grab stuff yourself, we should meet somewhere in the middle [20:47] you expect you'll be able to pull it off that would be great. [20:47] if* [20:48] if not, i'd be happy to give it a go [20:51] how do you name the annotations files btw? by just id, or? [20:51] .xml ? [20:52] yeah id.xml [20:52] ok [20:52] (get warcs too please) [20:53] actually scratch that, I'll chuck the list into archivebot [20:53] wpull is JAA's territory, i'm just a grabsite n00b :D [20:53] *** Mateon1 has quit IRC (Ping timeout: 252 seconds) [20:54] *** Mateon1 has joined #archiveteam-ot [20:54] if you're chucking it into grab-site it'll probably generate a warc, no? [20:54] that, and then some [20:56] maybe 'youtube' ignoreset would help [20:56] idk, i had a forum generating 10+GB... [20:58] (https://archive.org/details/WARC_www_subsim_com-radioroom-2018-09-07-89abc154_01) ..and to ~7 (i think) [20:59] i used login cookie, so it might contain some software and mods..but had to manually cancel it since it never stopped [20:59] *** svchfoo1 has joined #archiveteam-ot [20:59] *** JAA has joined #archiveteam-ot [20:59] *** svchfoo3 sets mode: +o JAA [21:00] *** bakJAA sets mode: +o JAA [21:00] *** svchfoo3 sets mode: +o svchfoo1 [21:03] *** jspiros has joined #archiveteam-ot [21:07] Kaz: youtube-dl can pull comments i think [21:07] as info json [21:08] eh [21:08] isn't this about annotations? [21:09] <@Kaz> (get warcs too please) [21:09] ..continue [21:09] you said that, i didnt't :D [21:09] yes [21:09] how did comments come into this [21:10] if we're going about quoting things.. [21:10] <@Kaz> isn't this about annotations? [21:10] because grabbing comments is doable, while warcing every */v/ i don't think is :/ [21:11] wha [21:11] drop making jumps [21:11] you're working from https://www.youtube.com/annotations_invideo?features=1&legacy=1&video_id= [21:11] correct? [21:11] warc would mean the video is included, correct? [21:11] no [21:12] ok [21:12] if you're grabbing the annotations xml.. get a warc of it [21:12] comments and videos themselves don't come into this [21:12] we're sure as shit not grabbing the whole of youtube today [21:12] aha! see, you're smarter than i look [21:13] i think that's a compliment [21:13] i look pretty sh*t, but yeah, it was meant to be [21:14] anywho, i'll go with your idea [21:14] *** ola_norsk has quit IRC (leaving) [21:16] I've never seen youtube-dl include comments [21:22] *** BlueMax has joined #archiveteam-ot [21:37] *** hook54321 has joined #archiveteam-ot [21:43] *** wp494 has quit IRC (Ping timeout: 268 seconds) [21:43] *** wp494 has joined #archiveteam-ot [21:43] *** svchfoo3 sets mode: +o wp494 [22:07] *** ola_norsk has joined #archiveteam-ot [22:07] cf: did you you check your all_ids.txt for duplicates? [22:08] i mean, are all 200k of them unique? [22:10] 200k+ [22:17] * ola_norsk sucks at grep :/ and came out with 40k+ lines when doing adding/transfering unique from-and-to a file from all_ids :/ http://paste.ubuntu.com/p/6hfm34VSDM/ [22:19] "grep -Fxq" to find [22:20] used* [22:21] *** BlueMax has quit IRC (Read error: Connection reset by peer) [22:21] *** JAA has quit IRC (Read error: Operation timed out) [22:22] it's more than likely i messed up [22:23] *** BlueMax has joined #archiveteam-ot [22:24] *** svchfoo1 has quit IRC (Ping timeout: 246 seconds) [22:26] *** jspiros has quit IRC (Read error: Operation timed out) [22:38] *** BlueMax has quit IRC (Quit: Leaving) [22:39] *** BlueMax has joined #archiveteam-ot [22:44] *** godane has quit IRC (Read error: Operation timed out) [22:51] i put the stuff here https://archive.org/details/Youtube_Video_IDs_CF_2018-12-01 [22:51] *** ola_norsk has quit IRC (Skål!) [23:24] *** jspiros has joined #archiveteam-ot [23:25] *** svchfoo1 has joined #archiveteam-ot [23:26] *** svchfoo3 sets mode: +o svchfoo1 [23:26] *** JAA has joined #archiveteam-ot [23:26] *** svchfoo1 sets mode: +o JAA [23:27] *** bakJAA sets mode: +o JAA [23:27] *** godane has joined #archiveteam-ot [23:41] (For log completeness, posted in -bs by mistake...) ola_norsk: Extracting duplicates from a file: awk 'seen[$0]++' I have no interest in archiving all YouTube annotations myself. This is a pretty big task, and Google isn't exactly known for liking automated access, i.e. bans come flying very quickly in my experience. [23:43] In other words, warrior. [23:44] Also, odemg may have a good list of video IDs from the metadata archival project. [23:44] Ivan pushes a TB of videos to a google drive a day but google produces EB so it’s not truly viable as such [23:45] Sorry YouTube in the form of google pushes EB [23:45] Not an EB per day though, more like an EB in total. [23:46] it's more than an EB in total [23:46] That’s still not viable to grab all of. Something like 1000 [23:46] Around that order of magnitude at least. [23:46] hours in a minute is iploaded [23:47] Or some obscure number [23:47] Nobody suggested grabbing all of YouTube anyway. That won't ever happen. [23:48] Unless it fades away into obscurity yet somehow survives for a few more decades until an EB of storage capacity is a laughable amount. [23:49] * Flashfire chuckles in futuristic