#archiveteam-ot 2018-12-01,Sat

↑back Search

Time Nickname Message
00:07 πŸ”— dashcloud has quit IRC (Read error: Connection reset by peer)
00:08 πŸ”— dashcloud has joined #archiveteam-ot
00:56 πŸ”— BlueMax has joined #archiveteam-ot
01:36 πŸ”— dashcloud has quit IRC (Read error: Connection reset by peer)
01:38 πŸ”— dashcloud has joined #archiveteam-ot
02:00 πŸ”— Sanqui has quit IRC (Ping timeout: 260 seconds)
02:06 πŸ”— Sanqui has joined #archiveteam-ot
02:06 πŸ”— svchfoo1 sets mode: +o Sanqui
02:11 πŸ”— Sanqui has quit IRC (Read error: Operation timed out)
02:23 πŸ”— Sanqui has joined #archiveteam-ot
02:24 πŸ”— svchfoo1 sets mode: +o Sanqui
03:22 πŸ”— Sanqui has quit IRC (Ping timeout: 260 seconds)
03:23 πŸ”— Sanqui has joined #archiveteam-ot
03:24 πŸ”— svchfoo1 sets mode: +o Sanqui
03:43 πŸ”— wp494 has quit IRC (Ping timeout: 260 seconds)
03:44 πŸ”— wp494 has joined #archiveteam-ot
03:44 πŸ”— svchfoo1 sets mode: +o wp494
04:22 πŸ”— odemg has quit IRC (Ping timeout: 265 seconds)
04:34 πŸ”— odemg has joined #archiveteam-ot
05:11 πŸ”— adinbied has joined #archiveteam-ot
05:25 πŸ”— adinbied has quit IRC (Left Channel.)
06:24 πŸ”— adinbied has joined #archiveteam-ot
06:59 πŸ”— icedice has quit IRC (Quit: Leaving)
07:04 πŸ”— Mateon1 has quit IRC (Remote host closed the connection)
07:04 πŸ”— Mateon1 has joined #archiveteam-ot
07:08 πŸ”— jspiros has quit IRC (Remote host closed the connection)
07:08 πŸ”— swebb has quit IRC (Ping timeout: 240 seconds)
07:08 πŸ”— svchfoo1 has quit IRC (Ping timeout: 240 seconds)
07:09 πŸ”— swebb has joined #archiveteam-ot
07:10 πŸ”— nightpoo- has quit IRC (Ping timeout: 246 seconds)
07:11 πŸ”— JAA has quit IRC (Ping timeout: 246 seconds)
07:16 πŸ”— godane has quit IRC (Ping timeout: 492 seconds)
07:17 πŸ”— nightpool has joined #archiveteam-ot
07:25 πŸ”— godane has joined #archiveteam-ot
07:55 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
08:03 πŸ”— BlueMax has joined #archiveteam-ot
08:10 πŸ”— JAA has joined #archiveteam-ot
08:10 πŸ”— svchfoo1 has joined #archiveteam-ot
08:11 πŸ”— jspiros has joined #archiveteam-ot
08:11 πŸ”— bakJAA sets mode: +o JAA
08:11 πŸ”— svchfoo3 sets mode: +o JAA
08:12 πŸ”— svchfoo3 sets mode: +o svchfoo1
08:29 πŸ”— jspiros has quit IRC (hub.efnet.us irc.colosolutions.net)
08:29 πŸ”— svchfoo1 has quit IRC (hub.efnet.us irc.colosolutions.net)
08:29 πŸ”— JAA has quit IRC (hub.efnet.us irc.colosolutions.net)
08:40 πŸ”— jspiros has joined #archiveteam-ot
08:40 πŸ”— svchfoo1 has joined #archiveteam-ot
08:40 πŸ”— JAA has joined #archiveteam-ot
08:40 πŸ”— irc.colosolutions.net sets mode: +oo svchfoo1 JAA
08:41 πŸ”— JAA sets mode: +o bakJAA
08:41 πŸ”— schbirid has joined #archiveteam-ot
08:41 πŸ”— bakJAA sets mode: +o JAA
09:36 πŸ”— alex___ has joined #archiveteam-ot
09:40 πŸ”— LFlare43 has quit IRC (Quit: The Lounge - https://thelounge.chat)
10:35 πŸ”— BlueMax has quit IRC (Remote host closed the connection)
10:37 πŸ”— BlueMax has joined #archiveteam-ot
11:54 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
12:51 πŸ”— wp494 has quit IRC (Read error: Operation timed out)
12:51 πŸ”— wp494 has joined #archiveteam-ot
12:51 πŸ”— svchfoo3 sets mode: +o wp494
13:17 πŸ”— VerifiedJ has joined #archiveteam-ot
13:18 πŸ”— hook54321 has quit IRC (Quit: Connection closed for inactivity)
13:36 πŸ”— wmvhater has quit IRC (Read error: Operation timed out)
13:38 πŸ”— kiska1 has quit IRC (Ping timeout (120 seconds))
13:38 πŸ”— wmvhater has joined #archiveteam-ot
13:39 πŸ”— kiska1 has joined #archiveteam-ot
13:41 πŸ”— wmvhater has quit IRC (Client Quit)
13:42 πŸ”— wmvhater has joined #archiveteam-ot
13:43 πŸ”— kiska1 has quit IRC (Client Quit)
13:43 πŸ”— kiska1 has joined #archiveteam-ot
13:46 πŸ”— wmvhater has quit IRC (Read error: Operation timed out)
13:49 πŸ”— wmvhater has joined #archiveteam-ot
13:51 πŸ”— godane has quit IRC (Ping timeout: 265 seconds)
15:15 πŸ”— adinbied has quit IRC (Quit: Left Channel.)
15:36 πŸ”— adinbied has joined #archiveteam-ot
15:36 πŸ”— schbirid has quit IRC (Remote host closed the connection)
15:47 πŸ”— schbirid has joined #archiveteam-ot
16:33 πŸ”— alex___ has quit IRC (alex___)
16:36 πŸ”— alex___ has joined #archiveteam-ot
19:37 πŸ”— icedice has joined #archiveteam-ot
19:39 πŸ”— ola_norsk has joined #archiveteam-ot
19:40 πŸ”— ola_norsk Happy New Year!
19:45 πŸ”— ola_norsk Without harping on too much on the YouTube Annotations issue; Would anyone happen to have a good idea to get all video id's by January 15th 2019..that doesn't involve scrapy?
19:46 πŸ”— ola_norsk I bet i can pull the annotations, the hard part is figuring out all the .ID's
19:49 πŸ”— ola_norsk youtube data API seems to have some sort of points costs to it, and i'm not paying to unfuck youtubes fuckups
19:51 πŸ”— Kaz just iterate through them
19:51 πŸ”— ola_norsk 'them' who ? , i have 4Mbit connection :/
19:51 πŸ”— alex___ has quit IRC (Read error: Operation timed out)
19:52 πŸ”— ola_norsk If i had a table of all the links of every video (or video id) i could iterate
19:53 πŸ”— ola_norsk there's no way i'd be able to scrape every ID off of youtube singlehandedly in 1.25 month
19:53 πŸ”— Kaz correct
19:54 πŸ”— ola_norsk eientei95: you see that, "correct" .. it's why i'm here
19:54 πŸ”— * ola_norsk is no 1337 haxor
19:55 πŸ”— Kaz you're going to have to brute-force your way through if you want everything
19:55 πŸ”— Kaz otherwise just search google/facebook/twitter/reddit/whatever
19:55 πŸ”— ola_norsk Kaz: "...stop being brainless"
19:57 πŸ”— JAA has quit IRC (Read error: Operation timed out)
19:57 πŸ”— alex___ has joined #archiveteam-ot
19:57 πŸ”— Kaz happy to hear any smart ideas you've come up with for this
19:57 πŸ”— Kaz Google's not going to give you a list
19:58 πŸ”— Kaz Lots of videos are just going to be unlisted anyway, so won't show in searches etc
19:58 πŸ”— ola_norsk Kaz: i'm looking at Scrapy, that's the extent of my smart
19:58 πŸ”— Kaz As, as I said, your options are a) searching whatever you can to scrape youtube.com and youtu.be links
19:58 πŸ”— Kaz or b) bruteforcing the list
19:59 πŸ”— jspiros has quit IRC (Read error: Operation timed out)
20:00 πŸ”— adinbied has quit IRC (Read error: Operation timed out)
20:00 πŸ”— svchfoo1 has quit IRC (Ping timeout: 246 seconds)
20:01 πŸ”— adinbied has joined #archiveteam-ot
20:11 πŸ”— ola_norsk b seems to be the quicker option then
20:14 πŸ”— cf ola_norsk: I've already started scraping and grabbing annotation xmls
20:15 πŸ”— cf started with all submissions to reddit since that data is easily available
20:15 πŸ”— cf got 20M ids or so
20:15 πŸ”— cf going to do that and also grab from the top hundred thousand channels or so
20:15 πŸ”— cf maybe see if I can get a primitive spider crawling for more ids but not super invested
20:23 πŸ”— ola_norsk cf: awesome. Though instead of the top hundred thousand channels, maybe it would be better to proiritize by category? e.g News, History, Technology, first, then e.g Humour last?
20:24 πŸ”— ola_norsk cf: but yeah, it's hard to care, really..when it's about time YouTube took a dive from their monopoly pedestal
20:25 πŸ”— cf could prioritize by category, but not sure how easy it is to filter by something like that. will take a look tho
20:26 πŸ”— ola_norsk what format do have the ids save as?
20:27 πŸ”— cf just the 11 chars after ?v= in the url. [0-9A-Za-z_-]{11}
20:27 πŸ”— ola_norsk if you can share them, feel free to do so
20:27 πŸ”— cf http://files.ulim.it/all_ids.txt
20:28 πŸ”— cf i'm sure there's things that just matched the regex despite not being a real id (the first line is an example)
20:28 πŸ”— cf but spot checking seems to show it being pretty good
20:28 πŸ”— cf and again, that's all of the ids I extracted from reddit submission data
20:30 πŸ”— ola_norsk still more awesomer than me _trying to_ reinventing the wheel, so thanks!
20:33 πŸ”— ola_norsk could git be used to commit patches to the list in the future you think?
20:34 πŸ”— ola_norsk it's quite a fuckload of textfile there :D
20:36 πŸ”— ola_norsk or mysql, to insert new id's into perhaps?
20:39 πŸ”— cf rdms's tend to choke when you're just using them to store a list of distinct items. probably still a bit better than a text file but not by much. not sure how git would fare
20:39 πŸ”— cf *rdbms's
20:40 πŸ”— ola_norsk it took me over 2 minutes just to 'cat' trought the list
20:41 πŸ”— ola_norsk do you happen to have a record of the ones you've already pulled the annotations from?
20:41 πŸ”— godane has joined #archiveteam-ot
20:41 πŸ”— ola_norsk that way i could start on the ones you've not yet done
20:42 πŸ”— cf just have a script working its way down the list
20:42 πŸ”— cf probably the first 100k or so
20:42 πŸ”— ola_norsk ok
20:42 πŸ”— cf but its parallelized so its going to be out of order
20:42 πŸ”— ola_norsk if i reverse the list and work upward?
20:44 πŸ”— ola_norsk a bit of duplicate is impossible to prevent i guess, with 'tubeup' already having done a lot
20:45 πŸ”— ola_norsk MirrorTube, i mean
20:45 πŸ”— cf sure yeah if you want to grab stuff yourself, we should meet somewhere in the middle
20:47 πŸ”— ola_norsk you expect you'll be able to pull it off that would be great.
20:47 πŸ”— ola_norsk if*
20:48 πŸ”— ola_norsk if not, i'd be happy to give it a go
20:51 πŸ”— ola_norsk how do you name the annotations files btw? by just id, or?
20:51 πŸ”— ola_norsk <id>.xml ?
20:52 πŸ”— cf yeah id.xml
20:52 πŸ”— ola_norsk ok
20:52 πŸ”— Kaz (get warcs too please)
20:53 πŸ”— Kaz actually scratch that, I'll chuck the list into archivebot
20:53 πŸ”— ola_norsk wpull is JAA's territory, i'm just a grabsite n00b :D
20:53 πŸ”— Mateon1 has quit IRC (Ping timeout: 252 seconds)
20:54 πŸ”— Mateon1 has joined #archiveteam-ot
20:54 πŸ”— Kaz if you're chucking it into grab-site it'll probably generate a warc, no?
20:54 πŸ”— ola_norsk that, and then some
20:56 πŸ”— ola_norsk maybe 'youtube' ignoreset would help
20:56 πŸ”— ola_norsk idk, i had a forum generating 10+GB...
20:58 πŸ”— ola_norsk (https://archive.org/details/WARC_www_subsim_com-radioroom-2018-09-07-89abc154_01) ..and to ~7 (i think)
20:59 πŸ”— ola_norsk i used login cookie, so it might contain some software and mods..but had to manually cancel it since it never stopped
20:59 πŸ”— svchfoo1 has joined #archiveteam-ot
20:59 πŸ”— JAA has joined #archiveteam-ot
20:59 πŸ”— svchfoo3 sets mode: +o JAA
21:00 πŸ”— bakJAA sets mode: +o JAA
21:00 πŸ”— svchfoo3 sets mode: +o svchfoo1
21:03 πŸ”— jspiros has joined #archiveteam-ot
21:07 πŸ”— ola_norsk Kaz: youtube-dl can pull comments i think
21:07 πŸ”— ola_norsk as info json
21:08 πŸ”— Kaz eh
21:08 πŸ”— Kaz isn't this about annotations?
21:09 πŸ”— ola_norsk <@Kaz> (get warcs too please)
21:09 πŸ”— Kaz ..continue
21:09 πŸ”— ola_norsk you said that, i didnt't :D
21:09 πŸ”— Kaz yes
21:09 πŸ”— Kaz how did comments come into this
21:10 πŸ”— Kaz if we're going about quoting things..
21:10 πŸ”— Kaz <@Kaz> isn't this about annotations?
21:10 πŸ”— ola_norsk because grabbing comments is doable, while warcing every */v/<id> i don't think is :/
21:11 πŸ”— Kaz wha
21:11 πŸ”— Kaz drop making jumps
21:11 πŸ”— Kaz you're working from https://www.youtube.com/annotations_invideo?features=1&legacy=1&video_id=<video_id>
21:11 πŸ”— Kaz correct?
21:11 πŸ”— ola_norsk warc would mean the video is included, correct?
21:11 πŸ”— Kaz no
21:12 πŸ”— ola_norsk ok
21:12 πŸ”— Kaz if you're grabbing the annotations xml.. get a warc of it
21:12 πŸ”— Kaz comments and videos themselves don't come into this
21:12 πŸ”— Kaz we're sure as shit not grabbing the whole of youtube today
21:12 πŸ”— ola_norsk aha! see, you're smarter than i look
21:13 πŸ”— Kaz i think that's a compliment
21:13 πŸ”— ola_norsk i look pretty sh*t, but yeah, it was meant to be
21:14 πŸ”— ola_norsk anywho, i'll go with your idea
21:14 πŸ”— ola_norsk has quit IRC (leaving)
21:16 πŸ”— ivan I've never seen youtube-dl include comments
21:22 πŸ”— BlueMax has joined #archiveteam-ot
21:37 πŸ”— hook54321 has joined #archiveteam-ot
21:43 πŸ”— wp494 has quit IRC (Ping timeout: 268 seconds)
21:43 πŸ”— wp494 has joined #archiveteam-ot
21:43 πŸ”— svchfoo3 sets mode: +o wp494
22:07 πŸ”— ola_norsk has joined #archiveteam-ot
22:07 πŸ”— ola_norsk cf: did you you check your all_ids.txt for duplicates?
22:08 πŸ”— ola_norsk i mean, are all 200k of them unique?
22:10 πŸ”— ola_norsk 200k+
22:17 πŸ”— * ola_norsk sucks at grep :/ and came out with 40k+ lines when doing adding/transfering unique from-and-to a file from all_ids :/ http://paste.ubuntu.com/p/6hfm34VSDM/
22:19 πŸ”— ola_norsk "grep -Fxq" to find
22:20 πŸ”— ola_norsk used*
22:21 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
22:21 πŸ”— JAA has quit IRC (Read error: Operation timed out)
22:22 πŸ”— ola_norsk it's more than likely i messed up
22:23 πŸ”— BlueMax has joined #archiveteam-ot
22:24 πŸ”— svchfoo1 has quit IRC (Ping timeout: 246 seconds)
22:26 πŸ”— jspiros has quit IRC (Read error: Operation timed out)
22:38 πŸ”— BlueMax has quit IRC (Quit: Leaving)
22:39 πŸ”— BlueMax has joined #archiveteam-ot
22:44 πŸ”— godane has quit IRC (Read error: Operation timed out)
22:51 πŸ”— ola_norsk i put the stuff here https://archive.org/details/Youtube_Video_IDs_CF_2018-12-01
22:51 πŸ”— ola_norsk has quit IRC (SkΓ₯l!)
23:24 πŸ”— jspiros has joined #archiveteam-ot
23:25 πŸ”— svchfoo1 has joined #archiveteam-ot
23:26 πŸ”— svchfoo3 sets mode: +o svchfoo1
23:26 πŸ”— JAA has joined #archiveteam-ot
23:26 πŸ”— svchfoo1 sets mode: +o JAA
23:27 πŸ”— bakJAA sets mode: +o JAA
23:27 πŸ”— godane has joined #archiveteam-ot
23:41 πŸ”— JAA (For log completeness, posted in -bs by mistake...) ola_norsk: Extracting duplicates from a file: awk 'seen[$0]++' <file This will print each duplicate (but not the first appearance). You could also do sort <file | uniq -d but I prefer the awk solution since it doesn't have to sort the file and is therefore *much* faster.
23:43 πŸ”— JAA I have no interest in archiving all YouTube annotations myself. This is a pretty big task, and Google isn't exactly known for liking automated access, i.e. bans come flying very quickly in my experience.
23:43 πŸ”— JAA In other words, warrior.
23:44 πŸ”— JAA Also, odemg may have a good list of video IDs from the metadata archival project.
23:44 πŸ”— Flashfire Ivan pushes a TB of videos to a google drive a day but google produces EB so it’s not truly viable as such
23:45 πŸ”— Flashfire Sorry YouTube in the form of google pushes EB
23:45 πŸ”— JAA Not an EB per day though, more like an EB in total.
23:46 πŸ”— ivan it's more than an EB in total
23:46 πŸ”— Flashfire That’s still not viable to grab all of. Something like 1000
23:46 πŸ”— JAA Around that order of magnitude at least.
23:46 πŸ”— Flashfire hours in a minute is iploaded
23:47 πŸ”— Flashfire Or some obscure number
23:47 πŸ”— JAA Nobody suggested grabbing all of YouTube anyway. That won't ever happen.
23:48 πŸ”— JAA Unless it fades away into obscurity yet somehow survives for a few more decades until an EB of storage capacity is a laughable amount.
23:49 πŸ”— * Flashfire chuckles in futuristic

irclogger-viewer