#archiveteam-ot 2018-12-01,Sat

↑back Search

Time	Nickname	Message
00:07 ^🔗		dashcloud has quit IRC (Read error: Connection reset by peer)
00:08 ^🔗		dashcloud has joined #archiveteam-ot
00:56 ^🔗		BlueMax has joined #archiveteam-ot
01:36 ^🔗		dashcloud has quit IRC (Read error: Connection reset by peer)
01:38 ^🔗		dashcloud has joined #archiveteam-ot
02:00 ^🔗		Sanqui has quit IRC (Ping timeout: 260 seconds)
02:06 ^🔗		Sanqui has joined #archiveteam-ot
02:06 ^🔗		svchfoo1 sets mode: +o Sanqui
02:11 ^🔗		Sanqui has quit IRC (Read error: Operation timed out)
02:23 ^🔗		Sanqui has joined #archiveteam-ot
02:24 ^🔗		svchfoo1 sets mode: +o Sanqui
03:22 ^🔗		Sanqui has quit IRC (Ping timeout: 260 seconds)
03:23 ^🔗		Sanqui has joined #archiveteam-ot
03:24 ^🔗		svchfoo1 sets mode: +o Sanqui
03:43 ^🔗		wp494 has quit IRC (Ping timeout: 260 seconds)
03:44 ^🔗		wp494 has joined #archiveteam-ot
03:44 ^🔗		svchfoo1 sets mode: +o wp494
04:22 ^🔗		odemg has quit IRC (Ping timeout: 265 seconds)
04:34 ^🔗		odemg has joined #archiveteam-ot
05:11 ^🔗		adinbied has joined #archiveteam-ot
05:25 ^🔗		adinbied has quit IRC (Left Channel.)
06:24 ^🔗		adinbied has joined #archiveteam-ot
06:59 ^🔗		icedice has quit IRC (Quit: Leaving)
07:04 ^🔗		Mateon1 has quit IRC (Remote host closed the connection)
07:04 ^🔗		Mateon1 has joined #archiveteam-ot
07:08 ^🔗		jspiros has quit IRC (Remote host closed the connection)
07:08 ^🔗		swebb has quit IRC (Ping timeout: 240 seconds)
07:08 ^🔗		svchfoo1 has quit IRC (Ping timeout: 240 seconds)
07:09 ^🔗		swebb has joined #archiveteam-ot
07:10 ^🔗		nightpoo- has quit IRC (Ping timeout: 246 seconds)
07:11 ^🔗		JAA has quit IRC (Ping timeout: 246 seconds)
07:16 ^🔗		godane has quit IRC (Ping timeout: 492 seconds)
07:17 ^🔗		nightpool has joined #archiveteam-ot
07:25 ^🔗		godane has joined #archiveteam-ot
07:55 ^🔗		BlueMax has quit IRC (Read error: Connection reset by peer)
08:03 ^🔗		BlueMax has joined #archiveteam-ot
08:10 ^🔗		JAA has joined #archiveteam-ot
08:10 ^🔗		svchfoo1 has joined #archiveteam-ot
08:11 ^🔗		jspiros has joined #archiveteam-ot
08:11 ^🔗		bakJAA sets mode: +o JAA
08:11 ^🔗		svchfoo3 sets mode: +o JAA
08:12 ^🔗		svchfoo3 sets mode: +o svchfoo1
08:29 ^🔗		jspiros has quit IRC (hub.efnet.us irc.colosolutions.net)
08:29 ^🔗		svchfoo1 has quit IRC (hub.efnet.us irc.colosolutions.net)
08:29 ^🔗		JAA has quit IRC (hub.efnet.us irc.colosolutions.net)
08:40 ^🔗		jspiros has joined #archiveteam-ot
08:40 ^🔗		svchfoo1 has joined #archiveteam-ot
08:40 ^🔗		JAA has joined #archiveteam-ot
08:40 ^🔗		irc.colosolutions.net sets mode: +oo svchfoo1 JAA
08:41 ^🔗		JAA sets mode: +o bakJAA
08:41 ^🔗		schbirid has joined #archiveteam-ot
08:41 ^🔗		bakJAA sets mode: +o JAA
09:36 ^🔗		alex___ has joined #archiveteam-ot
09:40 ^🔗		LFlare43 has quit IRC (Quit: The Lounge - https://thelounge.chat)
10:35 ^🔗		BlueMax has quit IRC (Remote host closed the connection)
10:37 ^🔗		BlueMax has joined #archiveteam-ot
11:54 ^🔗		BlueMax has quit IRC (Read error: Connection reset by peer)
12:51 ^🔗		wp494 has quit IRC (Read error: Operation timed out)
12:51 ^🔗		wp494 has joined #archiveteam-ot
12:51 ^🔗		svchfoo3 sets mode: +o wp494
13:17 ^🔗		VerifiedJ has joined #archiveteam-ot
13:18 ^🔗		hook54321 has quit IRC (Quit: Connection closed for inactivity)
13:36 ^🔗		wmvhater has quit IRC (Read error: Operation timed out)
13:38 ^🔗		kiska1 has quit IRC (Ping timeout (120 seconds))
13:38 ^🔗		wmvhater has joined #archiveteam-ot
13:39 ^🔗		kiska1 has joined #archiveteam-ot
13:41 ^🔗		wmvhater has quit IRC (Client Quit)
13:42 ^🔗		wmvhater has joined #archiveteam-ot
13:43 ^🔗		kiska1 has quit IRC (Client Quit)
13:43 ^🔗		kiska1 has joined #archiveteam-ot
13:46 ^🔗		wmvhater has quit IRC (Read error: Operation timed out)
13:49 ^🔗		wmvhater has joined #archiveteam-ot
13:51 ^🔗		godane has quit IRC (Ping timeout: 265 seconds)
15:15 ^🔗		adinbied has quit IRC (Quit: Left Channel.)
15:36 ^🔗		adinbied has joined #archiveteam-ot
15:36 ^🔗		schbirid has quit IRC (Remote host closed the connection)
15:47 ^🔗		schbirid has joined #archiveteam-ot
16:33 ^🔗		alex___ has quit IRC (alex___)
16:36 ^🔗		alex___ has joined #archiveteam-ot
19:37 ^🔗		icedice has joined #archiveteam-ot
19:39 ^🔗		ola_norsk has joined #archiveteam-ot
19:40 ^🔗	ola_norsk	Happy New Year!
19:45 ^🔗	ola_norsk	Without harping on too much on the YouTube Annotations issue; Would anyone happen to have a good idea to get all video id's by January 15th 2019..that doesn't involve scrapy?
19:46 ^🔗	ola_norsk	I bet i can pull the annotations, the hard part is figuring out all the .ID's
19:49 ^🔗	ola_norsk	youtube data API seems to have some sort of points costs to it, and i'm not paying to unfuck youtubes fuckups
19:51 ^🔗	Kaz	just iterate through them
19:51 ^🔗	ola_norsk	'them' who ? , i have 4Mbit connection :/
19:51 ^🔗		alex___ has quit IRC (Read error: Operation timed out)
19:52 ^🔗	ola_norsk	If i had a table of all the links of every video (or video id) i could iterate
19:53 ^🔗	ola_norsk	there's no way i'd be able to scrape every ID off of youtube singlehandedly in 1.25 month
19:53 ^🔗	Kaz	correct
19:54 ^🔗	ola_norsk	eientei95: you see that, "correct" .. it's why i'm here
19:54 ^🔗	*	ola_norsk is no 1337 haxor
19:55 ^🔗	Kaz	you're going to have to brute-force your way through if you want everything
19:55 ^🔗	Kaz	otherwise just search google/facebook/twitter/reddit/whatever
19:55 ^🔗	ola_norsk	Kaz: "...stop being brainless"
19:57 ^🔗		JAA has quit IRC (Read error: Operation timed out)
19:57 ^🔗		alex___ has joined #archiveteam-ot
19:57 ^🔗	Kaz	happy to hear any smart ideas you've come up with for this
19:57 ^🔗	Kaz	Google's not going to give you a list
19:58 ^🔗	Kaz	Lots of videos are just going to be unlisted anyway, so won't show in searches etc
19:58 ^🔗	ola_norsk	Kaz: i'm looking at Scrapy, that's the extent of my smart
19:58 ^🔗	Kaz	As, as I said, your options are a) searching whatever you can to scrape youtube.com and youtu.be links
19:58 ^🔗	Kaz	or b) bruteforcing the list
19:59 ^🔗		jspiros has quit IRC (Read error: Operation timed out)
20:00 ^🔗		adinbied has quit IRC (Read error: Operation timed out)
20:00 ^🔗		svchfoo1 has quit IRC (Ping timeout: 246 seconds)
20:01 ^🔗		adinbied has joined #archiveteam-ot
20:11 ^🔗	ola_norsk	b seems to be the quicker option then
20:14 ^🔗	cf	ola_norsk: I've already started scraping and grabbing annotation xmls
20:15 ^🔗	cf	started with all submissions to reddit since that data is easily available
20:15 ^🔗	cf	got 20M ids or so
20:15 ^🔗	cf	going to do that and also grab from the top hundred thousand channels or so
20:15 ^🔗	cf	maybe see if I can get a primitive spider crawling for more ids but not super invested
20:23 ^🔗	ola_norsk	cf: awesome. Though instead of the top hundred thousand channels, maybe it would be better to proiritize by category? e.g News, History, Technology, first, then e.g Humour last?
20:24 ^🔗	ola_norsk	cf: but yeah, it's hard to care, really..when it's about time YouTube took a dive from their monopoly pedestal
20:25 ^🔗	cf	could prioritize by category, but not sure how easy it is to filter by something like that. will take a look tho
20:26 ^🔗	ola_norsk	what format do have the ids save as?
20:27 ^🔗	cf	just the 11 chars after ?v= in the url. [0-9A-Za-z_-]{11}
20:27 ^🔗	ola_norsk	if you can share them, feel free to do so
20:27 ^🔗	cf	http://files.ulim.it/all_ids.txt
20:28 ^🔗	cf	i'm sure there's things that just matched the regex despite not being a real id (the first line is an example)
20:28 ^🔗	cf	but spot checking seems to show it being pretty good
20:28 ^🔗	cf	and again, that's all of the ids I extracted from reddit submission data
20:30 ^🔗	ola_norsk	still more awesomer than me _trying to_ reinventing the wheel, so thanks!
20:33 ^🔗	ola_norsk	could git be used to commit patches to the list in the future you think?
20:34 ^🔗	ola_norsk	it's quite a fuckload of textfile there :D
20:36 ^🔗	ola_norsk	or mysql, to insert new id's into perhaps?
20:39 ^🔗	cf	rdms's tend to choke when you're just using them to store a list of distinct items. probably still a bit better than a text file but not by much. not sure how git would fare
20:39 ^🔗	cf	*rdbms's
20:40 ^🔗	ola_norsk	it took me over 2 minutes just to 'cat' trought the list
20:41 ^🔗	ola_norsk	do you happen to have a record of the ones you've already pulled the annotations from?
20:41 ^🔗		godane has joined #archiveteam-ot
20:41 ^🔗	ola_norsk	that way i could start on the ones you've not yet done
20:42 ^🔗	cf	just have a script working its way down the list
20:42 ^🔗	cf	probably the first 100k or so
20:42 ^🔗	ola_norsk	ok
20:42 ^🔗	cf	but its parallelized so its going to be out of order
20:42 ^🔗	ola_norsk	if i reverse the list and work upward?
20:44 ^🔗	ola_norsk	a bit of duplicate is impossible to prevent i guess, with 'tubeup' already having done a lot
20:45 ^🔗	ola_norsk	MirrorTube, i mean
20:45 ^🔗	cf	sure yeah if you want to grab stuff yourself, we should meet somewhere in the middle
20:47 ^🔗	ola_norsk	you expect you'll be able to pull it off that would be great.
20:47 ^🔗	ola_norsk	if*
20:48 ^🔗	ola_norsk	if not, i'd be happy to give it a go
20:51 ^🔗	ola_norsk	how do you name the annotations files btw? by just id, or?
20:51 ^🔗	ola_norsk	<id>.xml ?
20:52 ^🔗	cf	yeah id.xml
20:52 ^🔗	ola_norsk	ok
20:52 ^🔗	Kaz	(get warcs too please)
20:53 ^🔗	Kaz	actually scratch that, I'll chuck the list into archivebot
20:53 ^🔗	ola_norsk	wpull is JAA's territory, i'm just a grabsite n00b :D
20:53 ^🔗		Mateon1 has quit IRC (Ping timeout: 252 seconds)
20:54 ^🔗		Mateon1 has joined #archiveteam-ot
20:54 ^🔗	Kaz	if you're chucking it into grab-site it'll probably generate a warc, no?
20:54 ^🔗	ola_norsk	that, and then some
20:56 ^🔗	ola_norsk	maybe 'youtube' ignoreset would help
20:56 ^🔗	ola_norsk	idk, i had a forum generating 10+GB...
20:58 ^🔗	ola_norsk	(https://archive.org/details/WARC_www_subsim_com-radioroom-2018-09-07-89abc154_01) ..and to ~7 (i think)
20:59 ^🔗	ola_norsk	i used login cookie, so it might contain some software and mods..but had to manually cancel it since it never stopped
20:59 ^🔗		svchfoo1 has joined #archiveteam-ot
20:59 ^🔗		JAA has joined #archiveteam-ot
20:59 ^🔗		svchfoo3 sets mode: +o JAA
21:00 ^🔗		bakJAA sets mode: +o JAA
21:00 ^🔗		svchfoo3 sets mode: +o svchfoo1
21:03 ^🔗		jspiros has joined #archiveteam-ot
21:07 ^🔗	ola_norsk	Kaz: youtube-dl can pull comments i think
21:07 ^🔗	ola_norsk	as info json
21:08 ^🔗	Kaz	eh
21:08 ^🔗	Kaz	isn't this about annotations?
21:09 ^🔗	ola_norsk	<@Kaz> (get warcs too please)
21:09 ^🔗	Kaz	..continue
21:09 ^🔗	ola_norsk	you said that, i didnt't :D
21:09 ^🔗	Kaz	yes
21:09 ^🔗	Kaz	how did comments come into this
21:10 ^🔗	Kaz	if we're going about quoting things..
21:10 ^🔗	Kaz	<@Kaz> isn't this about annotations?
21:10 ^🔗	ola_norsk	because grabbing comments is doable, while warcing every */v/<id> i don't think is :/
21:11 ^🔗	Kaz	wha
21:11 ^🔗	Kaz	drop making jumps
21:11 ^🔗	Kaz	you're working from https://www.youtube.com/annotations_invideo?features=1&legacy=1&video_id=<video_id>
21:11 ^🔗	Kaz	correct?
21:11 ^🔗	ola_norsk	warc would mean the video is included, correct?
21:11 ^🔗	Kaz	no
21:12 ^🔗	ola_norsk	ok
21:12 ^🔗	Kaz	if you're grabbing the annotations xml.. get a warc of it
21:12 ^🔗	Kaz	comments and videos themselves don't come into this
21:12 ^🔗	Kaz	we're sure as shit not grabbing the whole of youtube today
21:12 ^🔗	ola_norsk	aha! see, you're smarter than i look
21:13 ^🔗	Kaz	i think that's a compliment
21:13 ^🔗	ola_norsk	i look pretty sh*t, but yeah, it was meant to be
21:14 ^🔗	ola_norsk	anywho, i'll go with your idea
21:14 ^🔗		ola_norsk has quit IRC (leaving)
21:16 ^🔗	ivan	I've never seen youtube-dl include comments
21:22 ^🔗		BlueMax has joined #archiveteam-ot
21:37 ^🔗		hook54321 has joined #archiveteam-ot
21:43 ^🔗		wp494 has quit IRC (Ping timeout: 268 seconds)
21:43 ^🔗		wp494 has joined #archiveteam-ot
21:43 ^🔗		svchfoo3 sets mode: +o wp494
22:07 ^🔗		ola_norsk has joined #archiveteam-ot
22:07 ^🔗	ola_norsk	cf: did you you check your all_ids.txt for duplicates?
22:08 ^🔗	ola_norsk	i mean, are all 200k of them unique?
22:10 ^🔗	ola_norsk	200k+
22:17 ^🔗	*	ola_norsk sucks at grep :/ and came out with 40k+ lines when doing adding/transfering unique from-and-to a file from all_ids :/ http://paste.ubuntu.com/p/6hfm34VSDM/
22:19 ^🔗	ola_norsk	"grep -Fxq" to find
22:20 ^🔗	ola_norsk	used*
22:21 ^🔗		BlueMax has quit IRC (Read error: Connection reset by peer)
22:21 ^🔗		JAA has quit IRC (Read error: Operation timed out)
22:22 ^🔗	ola_norsk	it's more than likely i messed up
22:23 ^🔗		BlueMax has joined #archiveteam-ot
22:24 ^🔗		svchfoo1 has quit IRC (Ping timeout: 246 seconds)
22:26 ^🔗		jspiros has quit IRC (Read error: Operation timed out)
22:38 ^🔗		BlueMax has quit IRC (Quit: Leaving)
22:39 ^🔗		BlueMax has joined #archiveteam-ot
22:44 ^🔗		godane has quit IRC (Read error: Operation timed out)
22:51 ^🔗	ola_norsk	i put the stuff here https://archive.org/details/Youtube_Video_IDs_CF_2018-12-01
22:51 ^🔗		ola_norsk has quit IRC (Skål!)
23:24 ^🔗		jspiros has joined #archiveteam-ot
23:25 ^🔗		svchfoo1 has joined #archiveteam-ot
23:26 ^🔗		svchfoo3 sets mode: +o svchfoo1
23:26 ^🔗		JAA has joined #archiveteam-ot
23:26 ^🔗		svchfoo1 sets mode: +o JAA
23:27 ^🔗		bakJAA sets mode: +o JAA
23:27 ^🔗		godane has joined #archiveteam-ot
23:41 ^🔗	JAA	(For log completeness, posted in -bs by mistake...) ola_norsk: Extracting duplicates from a file: awk 'seen[$0]++' <file This will print each duplicate (but not the first appearance). You could also do sort <file \| uniq -d but I prefer the awk solution since it doesn't have to sort the file and is therefore much faster.
23:43 ^🔗	JAA	I have no interest in archiving all YouTube annotations myself. This is a pretty big task, and Google isn't exactly known for liking automated access, i.e. bans come flying very quickly in my experience.
23:43 ^🔗	JAA	In other words, warrior.
23:44 ^🔗	JAA	Also, odemg may have a good list of video IDs from the metadata archival project.
23:44 ^🔗	Flashfire	Ivan pushes a TB of videos to a google drive a day but google produces EB so it’s not truly viable as such
23:45 ^🔗	Flashfire	Sorry YouTube in the form of google pushes EB
23:45 ^🔗	JAA	Not an EB per day though, more like an EB in total.
23:46 ^🔗	ivan	it's more than an EB in total
23:46 ^🔗	Flashfire	That’s still not viable to grab all of. Something like 1000
23:46 ^🔗	JAA	Around that order of magnitude at least.
23:46 ^🔗	Flashfire	hours in a minute is iploaded
23:47 ^🔗	Flashfire	Or some obscure number
23:47 ^🔗	JAA	Nobody suggested grabbing all of YouTube anyway. That won't ever happen.
23:48 ^🔗	JAA	Unless it fades away into obscurity yet somehow survives for a few more decades until an EB of storage capacity is a laughable amount.
23:49 ^🔗	*	Flashfire chuckles in futuristic

irclogger-viewer