#archiveteam-bs 2019-11-17,Sun

↑back Search

Time Nickname Message
00:26 πŸ”— jc86035_ has joined #archiveteam-bs
00:26 πŸ”— jc86035 has quit IRC (Quit: Leaving.)
00:26 πŸ”— jc86035_ is now known as jc86035
00:28 πŸ”— schbirid has quit IRC (Quit: Leaving)
00:42 πŸ”— OrIdow6 has joined #archiveteam-bs
00:45 πŸ”— BartoCH has quit IRC (Ping timeout: 745 seconds)
00:51 πŸ”— PilgrimSp has quit IRC (Quit: Page closed)
01:11 πŸ”— bsmith093 has quit IRC (Remote host closed the connection)
02:08 πŸ”— jake_test Has anyone backed up all of SCP?
02:14 πŸ”— jodizzle jake_test: scpfoundation.net was grabbed a few days ago, scp-wiki.net has been grabbed a couple times over the years.
02:14 πŸ”— jodizzle Are there others?
02:26 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
02:33 πŸ”— BlueMax has joined #archiveteam-bs
02:40 πŸ”— JAA jc86035 and I have been talking a bit about the upcoming YouTube like playlist disappearance and a possible archival. What does everyone think about that? Should we try?
02:40 πŸ”— jc86035 Is anyone already planning to do e.g. warrior stuff or URL compiling for YouTube's liked videos playlists? It was mentioned here they were going to go away in about three weeks
02:41 πŸ”— JAA Context for the unaware: "After December 5, 2019, your public β€œLiked videos” playlist will be made private, which means only you will be able to see this playlist. You can still like videos, and videos will still show the number of likes." https://support.google.com/youtube/answer/6083270
02:43 πŸ”— jc86035 The main issues that would presumably come up include
02:43 πŸ”— jc86035 β€’ how to actually find such playlists (presumably you could use standard YouTube search but it probably wouldn't be comprehensive, I also don't know if/how the playlist titles are localized)
02:43 πŸ”— jc86035 β€’ whether to use the old YouTube interface (?disable_polymer=1) or the new site (in theory you can get playlist data from both but the new interface is ten times heavier on data usage and doesn't work in some browsers when viewed through the Wayback Machine)
02:43 πŸ”— jc86035 β€’ whether to archive everything, or just archive playlists from channels with more than x subscribers (since a lot of people might be exposing their liked videos by accident)
02:44 πŸ”— jc86035 I'm somewhat Lua-literate (I've used it a lot as a Wikipedia editor) so I could help with wget-lua and such
02:48 πŸ”— markedL is this different than a playlist, or just an auto-populated playlist? can we see an example?
02:55 πŸ”— jc86035 In the interface these just looks like any other playlist. They're populated entirely based on the videos that users press the like button on, so I'm not sure whether or not these would be considered auto-populated.
02:55 πŸ”— jc86035 *these just look
02:56 πŸ”— jc86035 So, for example, the liked videos playlist for /user/ERB is https://www.youtube.com/playlist?list=LLMu5gPmKp5av0QCAajKTMhw.
02:57 πŸ”— jc86035 Their channel ID is UCMu5gPmKp5av0QCAajKTMhw, so the URLs are predictable.
02:58 πŸ”— jc86035 (i.e. LL[a-ZA-Z0-9_-]{22})
02:59 πŸ”— jc86035 https://web.archive.org/web/timemap/?url=https://www.youtube.com/playlist?list=LL&matchType=prefix&collapse=urlkey&output=json&fl=original%2Cmimetype%2Ctimestamp%2Cendtimestamp%2Cgroupcount%2Cuniqcount&filter=!statuscode%3A%5B45%5D.. indicates that there are currently about 12,789 of these saved to the Internet Archive.
03:00 πŸ”— jc86035 * try this url if that one doesn't work: https://web.archive.org/web/timemap/?url=https://www.youtube.com/playlist?list=LL&matchType=prefix&collapse=urlkey&output=json&fl=original%2Cmimetype%2Ctimestamp%2Cendtimestamp%2Cgroupcount%2Cuniqcount&filter=!statuscode%3A[45]..&limit=100000
03:02 πŸ”— jc86035 Also, there are definitely some lists of the most subscribed channels floating around, so you could in theory just keep testing URLs based on massive lists of channel IDs and see which ones work.
03:08 πŸ”— jc86035 (I tried to download the list of every saved channel from the timemap API, but it looks like the query was too large and also the server really didn't like that. Splitting it into multiple queries should work.)
03:14 πŸ”— manjaro-u has quit IRC (Ping timeout: 252 seconds)
03:19 πŸ”— bsmith093 has joined #archiveteam-bs
03:27 πŸ”— markedL I see value in the data set. A long list's view has more pagination than I would have liked, but at least google can scale. Go for the presentation that's easier to grab or easier to playback. Is the plan to go after the videos later, because the lists are going to become less playable as time passes. We are stretched thin over the next 3 weeks
03:27 πŸ”— markedL as is, so hopefully this will be self sufficient or easy.
03:30 πŸ”— jc86035 markedL: There will be a lot of videos, although many of them will be duplicates or will already have been archived. There's no hard time limit on the videos themselves, so I would do those after the lists (if bandwidth and storage are the main limiting factors)
03:32 πŸ”— jc86035 Typically there won't be any bandwidth limits on Google's end, I've tried doing dozens of concurrent connections with no issues (for html, that is, not video)
03:39 πŸ”— markedL https://www.wolframalpha.com/input/?i=2billion%2F%28now+until+Dec+5%29 = 1279 per second
03:47 πŸ”— jc86035 markedL: A lot of channels won't have public liked playlists, presumably. If the playlist doesn't exist, the YouTube website returns an HTTP 200 with an error message on it ("The playlist does not exist."). It's somewhat valuable information, I guess, but it would make archival based on testing channel IDs harder than necessary.
03:48 πŸ”— jc86035 example: https://web.archive.org/web/*/https://www.youtube.com/playlist?list=LL9gFih9rw0zNCK3ZtoKQQyA, captures are new and old SPN respectively
03:53 πŸ”— jc86035 I didn't think we would be testing every public channel (especially given the tight timeframe and the fact that there's no public index that I'm aware of). Nevertheless, if you were to deduplicate images you could potentially do it over a few hundred connections (unless the YouTube ratelimit would prevent that).
03:56 πŸ”— markedL did you discuss with JAA whether this is going into WBM?
03:57 πŸ”— JAA It should if at all possible.
03:58 πŸ”— JAA But it'd be quite large, so we first need a size estimate to bring to IA.
03:58 πŸ”— JAA Especially if we also grab the video thumbs.
04:03 πŸ”— jc86035 According to Firefox's network inspector, on the old site each page (including images) is about 1.2MB uncompressed, and the next 100 videos are about 0.7MB.
04:05 πŸ”— markedL how large is "massive lists of channel ID". my intuition says you're going to run out of IDs before the deadline
04:05 πŸ”— jc86035 I don't really know, I've never tried to find out
04:06 πŸ”— jc86035 Presumably the people who did the video annotations have a log of channel IDs that they found? They found a billion annotations, so that's probably tens of millions of channels already
04:21 πŸ”— markedL I think they're on discord, I'll message them. and candidates for a channel name?
04:23 πŸ”— markedL I'll propose, #dis-liked or #DownTheTube
04:26 πŸ”— markedL I guess people like lowercase #down-the-tube
04:26 πŸ”— odemgi_ has quit IRC (Ping timeout: 252 seconds)
04:29 πŸ”— bsmith093 has quit IRC (Quit: Leaving.)
04:29 πŸ”— bsmith093 has joined #archiveteam-bs
04:29 πŸ”— qw3rty2 has joined #archiveteam-bs
04:36 πŸ”— odemg has quit IRC (Ping timeout: 745 seconds)
04:38 πŸ”— qw3rty has quit IRC (Ping timeout: 745 seconds)
04:41 πŸ”— odemg has joined #archiveteam-bs
05:00 πŸ”— bsmith093 has quit IRC (Quit: Leaving.)
05:08 πŸ”— bsmith093 has joined #archiveteam-bs
05:08 πŸ”— bsmith093 has quit IRC (Client Quit)
05:16 πŸ”— Quirk8 has joined #archiveteam-bs
05:29 πŸ”— Smiley has joined #archiveteam-bs
05:33 πŸ”— SmileyG has quit IRC (Read error: Operation timed out)
06:18 πŸ”— odemg has quit IRC (Ping timeout: 745 seconds)
06:21 πŸ”— OrIdow6 has quit IRC (Read error: Connection reset by peer)
06:23 πŸ”— OrIdow6 has joined #archiveteam-bs
06:25 πŸ”— erkinalp has joined #archiveteam-bs
06:36 πŸ”— odemgi has joined #archiveteam-bs
06:50 πŸ”— OrIdow6 (Putting this here because there is only one person in the "project" channel(s) at present) A cheap way to discover more public channels (if you do run out of them) is to extract them from the playlists that you download
06:57 πŸ”— OrIdow6 Probably best only to consider doing this once you run out of channels - if some topical areas' users are more likely to have public playlists, doing this first would probably bias what is captured
06:58 πŸ”— Raccoon Of which project
06:59 πŸ”— OrIdow6 This prospective Youtube like-playlists thing
07:21 πŸ”— killsushi has joined #archiveteam-bs
07:46 πŸ”— jc86035 Orldow6: In theory, yes, but it wouldn't be sufficient if we wanted playlists from users without uploads
07:47 πŸ”— jc86035 OrIdow6:
07:48 πŸ”— jc86035 OrIdow6: (on re-reading, I realize you did address that, so never mind)
07:49 πŸ”— jc86035 However, it could potentially reduce the bandwidth for the actual playlist archival, since this would require all of the video pages to be downloaded as well
07:49 πŸ”— jc86035 Though if we were to completely archive those pages as well we'd end up downloading millions of comment sections, so that might also help....
07:50 πŸ”— Raccoon I wonder if we'll ever realize a future where YouTube mass-undeletes millions of videos "on second thought."
07:50 πŸ”— jc86035 since users who comment are more likely to have liked videos playlists
07:50 πŸ”— OrIdow6 I'm not familiar enough with Youtube to tell what you mean by the "bandwidth" message
07:51 πŸ”— Raccoon I used to maintain a playlist with several dozen videos of convenient store security camera footage of armed robberies. where the shop-keep was unarmed and shot dead, or where the shop-keep was armed and prevailed.
07:52 πŸ”— Raccoon It was very effective imagry, which i imagine is why the videos were deleted for political preservation
07:52 πŸ”— jc86035 OrIdow6: Everyone's internet connection has a maximum rate of data flow (i.e. bandwidth). Even if we were to distribute the archival among thousands of different internet connections (with the warriors), it would still take quite a while to download all the massive number of liked videos playlists that are about to be made private.
07:53 πŸ”— OrIdow6 I mean, what is "it" in "it could potentially reduce the bandwidth for the actual playlist archival", vs. "this", which "require[s] all of the video pages to be downloaded as well"?
07:54 πŸ”— jc86035 You might not be able to get channel IDs from playlist entries.
07:55 πŸ”— jc86035 So getting the channel IDs would either necessitate millions of downloads of video pages or millions of requests to the YouTube API.
07:55 πŸ”— jc86035 Downloading the video pages would necessitate downloading all of the images and scripts used on the pages so that they're not incomplete.
08:17 πŸ”— jake_test has quit IRC (Read error: Operation timed out)
08:31 πŸ”— OrIdow6 jc86035: It does seem that some videos on playlists don't have the numeric ID listed (in a playlist that was put in this channel earlier, this applied to about half) - those that don't, do always list the human-readable name of the channel, and so downloading the video as opposed to the user video list (which is what youtube-dl uses) would not have to happen for each video (and so if e.g. a user is listed many times across many p
08:31 πŸ”— OrIdow6 laylists, only one download is needed) - the problem of images etc. on the video list page is a hard one, though
08:33 πŸ”— jc86035 okay, my fault for not noticing, that was stupid
08:33 πŸ”— jc86035 still, if we were to run out of IDs it would be worthwhile to look through top comments though
08:36 πŸ”— OrIdow6 Yes, or any additional way of discovering new users (again, I haven't used Youtube too much) besides these two - you probably do probably have a better idea than mine with the comment thing, though, in terms of which users are likely to be forthright with things
08:46 πŸ”— jc86035 honestly it's mostly an educated guess, it might not pan out. that said, I don't think there are really any good ways to find YouTube users with literally no presence except for their liked videos playlist, aside from maybe scraping other websites for channel or playlist links
09:07 πŸ”— meltir has quit IRC (Quit: leaving)
09:13 πŸ”— jc86035 from a sample of the top ~500 YouTube channels, about 45% of them have public liked videos playlists and about 40% have at least one video in those playlists.
09:25 πŸ”— ivan has quit IRC (Quit: Leaving)
09:27 πŸ”— ivan has joined #archiveteam-bs
09:27 πŸ”— Fusl__ sets mode: +o ivan
09:27 πŸ”— Fusl sets mode: +o ivan
09:27 πŸ”— Fusl_ sets mode: +o ivan
09:28 πŸ”— ivan- has joined #archiveteam-bs
09:28 πŸ”— Fusl__ sets mode: +o ivan-
09:28 πŸ”— Fusl sets mode: +o ivan-
09:28 πŸ”— Fusl_ sets mode: +o ivan-
09:40 πŸ”— ivan has quit IRC (Ping timeout: 745 seconds)
09:40 πŸ”— LowLevelM erkinalp: I have been working on Warriorbox, the raspberry pi archiver. The source is https://gitlab.com/lowlevelm/warriorbox, and I made an irc for the topic. #warriorbox
10:27 πŸ”— BartoCH has joined #archiveteam-bs
10:45 πŸ”— BlueMax has quit IRC (Quit: Leaving)
10:52 πŸ”— Nick-PC_ has joined #archiveteam-bs
11:30 πŸ”— Gfy has quit IRC (Ping timeout: 258 seconds)
11:55 πŸ”— Nick-PC__ has joined #archiveteam-bs
11:59 πŸ”— Nick-PC_ has quit IRC (Ping timeout: 252 seconds)
12:04 πŸ”— Gfy has joined #archiveteam-bs
12:12 πŸ”— erkinalp has quit IRC (Ping timeout: 260 seconds)
12:17 πŸ”— girst_ is now known as girst
12:21 πŸ”— Nick-PC_ has joined #archiveteam-bs
12:25 πŸ”— Nick-PC__ has quit IRC (Ping timeout: 252 seconds)
13:07 πŸ”— yano has quit IRC (Quit: WeeChat, The Better IRC Client, https://weechat.org/)
13:10 πŸ”— Nick-PC_ has quit IRC (Ping timeout: 252 seconds)
13:11 πŸ”— Nick-PC has joined #archiveteam-bs
13:20 πŸ”— yano has joined #archiveteam-bs
13:26 πŸ”— VoynichCr has joined #archiveteam-bs
13:52 πŸ”— britmob has quit IRC (Ping timeout: 252 seconds)
13:53 πŸ”— britmob has joined #archiveteam-bs
13:57 πŸ”— ivan- is now known as ivan
14:20 πŸ”— manjaro-u has joined #archiveteam-bs
14:24 πŸ”— manjaro-u has quit IRC (Read error: Operation timed out)
14:27 πŸ”— manjaro-u has joined #archiveteam-bs
17:27 πŸ”— tech234a has joined #archiveteam-bs
17:38 πŸ”— SmileyG has joined #archiveteam-bs
17:40 πŸ”— Smiley has quit IRC (Ping timeout: 252 seconds)
18:19 πŸ”— powerKitt has joined #archiveteam-bs
18:20 πŸ”— powerKitt With the FTC actively targetting YouTube users now with extremely vague rules and massive fines, I doubt it's going to last post-2020
18:25 πŸ”— powerKitt And if it does, most of the notable content will be set to private or outright deleted to prevent the risk of financially ruining the content creator
18:32 πŸ”— powerKitt So we've got roughly a month and a half to back as much up to IA as we can, while IA is actively setting archiveteam_youtube videos to noindex
18:32 πŸ”— powerKitt due to copyright concerns
18:34 πŸ”— DogsRNice has joined #archiveteam-bs
18:37 πŸ”— markedL which deadline is 1.5months away?
18:39 πŸ”— powerKitt January 1st, 2020 is when the FTC starts issuing $42,000 fines to users on YouTube
18:40 πŸ”— powerKitt Based on extremely vague guidelines about "child friendly content"
18:40 πŸ”— ivan #youtubearchive
18:40 πŸ”— powerKitt As a result, most if not all of the noteworthy content on the site will probably be privated or deleted by then
18:41 πŸ”— powerKitt due to users protecting themselves from being financially destroyed by $42,000 fines per video
18:41 πŸ”— ivan YouTube is gonna survive 2020 barring nukes asteroids etc
18:42 πŸ”— ivan YouTube is much much larger than US creators with potentially kid-friendly content
18:42 πŸ”— ivan FTC is not a unchangeable robot
18:43 πŸ”— powerKitt YouTube as a site will survive, but the content people will remember and want to look back on likely will not.
18:44 πŸ”— ivan we have feedback mechanisms in the US where we can try to contain egregiously bad conduct through our senators or through lawsuits or through protests/blackouts
18:45 πŸ”— ivan anyway, help me find all the good content
18:53 πŸ”— powerKitt The problem is that, right now, anything the FTC decides is content "attractive to children" can result in $42,000 fines.
18:53 πŸ”— powerKitt And that's so vague that there's no telling what all is affected, and it's going to take time to get the FTC to change
18:54 πŸ”— ivan powerKitt: you have google fiber, can you become my hosting provider for youtube archiving
18:55 πŸ”— powerKitt I don't have the hard drive space on hand, unfortunately
18:55 πŸ”— ivan powerKitt: I copy ~22TB of YouTube every day into Google Drive
18:57 πŸ”— ivan so I just need bandwidth and CPU
19:01 πŸ”— powerKitt I'll have to upgrade my CPU first, I've basically got a potato at the moment.
19:10 πŸ”— jc86035 powerKitt: do you have a source for the FTC fines? I was able to find some articles from September about the $170 million fine but nothing about individual channels
19:10 πŸ”— jc86035 so not sure if this is actually a thing
19:11 πŸ”— powerKitt https://twitter.com/Chadtronic/status/1195794036247928833
19:11 πŸ”— markedL this describes what they're talking about, https://www.youtube.com/watch?v=1b9HGNHm-aQ
19:11 πŸ”— powerKitt https://twitter.com/Chadtronic/status/1195794036247928833?s=20
19:12 πŸ”— powerKitt * https://www.youtube.com/watch?v=qu18mjLhCCM
19:15 πŸ”— jc86035 wow, there's barely any mainstream coverage
19:18 πŸ”— powerKitt https://www.ftc.gov/news-events/audio-video/video/ftc-press-conference-settlement-google-youtube Full FTC video
19:25 πŸ”— jc86035 knowing the state of the US government, it seems kind of bad considering the massive potential for things to go wrong, but I don't think we'll really know until the new enforcement comes into effect. most of the coverage this week is related to completely unrelated things for some reason
19:26 πŸ”— LowLevelM From what I have heard so far, I think this will be affecting ads on videos, and will not be removing any videos.
19:27 πŸ”— powerKitt The problem is, if YouTubers guess wrong about if their videos "attract children" or not, they'll be facing $42,000 fines per video.
19:27 πŸ”— jc86035 I feel like it would be better to wait for an actual third-party legal expert to weigh in, it looks like the consternation about the fines is mostly being led by the YouTubers themselves?
19:29 πŸ”— powerKitt The change isn't removing videos by itself, but users are already removing their videos to prevent the possibility of fines
19:29 πŸ”— jc86035 It seems possible that they would fine channels but it doesn't look like they've done a lot of that in the past, you'd expect there to be news stories about random small websites being fined tens of thousands of dollars
19:29 πŸ”— LowLevelM If they are unsure, they should just remove ads.
19:29 πŸ”— jc86035 It's not an option for all YouTube users, some people presumably still use just YouTube as their only revenue stream
19:30 πŸ”— powerKitt "It's like the saying of shooting fish in a barrel. YouTube is the barrel, and the content creators are the fish" - the FTC in the press conference I linked
19:30 πŸ”— powerKitt They are directly going for individual YouTube users.
19:31 πŸ”— jc86035 He does explicitly say "civil penalties", as well
19:32 πŸ”— jc86035 I guess it's possible that they could stay consistent with their track record of lax enforcement and not bankrupt a bunch of financially unstable people, but their current interpretation of the law does seem like it would be open to abuse
19:42 πŸ”— powerKitt has quit IRC (Ping timeout: 260 seconds)
19:54 πŸ”— asdf0101 has quit IRC (The Lounge - https://thelounge.chat)
19:54 πŸ”— markedL has quit IRC (Quit: The Lounge - https://thelounge.chat)
19:54 πŸ”— asdf0101 has joined #archiveteam-bs
19:55 πŸ”— markedL has joined #archiveteam-bs
20:47 πŸ”— erkinalp has joined #archiveteam-bs
21:14 πŸ”— LowLevelM has quit IRC (Read error: Operation timed out)
21:14 πŸ”— LowLevelM has joined #archiveteam-bs
21:27 πŸ”— erkinalp has quit IRC (Ping timeout: 260 seconds)
21:36 πŸ”— tech234a has quit IRC (Quit: Connection closed for inactivity)
21:58 πŸ”— zhongfu has quit IRC (Remote host closed the connection)
22:04 πŸ”— icedice has joined #archiveteam-bs
22:04 πŸ”— icedice has quit IRC (Connection closed)
22:05 πŸ”— icedice has joined #archiveteam-bs
22:06 πŸ”— zhongfu has joined #archiveteam-bs
22:40 πŸ”— X-Scale` has joined #archiveteam-bs
22:45 πŸ”— X-Scale has quit IRC (Read error: Operation timed out)
22:45 πŸ”— X-Scale` is now known as X-Scale
22:45 πŸ”— BlueMax has joined #archiveteam-bs
23:39 πŸ”— raeyulca has quit IRC (Ping timeout: 610 seconds)

irclogger-viewer