[00:13] *** Zerote has quit IRC (Read error: Connection reset by peer) [00:14] *** Zerote has joined #archiveteam-bs [00:21] *** omglolbah has joined #archiveteam-bs [00:34] *** katocala has quit IRC () [00:40] *** katocala has joined #archiveteam-bs [00:53] *** jason0597 has quit IRC (Read error: Operation timed out) [01:17] yeah I'm thinking of a project for the clips [01:17] does anyone have lists or estimates? [01:20] I see a ton of backlog, did I miss anything important? [01:27] arkiver: I think it's mostly wiki edit notices from purplebot for the past few days in this channel [01:27] That and JAA talking about Tigris [01:29] nice [01:29] thanks jodizzle [01:34] *** JAA sets mode: +o arkiver [01:34] Yep, still looking into that and hoping they'll fix the CVS server. [01:35] Hmm...is there anything that needs attention overall in ArchiveTeam? It's about a week until the end of 2020 June s: [01:36] I know there's that Microsoft coding website that's shutting down (yet didn't gave an exact date of when in this month)... [01:36] Ryz: we could use some workers on the bitbucket project :) [01:36] I'll make an announcement tomorrow as well regarding bitbucket [01:36] *doesn't have machinery capable of such a task* oo; [01:38] no problem :P [01:38] https://developer.arm.com/docs still needs to be archived. The AB job didn't get far due to some cookie bullshit (switching to an "edit mode", which then causes every page to redirect to a login form). [01:38] I'm mainly so focused and specialized with #archivebot <#>; [01:39] I did some hunting for HTML parsers (partial and full) for JAA yesterday although it's a mixed bag... [01:39] Well, you did ask about AT in general. :-P [01:40] I mean considering I would want to see some activity happening, most of the stuff is inactive aside from a few bright spots (and one very huge bright as a sun as ArchiveBot) [01:41] Ryz: How about doing some verification that we grabbed all of TranceFix.nl? You know, clicking around, going to deep pagination, etc. [01:42] Aside from unfortunately the new posts probably ever since the archive had started at the time, sure I'll try to poke around [01:44] Hmm... I have an idea~ [01:45] Okay, the https://www.trancefix.nl/ job started running on 2020 May 06; I can just try and scoop up all of the new and bumped topics of that forum and run 'em in "!a <" [01:45] Or "!ao <" [01:51] Thoughts JAA? [01:54] Hmm, in the future, it could be a Python script to auto-collect topics and the respective pagination [01:59] Ryz: Yeah, that sounds like a good idea. We did grab some of those threads for various reasons, but definitely not everything. [01:59] !ao < but with a dummy in front to get rid of the session IDs. [02:00] https://www.trancefix.nl/activity.php should be useful for this, but no idea how far back it goes. [02:01] Yeah, just a month it seems. [02:01] Clicked 30 times~ [02:02] A dummy? Do I have to apply it to all of the URLs? [02:03] No, just one at the beginning and set the concurrency to 1 at the start. [02:06] The first one is https://www.trancefix.nl/showthread.php?358454-Closing-of-TranceFix-nl - so it'll be https://www.trancefix.nl/showthread.php?358454-Closing-of-TranceFix-nl?archiveteam ? [02:08] Oh, BTW, has the remaining targeted links been retried in a separate job? [02:08] No, just add a trancefix.nl/forum.php?archiveteam I think. [02:09] *** nicolas17 has quit IRC (Quit: Konversation terminated!) [02:09] Ah, okay~ [02:09] Since it'll be "!ao <" I'll have to grab the individual pages of the topic threads correct? [02:09] There were remaining onsite URLs? [02:09] Correct [02:10] Mm, it's kinda fortunate since it's dying down, I don't have to have my progress grind to a bump on fetching those extra pagination [02:24] *** zhongfu has quit IRC (Remote host closed the connection) [02:24] *** zhongfu has joined #archiveteam-bs [02:26] Ryz: If you have a list of threads, I can write a quick script to get the page counts and generate all the page URLs. [02:27] I.e. a list of all threads that have received a reply since 6 May. [02:27] Welp, I was almost finished, or rather, halfway finished through the biggest and thickest one being https://www.trancefix.nl/forumdisplay.php?227-Music-Related-Discussion [02:27] Lemme give you that list [02:29] Here JAA: https://transfer.notkiska.pw/AXw59/TranceFix-mine-out-the-thread-pagination [02:33] Wish you offered that earlier, because I'm more or less stretched <#>; [02:34] What does that list cover exactly? It's way too short to be everything. [02:35] I said I was almost done gathering all of the new topics >>; [02:35] This is the rest of it [02:35] Or is that just the multi-page threads you didn't already do? [02:35] Yeah, that, I did the multi-page threads myself with some tools <#>; [02:35] Right [02:35] Notepad++ is a major blessing [02:36] grep :-) [02:46] Ryz: https://transfer.notkiska.pw/jdyMA/trancefix-updated-threads-pages [02:46] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [02:46] And if you want to learn this magic: tr -d '\r' (curl-ua is from my little-things.) [02:47] Huh, doesn't have the first page, time to re-add that~ [02:47] Yes, I only did the later pages. You can just combine it with your list. [02:50] Time to run the latest threads~ [03:01] JAA: On developer.arm.com: if the current run really is unfixably stuck, you may as well ignore the site itself and let it finish, so that the log gets uploaded and the rest of us can look at it [03:01] OrIdow6: I was hoping the cookie might have a short expiration time, but yeah, probably not. [03:02] arkiver: I'm looking at Mixer - it's sort of hard to estimate b/c users with video saved seem to be very sparse, but I'll hopefully have something soon [03:19] Are you able to get a list of users OrIdow6? tech234a mentioned rate limiting in the API. Even if they don't have clips I figure we should save the user profiles. [03:20] I'm not surprised that most people watched but never streamed. [03:25] *** godane has quit IRC (Ping timeout: 745 seconds) [03:29] *** godane has joined #archiveteam-bs [03:36] *** Stiletto has quit IRC () [03:38] lennier1: There's an API that maps sequential "channel" IDs to info about them, so that's effectively a list; there are about 20 million of them [03:38] By the way, do you know the difference between users and channels; and clips, vod, and recordings? [03:39] I found an API endpoint that lists all? "recordings" that I'll look at in a bit [03:39] *** qw3rty_ has joined #archiveteam-bs [03:41] *** Stiletto has joined #archiveteam-bs [03:41] A clip is a short segment of a broadcast that any user can save, and is kept indefinitely. A VOD is a replay of an entire stream--these are regularly purged. [03:42] This page has a good explanation of VODs and how long they're kept for various accounts. https://watchbeam.zendesk.com/hc/en-us/articles/209662033-Past-Streams-VoDs- [03:43] Not really familiar with "recording' as a term--maybe includes both clips and VODs? [03:43] *** qw3rty__ has quit IRC (Ping timeout: 265 seconds) [03:46] The Mixer API reports that there are 1035395 total saved stream recordings (not clips), though I'm not sure how useful that info is [03:46] Perhaps this could be used to enumerate channels that have streamed recently? [03:48] Right now, I'm just concerned with getting an estimate [03:54] I don't know if there's difference between channel and user. Maybe channel just refers to a user's stream? Are user IDs and channel IDs referenced by the API the same? https://dev.mixer.com/rest/index.html [04:17] Very rough lower bound on Mixer: 3.6 PiB [04:18] arkiver [04:22] lennier1: It appears they are the same [04:22] The IDs [04:23] Thanks, that makes sense to me. [04:23] *** hook54321 has quit IRC (Quit: Connection closed for inactivity) [04:26] *** godane has quit IRC (Read error: Connection reset by peer) [04:32] *** lennier2 has joined #archiveteam-bs [04:36] *** lennier2_ has joined #archiveteam-bs [04:37] Yeah...... [04:39] *** lennier1 has quit IRC (Read error: Operation timed out) [04:39] *** lennier2_ is now known as lennier1 [04:41] *** lennier2 has quit IRC (Ping timeout: 272 seconds) [04:45] *** Maylay has quit IRC (Read error: Operation timed out) [04:50] *** Maylay has joined #archiveteam-bs [04:59] Next step is presumably to collect enough data to see what filtering on popularity etc. would do to that number [05:05] Yeah, something like follower count could give you an idea of how popular a streamer is (though presumably the more popular channels have more clips). [05:05] Is the 3.6 PiB for clips only? [05:09] *** Maylay has quit IRC (Ping timeout: 745 seconds) [05:13] lennier1: Just "VOD", using the /recordings endpoint [05:14] Randomly chose 30 pages, for each VOD in the page, I multiplied the lowest listed bitrate by the duration [05:19] *** bsmith093 has quit IRC (Read error: Operation timed out) [05:19] I couldn't find a way to list all clips [05:19] Though I didn't look too hard after coming up with that first total [05:21] What sort of durations are there? Clips seem to be 1m3s or less. VODs can be several hours. [05:22] GET /clips/channels/{channelId} [05:22] Returns all clips for the channel. [05:22] I know about that endpoint [05:23] But it is near useless unless you have a good sample of channels as well [05:23] Again, I'm not enumerating them now, just estimating [05:26] Right, but maybe a using that endpoint for a random sample of channel ids. Not sure how many you'd need to do to get a reasonable estimate. [05:27] 2 hours ago or so, I tried listing ~120 channels, and none had "VOD" recordings, so I imagine a similar ratio applies here [05:28] What a small sitemap: https://mixer.com/sitemap.xml [05:28] "users with video saved seem to be very sparse" [05:29] kiska: I like how they have a user's page listed as the last one [05:30] Huh this is what I see https://server8.kiska.pw/uploads/d67ee29a52e1890f/image.png [05:30] Oh they have an API... [05:31] *** bsmith093 has joined #archiveteam-bs [05:31] That's what I mean; https://mixer.com/Ninja is an individual user (albeit apparently a popular one) [05:32] The most popular one. [05:32] https://socialblade.com/mixer/top/500/most-followers [05:33] Most popular Mixer topic right now--Mixer closing. :) [05:34] Yeah and it's streamers trying to figure out where to take their audience [05:34] everyone over to twitch! [05:34] lol [05:34] Apparently there is this endpoint https://mixer.com/users but it doesn't load for me [05:34] yeah, though Mixer is partnering with Facebook Gaming [05:35] There might be more users with clips since those stick around. For regular users, VODs get deleted after 14 days. [05:35] kiska: /api/v1/users [05:36] "no such endpoint" [05:36] I assume I'll need to not run that in Chrome :D [05:36] Yeah I get 404 on that endpoint :D [05:36] So maybe I need to auth with it? [05:36] Most of the APIs work fine in Chrome btw [05:36] Have you seen https://dev.mixer.com/rest/index.html? [05:37] I don't think there is a bare /users endpoint [05:37] I'm trying out the API in Firefox now. [05:37] *** Maylay has joined #archiveteam-bs [05:37] *** Maylay has quit IRC (Remote host closed the connection!) [05:37] Most Mixer APIs don't require authentication (unless they have the lock icon in the docs) [05:37] tech234a: Yes, that's what I've been going off here [05:38] Get numerical ID from user name (CaptainCripp1e): https://mixer.com/api/v1/channels/CaptainCripp1e?fields=id [05:38] *** Maylay has joined #archiveteam-bs [05:38] Good to know [05:38] Then something like: https://mixer.com/api/v1/channels/5739806 [05:39] Some URLs grabbed from Common Crawl/Internet Archive https://usercontent.irccloud-cdn.com/file/cv4PgJGM/mixer.com.txt [05:40] Clips for that channel: https://mixer.com/api/v1/clips/channels/5739806 [05:40] Might not have gotten everything that's in Common Crawl because I was getting some errors, not sure if it was on my end or Common Crawl's end [05:40] I already checked the CDX server [05:41] The problem with that is that it's not a random sample in any case [05:41] Some URLs from when it was Beam, useful for finding very old users https://usercontent.irccloud-cdn.com/file/OtNcOmiz/beam.pro.txt [05:41] yeah its not very random [05:41] So getting information based on that list will be biased to whatever the IA's crawl configuration picks up [05:45] What I'll probably do now (or later) is just choose random numerical IDs and see what's at them [05:46] Some kind of live API: https://dev.mixer.com/reference/constellation/events/live [05:46] Unsure if it requires specific channels to be specified [05:53] *** HP_Archiv has joined #archiveteam-bs [06:07] Actually, I guess clips do expire after 90 days. That's different than Twitch. https://watchbeam.zendesk.com/hc/en-us/articles/360005089311-Clips-FAQ- [06:08] Length is 5 seconds to 300 seconds. [06:14] *** HP_Archiv has quit IRC (Quit: Leaving) [06:32] *** hook54321 has joined #archiveteam-bs [06:33] *** svchfoo3 sets mode: +o hook54321 [06:35] *** HP_Archiv has joined #archiveteam-bs [06:40] *** godane has joined #archiveteam-bs [06:41] *** HP_Archiv has quit IRC (Quit: Leaving) [06:58] Weird [06:59] That’s very different from Twitch [07:23] *** larryv has quit IRC (larryv) [08:48] *** godane has quit IRC (Ping timeout: 256 seconds) [08:49] *** justcool3 has quit IRC (Quit: Connection closed for inactivity) [09:13] *** godane has joined #archiveteam-bs [09:13] SketchCow: good news i got a magazine for scanning called buzzworm magazine [09:14] you guys don't have it and it doesn't look like its been digitize anywhere [09:14] its a enviromental journal magazine [09:40] *** VerifiedJ has joined #archiveteam-bs [09:46] A tip to you all: don't leave a Mixer tab open for a long time; it was eating ~5 GB of memory for me [10:46] OrIdow6: thank you! [10:46] let's make a channel [10:46] I'd say on hackint [10:51] arkiver: Thanks, though I'm not exactly the bearer of good news with that size estimate, haha [10:53] And yes, Hackint has become the norm these days [10:57] OrIdow6: I'm thinking we might only get the clips that users explicitly saved [10:58] The automatically recorded data that is removed anyway after a few days is probably not as important [10:59] Per https://watchbeam.zendesk.com/hc/en-us/articles/360005089311-Clips-FAQ- , explicitly saved clips get removed, too [11:00] ah [11:00] is there anything that would be permanent? [11:00] https://watchbeam.zendesk.com/hc/en-us/articles/209662033-Past-Streams-VoDs- - "Some channels (such as official esports and event channels) have a longer [infinite?] retention limit due to business reasons" [11:01] "longer" than 180 days, in any case [11:02] If there aren't too many clips, I can see it useful to save all of them, anyhow, as they're effectively the user-curated "best" of the site, albeit only the last 90 days of the site [11:05] Sort of a sample of what it was like [11:12] https://github.com/mixer - should probably be saved, too [11:14] *** Jens has quit IRC (Ping timeout: 265 seconds) [11:14] *** Jens has joined #archiveteam-bs [11:18] And yet "curl "https://mixer.com/api/v2/vods/channels/19088261" | jq .[].expirationDate" gives "expirationDate"s in 2025 [11:19] (As far as I can tell, the /v2/ API is not publicly documented) [11:20] OrIdow6: Ryz already threw the GitHub org into ArchiveBot. [11:21] Oh, thanks to Ryz then [12:09] *** dashcloud has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [12:24] arkiver: #mixdown on hackint? [12:24] *** dashcloud has joined #archiveteam-bs [12:26] *** jason0597 has joined #archiveteam-bs [12:45] *** katocala has quit IRC (Ping timeout: 496 seconds) [12:46] *** katocala has joined #archiveteam-bs [12:51] HCross: lets do that [13:12] *** dashcloud has quit IRC (Ping timeout: 745 seconds) [13:22] *** BlueMax has quit IRC (Quit: Leaving) [13:28] *** dashcloud has joined #archiveteam-bs [13:49] @jodizzle Thanks for the ACM download links. Sadly they blocked my IP after 100 files. [14:04] *** dashcloud has quit IRC (Read error: Operation timed out) [14:31] *** Raccoon has quit IRC (Remote host closed the connection) [14:32] *** dashcloud has joined #archiveteam-bs [14:32] *** Raccoon has joined #archiveteam-bs [14:38] *** DogsRNice has joined #archiveteam-bs [14:39] *** Arcorann_ has quit IRC (Read error: Connection reset by peer) [14:43] *** DogsRNice has quit IRC (Ping timeout: 265 seconds) [14:46] *** DogsRNice has joined #archiveteam-bs [14:49] *** dashcloud has quit IRC (Read error: Operation timed out) [16:05] *** systwi_ has joined #archiveteam-bs [16:06] *** systwi has quit IRC (Read error: Operation timed out) [16:10] *** yano_ is now known as yano [16:37] *** dashcloud has joined #archiveteam-bs [17:18] *** dashcloud has quit IRC (Read error: Operation timed out) [17:26] qw3rty_: Yeah, figured that would happen. I think if we wanted to get all of them, the best way would be to do it with mips, but my understanding is that that machine doesn't exist anymore. [17:27] One thing to note is that you can get the abstract of each paper by replacing /pdf/ with /abs/ [17:28] You can also sometimes get HTML versions by replacing /pdf/ with /fullHtml/ (https://dl.acm.org/doi/fullHtml/10.1145/3363499) though I think that's much rarer. [17:30] But yeah, feel free to share the list with anyone in case they have the tools to deal with. I also have a CSV that maps paper titles to links if anyone wanted to pick and choose. [18:20] *** logchfoo2 starts logging #archiveteam-bs at Tue Jun 23 18:20:31 2020 [18:20] *** logchfoo2 has joined #archiveteam-bs [18:20] *** paul2520 has joined #archiveteam-bs [18:23] *** Zerote has quit IRC (Read error: Operation timed out) [18:25] *** systwi_ has quit IRC (Read error: Operation timed out) [18:34] I occasionally read Slate Star Codex. Didn't realize his identity was pressing national news. Some relevent links: [18:34] https://sscpodcast.libsyn.com/ [18:34] https://twitter.com/slatestarcodex [18:35] https://www.reddit.com/r/slatestarcodex/ [18:50] *** godane has quit IRC (Read error: Operation timed out) [18:55] *** nicolas17 has joined #archiveteam-bs [19:12] *** godane has joined #archiveteam-bs [19:15] *** paul2520 has quit IRC (Read error: Operation timed out) [19:17] *** Jake has quit IRC (Read error: Operation timed out) [19:18] *** Wingy has quit IRC (Read error: Operation timed out) [19:19] *** asdf0101 has quit IRC (Read error: Operation timed out) [19:24] *** Gfy_ has joined #archiveteam-bs [19:24] *** Gfy has quit IRC (Read error: Connection reset by peer) [19:24] *** Stiletto has quit IRC (Read error: Operation timed out) [19:25] *** Stiletto has joined #archiveteam-bs [19:25] *** dxrt_ has quit IRC (Ping timeout: 622 seconds) [19:26] *** sembiance has quit IRC (Read error: Connection reset by peer) [19:26] *** systwi has quit IRC (Ping timeout: 622 seconds) [19:30] *** jason0597 has quit IRC (Read error: Operation timed out) [19:31] *** luckcolor has quit IRC (Ping timeout: 622 seconds) [19:31] *** mr_archiv has quit IRC (Ping timeout: 622 seconds) [19:32] *** mr_archiv has joined #archiveteam-bs [19:32] *** luckcolor has joined #archiveteam-bs [19:36] *** systwi has joined #archiveteam-bs [19:37] *** dxrt_ has joined #archiveteam-bs [19:38] *** sembiance has joined #archiveteam-bs [19:38] *** scorche` has joined #archiveteam-bs [19:39] *** Panasonic has joined #archiveteam-bs [19:39] *** Ravenloft has quit IRC (Read error: Connection reset by peer) [19:39] *** Jake has joined #archiveteam-bs [19:39] *** wp494 has quit IRC (Ping timeout: 255 seconds) [19:40] *** scorche has quit IRC (Read error: Operation timed out) [19:40] *** scorche` is now known as scorche [19:40] *** wp494 has joined #archiveteam-bs [19:40] *** systwi_ has joined #archiveteam-bs [19:41] *** colona_ has quit IRC (Read error: Connection reset by peer) [19:41] *** colona has joined #archiveteam-bs [19:42] *** paul2520 has joined #archiveteam-bs [19:42] *** scorche has quit IRC (hub.efnet.us irc.Prison.NET) [19:42] *** godane has quit IRC (hub.efnet.us irc.Prison.NET) [19:42] *** bsmith093 has quit IRC (hub.efnet.us irc.Prison.NET) [19:42] *** achip has quit IRC (hub.efnet.us irc.Prison.NET) [19:42] *** phirephly has quit IRC (hub.efnet.us irc.Prison.NET) [19:42] *** Somebody2 has quit IRC (hub.efnet.us irc.Prison.NET) [19:43] *** fuzzy802 has joined #archiveteam-bs [19:45] *** phirephl- has joined #archiveteam-bs [19:46] *** Ctrl has quit IRC (Read error: Operation timed out) [19:46] *** dxrt- has joined #archiveteam-bs [19:46] *** dxrt has quit IRC (Ping timeout: 265 seconds) [19:47] *** colona has quit IRC (Read error: Connection reset by peer) [19:49] *** systwi has quit IRC (Ping timeout: 622 seconds) [19:49] *** maxfan8 has quit IRC (Quit: WeeChat 2.8) [19:49] *** sivoais_ has quit IRC (Remote host closed the connection) [19:51] *** fuzzy8021 has quit IRC (Read error: Operation timed out) [19:52] *** colona has joined #archiveteam-bs [19:52] *** maxfan8 has joined #archiveteam-bs [19:53] *** fuzzy802 is now known as fuzzy8021 [19:59] *** Meli has quit IRC (Read error: Connection reset by peer) [20:00] *** scorche has joined #archiveteam-bs [20:00] *** godane has joined #archiveteam-bs [20:00] *** bsmith093 has joined #archiveteam-bs [20:00] *** achip has joined #archiveteam-bs [20:00] *** Somebody2 has joined #archiveteam-bs [20:03] *** wessel152 has quit IRC (Read error: Operation timed out) [20:04] *** SynMonger has joined #archiveteam-bs [20:04] *** Meli has joined #archiveteam-bs [20:09] *** sivoais has joined #archiveteam-bs [20:13] *** synm0nger has quit IRC (Read error: Operation timed out) [20:32] *** ranma has quit IRC () [20:42] *** dashcloud has joined #archiveteam-bs [20:43] *** jason0597 has joined #archiveteam-bs [20:51] *** Ctrl has joined #archiveteam-bs [21:06] *** VerifiedJ has quit IRC (Quit: Leaving) [21:26] so i'm at 1963k items now [21:27] i have uploaded 46k items so far this month [21:33] *** systwi has joined #archiveteam-bs [21:38] *** systwi_ has quit IRC (Read error: Operation timed out) [21:44] *** Maylay has quit IRC (Read error: Connection reset by peer) [21:46] *** Maylay has joined #archiveteam-bs [22:54] *** Arcorann_ has joined #archiveteam-bs [22:54] *** Arcorann_ has quit IRC (Read error: Connection reset by peer) [22:55] *** Arcorann_ has joined #archiveteam-bs [22:58] *** BlueMax has joined #archiveteam-bs [23:17] *** Arcorann_ has quit IRC (Read error: Connection reset by peer) [23:17] *** Somebody2 has quit IRC (Read error: Operation timed out) [23:18] *** bsmith094 has joined #archiveteam-bs [23:19] *** godane1 has joined #archiveteam-bs [23:20] *** scorche has quit IRC (hub.efnet.us irc.Prison.NET) [23:20] *** godane has quit IRC (hub.efnet.us irc.Prison.NET) [23:20] *** bsmith093 has quit IRC (hub.efnet.us irc.Prison.NET) [23:20] *** achip has quit IRC (hub.efnet.us irc.Prison.NET) [23:29] *** Arcorann has joined #archiveteam-bs [23:31] *** bsmith094 has quit IRC (Ping timeout: 745 seconds) [23:31] *** godane1 has quit IRC (Read error: Operation timed out) [23:41] *** bsmith093 has joined #archiveteam-bs [23:49] *** Somebody2 has joined #archiveteam-bs [23:49] *** achip has joined #archiveteam-bs [23:49] *** scorche has joined #archiveteam-bs