[00:05] oh my [00:05] I have scripts for this already [00:05] why does twitch say he's live? [00:06] wait what, how does he have two years of VODs? [00:09] i don't know much about twitch archiving cause i have not done it in years [00:09] i figure youtube-dl or streamlink should work for that stuff [00:13] he must have highlighted everything [00:15] "Past Broadcasts" are supposed to expire after 60 days and the link you sent goes past 3 years [00:15] so perhaps the latest ones won't expire either? idk [00:17] started my script to get metadata, hope I don't get throttled... [00:20] nicolas17: Some accounts are exempt from the expiration, I believe. Might be one of them? [00:29] *** VADemon has joined #archiveteam-bs [00:29] wow the wiki requires premoderation for new external links now? [00:30] [17:36] Why are wiki edits being sent to moderation for review now? [00:30] [17:39] Because you posted the secret publicly. [00:30] lol [00:31] *** OrIdow6 has joined #archiveteam-bs [00:32] Yup, all edits go into a moderation queue now, and the secret word is no longer a thing. [00:32] The idea is to whitelist trusted users, but it'll take a bit to set everything up correctly. [00:35] There'll be a proper announcement when it's all done, I think. [00:35] damn, the third oldest VOD is 18 hours [00:37] *** wyatt8750 has joined #archiveteam-bs [00:38] *** wyatt8740 has quit IRC (Ping timeout: 260 seconds) [00:40] I think downloading these videos will involve one million files (HLS segments) [00:43] nope bad estimate... it was 3 million [01:23] I'm missing the video titles... bah [01:38] *** rklane has joined #archiveteam-bs [02:05] JAA: Ahhhhh shit [02:08] I guess this isn't the first time the secret word got leaked? [02:11] Wiki went with a secret word for 10 years, not bad [02:12] Frogging: It's been changed at least twice since I've been active in AT stuff. [02:12] how long's that? [02:12] think I've seen your name around for a number of years, but not sure [02:13] And that's....*checks wiki account age* 6 years or so. [02:13] well [02:13] once every 2 or 3 years on average seems alright [02:13] I liked the secret word. I liked people being made to come to the IRC channel, a common point of conversation, and announce their interest. [02:14] I also recognize that at some point, it just becomes a burden. If the traffic of Archive Team goes down, that is, it starts shrinking as a team, that sort of stuff can be fine [02:14] But there's a lot of people and a lot of eyes. We have the staffing/volunteers for moderation, I'm fine. [02:14] didn't it use to be "yahoosucks" as the password? [02:14] Yeah that's the original password I remembered. [02:15] I remember when I tried to get my account at first and got confused at the "what is your quest" question [02:15] the quest is, well... to trivially edit the wiki? [02:16] JAA: If you want a hand with anything on the wiki, let me know. I'm more than happy to help. [02:37] *** rklane has quit IRC (Read error: Connection reset by peer) [02:50] okay [02:50] here's the metadata of every reckful Twitch VOD https://f001.backblazeb2.com/file/twitch-archive/reckful/reckful-vod-meta.tar.gz?Authorization=3_20200703024851_9d2c5dc2ab82c7acf1381f57_44a26874d2fd191343e5d08406b011edd0548ffb_001_20200710024851_0031_dnld [02:51] to get the video content, feed every video's urllist.txt into your favorite download tool [02:51] $ cat */urllist.txt | wc -l [02:52] 3148824 [03:13] *** fredgido_ has joined #archiveteam-bs [03:16] *** fredgido has quit IRC (Read error: Operation timed out) [03:28] *** qw3rty_ has joined #archiveteam-bs [03:35] *** qw3rty__ has quit IRC (Read error: Operation timed out) [03:55] *** Pixi has quit IRC (Quit: Leaving) [04:00] *** Pixi has joined #archiveteam-bs [04:24] *** HP_Archiv has joined #archiveteam-bs [04:37] ... wow https://twitter.com/immunda/status/1278783894683336704 [04:51] *** rklane has joined #archiveteam-bs [05:21] ^^ nicolas17, fascinating. wonder how that even came to be configured in that way [05:22] it was certainly not intentional to "use it as a CDN" (as the tweet (jokingly?) implies) since the wayback machine has *terrible* latency [05:31] I'm downloading the chats from Reckful's Twitch VODs, might take 24h [05:31] I might be able to parallelize it, but I'm afraid that anything I do to make it faster could cause "too many requests" errors, so I'd rather wait [05:32] looks like his streams averaged 1 message per second in the chat x_x [06:06] *** VADemon has quit IRC (Read error: Connection reset by peer) [06:17] *** rklane has quit IRC (Quit: This computer has gone to sleep) [06:25] *** nicolas17 has quit IRC (Ping timeout: 765 seconds) [06:31] *** sknebel_ has joined #archiveteam-bs [06:31] *** sknebel has quit IRC (Read error: Connection reset by peer) [06:36] That they experienced some sort of data loss and decided to restore their site from the WBM sounds like the most plausible scenario to me [06:37] I like this comment, though: "From the new book. 101 ways to save on your AWS bill. This is way 100" [06:39] Or that version control is difficult to access, they switched temple engines and it's hard to recreate the output of the old one (to work with the new one) without a bunch of dependencies, etc. [06:39] *or that they [07:08] *** godane has quit IRC (Ping timeout: 265 seconds) [07:29] *** BlueMax has quit IRC (Quit: Leaving) [07:30] *** BlueMax has joined #archiveteam-bs [10:00] *** schbirid has joined #archiveteam-bs [10:20] *** BlueMax has quit IRC (Read error: Connection reset by peer) [11:12] *** Jake9 has joined #archiveteam-bs [11:14] *** Jake has quit IRC (Read error: Operation timed out) [11:14] *** Jake9 is now known as Jake [12:06] *** fredgido_ has quit IRC (Ping timeout: 265 seconds) [12:08] !d 6m81jii0evrnln9i2qsgedc2v 500 750 [12:09] ugh [12:18] I noticed today that a website called "Cool Running" (http://www.coolrunning.com/) has gone offline due to data compliance issues (which IMHO is a fantasticly-terrible reason to shut down - "yeah, we shut down because we couldn't obey the law") [12:19] not much we can do although it seems users can still request their data: [12:20] "If you would like to access data previously available on Cool Running, please contact support [coolrun@activenetwork.com] with the subject line ‘Cool Running Data Request’ and provide an event name, state, and year. We will do our best to provide an archived HTML copy of the results pages that were previously hosted on Cool Running." [12:21] should I add this to the Deathwatch? [12:51] So my Tigris CVS web crawl finished just after 05:00. Checking the remaining errors now. [12:59] Looks like apart from the .svn directories I already mentioned and two broken URLs, everything's ok. [13:01] Specifically, that's http://zebra.tigris.org/source/browse/*checkout*/zebra/src/vbcode/ACGWFD/ACGWFD.exe?revision=1.8 and http://zebra.tigris.org/source/browse/*checkout*/zebra/src/vbcode/ACGWFD/ACGWFD.exe?revision=1.6 [13:01] The other revisions of the file are fine. [13:04] By the way, what I did was to drop all cookies whenever I encountered an /ErrorPage or login redirect (and also ignore the redirect response, i.e. not write it to WARC). [13:05] On the website crawl, argouml and tortoisesvn have finished, so it's only argouml-stats and subversion remaining now. [13:07] argouml-stats is at 64k of 73k messages. [13:08] subversion is at 90k of 369k. [13:09] I think I'll set up something to grab the latter faster since it'll take another week or so. [13:42] *** godane has joined #archiveteam-bs [14:41] *** DopefishJ is now known as DFJustin [15:19] I found another 14 projects on Tigris. I had missed the "View subprojects" checkbox on the project list before somehow, and that's where those projects I discovered via the category pages came from as well. [15:35] *** Arcorann has quit IRC (Read error: Connection reset by peer) [15:37] *** DogsRNice has joined #archiveteam-bs [16:08] *** HP_Archiv has quit IRC (Quit: Leaving) [16:20] *** schbirid has quit IRC (Quit: Leaving) [16:33] *** nicolas17 has joined #archiveteam-bs [16:50] Those extra projects are done now. The subversion forums are running since about 2 hours and about 2.8k of 15.9k pages done. [16:53] Komixxy is nearing completion for the posts at 1.36M of 1.5M done. ~220k user profiles to be grabbed after that, and a fucktonne of errors to investigate. [16:55] I don't think I'll dig into those too much though. Will just check if there are major systematic problems. The site is horrible and has various bugs. [16:56] Turiver is a bit over halfway done. [17:16] hello [17:17] at 38% getting Reckful VOD chats, I got a status code != 200, my stupid code didn't log the actual number though, but it wouldn't surprise me if it was API throttling [17:17] * nicolas17 resumes [17:18] nicolas17: What are you using to download it? [17:25] my own script, which I wrote to archive friends' streams; here's the VOD chat part http://paste.debian.net/1155032/ [17:26] each "page" has like a hundred messages, so it takes a while to get all 90k messages from a 13h stream from this guy [17:28] Ah, API token, right. [17:29] I was wondering "why no WARC". [17:32] I now think I should be writing one message (one JSON object) per line instead of putting them into a single giant array which you then have to parse into memory, I had never dealt with chats this big before :P [17:38] *** HP_Archiv has joined #archiveteam-bs [17:42] *** HP_Archiv has quit IRC (Client Quit) [17:51] yikes, there's a 10h stream with 179480 chat messages [17:55] 140MB json [18:10] 315712 messages, 265MB json, and my tiny server almost runs out of memory writing it... yeah I have to change my approach here [18:20] Yep, write JSONL instead. [18:24] I'm starting over, because it's easier than *also* writing another script to convert the chats I already got into JSONL [18:54] it's just me or recentchanges syntax page is broken? https://www.archiveteam.org/index.php?title=Special:RecentChanges [18:55] Hide registered users etc? [18:57] yeah, newlines everywhere [18:57] same here [18:58] Yep, started yesterday during the moderation deployment. jrwr ^ [18:58] also here https://www.archiveteam.org/index.php?title=Coronavirus&curid=8147&action=history [19:11] *** hey has joined #archiveteam-bs [19:11] *** hey has quit IRC (Client Quit) [19:13] *** Nikchemny has joined #archiveteam-bs [19:47] JAA: Who are the mods of AT wiki? [19:49] Nikchemny: https://www.archiveteam.org/index.php?title=Special:ListUsers&group=moderator [19:50] Ah, I didn't check the statistics. Thanks [19:57] so now you can edit without an account? [20:01] *** BlueMax has joined #archiveteam-bs [20:10] *** jshoard_ has joined #archiveteam-bs [20:12] VoynichCr: What? https://www.archiveteam.org/index.php?title=User:Nikchemny [20:16] VoynichCr: Still need to create an account, but there's no secret word anymore. [20:18] ah ok [20:19] *** BlueMax has quit IRC (Read error: Connection reset by peer) [20:19] *** jshoard_ has quit IRC (Quit: Leaving) [20:20] *** BlueMax has joined #archiveteam-bs [20:20] Nikchemny: cool userpage [20:21] Yeah, I just show the most important information about me [20:21] I like websites too. Can we be friends? [20:22] IDK. Do you have twitter, telegram or Reddit? [20:22] I don't, but I like websites. [20:23] I think we can't meet IRL ( [20:25] Obviously, coronavirus is around. [20:26] Yeah, but it seems like only me and a few other people wear masks outside in my city. Other people wear them only in shops. [20:33] JAA: Why did AT cancel secret word? [20:34] Nikchemny: Because it was leaked and is generally a poor mechanism to fight spam or other unwanted edits. [20:34] I mean, it kind of works until it becomes public and doesn't. [20:35] Ah, that's why my last edit waited for a mod? [20:35] Yeah [20:37] Don't wikis have a number of edits with which user become comfirmed? Has AT wiki that? [20:38] No, it's manual here, at least at the moment. [20:38] This was all just deployed last night, so just about everything might change still in the next few days. [20:40] Am I the latest man who used secret word for registration? [20:42] No, there were a few people after you. [20:43] Hm, ok [20:45] VoynichCr: This https://www.archiveteam.org/index.php?title=List_of_book_databases page contains only official databases or some online libraries? [20:46] Is Library Genesis belong to this list? [21:12] Nikchemny: you can add any book database, but i think archiveteam doesn't archive sites like that because copyright [21:12] You mean http://elibrary.rsl.ru/ ? [21:14] if it's official is ok [21:16] Google Books is not official for me but ok [21:17] sorry, i think i didnt explain myself correctly, the elibrary site is ok, Lib Genesis can be added but i doubt we will archive it [21:18] google books is official... they show books after agreements with publishers, i guess [21:19] what i mean, AT doesn't archive pirate sites (i don't like that word, but i think it is more obvious like that) [21:22] Em, what? https://archive.fart.website/archivebot/viewer/domain/gen.lib.rus.ec [21:25] somebody archived the database files, not the books [21:26] Ah, ok [21:26] But that's sad [21:26] you know, it's a grey area, you archive the list of books, but skip the books [21:27] Wow [21:27] yeah... it's like archiving the index of library of alexandria and then sit on a rock and see how it will burn some day [21:30] copyright is the art of artificial scarcity [21:31] Well, some old news went to Valhalla and we know only their names, some frames, posters, but not the whole movies [21:31] *not news, films [21:33] the second law of thermodynamics is our enemy [21:35] Btw, IA excludes sites due to DMCA (some requests contain this), but why they not show dead sites like shii.org? [21:38] An example of why not only the accused people's social media stuff should be archived, but also the victims, in regards to the ongoing Super Smash Bros. drama; as for the latter, they can be unfairly suspended: https://old.reddit.com/r/smashbros/comments/hknomp/lima_put_out_a_post_regarding_zero_and_was/ [21:38] VoynichCr: RuTracker. AT saved some parts of Rutracker. Rutracker is a pirate site. So? [21:40] Nikchemny: Just like LibGen, I believe we archived the index of torrent files, not the contents. [21:41] I don't know. I want to laugh and I want to cry [21:41] Assuming we archived anything, that is. I don't remember anything happening in that regard. [21:42] I'm sure a lot of people here are not opposed to saving "pirate content" (i.e. unauthorised copies of copyrighted content). But all our data normally goes to the Internet Archive, and they need to be more careful about what they store and distribute for obvious reasons. [21:44] Well, some news sites have copyrighted pics [21:45] Virtually all content from the past century or so is copyrighted. [21:46] and the text themselves are copyrighted, but archiving news sites is more permisive, and book publishers are more aggresive on copyright grounds [21:47] even every tweet is copyrighted i guess [21:47] Yeah, as we all know, every unauthorised ebook download equates about 1 million dollars in damages for the publishers. [21:47] This message is copyrighted. [21:48] JAA is a copyright troll, you can see it. [21:49] Is there copyright for every word? [21:49] As I said, pretty much everything created in the past century is under copyright currently. One notable exception are works by the US government and its agencies, which are (almost?) all in the public domain. [21:49] Not every word, but the dictionary (as a cultural work) is copyrighted. [21:53] That would be great if I could create copyright for the word "nikchemny" and all Russian people who want to use words "nikchemny", "nikchemnogo", "nikchemnomu" etc. must pay me some rubles [21:57] JAA: Is changing the secret word for AT wiki easy? Maybe just change it every N months/weeks/years? [21:57] It's not great. In Russia you can't take a pic of a building and upload it to Wikipedia, because there is no freedom of panorama. Because building has an architect, and author. [21:58] In other countries, there is freedom of panorama, copyright is more permissive. You can photograph buildings of architects still alive. [21:59] *** lunik13 has quit IRC (Quit: :x) [22:00] Sorry, I am wrong. In Russia photographs of buildings are ok, but not works of art (sculptures for example) [22:01] Eh [22:01] It's a mess, check the map https://en.wikipedia.org/wiki/Freedom_of_panorama#France [22:01] *** lunik13 has joined #archiveteam-bs [22:02] No, I saw somewhere Russian painting on WKP [22:02] Nikchemny: No, it wasn't easy to change the secret word. That's why it only changed two or three times over the years. [22:03] https://ru.wikipedia.org/wiki/%D0%A1%D0%B2%D0%B0%D1%82%D0%BE%D0%B2%D1%81%D1%82%D0%B2%D0%BE_%D0%BC%D0%B0%D0%B9%D0%BE%D1%80%D0%B0 [22:03] 1848, copyright expired [22:03] Nikchemny: It's not possible to copyright a word, but a trademark would be possible. [22:04] Nikchemny: "other people do it" is not a valid defense [22:05] Not for copyright, anyway. It does work for trademarks. [22:05] JAA: So, with frequent changing it would be hard to leak the word, because it is old now. [22:06] Nikchemny: Yes, obviously. But it was not easy to change it, so it did not happen often. [22:07] Ah, I thought you wrote that it's easy, sorry [22:08] nicolas17 : Did you mean Wikipedia's picture or what? [22:08] Btw, a new Wikipedia is comming, it was announced today. A Wikipedia using Wikidata and "functions" to generate natural language. [22:09] half of what you have been saying, in particular "some news sites have copyrighted pics" [22:09] what news sites do is irrelevant to what IA can do [22:10] VoynichCr: Abstract Wikipedia? [22:10] yes [22:10] Sounds interesting. [22:12] anyone know if there is a way to get every IA items where you're authorized to write? [22:13] https://en.wikipedia.org/wiki/Abstract_Wikipedia [22:15] I don't think Abstract Wikipedia is going to help with listing IA permissions, sadly [22:15] :P [22:17] https://meduza.io/feature/2019/11/22/v-rossii-sozdadut-analog-vikipedii-proektom-za-dva-milliarda-rubley-zaymetsya-izdatel-pravoslavnoy-entsiklopedii Btw, my country is going to create our Wikipedia but without that stupid liberal articles. It may be based on bigenc.ru [22:18] nicolas17: "other people do it" may not be "a valid defense" in a system of morality or law, but it is in an attempt at pragmatism [22:18] Nikchemny: cool! [22:18] Btw, that would be great to save bigenc, because it's old and may not be a real thing in the future [22:19] bigenc is the Old Soviet Encyclopedia? [22:20] Em, nope. It's Big Russian Encyclopedia. It has some new facts, but wasn't updated from 2017. [22:20] Btw, according to bigenc. Spider-man is a franchise from 2002 to 2012 [22:21] "Some material from the explicitly Marxist-Leninist Great Soviet Encyclopedia has been included." https://en.wikipedia.org/wiki/Great_Russian_Encyclopedia [22:21] According to USA Wikipedia. [22:22] VoynichCr: I think that another Putin's friend can buy a yacht instead of creating ALL-new encyclopedia [22:23] Yes, bigenc based on Soviet encyclopedia, but it has some new facts [22:23] I like the idea of a Russian Wikipedia from a Russian point of view. Wikipedia is biased towards USA point of view. [22:25] Well, there is https://ruxpert.ru/ , but it's all about politics and love to our great lider [22:26] Stalin? [22:26] Nope, Putin [22:26] :p [22:28] It's like "All liberals are stupid and have fake facts, while our propag..., khm our news and putinists are right" [22:29] It sounds like any Russian-related article in English Wikipedia, but the opposite. [22:30] In English Wikipedia they write articles about Russia using Western newspapers. It is like writing articles about United States using Russian newspapers. [22:30] But they don't see their bias. [22:33] *** robogoat has quit IRC (Remote host closed the connection) [22:33] http://lurkmore.so/images/3/30/Liberalworld.jpg This pic is about hate and love to the same things from other countries. Yes, Wikipedia is written by people who live right now so, they love/hate USA/Russia. That's sad [22:36] For USA, the new enemy is China. [22:36] Yeah, the second economics [22:37] #archivebot [22:38] the first* [22:38] (lol) [22:38] Ah [22:38] Oops [23:01] *** Nikchemny has quit IRC (Quit: Page closed) [23:57] latest digitize magazines : https://www.patreon.com/posts/digitize-for-07-38942085