[00:17] *** BlueMax has joined #archiveteam-bs [00:31] *** Docti has joined #archiveteam-bs [00:33] Hello, here I am for further discusion [00:33] Docti: Is there a list of the forums that will be deleted? They only listed the ones that will live on, if I understand that correctly. [00:35] You are right, they only listed what will not be deleted [00:36] Here is the list of all the categories. If they are not in the list of what will be kept, they will be deleted https://forum.doctissimo.fr/doctissimo/liste_categorie.htm [00:37] But, since they might have changed the name of what will be kept (because they might rename it or put it in new categories), perhaps it would be easier to save everything ? [00:37] Wow, 320 million messages in total (across all forums, not just the sexuality ones). [00:38] I can try, but chances are their servers won't be able to handle that much in 4 days. [00:40] Thank you [00:40] So far I believe only the Sexuality forums are in danger, because they are not "socially acceptable" [00:40] I believe the other categories about family, daily life, medicine, etc, are safe [00:40] They're really going out of their way to make it hard to access the content. Some threads are not linked directly but their URLs are hidden in "cryptlinks" (which are just base64-encoded). [00:41] Yeah, I will target only the sexuality forums. [00:42] Perfect :) I have to admit I have very little knowledge in that domain of computer science. I am sure you will do your best [00:43] There are still 54 million posts just in these forums. This will be a challenge. [00:43] (For their servers) [00:48] This is much more than what I thought ! I hope the servers will be able to handle it [01:05] Docti: I can't find "Troubles de l’érection" and "Ejaculation prématurée" on https://forum.doctissimo.fr/doctissimo/liste_categorie.htm. They're listed as remaining in the announcement. [01:12] Indeed, I cannot find them either. Perhaps they have already been moved to their new categories ? [01:13] By the way, forget the part "Exhibition et voyeurs" or put it with the lowest priority - lots of post and topics of very low quality [01:14] Well, it seems that the ones that were already moved are still linked. E.g. "Contraception" is linked to the Santé part of the forums, not in Sexualité anymore. [01:15] Only some of them, apparently. http://forum.doctissimo.fr/doctissimo/erection/liste_sujet-1.htm exists and redirects to https://forum.doctissimo.fr/sante/Troubles-de-l-erection/liste_sujet-1.htm [01:16] Yeah, the other one was moved as well: https://forum.doctissimo.fr/sante/Ejaculation-prematuree-ou-precoce/liste_sujet-1.htm [01:20] Indeed, nice find, they have already been moved, but not the others. Just so you know: IST = Infections sexuellement transmissibles [01:23] Yes, some have been moved and are linking to the new location, some have been moved and are not linked anymore, and some have not been moved yet. [01:24] Yep, figured that one out, thanks! :-) [01:26] *** mtntmnky has quit IRC (Remote host closed the connection) [01:27] *** mtntmnky has joined #archiveteam-bs [01:27] There's also one which is in a different section but not listed in the announcement ("Techniques de séduction"). [01:28] It appears that it wasn't moved recently but was already in the psychology section before. [01:29] Some of the ones that will stay were also already elsewhere. [01:31] Great, they will not be deleted, so this is something you will not have to save [01:33] Oh fantastic, they didn't even set up redirects for the former locations of the moved categories. [01:45] That's nice :) If you don't mind, I will have to leave, it is a bit too late for me. I believe you can contact me on my page on the archiveteam site in the meantime. Thank a lot you for you work ! [01:46] Yeah, I should leave as well, but I'll see if I can get it running before. [01:46] Good night! [01:47] Good night [01:53] *** Docti has quit IRC (Ping timeout: 252 seconds) [01:59] *** Nikchemny has joined #archiveteam-bs [02:00] JAA: Hello, are there news about archive.st? [02:00] Btw, do AT use Alexa toolbar? [02:01] Nikchemny: No, I've been busy archiving things that disappear in the next few days. [02:02] Which for example? [02:04] Bitcoin Forum, Kongregate forums, now Doctissimo forums. [02:04] Hm, that's interesting. Are they on AB? [02:05] No [02:05] That's why I've been busy and still am. [02:21] bitcoin forum really? [02:29] Forum goes down, forum gets saved. [02:32] I think everyone must download https://www.alexa.com/toolbar to thank Brewster Kahle. [02:33] No thanks, especially since it hasn't been owned by Brewster in ... 20 years? [02:34] Yes, I know, but he created it... [02:34] So? I'd rather thank him for creating IA by donating there. [02:35] Than install adware/spyware/whatever the hell that toolbar probably is. [02:36] Do you think that Amazon is watching for the Alexa toolbar users? [02:36] Btw, I realized that fart.website is 17,000 in Russia https://www.alexa.com/siteinfo/fart.website [02:36] You bet they're using it to determine your interests and serve the corresponding Amazon ads. [02:37] In any case, this is very off-topic. -> #archiveteam-ot [02:37] Ok [02:37] I think my Doctissimo script is ready, just running some tests. [02:41] Yep, seems fine. Let's set some servers on fire. [02:46] There are 13857 pages listing 50 topics each across all forums I intend to grab now (everything in the Sexualité section). So that's 679k topics. Average post count per topic is about 80, so that's "only" around 1.4 million requests in total. Not as bad as expected. [02:47] Er, typo, 693k topics [02:47] JAA: Btw, what's with Telegram? Is it saves good or some content is missing (Mhm, videos and audios can't be saved)? [02:49] Nikchemny: Never looked into it in detail. Can you check some of the accounts we've archived with ArchiveBot in the WBM? [02:52] Eghrm. You saved it like t.me/something? Em, it can't be looked good. If the start link would be t.me/s/something then There would be the last 20 posts and I can tell which number of post is the last. Hm, JAA, I'll look for COVID-channels, I mean for their pics [02:55] JAA: Does it save posts like t.me/s/something/number or t.me/something/number? [02:56] Nikchemny: /s/channel/postid are the ones of interest. [02:56] The other thing is only a "open in Telegram" page that doesn't show anything. [02:56] JAA: Looks like that and looks like crap http://web.archive.org/web/20200630152730/https://t.me/MINSAPCuba/73 [02:56] That's not /s/. [02:56] In AB collection, btw [02:56] https://web.archive.org/web/20200630154215/https://t.me/s/MINSAPCuba/73 [02:57] Yes, sorry [02:58] JAA: Btw, it shows 10 previous and 10 next posts. So AB saves the same post 20 times [02:58] Yep [02:59] No way to avoid that. [02:59] Looks like we're not saving these attachments(?): https://web.archive.org/web/20200630160216/https://t.me/s/MINSAPCuba/44 [03:00] Yes, of course. Only Telegram users can do that [03:00] Doctissimo archival started. Retrieving 1 GB per minute, so I might run out of disk space quickly. [03:01] But it handles ~40 req/s just fine so far. [03:02] Looks like Telegram is not too friendly for not registered people. Only pics and text. No files, no audios, no videos [03:03] JAA: attachments can be saved only in app (or in web version, but I've never used it) [03:04] Nikchemny: Maybe there is a way to access them, but it's just hidden very well. [03:05] ¯\_(ツ)_/¯ [03:06] So disk space shouldn't be too much of an issue. Retrieving 1 GB per minute, but it compresses down to 120 MiB or so. :-) [03:09] Just to be clear: I'm only fetching the category and topic pages, nothing else. [03:10] JAA: So, what does AT think about Telegram chats? Maybe just make bot that could join important chats and then just scroll them as long as possible? I know it can't be in WBM and (as I saw once) there is # in chat's link, so WBM can't show it [03:10] *join in web version [03:12] There is https://t.me/lurkisdead , chat of lurkmore.to [03:13] could use the telegram API as an actual registered user, but I'm not sure if Telegram servers would get angry and rate-limit (or ban), so I'd be reluctant to try with my regular account / phone number [03:14] I bet there's something in the terms about it, and they'd ban quickly when discovered or reported. [03:14] I don't know if bot accounts can read past chatgroup history [03:15] Em, why? Do you think they can recognize when just person try to scroll chat and when bot scroll and saves it? [03:16] Nope, I don't mean Telegram bots. Just use clear number, register with it and save as more as possible chat? [03:19] Doctissimo performance improved over the past 10 minutes. Maybe they're using autoscaling. I'm now at 64 req/s, and response times reduced by a third. [03:19] JAA PovAddict: Is this a crazy idea? [03:20] No, but it will probably break quickly. [03:22] Well, for the first try, I think, https://t.me/lurkisdead is good. [03:25] JAA: Btw, anyone can download all his chats if he/she uses desktop app [03:26] Maybe with attachments [03:47] *** Raccoon has joined #archiveteam-bs [03:58] *** qw3rty_ has quit IRC (Ping timeout: 622 seconds) [04:13] . [04:21] *** Nikchemny has quit IRC (https://mibbit.com Online IRC Client) [05:10] *** wyatt8740 has quit IRC (Ping timeout: 260 seconds) [05:12] *** wyatt8740 has joined #archiveteam-bs [06:15] *** britmob has quit IRC (Read error: Connection reset by peer) [06:30] *** qw3rty has joined #archiveteam-bs [06:41] *** britmob has joined #archiveteam-bs [07:32] *** jshoard has joined #archiveteam-bs [07:33] *** PovAddict has quit IRC (Read error: Operation timed out) [09:04] *** HP_Archiv has joined #archiveteam-bs [10:11] *** lennier1 has quit IRC (Read error: Connection reset by peer) [10:13] *** lennier1 has joined #archiveteam-bs [10:54] Doctissimo is nearly done, just a few huge topics remaining. [10:54] The Sexualité section of Doctissimo* [10:56] https://github.blog/2020-07-16-github-archive-program-the-journey-of-the-worlds-open-source-code-to-the-arctic/ [10:57] "First, their well-known Wayback Machine is accessing and archiving raw GitHub data as WARCs, or Web ARChive files. As of this writing they have archived some 55TB of data. Second, they have the goal of making entire archived GitHub repositories available via “git clone,” while also keeping repo comments, issues, and other metadata easily accessible on the web. This second initiative is well underway and initial archiving is expected [11:16] *record scratch* Yep, that's us, and you probably wonder how we got into this situation. [11:16] #gitgud on hackint [15:04] *** BlueMax has quit IRC (Read error: Connection reset by peer) [15:10] *** fredgido_ has quit IRC (Read error: Connection reset by peer) [15:19] *** Arcorann has quit IRC (Read error: Connection reset by peer) [15:31] *** systwi_ has joined #archiveteam-bs [15:36] *** systwi has quit IRC (Read error: Operation timed out) [16:00] *** kiska has quit IRC (Remote host closed the connection) [16:01] *** kiska has joined #archiveteam-bs [16:05] *** britmob has quit IRC (Read error: Operation timed out) [16:43] *** schbirid has joined #archiveteam-bs [16:45] *** britmob has joined #archiveteam-bs [17:04] *** VerifiedJ has joined #archiveteam-bs [17:17] *** Nikchemny has joined #archiveteam-bs [17:18] JAA: There is https://docplayer.ru/ and there is example of the document: https://docplayer.ru/34254329-A-a-pleshakov-s-a-pleshakov-enciklopediya-puteshestviy-strany-mira-kniga-dlya-uchashchihsya-nachalnyh-klassov.html [17:18] Can AB save PDF or not? [17:21] Ah, here is link: https://docplayer.ru/storage/54/34254329/1595269256/uooJxGU_y6Zjzfubsxfreg/34254329.pdf . It can be reached only with Google-captcha [17:32] JAA: Btw, I made https://archive.st/archive/2020/7/www.mk.ru/wae8/ , copy of https://www.mk.ru/politics/2020/07/20/kadyrov-otvetil-ssha-na-sankcii-fotografiey-s-avtomatami.html . Image ( https://archive.st/archive/2020/7/www.mk.ru/wae8/July202020508pm-58i1956pyt8zflwu0haadkhn4r7a9ecm.jpg ) looks similar, but archived copy is not [17:32] https://archive.st/archive/2020/7/www.mk.ru/wae8/www.mk.ru/politics/2020/07/20/kadyrov-otvetil-ssha-na-sankcii-fotografiey-s-avtomatami.html [17:32] *** PovAddict has joined #archiveteam-bs [17:33] *** DogsRNice has joined #archiveteam-bs [17:33] PovAddict: Which Telegram chat are you gonna to save? [17:48] I have no time for any of this [18:05] *** prq has joined #archiveteam-bs [18:23] *** Nikchemny has quit IRC (Quit: https://mibbit.com Online IRC Client) [18:30] *** icedice has joined #archiveteam-bs [18:39] -- archive team slogan [19:54] *** lennier2 has joined #archiveteam-bs [19:55] *** lennier1 has quit IRC (Ping timeout: 260 seconds) [19:55] *** lennier2 is now known as lennier1 [20:00] *** Nikchemny has joined #archiveteam-bs [20:02] katocala: Please add links from articles from this category https://ru.wikipedia.org/wiki/Категория:Организации_в_России,_самостоятельно_присуждающие_учёные_степени to https://www.archiveteam.org/index.php?title=ArchiveBot/Educational_institutions/list#Russia [20:03] And this https://ru.wikipedia.org/wiki/%D0%9A%D0%B0%D1%82%D0%B5%D0%B3%D0%BE%D1%80%D0%B8%D1%8F:%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D0%B9%D1%81%D0%BA%D0%B8%D0%B5_%D0%B8%D0%BD%D1%81%D1%82%D0%B8%D1%82%D1%83%D1%82%D1%8B_%D0%B8%D1%81%D0%BA%D1%83%D1%81%D1%81%D1%82%D0%B2%D0%B0_%D0%B8_%D0%BA%D1%83%D0%BB%D1%8C%D1%82%D1%83%D1%80%D1%8B [20:11] *** schbirid has quit IRC (Remote host closed the connection) [20:15] *** Docti has joined #archiveteam-bs [20:20] *** Docti has quit IRC (Ping timeout: 252 seconds) [20:51] *** HP_Archiv has quit IRC (Quit: Leaving) [20:52] I saw this petition being circulated, with someone who has inside connections suggesting there have been layoffs and MNopedia.org is at risk. It's a fabulous, mostly Creative-Commons resource used on Wikipedia (some articles are heavily based on MNopedia articles). [20:52] I know at least some of the content is in the Wayback Machine... but can we get a crawl? [20:53] I'd be happy to run one myself (my bandwidth can support it), but it doesn't seem to be MediaWiki [20:53] oh sorry, petition link: https://docs.google.com/forms/d/e/1FAIpQLSdeqbOJ9-tmdrwXXfvbgR5BZASPh5jqZVkuyj5S-ox0VADQCQ/viewform [20:54] ...in the meanwhile, I'll run a wget with mirror options today or overnight. [21:14] *** Dallas has joined #archiveteam-bs [21:16] paul2520: Why not just ask for saving https://www.mnopedia.org/ on AB channel? [21:17] Or it's too big? [21:18] Nikchemny:ah, I always ask in the wrong #archive* channel [21:18] #archivebot ? [21:18] thanks. [21:18] done [21:18] it's probably relatively small [21:20] Btw, it looks like people on AB are very busy. I think (if they won't save it) you can ask for it again. [21:20] paul2520: Btw, it is on mediawiki or not? [21:21] I don't think so [21:21] any ETA, like tomorrow? or more like give it a week and see if things are quieter/slower? [21:21] (or do you think asking at a weird time might work?) [21:23] I think that your question could be sink in all this text. Sorry, I'm not at AB-channel, so I can't see what's happening there. [21:26] paul2520 [21:37] *** jshoard has quit IRC (Leaving) [21:44] *** VerifiedJ has quit IRC (Quit: Leaving) [21:50] *** Nikchemny has quit IRC (Quit: Page closed) [22:16] *** Dallas has quit IRC (Quit: Dallas) [22:17] *** Dallas has joined #archiveteam-bs [22:17] *** HP_Archiv has joined #archiveteam-bs [22:30] *** Jake has quit IRC (Read error: Connection reset by peer) [22:30] *** Jake5 has joined #archiveteam-bs [22:30] *** Jake5 is now known as Jake [22:48] *** Dallas has quit IRC (Quit: Dallas) [22:52] *** Dallas has joined #archiveteam-bs [23:08] *** Dallas has quit IRC (Quit: Dallas) [23:09] *** Dallas has joined #archiveteam-bs [23:10] *** Arcorann has joined #archiveteam-bs [23:53] *** BlueMax has joined #archiveteam-bs [23:55] The Doctissimo Sexualité finished this afternoon. I'm considering grabbing all the other forums as well. It's fast and not too big, so why not? [23:56] Not too big in terms of WARC size, that is. The 54 million posts from Sexualité were only 60-odd GiB. [23:57] nice