[00:00] At the moment, I think the best approach for peertube.video would be using their API to list all accounts, get all their videos and use my yt-dl pull request (yt-dl currently have an incomplete PeerTube extractor) with TubeUp, and save all the webpages into IA wayback machine to keep the public metadata. [00:00] Critique/comments wanted [00:01] *** Mateon1 has quit IRC (Ping timeout: 255 seconds) [00:01] *** Mateon1 has joined #archiveteam-bs [00:04] *** Ctrl has joined #archiveteam-bs [00:06] Reddit as a whole is archivable. Individual subreddits or users require you to first build an index of entire Reddit. The same thing applies to a user's saved, upvoted, etc. It's due to how Reddit stores the data internally. [00:08] Twitter's better in a sense because at least the search works. Finding old retweets is impossible though as far as I know. [00:08] Anyway, for Reddit: #shreddit [00:08] *** er1sian has quit IRC (Read error: Operation timed out) [00:11] er1sian: We won't archive anything from the Fediverse unless the operators of the affected instance ask us to. [00:24] *** er1sian has joined #archiveteam-bs [00:25] AFAIK owner is AWOL, so I doubt they will ask. Out of curiosity, why does archiveteam wait for on owner's request when it comes to Fediverse sites? If they never ask, then it all the user data gets lost? [00:26] I understand if its to help them do a clean handover, but that doesn't look like it will happen [00:32] *** nicolas17 has quit IRC (Read error: Connection reset by peer) [00:35] *** robogoat has quit IRC (Read error: Operation timed out) [00:35] *** robogoat has joined #archiveteam-bs [00:36] *** nicolas17 has joined #archiveteam-bs [00:37] er1sian: some jackass decided to get Really Pissed Off that the custodian of their data was going to change hands and wrote a big blog post about how we're all evil SOBs [00:38] *** cppchrisc has quit IRC (Read error: Operation timed out) [00:38] clearly they missed those assemblies in school where the local police/child protection advocates/etc come in and say "if you don't want it stored forever on somebody's disk somewhere don't post it to begin with" by the way they wrote the whole thing [00:38] *** benjinsmi has joined #archiveteam-bs [00:38] eventually it hit Jason and then Jason told us to not bother unless a Fedi host/operator explicitly reaches out and says "ARCHIVE THIS THING BECAUSE I'M SHUTTING IT DOWN" [00:39] *** britmob_ has joined #archiveteam-bs [00:39] *** Datechnom has quit IRC (Ping timeout: 496 seconds) [00:40] *** Ryz has quit IRC (Read error: Operation timed out) [00:40] *** mistym has quit IRC (Read error: Operation timed out) [00:41] *** cppchrisc has joined #archiveteam-bs [00:41] *** cppchrisc has quit IRC (Connection closed) [00:41] *** cf has quit IRC (Read error: Operation timed out) [00:42] Ahh, that's a pain. I can understand why a Fediverse user would dislike data permanence but its on them. I assume they didn't even try to ask you all to delete their content. [00:42] I might just try to contact the most popular users and ask if they'd like my help saving/migrating their channels. [00:42] Thanks for explaining :) [00:42] I'll go look for the blog and get mad for a while [00:42] *** er1sian has left Leaving [00:42] *** nyany_ has quit IRC (Read error: Operation timed out) [00:42] *** Larsenv has quit IRC (Read error: Operation timed out) [00:43] *** cppchrisc has joined #archiveteam-bs [00:43] *** benjinss has quit IRC (Ping timeout: 496 seconds) [00:43] *** Lord_Nigh has quit IRC (Ping timeout: 496 seconds) [00:43] *** svchfoo3 has quit IRC (Ping timeout: 496 seconds) [00:43] *** mistym has joined #archiveteam-bs [00:43] *** Lord_Nigh has joined #archiveteam-bs [00:44] *** britmob has quit IRC (Ping timeout: 496 seconds) [00:46] *** ctrl_ has quit IRC (Read error: Operation timed out) [00:47] *** PurpleSym has quit IRC (Write error: Broken pipe) [00:50] *** PurpleSym has joined #archiveteam-bs [00:51] *** svchfoo1 sets mode: +o PurpleSym [00:51] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [00:51] *** balrog has quit IRC (Read error: Operation timed out) [00:52] *** balrog has joined #archiveteam-bs [00:54] *** paul2520 has quit IRC (Read error: Operation timed out) [00:54] *** robogoat has quit IRC (Write error: Broken pipe) [00:54] *** dxrt_ has quit IRC (Read error: Operation timed out) [00:54] *** NIC007a83 has quit IRC (Remote host closed the connection) [00:54] *** Lord_Nigh has joined #archiveteam-bs [00:54] *** robogoat has joined #archiveteam-bs [00:54] *** equant has quit IRC (Read error: Operation timed out) [00:54] *** keith20 has quit IRC (Read error: Operation timed out) [00:54] *** wabu has quit IRC (Read error: Operation timed out) [00:54] *** Dj-Wawa has quit IRC (Read error: Operation timed out) [00:54] *** chaz has quit IRC (Read error: Operation timed out) [00:54] *** fredgido has joined #archiveteam-bs [00:54] *** Dj-Wawa has joined #archiveteam-bs [00:54] *** ctrl_ has joined #archiveteam-bs [00:54] *** dxrt_ has joined #archiveteam-bs [00:54] *** dxrt sets mode: +o dxrt_ [00:54] *** wabu has joined #archiveteam-bs [00:54] *** keith20 has joined #archiveteam-bs [00:54] *** asdf0101 has quit IRC (Read error: Connection reset by peer) [00:54] *** systwi_ has joined #archiveteam-bs [00:54] *** jake_test has quit IRC (Read error: Connection reset by peer) [00:54] *** asdf0101 has joined #archiveteam-bs [00:55] *** paul2520 has joined #archiveteam-bs [00:55] *** jake_test has joined #archiveteam-bs [00:55] *** gandalf has quit IRC (Ping timeout: 622 seconds) [00:55] *** klg_ has joined #archiveteam-bs [00:55] *** gandalf has joined #archiveteam-bs [00:55] *** NIC007a83 has joined #archiveteam-bs [00:55] *** odemgi has quit IRC (Read error: Operation timed out) [00:55] *** Larsenv has joined #archiveteam-bs [00:55] *** klg has quit IRC (Read error: Operation timed out) [00:56] *** Flashfire has quit IRC (Remote host closed the connection) [00:56] *** kiska has quit IRC (Read error: Connection reset by peer) [00:57] *** Flashfire has joined #archiveteam-bs [00:58] *** jake_test has quit IRC (Read error: Operation timed out) [00:58] *** gtwy has quit IRC (Read error: Operation timed out) [01:00] *** fredgido_ has quit IRC (Read error: Operation timed out) [01:00] *** systwi has quit IRC (Ping timeout: 622 seconds) [01:00] *** cf has joined #archiveteam-bs [01:01] *** paul2520 has quit IRC (Read error: Operation timed out) [01:03] *** logchfoo2 starts logging #archiveteam-bs at Wed Feb 12 01:03:22 2020 [01:03] *** logchfoo2 has joined #archiveteam-bs [01:03] *** Kenshin has joined #archiveteam-bs [01:04] *** Auctus has joined #archiveteam-bs [01:04] *** Ravenloft has quit IRC (Read error: Operation timed out) [01:04] *** Raccoon has quit IRC (Ping timeout: 622 seconds) [01:04] *** Raccoon` is now known as Raccoon [01:04] *** Ravenloft has joined #archiveteam-bs [01:05] *** fredgido has quit IRC (Remote host closed the connection) [01:05] *** systwi_ has quit IRC (Read error: Operation timed out) [01:05] *** Auctus_ has quit IRC (Read error: Operation timed out) [01:06] *** odemgi has joined #archiveteam-bs [01:06] *** chaz has joined #archiveteam-bs [01:06] *** odemgi_ has joined #archiveteam-bs [01:07] *** wp494 has quit IRC (Read error: Operation timed out) [01:07] *** fredgido has joined #archiveteam-bs [01:07] *** britmob_ has quit IRC (Read error: Connection reset by peer) [01:07] *** wp494 has joined #archiveteam-bs [01:07] *** Yurume has quit IRC (Read error: Connection reset by peer) [01:09] *** Yurume has joined #archiveteam-bs [01:09] *** equant has joined #archiveteam-bs [01:09] *** paul2520 has joined #archiveteam-bs [01:09] *** britmob_ has joined #archiveteam-bs [01:20] *** odemgi has quit IRC (Read error: Operation timed out) [01:37] *** Ryz has joined #archiveteam-bs [01:37] *** svchfoo3 has joined #archiveteam-bs [01:37] *** svchfoo1 sets mode: +o svchfoo3 [01:38] *** nyany_ has joined #archiveteam-bs [01:47] *** kiska has joined #archiveteam-bs [01:47] *** svchfoo3 sets mode: +o kiska [01:47] *** svchfoo1 sets mode: +o kiska [01:50] *** Datechnom has joined #archiveteam-bs [02:10] *** Ravenloft has quit IRC (Ping timeout: 360 seconds) [02:10] *** Ravenloft has joined #archiveteam-bs [02:11] *** bsmith093 has quit IRC (Ping timeout: 615 seconds) [02:13] *** HP_Archiv has joined #archiveteam-bs [02:38] *** asdf0101 has quit IRC (The Lounge - https://thelounge.chat) [02:40] *** asdf0101 has joined #archiveteam-bs [03:00] *** BlueMax has quit IRC (Read error: Connection reset by peer) [03:35] *** thuban2 has joined #archiveteam-bs [03:38] *** thuban1 has quit IRC (Ping timeout: 255 seconds) [03:38] *** bsmith093 has joined #archiveteam-bs [03:51] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [03:56] *** bsmith093 has quit IRC (Quit: Leaving.) [04:02] *** Smiley has quit IRC (Ping timeout: 255 seconds) [04:14] *** BlueMax has joined #archiveteam-bs [04:18] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [04:18] *** odemgi_ has quit IRC (Ping timeout: 246 seconds) [04:21] *** RichardG has joined #archiveteam-bs [04:28] *** bsmith093 has joined #archiveteam-bs [04:36] *** qw3rty_ has joined #archiveteam-bs [04:40] *** qw3rty has quit IRC (Ping timeout: 276 seconds) [04:45] *** Smiley has joined #archiveteam-bs [05:16] *** HP_Archiv has quit IRC (Ping timeout: 276 seconds) [05:16] *** HP_Archiv has joined #archiveteam-bs [05:49] *** d5f4a3622 has quit IRC (Read error: Connection reset by peer) [05:50] *** d5f4a3622 has joined #archiveteam-bs [06:02] *** Flashfire has quit IRC (Remote host closed the connection) [06:02] *** kiska has quit IRC (Remote host closed the connection) [06:02] *** kiska has joined #archiveteam-bs [06:02] *** svchfoo3 sets mode: +o kiska [06:03] *** svchfoo1 sets mode: +o kiska [06:03] *** Flashfire has joined #archiveteam-bs [06:08] *** ranma_ has joined #archiveteam-bs [06:20] *** ranma has quit IRC (Ping timeout: 745 seconds) [06:29] *** odemgi has joined #archiveteam-bs [06:44] *** nicolas17 has quit IRC (Ping timeout: 745 seconds) [06:58] *** thuban2 has quit IRC (Read error: Operation timed out) [06:59] *** thuban2 has joined #archiveteam-bs [07:10] *** HP_Archiv has quit IRC (Ping timeout: 276 seconds) [07:13] *** HP_Archiv has joined #archiveteam-bs [07:34] *** superkuh has quit IRC (Read error: Operation timed out) [07:35] *** superkuh has joined #archiveteam-bs [08:26] *** luckcolor has quit IRC (Read error: Operation timed out) [08:29] *** luckcolor has joined #archiveteam-bs [08:34] *** RichardG_ has joined #archiveteam-bs [08:34] *** RichardG has quit IRC (Read error: Connection reset by peer) [09:22] *** Mayonaise has quit IRC (Read error: Operation timed out) [09:44] *** BlueMax has quit IRC (Quit: Leaving) [09:54] *** Smiley has quit IRC (Ping timeout: 496 seconds) [10:09] *** VerifiedJ has joined #archiveteam-bs [10:22] *** Smiley has joined #archiveteam-bs [10:29] *** mtntmnky has quit IRC (Remote host closed the connection) [10:30] *** mtntmnky has joined #archiveteam-bs [10:30] *** SmileyG has joined #archiveteam-bs [10:41] *** Smiley has quit IRC (Ping timeout: 745 seconds) [11:05] *** HP_Archiv has quit IRC (Quit: Leaving) [11:25] *** bitbit has joined #archiveteam-bs [11:25] dxrt: hi :) [11:27] Hello [11:27] So we grabbed the full site and it is viewable in the wayback machine and the WARCs are also available if interested. [11:28] *** d5f4a3622 has quit IRC (Quit: https://i.imgur.com/xacQ09F.mp4) [11:30] *** mtntmnky has quit IRC (Remote host closed the connection) [11:30] cool! I couldn't find it via archive.org search for "botbot" and neither web.archive.org/cdx/search also returns too few results to be the full record. can you link me to the WARCs? [11:30] *** mtntmnky has joined #archiveteam-bs [11:32] https://archive.fart.website/archivebot/viewer/job/6afwa and https://archive.fart.website/archivebot/viewer/job/6egkw - the latter being more recent and without off-site links. [11:36] *** d5f4a3622 has joined #archiveteam-bs [11:36] *** britmob_ has quit IRC (Read error: Connection reset by peer) [11:36] *** britmob has joined #archiveteam-bs [11:39] dxrt: that's amazing thank you! I haven't tried to open WARC files yet even though I read about them generally. I should get all the files in this folder yes? and then combine them + use some sort of a WARC cli tool? [11:43] Yeah get them all. I usually just extract them with a generic unarchiving tool but something like https://github.com/chfoo/warcat will probably work better. There's a whole list of relevant software here https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem. I gotta run off, but someone else should be able to assist. [11:44] dxrt: thanks again [12:05] *** NIC007a83 has quit IRC (Ping timeout: 745 seconds) [12:06] *** NIC007a83 has joined #archiveteam-bs [12:26] *** Dragnog2 has joined #archiveteam-bs [13:13] *** eythian has quit IRC (Ping timeout: 246 seconds) [13:31] *** eythian has joined #archiveteam-bs [13:33] *** trumad has joined #archiveteam-bs [13:33] AlsoJAA: hey ho [13:58] *** n00b151_ has joined #archiveteam-bs [14:13] *** n00b151_ has quit IRC (Ping timeout: 260 seconds) [14:26] *** thuban3 has joined #archiveteam-bs [14:29] *** thuban2 has quit IRC (Ping timeout: 255 seconds) [14:35] *** equant has quit IRC (Read error: Connection reset by peer) [14:38] *** equant has joined #archiveteam-bs [14:39] which command should I invoke with warcat to output the files inside the WARCs here? https://archive.fart.website/archivebot/viewer/job/6egkw [14:39] and on which file out of the parts should I invoke the command? maybe on botbot.me-inf-20181016-112202-6egkw-meta.warc.gz ? [14:44] *** Yurume_ has joined #archiveteam-bs [14:47] *** jake_test has quit IRC (Read error: Operation timed out) [14:47] *** NIC007a83 has quit IRC (Remote host closed the connection) [14:48] *** kiska has quit IRC (Read error: Operation timed out) [14:48] *** gtwy has quit IRC (Read error: Operation timed out) [14:49] *** gtwy has joined #archiveteam-bs [14:49] *** antomatic has joined #archiveteam-bs [14:49] *** systwi_ has joined #archiveteam-bs [14:49] *** Yurume has quit IRC (Read error: Operation timed out) [14:49] *** kiska has joined #archiveteam-bs [14:50] *** NIC007a83 has joined #archiveteam-bs [14:50] *** svchfoo1 sets mode: +o kiska [14:50] *** svchfoo3 sets mode: +o kiska [14:56] *** systwi has quit IRC (Read error: Operation timed out) [14:58] *** antomati_ has quit IRC (Read error: Operation timed out) [15:02] bitbit: The -meta.warc.gz only contains the log of the retrieval. The actual data is in the numbered ones. [15:11] *** nicolas17 has joined #archiveteam-bs [15:15] AlsoJAA: thanks! I think I figured it out. first I do "cat ... > final.warc.gz" and then maybe "warcat extract final.warc.gz --output-dir ./final --progress"? [15:16] *** jake_test has joined #archiveteam-bs [15:19] can you just cat multiple .gz files together? [15:20] stackoverflow says yes [15:22] Yes, you can. [15:23] bitbit: I think you can also do `warcat extract file0.warc.gz file1.warc.gz ...`, but not entirely sure. [15:23] To avoid copying around stuff needlessly. [15:24] interesting I will give it a try next [15:34] yes! it works. so great [16:10] *** JAA has joined #archiveteam-bs [16:10] *** AlsoJAA sets mode: +o JAA [16:13] *** mtntmnky has quit IRC (Remote host closed the connection) [16:14] *** mtntmnky has joined #archiveteam-bs [16:16] *** nicolas17 has quit IRC (Quit: Konversation terminated!) [16:45] *** systwi has joined #archiveteam-bs [16:52] *** systwi_ has quit IRC (Ping timeout: 622 seconds) [17:11] *** atphoenix has quit IRC (Read error: Connection reset by peer) [17:15] *** atphoenix has joined #archiveteam-bs [17:44] *** VerifiedJ has quit IRC (Read error: Connection reset by peer) [18:46] *** Dragnog2 has quit IRC (Quit: Connection closed for inactivity) [19:55] JJA, AlsoJAA: is it possible to get the size of the resulting files before I begin the extraction? [19:57] *** vesi has joined #archiveteam-bs [19:59] bitbit: A decent estimate would be the size of the decompressed data. Something like `zcat *.warc.gz | wc -c`. [20:00] (gzip files have the decompressed size at the end of the data, but that's modulo 2 or 4 GiB.) [20:08] vesi: Do you have a reference for those forums closing? All I see is "If you like, you can still explore archived discussion in these forums." etc. [20:09] Looks like they were made read-only in April 2017. [20:12] JAA: https://i.imgur.com/ZLCdyse.png is what I see when I go there\ [20:13] After I turn JS on [20:13] Ah, yeah. [20:13] JAA: from what I can tell zcat just prints the uncompressed data? so its like I'm doing uncompress no? [20:13] I want to throw in #forum76 as a channel name [20:13] Because JS is totally needed for displaying an error like that. Ugh. [20:13] A week's worth of notice D: [20:14] bitbit: Correct. [20:31] oh man the glossy graphics [20:35] *** TC01 has quit IRC (Read error: Operation timed out) [20:36] *** TC01 has joined #archiveteam-bs [20:47] Response time from those forums is horrid. [20:48] Hey all, I'm here now. I got pinged on Discord by another member. Just thought to pass on the message to anyone I knew. [20:49] it's in archivebot now [20:50] Thank you. As a noob, some noob questions: [20:51] Does that mean that archivebot has queued a job to archive the forums? If so, will everything be archived, or only n-levels from root? Is one week enough time for the forums to be archived? [20:51] Average response time across ~200 requests: 1190 ms. Eww. [20:51] it's not enough time [20:51] Gross. Can I ask that we prioritize a certain subforum? Just a hunch. [20:51] A job is running for the forums in ArchiveBot now. It won't finish in time, which is why I'm looking into alternative ways. But with this performance of their servers, well... [20:53] Given the nature of the community that sent out the alert that these forums are getting taken offline, I think this subforum is the one of most interest to the most people: http://forums.bethsoft.com/forum/16-elder-scrolls-lore/ [20:54] No way to prioritise this with my method of grabbing (bruteforcing topic IDs). Possible in principle though. [20:55] JAA: there's a sitemap [20:56] Let's see what happens if I throw more connections at it. [20:57] hook54321: Ah, right, I always ignore those. :-P [20:57] is the response time by chance better over http? I've come across a couple of sites that are like that for some reason [20:58] I am using HTTP. [20:58] oh [20:58] Seems pretty similar on HTTPS. [20:59] 1884 requests, 3153 ms avg response time :-| [20:59] Another noob question for myself and for anyone who asks: will the public be able to access the archived version? Will it be accessible via the IA's WaybackMachine, or though other means? [20:59] vesi: Everything we archive goes into the WBM. [20:59] (The exception confirms the rule.) [21:00] worth trying to contact admins and ask for enough time for archivebot to finish? [21:00] unlike most other WBM collections, our archives can also be downloaded and converted into a zip file by anyone :) [21:01] Ah that's great to hear! [21:01] a couple of the admins are listed as last active in late january (there's a contact email address but i don't know whether it's still monitored) [21:03] So I know you are brute-forcing topic ids, but question — the threads here are paginated. Will the bot be able to catch all the pages, or only the first one? [21:06] thuban3: I'm conflicted about whether or not to try to contact them, that could just make things worse. It largely depends if they're the kind of company that sends out legal threats willy nilly. [21:07] vesi: archive bot is a crawler; it finds pages to save by following links from the root node. i believe JAA was referring to "alternative ways" (we often use that type of brute-forcing in archive jobs where crawls don't apply well) [21:08] so yes, it is paginated [21:08] you can see the urls being archived on the dashboard at http://dashboard.at.ninjawedding.org/3 (click the button in the lower left of a job to show just that one) [21:10] Starting to see some timeouts, and the response time is increasing as well. [21:11] 4.3 s by now. :-| [21:12] Speaking as a webdev, throwing more connections at it might be making it choke,. [21:12] Can happen, yes. [21:12] Thank you for the info! [21:13] It really depends on what causes it to be this slow. If it's somehow related to network lag, for example, but not throughput-limited, more connections can help despite horrible response times. [21:13] Doesn't seem to be the case here though. [21:14] i mean, it's only set to 6 concurrent connections. [21:14] I'm talking about my qwarc run. ArchiveBot is almost never fast enough to kill web servers. [21:14] ah [21:14] would pausing or aborting the archivebot job help? [21:15] Nope, probably won't make a difference. [21:15] I'm doing ~22 requests per second. [21:15] So 10+ times faster than AB. [21:17] I'm guessing that over the past 2-3 years, since the forums went into read-only mode, they probably moved them to smaller servers to match reduced activity. So it's probably a pretty small infrastructure that's supporting the current crawl. [21:23] Seems to have stabilised at just over 4 second response time. [21:30] vesi: some of them redirect to the new site already. http://forums.bethsoft.com/forum/259-wolfenstein/ [21:30] *** Flashfire has quit IRC (Remote host closed the connection) [21:30] *** kiska has quit IRC (Remote host closed the connection) [21:30] *** kiska has joined #archiveteam-bs [21:31] Hmm that's alright. No big loss there. [21:31] *** svchfoo1 sets mode: +o kiska [21:31] *** svchfoo3 sets mode: +o kiska [21:31] Honestly, I think the most important subforum to archive is probably http://forums.bethsoft.com/forum/16-elder-scrolls-lore/ [21:31] Some of the original writers for the games posted there + participated in forum role-play [21:32] It became a kind of cornerstone of lore for the fandom, etc. [21:56] *** HP_Archiv has joined #archiveteam-bs [22:58] *** OrIdow6 has quit IRC (Quit: Leaving.) [23:06] Response time has decreased to ~3 seconds in the last 2 hours. I should be able to comfortably grab all topics in time if it stays like that. [23:13] *** OrIdow6 has joined #archiveteam-bs [23:13] *** BlueMax has joined #archiveteam-bs [23:35] *** RichardG_ is now known as RichardG [23:48] That's great to hear. Thank you for jumping on the ball for this.