[00:31] *** kiska1825 has joined #archiveteam-bs [00:31] *** kiska1825 has quit IRC (Client Quit) [00:32] *** kiska1825 has joined #archiveteam-bs [00:32] *** Ryz has joined #archiveteam-bs [01:02] *** pew has quit IRC (Ping timeout: 265 seconds) [01:14] JAA, since archiving the forums of Kongregate is unlikely to be done with ArchiveBot, would this be an ArchiveTeam Warrior project? [01:14] *** pew has joined #archiveteam-bs [01:19] Ryz: That or a qwarc task. [01:20] Isn't that website JS-filled, as per your reaction to the announcement link I pointed out? oo; [01:21] There are a few other important DPoS projects stuck in dev for too long already, so I guess I might try with qwarc. [01:21] Yep, it is. [01:21] I haven't looked at it in detail yet at all. [01:21] Ultimately, JS interactions can always be emulated. I've done that many times before with qwarc. [01:24] Fortunately, it isn't the whole forums of Kongregate...yet [01:32] Meh, I'll just grab the whole thing I think. Too complicated to do it differently. [02:22] *** cerca has quit IRC (Remote host closed the connection) [02:28] Two things that are going down, but that no one but Google Translate seems to be able to read: [02:28] https://matome.naver.jp/ - per that message in #archiveteam, does seem to be going down at the end of September. It seems to be some Web 2.0 site, can't really tell what it is [02:30] https://pclab.pl/ is being frozen, if not taken down; status of https://forum.pclab.pl/ is unclear [03:29] *** qw3rty__ has joined #archiveteam-bs [03:37] *** qw3rty_ has quit IRC (Read error: Operation timed out) [03:59] *** icedice2 has quit IRC (Quit: Leaving) [05:19] *** rklane has joined #archiveteam-bs [05:52] *** mgrandi has joined #archiveteam-bs [06:10] *** nicolas17 has quit IRC (Quit: sleep) [08:13] So I'll ask that someone put https://pclab.pl/ and https://matome.naver.jp/ into AB; will see about the forums later [08:17] Hello OrIdow6, do you have a source for https://pclab.pl/ ? And https://matome.naver.jp/ besides the message posted on another chatroom? [08:18] Ryz: No, both per those links in #archiveteam in the last few days [08:18] https://pclab.pl/news84673.html and http://navermatome-official.blog.jp/archives/83259956.html [08:19] Again, I'm not 100% sure, because I speak neither Polish nor Japanese, but machine translation makes it look like they're going down [08:22] I've ran https://pclab.pl/ (but not the forums yet), but https://matome.naver.jp/ might be large [08:23] The former is processing right now via AB [08:23] OrIdow6 ^ [08:24] Ryz: How are you determining that it's too large? [08:24] And thank you for pclab [08:25] It seems to be a social media website with blog capabilities [08:26] The pagination can go up to 50 pages, but I feel there's a lot more when you dig around the individual users' profiles [08:26] Pagination at least on the main page [08:27] My Tigris grab is progressing nicely. It's nearly finished, just 15 projects are still retrieving things, mostly the largest ones with a lot of messages in the forums. [08:27] Very strange that it's still online. [08:32] Ryz: It looked more to me like it was a collection of user-created "search results", in the form of lists of links [08:33] The Turiver grab is at 31k of 130k thread IDs. [08:34] And Komixxy.pl is at almost a third of the posts now. There are a lot of warnings and errors to go through though because the site's pretty weird. [08:35] Empty usernames and whatnot [08:39] In any case, it's going to have to get archived at some timne [09:48] *** fredgido has joined #archiveteam-bs [09:58] *** HP_Archiv has joined #archiveteam-bs [10:16] *** BlueMax has quit IRC (Quit: Leaving) [10:17] Tigris just started timing out a couple minutes ago. This may be the end. [10:18] How much did you get? [10:18] Nevermind, it's back. [10:18] Oh [10:19] ¯\_(ツ)_/¯ [10:19] They are based in Georgia (the US state), so maybe there's a bit more time until they wake up and take it down. [10:20] Could be [10:21] Looooooot :p [10:21] I've been working on another grab of the CVS repos, which are only available through the web interface. The AB job I launched for those yesterday laturally blew up. [10:22] naturally* [10:37] Is there any difference between e.g. http://delphiexpert.tigris.org/source/browse/delphiexpert/src/UStringUtils.pas?view=markup&pathrev=MAIN http://delphiexpert.tigris.org/source/browse/*checkout*/delphiexpert/src/UStringUtils.pas?revision=1.2&pathrev=MAIN and http://delphiexpert.tigris.org/source/browse/*checkout*/delphiexpert/src/UStringUtils.pas?revision=1.2&content-type=text%2Fplain&pathrev=MAIN ? [10:38] *** HP_Archiv has quit IRC (Quit: Leaving) [10:41] JAA [10:41] *** HP_Archiv has joined #archiveteam-bs [10:42] OrIdow6: They all have the same file contents. Only the first file has the commit message and metadata (author, date) though. [10:43] first link* [10:45] Oh [10:47] Looks like the list page has the metadata, though [10:48] I think I've seen some that's only on the view=markup page. [10:50] Oh [10:57] By the way, hideattic=0 is important to catch deleted files. [11:17] CVS web grab started for the 303 repos I'm aware of. I'm first grabbing the directory structure for all projects, then the most recent revision of every file for all projects, then the revision history. [11:24] OrIdow6: Oh by the way, the pathrev is irrelevant for the /*checkout*/ URLs; I'm not sure why it's there, to be honest. I don't include it in my requests, which reduces the number of requests massively, especially when there are many branches that share a revision. [11:25] Oh shit, Tigris *sometimes* serves error pages with status 200. Eww. [11:28] http://www.tigris.org/servlets/ErrorPage [11:30] They're 302 redirects to that URL. Fortunately, I got only a handful of these on my pages crawl (requeueing), but the CVS web interface throws them a lot under load. [11:32] *** VerifiedJ has joined #archiveteam-bs [11:40] *** HP_Archiv has quit IRC (Quit: Leaving) [11:40] I also got occasional odd redirects to the login form, which oddly enough appeared on the same three projects as the /ErrorPage ones. [11:42] The CVS web interface is buggy as well: http://silvertejp.tigris.org/source/browse/silvertejp/maven2/silvertejp-limax/src/main/java/org/tigris/silvertejp/limax/util/.svn/ redirects to the login. [11:42] (Yeah, someone committed a full SVN repository to CVS. lol) [11:48] Restarted the CVS web grab with a fix for those error pages and login redirects. [11:52] Reduced my concurrency a bit and am now getting a better throughput as well. :-) [11:52] (And no more error page redirects) [13:06] Seems like the login redirects may be cookie-based in some cases. Seriously, fuck this site. [13:12] *** VerifiedJ has quit IRC (Quit: Leaving) [13:16] Sounds familiar [13:19] Initial directory structure scan is done except for a few errors, about 117k files exist in total. [13:20] I assume you're going to hammer it with Qwarc now [13:23] Only a little bit. Can't go too fast because their servers are falling over immediately if it's more than a light breeze. [13:25] Looks like the request to any directory called ".svn" in silvertejp is broken, and sometimes that then causes all further requests for other things to also fall into the login trap. [13:31] Depending on how the website works, it could help your max speed to be parallel across different projects/subdomains, instead of within each [13:31] I.e., depending on what resource it is that's being used up [13:33] Errors seem to be random, and it's all in the same CVS repository anyway (each project is a "module" in CVS terms). [13:33] Oh [13:46] More issues. Sigh... [13:49] *** benjins has joined #archiveteam-bs [13:50] I was thinking of suggesting stopping the AB job to get you more resources, but at this rate, if the site shuts down soon, it may end get the most [13:50] *** Terbium_ has joined #archiveteam-bs [13:50] *getting [13:51] *** maxfan8_ has joined #archiveteam-bs [13:52] Disadvantage of a structured grab [13:52] Yeah, but the AB grab will be incomplete as well due to those error and login redirects. [13:52] *** Maylay_ has joined #archiveteam-bs [13:52] *** Maylay_ has quit IRC (Remote host closed the connection!) [13:52] My grab's stuck in a loop at the moment. Never encountered this before. [13:52] *** Maylay_ has joined #archiveteam-bs [13:58] Yup, that's a bug in qwarc. Heh [14:02] *** Terbium has quit IRC (se.hub efnet.deic.eu) [14:02] *** benjinsmi has quit IRC (se.hub efnet.deic.eu) [14:02] *** Maylay has quit IRC (se.hub efnet.deic.eu) [14:02] *** maxfan8 has quit IRC (se.hub efnet.deic.eu) [14:20] *** Ryz has quit IRC (Remote host closed the connection) [14:20] *** kiska1825 has quit IRC (Remote host closed the connection) [14:21] *** kiska1825 has joined #archiveteam-bs [14:21] *** Ryz has joined #archiveteam-bs [14:23] *** Raccoon has quit IRC (Ping timeout: 272 seconds) [14:54] *** Arcorann has quit IRC (Read error: Connection reset by peer) [15:08] *** lennier1 has quit IRC (Ping timeout: 265 seconds) [15:09] *** lennier1 has joined #archiveteam-bs [16:48] *** Raccoon has joined #archiveteam-bs [17:11] *** mgrandi has quit IRC (Leaving) [17:19] *** rklane has quit IRC (Quit: This computer has gone to sleep) [17:24] *** tobbez has joined #archiveteam-bs [18:10] *** nicolas17 has joined #archiveteam-bs [18:25] Turiver: about one third done [18:25] Komixxy: a bit under halfway done with posts, 170k+ users discovered so far [18:25] Tigris: mess [19:38] *** Ryz has quit IRC (Remote host closed the connection) [19:38] *** kiska1825 has quit IRC (Remote host closed the connection) [19:39] *** kiska1825 has joined #archiveteam-bs [19:39] *** Ryz has joined #archiveteam-bs [19:49] Six projects are still running on my Tigris website crawl. [19:50] *** qwebirc91 has joined #archiveteam-bs [19:51] *** qwebirc91 has quit IRC (Client Quit) [19:51] And my CVS web grab is nearly done with the current revision retrieval. [19:51] *** qwebirc31 has joined #archiveteam-bs [19:51] *** qwebirc31 has quit IRC (Client Quit) [19:51] *** Nikchemny has joined #archiveteam-bs [19:52] JAA: Sorry, I didn't edit wikipage today because I was busy [19:54] Nikchemny: No worries, I've been busy with other things anyway. [19:55] JAA: Btw, I've saved article from news.mail.ru with chromebot and archivebot. I started these with my phone [20:22] JAA: This article: https://news.mail.ru/politics/42417368/ [20:22] Mhm [20:31] Correction on the Tigris CVS progress: 96k files done, 21.5k remaining; it has started with the file revisions at 23k done, 73k remaining. [21:47] *** robogoat has quit IRC (Ping timeout: 745 seconds) [21:55] *** robogoat has joined #archiveteam-bs [22:00] *** Nikchemny has quit IRC (Quit: Page closed) [22:06] https://lists.openstreetmap.org/pipermail/dev/2020-July/030958.html OpenStreetMap Trac and SVN going away [22:17] Only being made read-only for now according to that. [22:18] ah right [22:18] put it at the bottom of the list :) [22:42] *** OrIdow6 has quit IRC (Ping timeout: 265 seconds) [22:46] *** OrIdow6 has joined #archiveteam-bs [22:55] https://www.reddit.com/r/DataHoarder/comments/hk64ye/americas_network_magazines_scans_and_other_rare/ [23:16] *** Arcorann has joined #archiveteam-bs [23:24] *** OrIdow6 has quit IRC (Ping timeout: 265 seconds) [23:30] Four projects remaining on the Tigris website crawl: argouml, argouml-stats, subversion, and tortoisesvn. All just recursing through the discussion forums. [23:32] 16k files and 44k file revisions remaining on the CVS web crawl. (Each of the files will further queue a file revisions item later, so that's really 60k file revs remaining.) [23:38] SketchCow: FOS /2 is nearly full. [23:39] *** kiska1825 has quit IRC (Remote host closed the connection) [23:39] *** Ryz has quit IRC (Remote host closed the connection) [23:39] ~8 TiB of it is AB. [23:39] *** kiska1825 has joined #archiveteam-bs [23:39] *** Ryz has joined #archiveteam-bs [23:58] can anyone else here archive this : https://www.twitch.tv/reckful/videos?filter=archives&sort=time [23:59] https://www.reddit.com/r/DataHoarder/comments/hk66tb/twitch_streamer_reckful_died_preserving_the/ [23:59] *** BlueMax has joined #archiveteam-bs