[00:10] *** MasterOfP has joined #archiveteam-bs [00:12] Hello! I came across the "Montreal Mirror archives, 1997-2010" on archive,org, and am wondering if there is a way to index the .warc to find a certain word, instead of having to look through every page. [00:15] *** Maylay has joined #archiveteam-bs [00:30] *** jason0597 has quit IRC (Read error: Operation timed out) [00:37] *** MasterOfP has quit IRC (Quit: https://mibbit.com Online IRC Client) [00:37] *** wp494 has quit IRC (Read error: Connection reset by peer) [00:38] *** wp494 has joined #archiveteam-bs [00:52] *** idontknow has joined #archiveteam-bs [00:55] *** wyatt8740 has quit IRC (Ceci n'est pas un IRC quit message.) [00:56] *** wyatt8740 has joined #archiveteam-bs [00:56] *** wyatt8740 has quit IRC (Remote host closed the connection) [01:03] *** maxfan8 has quit IRC (Quit: WeeChat 2.8) [01:03] *** idontknow has quit IRC (Ping timeout: 252 seconds) [01:03] *** maxfan8 has joined #archiveteam-bs [01:03] Tigris.org only has a few hundred repos [01:05] Looks like it's time for me learn SVN now [01:06] svnsync [01:17] *** katocala has quit IRC () [01:19] *** Arcorann has joined #archiveteam-bs [01:35] svnrdump [01:35] I'll grab all repos. [01:36] At least I've always used svnrdump and saw that recommended in several places before. [01:40] Oh, some of it is CVS. Eww [01:40] E.g. http://xanta.tigris.org/source/browse/xanta/ [01:47] It appears that the site has 3.2 million forum posts as well. [01:50] Yeah, this is going to be a huge mess. [01:52] nicolas17: Any idea about svnrdump vs svnsync? I'm not really familiar with SVN, but I've been told that svnrdump preserves the most information. [01:52] oh, that's possible yes [01:53] I never used svnrdump [01:53] Fun, there's also a project "www" which obviously breaks things as it collides with Tigris's main website. [01:54] Well, not "break" but behaves differently than everything else. [01:55] nicolas17: Might be interesting to run svnsync on a repo and svnrdump on both the original one and the mirror to see whether there are any differences. [02:04] Well, seems it requires a decent amount of understanding how SVN works to create a repo to which you can svnsync, so I won't do that now. [02:08] Currently grabbing all the source info pages to get an idea how many svn vs CVS repos there are. [02:08] Does anyone have any idea what the correct way of archiving a CVS repo is? [02:34] *** zyphlar_ has joined #archiveteam-bs [02:47] *** mattl has quit IRC (Read error: Connection reset by peer) [02:48] *** kyledrake has quit IRC (Read error: Connection reset by peer) [02:48] *** t3 has quit IRC (Read error: Connection reset by peer) [02:48] *** justcool3 has quit IRC (Read error: Connection reset by peer) [02:48] *** Ctrl-S___ has quit IRC (Write error: Connection reset by peer) [02:48] *** katocala has joined #archiveteam-bs [02:50] *** t3 has joined #archiveteam-bs [02:50] *** justcool3 has joined #archiveteam-bs [02:52] *** Ctrl-S___ has joined #archiveteam-bs [02:53] *** katocala has quit IRC (Client Quit) [03:00] *** kyledrake has joined #archiveteam-bs [03:02] *** BlueMaxim has joined #archiveteam-bs [03:04] So there are 666 projects in total on Tigris, including the odd "www" one. 374 of these use SVN, and 290 use CVS. This is off by one because http://mgc.tigris.org/ has no repository anymore: http://mgc.tigris.org/source/browse/mgc/ [03:04] *** katocala has joined #archiveteam-bs [03:06] Here are the 954 (374 + 2 * 290) commands from all the /source/browse// pages: https://transfer.notkiska.pw/160MrR/tigris-checkout-commands [03:06] *** mattl has joined #archiveteam-bs [03:10] On a completely unrelated note, the ARM documentation will disappear by June 30. I threw it into ArchiveBot earlier. infocenter.arm.com is running fine, but developer.arm.com/docs ran into something (no idea what) that set a cookie switching to an "edit mode" which redirects everything to login pages. They claim that the content will be migrated with redirects and all, but the new site is JS hell and [03:10] pretty much unarchivable. [03:11] So that requires some action as well. [03:14] *** BlueMax has quit IRC (Ping timeout: 745 seconds) [03:18] *** wyatt8740 has joined #archiveteam-bs [03:22] *** Arcorann has quit IRC (Read error: Connection reset by peer) [03:42] *** qw3rty__ has joined #archiveteam-bs [03:50] *** qw3rty_ has quit IRC (Read error: Operation timed out) [04:03] *** benjins has quit IRC (Remote host closed the connection) [04:05] *** benjins has joined #archiveteam-bs [04:22] *** Arcorann has joined #archiveteam-bs [04:44] *** zyphlar_ has quit IRC (Quit: Connection closed for inactivity) [05:20] *** BlueMax has joined #archiveteam-bs [05:20] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [05:24] *** omglolba- has joined #archiveteam-bs [05:24] *** omglolbah has quit IRC (Write error: Connection reset by peer) [05:32] *** TC01 has quit IRC (Ping timeout: 745 seconds) [05:33] *** TC01 has joined #archiveteam-bs [05:46] *** nicolas17 has quit IRC (Quit: Konversation terminated!) [06:18] so tubeup is failing hard right now [06:18] end up just uploading metadata when i can't upload any files [06:19] then it uploads the .part files sometimes [06:20] feels like have to watch over this program to make sure it give IA the right data instead of junk incomplete data now [06:53] *** tsr has quit IRC (Quit: foo) [07:01] *** justcool3 has quit IRC (Quit: Connection closed for inactivity) [07:01] *** godane has quit IRC (Read error: Connection reset by peer) [07:19] *** larryv_ has joined #archiveteam-bs [07:20] *** godane has joined #archiveteam-bs [07:21] *** larryv has quit IRC (Read error: Operation timed out) [07:50] *** BlueMax has quit IRC (Quit: Leaving) [07:50] godane: .part files usually are the result of ytdl not finishing a download, not tubeup [07:50] Also, don't bother with tubeup unless you don't mind the no-index of content [07:57] i only use tubeup cause its japanese language and i don't know the name [07:58] its just easier that way for now [08:40] *** larryv_ has quit IRC (Ping timeout: 272 seconds) [09:55] JAA: If the redirects are frequent enough, hopefully it would be possible to figure out what sets the cookie (assuming there's only one) once the log is uploaded [11:11] *** BartoCH has joined #archiveteam-bs [11:14] *** BlueMax has joined #archiveteam-bs [12:02] *** larryv has joined #archiveteam-bs [12:06] *** Maylay has quit IRC (Ping timeout: 265 seconds) [12:07] *** larryv has quit IRC (Ping timeout: 272 seconds) [12:18] *** Maylay has joined #archiveteam-bs [12:44] *** larryv has joined #archiveteam-bs [12:58] *** larryv has quit IRC (Read error: Operation timed out) [12:58] *** jason0597 has joined #archiveteam-bs [13:24] *** BlueMax has quit IRC (Quit: Leaving) [13:43] OrIdow6: I looked at the log file already and didn't see anything obvious. [14:28] *** HP_Archiv has quit IRC (Read error: Connection reset by peer) [15:28] *** Arcorann has quit IRC (Read error: Connection reset by peer) [16:12] *** jason0597 has quit IRC (Read error: Operation timed out) [16:32] *** larryv has joined #archiveteam-bs [16:43] *** larryv has quit IRC (Quit: larryv) [16:45] *** larryv has joined #archiveteam-bs [17:21] *** larryv has quit IRC (Read error: Operation timed out) [17:25] *** larryv has joined #archiveteam-bs [17:31] *** larryv_ has joined #archiveteam-bs [17:33] *** VerifiedJ has joined #archiveteam-bs [17:33] *** larryv has quit IRC (Read error: Operation timed out) [17:56] *** nicolas17 has joined #archiveteam-bs [18:28] *** britmob_ has quit IRC (Read error: Connection reset by peer) [18:28] *** britmob has joined #archiveteam-bs [19:06] Hmm, looks like Tigris's CVS repository is dead. My connections are timing out. [19:11] The SVN repositories are directly accessible as open directories, so it might be possible in principle even to archive them as WARC and have `svn checkout` from the WBM work, maybe. [19:11] I'm not going to look into that though. [19:21] *** wyatt8740 has quit IRC (Ping timeout: 260 seconds) [19:21] *** wyatt8750 has joined #archiveteam-bs [19:26] *** HP_Archiv has joined #archiveteam-bs [20:00] what do you mean by open directories? link? [20:07] You can browse the directory structure. E.g. http://ado-mock.tigris.org/svn/ado-mock/ [20:08] But it doesn't actually expose everything (e.g. past revisions) and doesn't really matter anyway regarding capturing SVN's HTTP requests in WARCs. [20:08] At first I thought it would be as simple as doing a recursive crawl of that. [20:10] 'svn checkout' over HTTP uses a special protocol, you can't serve it as plain files [20:11] Right [20:11] Do you know where that protocol is documented? [20:13] it used to be WebDAV-based, I think there's now a new protocol that doesn't need a zillion requests so it doesn't kill you with latency [21:26] *** t3 has quit IRC (Quit: Connection closed for inactivity) [21:27] *** VerifiedJ has quit IRC (Quit: Leaving) [21:37] *** Raccoon has quit IRC (Ping timeout: 265 seconds) [21:52] *** larryv_ has quit IRC (Quit: larryv_) [22:10] *** larryv has joined #archiveteam-bs [22:24] *** DigiDigi has quit IRC (Read error: Operation timed out) [22:41] latest scans : https://www.patreon.com/posts/digitize-for-06-38448509 [22:45] *** godane has quit IRC (Quit: Leaving.) [22:51] *** DigiDigi has joined #archiveteam-bs [23:16] *** wyatt8740 has joined #archiveteam-bs [23:17] *** wyatt8750 has quit IRC (Ping timeout: 260 seconds) [23:31] *** wyatt8750 has joined #archiveteam-bs [23:32] *** wyatt8740 has quit IRC (Ping timeout: 260 seconds) [23:36] *** wyatt8750 has quit IRC (Read error: Operation timed out) [23:36] *** BlueMax has joined #archiveteam-bs