#archiveteam-bs 2020-06-20,Sat

↑back Search

Time Nickname Message
00:10 🔗 MasterOfP has joined #archiveteam-bs
00:12 🔗 MasterOfP Hello! I came across the "Montreal Mirror archives, 1997-2010" on archive,org, and am wondering if there is a way to index the .warc to find a certain word, instead of having to look through every page.
00:15 🔗 Maylay has joined #archiveteam-bs
00:30 🔗 jason0597 has quit IRC (Read error: Operation timed out)
00:37 🔗 MasterOfP has quit IRC (Quit: https://mibbit.com Online IRC Client)
00:37 🔗 wp494 has quit IRC (Read error: Connection reset by peer)
00:38 🔗 wp494 has joined #archiveteam-bs
00:52 🔗 idontknow has joined #archiveteam-bs
00:55 🔗 wyatt8740 has quit IRC (Ceci n'est pas un IRC quit message.)
00:56 🔗 wyatt8740 has joined #archiveteam-bs
00:56 🔗 wyatt8740 has quit IRC (Remote host closed the connection)
01:03 🔗 maxfan8 has quit IRC (Quit: WeeChat 2.8)
01:03 🔗 idontknow has quit IRC (Ping timeout: 252 seconds)
01:03 🔗 maxfan8 has joined #archiveteam-bs
01:03 🔗 OrIdow6 Tigris.org only has a few hundred repos
01:05 🔗 OrIdow6 Looks like it's time for me learn SVN now
01:06 🔗 nicolas17 svnsync
01:17 🔗 katocala has quit IRC ()
01:19 🔗 Arcorann has joined #archiveteam-bs
01:35 🔗 JAA svnrdump
01:35 🔗 JAA I'll grab all repos.
01:36 🔗 JAA At least I've always used svnrdump and saw that recommended in several places before.
01:40 🔗 JAA Oh, some of it is CVS. Eww
01:40 🔗 JAA E.g. http://xanta.tigris.org/source/browse/xanta/
01:47 🔗 JAA It appears that the site has 3.2 million forum posts as well.
01:50 🔗 JAA Yeah, this is going to be a huge mess.
01:52 🔗 JAA nicolas17: Any idea about svnrdump vs svnsync? I'm not really familiar with SVN, but I've been told that svnrdump preserves the most information.
01:52 🔗 nicolas17 oh, that's possible yes
01:53 🔗 nicolas17 I never used svnrdump
01:53 🔗 JAA Fun, there's also a project "www" which obviously breaks things as it collides with Tigris's main website.
01:54 🔗 JAA Well, not "break" but behaves differently than everything else.
01:55 🔗 JAA nicolas17: Might be interesting to run svnsync on a repo and svnrdump on both the original one and the mirror to see whether there are any differences.
02:04 🔗 JAA Well, seems it requires a decent amount of understanding how SVN works to create a repo to which you can svnsync, so I won't do that now.
02:08 🔗 JAA Currently grabbing all the source info pages to get an idea how many svn vs CVS repos there are.
02:08 🔗 JAA Does anyone have any idea what the correct way of archiving a CVS repo is?
02:34 🔗 zyphlar_ has joined #archiveteam-bs
02:47 🔗 mattl has quit IRC (Read error: Connection reset by peer)
02:48 🔗 kyledrake has quit IRC (Read error: Connection reset by peer)
02:48 🔗 t3 has quit IRC (Read error: Connection reset by peer)
02:48 🔗 justcool3 has quit IRC (Read error: Connection reset by peer)
02:48 🔗 Ctrl-S___ has quit IRC (Write error: Connection reset by peer)
02:48 🔗 katocala has joined #archiveteam-bs
02:50 🔗 t3 has joined #archiveteam-bs
02:50 🔗 justcool3 has joined #archiveteam-bs
02:52 🔗 Ctrl-S___ has joined #archiveteam-bs
02:53 🔗 katocala has quit IRC (Client Quit)
03:00 🔗 kyledrake has joined #archiveteam-bs
03:02 🔗 BlueMaxim has joined #archiveteam-bs
03:04 🔗 JAA So there are 666 projects in total on Tigris, including the odd "www" one. 374 of these use SVN, and 290 use CVS. This is off by one because http://mgc.tigris.org/ has no repository anymore: http://mgc.tigris.org/source/browse/mgc/
03:04 🔗 katocala has joined #archiveteam-bs
03:06 🔗 JAA Here are the 954 (374 + 2 * 290) commands from all the /source/browse/<project>/ pages: https://transfer.notkiska.pw/160MrR/tigris-checkout-commands
03:06 🔗 mattl has joined #archiveteam-bs
03:10 🔗 JAA On a completely unrelated note, the ARM documentation will disappear by June 30. I threw it into ArchiveBot earlier. infocenter.arm.com is running fine, but developer.arm.com/docs ran into something (no idea what) that set a cookie switching to an "edit mode" which redirects everything to login pages. They claim that the content will be migrated with redirects and all, but the new site is JS hell and
03:10 🔗 JAA pretty much unarchivable.
03:11 🔗 JAA So that requires some action as well.
03:14 🔗 BlueMax has quit IRC (Ping timeout: 745 seconds)
03:18 🔗 wyatt8740 has joined #archiveteam-bs
03:22 🔗 Arcorann has quit IRC (Read error: Connection reset by peer)
03:42 🔗 qw3rty__ has joined #archiveteam-bs
03:50 🔗 qw3rty_ has quit IRC (Read error: Operation timed out)
04:03 🔗 benjins has quit IRC (Remote host closed the connection)
04:05 🔗 benjins has joined #archiveteam-bs
04:22 🔗 Arcorann has joined #archiveteam-bs
04:44 🔗 zyphlar_ has quit IRC (Quit: Connection closed for inactivity)
05:20 🔗 BlueMax has joined #archiveteam-bs
05:20 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
05:24 🔗 omglolba- has joined #archiveteam-bs
05:24 🔗 omglolbah has quit IRC (Write error: Connection reset by peer)
05:32 🔗 TC01 has quit IRC (Ping timeout: 745 seconds)
05:33 🔗 TC01 has joined #archiveteam-bs
05:46 🔗 nicolas17 has quit IRC (Quit: Konversation terminated!)
06:18 🔗 godane so tubeup is failing hard right now
06:18 🔗 godane end up just uploading metadata when i can't upload any files
06:19 🔗 godane then it uploads the .part files sometimes
06:20 🔗 godane feels like have to watch over this program to make sure it give IA the right data instead of junk incomplete data now
06:53 🔗 tsr has quit IRC (Quit: foo)
07:01 🔗 justcool3 has quit IRC (Quit: Connection closed for inactivity)
07:01 🔗 godane has quit IRC (Read error: Connection reset by peer)
07:19 🔗 larryv_ has joined #archiveteam-bs
07:20 🔗 godane has joined #archiveteam-bs
07:21 🔗 larryv has quit IRC (Read error: Operation timed out)
07:50 🔗 BlueMax has quit IRC (Quit: Leaving)
07:50 🔗 HP_Archiv godane: .part files usually are the result of ytdl not finishing a download, not tubeup
07:50 🔗 HP_Archiv Also, don't bother with tubeup unless you don't mind the no-index of content
07:57 🔗 godane i only use tubeup cause its japanese language and i don't know the name
07:58 🔗 godane its just easier that way for now
08:40 🔗 larryv_ has quit IRC (Ping timeout: 272 seconds)
09:55 🔗 OrIdow6 JAA: If the redirects are frequent enough, hopefully it would be possible to figure out what sets the cookie (assuming there's only one) once the log is uploaded
11:11 🔗 BartoCH has joined #archiveteam-bs
11:14 🔗 BlueMax has joined #archiveteam-bs
12:02 🔗 larryv has joined #archiveteam-bs
12:06 🔗 Maylay has quit IRC (Ping timeout: 265 seconds)
12:07 🔗 larryv has quit IRC (Ping timeout: 272 seconds)
12:18 🔗 Maylay has joined #archiveteam-bs
12:44 🔗 larryv has joined #archiveteam-bs
12:58 🔗 larryv has quit IRC (Read error: Operation timed out)
12:58 🔗 jason0597 has joined #archiveteam-bs
13:24 🔗 BlueMax has quit IRC (Quit: Leaving)
13:43 🔗 JAA OrIdow6: I looked at the log file already and didn't see anything obvious.
14:28 🔗 HP_Archiv has quit IRC (Read error: Connection reset by peer)
15:28 🔗 Arcorann has quit IRC (Read error: Connection reset by peer)
16:12 🔗 jason0597 has quit IRC (Read error: Operation timed out)
16:32 🔗 larryv has joined #archiveteam-bs
16:43 🔗 larryv has quit IRC (Quit: larryv)
16:45 🔗 larryv has joined #archiveteam-bs
17:21 🔗 larryv has quit IRC (Read error: Operation timed out)
17:25 🔗 larryv has joined #archiveteam-bs
17:31 🔗 larryv_ has joined #archiveteam-bs
17:33 🔗 VerifiedJ has joined #archiveteam-bs
17:33 🔗 larryv has quit IRC (Read error: Operation timed out)
17:56 🔗 nicolas17 has joined #archiveteam-bs
18:28 🔗 britmob_ has quit IRC (Read error: Connection reset by peer)
18:28 🔗 britmob has joined #archiveteam-bs
19:06 🔗 JAA Hmm, looks like Tigris's CVS repository is dead. My connections are timing out.
19:11 🔗 JAA The SVN repositories are directly accessible as open directories, so it might be possible in principle even to archive them as WARC and have `svn checkout` from the WBM work, maybe.
19:11 🔗 JAA I'm not going to look into that though.
19:21 🔗 wyatt8740 has quit IRC (Ping timeout: 260 seconds)
19:21 🔗 wyatt8750 has joined #archiveteam-bs
19:26 🔗 HP_Archiv has joined #archiveteam-bs
20:00 🔗 nicolas17 what do you mean by open directories? link?
20:07 🔗 JAA You can browse the directory structure. E.g. http://ado-mock.tigris.org/svn/ado-mock/
20:08 🔗 JAA But it doesn't actually expose everything (e.g. past revisions) and doesn't really matter anyway regarding capturing SVN's HTTP requests in WARCs.
20:08 🔗 JAA At first I thought it would be as simple as doing a recursive crawl of that.
20:10 🔗 nicolas17 'svn checkout' over HTTP uses a special protocol, you can't serve it as plain files
20:11 🔗 JAA Right
20:11 🔗 JAA Do you know where that protocol is documented?
20:13 🔗 nicolas17 it used to be WebDAV-based, I think there's now a new protocol that doesn't need a zillion requests so it doesn't kill you with latency
21:26 🔗 t3 has quit IRC (Quit: Connection closed for inactivity)
21:27 🔗 VerifiedJ has quit IRC (Quit: Leaving)
21:37 🔗 Raccoon has quit IRC (Ping timeout: 265 seconds)
21:52 🔗 larryv_ has quit IRC (Quit: larryv_)
22:10 🔗 larryv has joined #archiveteam-bs
22:24 🔗 DigiDigi has quit IRC (Read error: Operation timed out)
22:41 🔗 godane latest scans : https://www.patreon.com/posts/digitize-for-06-38448509
22:45 🔗 godane has quit IRC (Quit: Leaving.)
22:51 🔗 DigiDigi has joined #archiveteam-bs
23:16 🔗 wyatt8740 has joined #archiveteam-bs
23:17 🔗 wyatt8750 has quit IRC (Ping timeout: 260 seconds)
23:31 🔗 wyatt8750 has joined #archiveteam-bs
23:32 🔗 wyatt8740 has quit IRC (Ping timeout: 260 seconds)
23:36 🔗 wyatt8750 has quit IRC (Read error: Operation timed out)
23:36 🔗 BlueMax has joined #archiveteam-bs

irclogger-viewer