#archiveteam-bs 2020-07-02,Thu

↑back Search

Time Nickname Message
00:31 🔗 kiska1825 has joined #archiveteam-bs
00:31 🔗 kiska1825 has quit IRC (Client Quit)
00:32 🔗 kiska1825 has joined #archiveteam-bs
00:32 🔗 Ryz has joined #archiveteam-bs
01:02 🔗 pew has quit IRC (Ping timeout: 265 seconds)
01:14 🔗 Ryz JAA, since archiving the forums of Kongregate is unlikely to be done with ArchiveBot, would this be an ArchiveTeam Warrior project?
01:14 🔗 pew has joined #archiveteam-bs
01:19 🔗 JAA Ryz: That or a qwarc task.
01:20 🔗 Ryz Isn't that website JS-filled, as per your reaction to the announcement link I pointed out? oo;
01:21 🔗 JAA There are a few other important DPoS projects stuck in dev for too long already, so I guess I might try with qwarc.
01:21 🔗 JAA Yep, it is.
01:21 🔗 JAA I haven't looked at it in detail yet at all.
01:21 🔗 JAA Ultimately, JS interactions can always be emulated. I've done that many times before with qwarc.
01:24 🔗 Ryz Fortunately, it isn't the whole forums of Kongregate...yet
01:32 🔗 JAA Meh, I'll just grab the whole thing I think. Too complicated to do it differently.
02:22 🔗 cerca has quit IRC (Remote host closed the connection)
02:28 🔗 OrIdow6 Two things that are going down, but that no one but Google Translate seems to be able to read:
02:28 🔗 OrIdow6 https://matome.naver.jp/ - per that message in #archiveteam, does seem to be going down at the end of September. It seems to be some Web 2.0 site, can't really tell what it is
02:30 🔗 OrIdow6 https://pclab.pl/ is being frozen, if not taken down; status of https://forum.pclab.pl/ is unclear
03:29 🔗 qw3rty__ has joined #archiveteam-bs
03:37 🔗 qw3rty_ has quit IRC (Read error: Operation timed out)
03:59 🔗 icedice2 has quit IRC (Quit: Leaving)
05:19 🔗 rklane has joined #archiveteam-bs
05:52 🔗 mgrandi has joined #archiveteam-bs
06:10 🔗 nicolas17 has quit IRC (Quit: sleep)
08:13 🔗 OrIdow6 So I'll ask that someone put https://pclab.pl/ and https://matome.naver.jp/ into AB; will see about the forums later
08:17 🔗 Ryz Hello OrIdow6, do you have a source for https://pclab.pl/ ? And https://matome.naver.jp/ besides the message posted on another chatroom?
08:18 🔗 OrIdow6 Ryz: No, both per those links in #archiveteam in the last few days
08:18 🔗 OrIdow6 https://pclab.pl/news84673.html and http://navermatome-official.blog.jp/archives/83259956.html
08:19 🔗 OrIdow6 Again, I'm not 100% sure, because I speak neither Polish nor Japanese, but machine translation makes it look like they're going down
08:22 🔗 Ryz I've ran https://pclab.pl/ (but not the forums yet), but https://matome.naver.jp/ might be large
08:23 🔗 Ryz The former is processing right now via AB
08:23 🔗 Ryz OrIdow6 ^
08:24 🔗 OrIdow6 Ryz: How are you determining that it's too large?
08:24 🔗 OrIdow6 And thank you for pclab
08:25 🔗 Ryz It seems to be a social media website with blog capabilities
08:26 🔗 Ryz The pagination can go up to 50 pages, but I feel there's a lot more when you dig around the individual users' profiles
08:26 🔗 Ryz Pagination at least on the main page
08:27 🔗 JAA My Tigris grab is progressing nicely. It's nearly finished, just 15 projects are still retrieving things, mostly the largest ones with a lot of messages in the forums.
08:27 🔗 JAA Very strange that it's still online.
08:32 🔗 OrIdow6 Ryz: It looked more to me like it was a collection of user-created "search results", in the form of lists of links
08:33 🔗 JAA The Turiver grab is at 31k of 130k thread IDs.
08:34 🔗 JAA And Komixxy.pl is at almost a third of the posts now. There are a lot of warnings and errors to go through though because the site's pretty weird.
08:35 🔗 JAA Empty usernames and whatnot
08:39 🔗 OrIdow6 In any case, it's going to have to get archived at some timne
09:48 🔗 fredgido has joined #archiveteam-bs
09:58 🔗 HP_Archiv has joined #archiveteam-bs
10:16 🔗 BlueMax has quit IRC (Quit: Leaving)
10:17 🔗 JAA Tigris just started timing out a couple minutes ago. This may be the end.
10:18 🔗 OrIdow6 How much did you get?
10:18 🔗 JAA Nevermind, it's back.
10:18 🔗 OrIdow6 Oh
10:19 🔗 JAA ¯\_(ツ)_/¯
10:19 🔗 JAA They are based in Georgia (the US state), so maybe there's a bit more time until they wake up and take it down.
10:20 🔗 OrIdow6 Could be
10:21 🔗 Ryz Looooooot :p
10:21 🔗 JAA I've been working on another grab of the CVS repos, which are only available through the web interface. The AB job I launched for those yesterday laturally blew up.
10:22 🔗 JAA naturally*
10:37 🔗 OrIdow6 Is there any difference between e.g. http://delphiexpert.tigris.org/source/browse/delphiexpert/src/UStringUtils.pas?view=markup&pathrev=MAIN http://delphiexpert.tigris.org/source/browse/*checkout*/delphiexpert/src/UStringUtils.pas?revision=1.2&pathrev=MAIN and http://delphiexpert.tigris.org/source/browse/*checkout*/delphiexpert/src/UStringUtils.pas?revision=1.2&content-type=text%2Fplain&pathrev=MAIN ?
10:38 🔗 HP_Archiv has quit IRC (Quit: Leaving)
10:41 🔗 OrIdow6 JAA
10:41 🔗 HP_Archiv has joined #archiveteam-bs
10:42 🔗 JAA OrIdow6: They all have the same file contents. Only the first file has the commit message and metadata (author, date) though.
10:43 🔗 JAA first link*
10:45 🔗 OrIdow6 Oh
10:47 🔗 OrIdow6 Looks like the list page has the metadata, though
10:48 🔗 JAA I think I've seen some that's only on the view=markup page.
10:50 🔗 OrIdow6 Oh
10:57 🔗 JAA By the way, hideattic=0 is important to catch deleted files.
11:17 🔗 JAA CVS web grab started for the 303 repos I'm aware of. I'm first grabbing the directory structure for all projects, then the most recent revision of every file for all projects, then the revision history.
11:24 🔗 JAA OrIdow6: Oh by the way, the pathrev is irrelevant for the /*checkout*/ URLs; I'm not sure why it's there, to be honest. I don't include it in my requests, which reduces the number of requests massively, especially when there are many branches that share a revision.
11:25 🔗 JAA Oh shit, Tigris *sometimes* serves error pages with status 200. Eww.
11:28 🔗 JAA http://www.tigris.org/servlets/ErrorPage
11:30 🔗 JAA They're 302 redirects to that URL. Fortunately, I got only a handful of these on my pages crawl (requeueing), but the CVS web interface throws them a lot under load.
11:32 🔗 VerifiedJ has joined #archiveteam-bs
11:40 🔗 HP_Archiv has quit IRC (Quit: Leaving)
11:40 🔗 JAA I also got occasional odd redirects to the login form, which oddly enough appeared on the same three projects as the /ErrorPage ones.
11:42 🔗 JAA The CVS web interface is buggy as well: http://silvertejp.tigris.org/source/browse/silvertejp/maven2/silvertejp-limax/src/main/java/org/tigris/silvertejp/limax/util/.svn/ redirects to the login.
11:42 🔗 JAA (Yeah, someone committed a full SVN repository to CVS. lol)
11:48 🔗 JAA Restarted the CVS web grab with a fix for those error pages and login redirects.
11:52 🔗 JAA Reduced my concurrency a bit and am now getting a better throughput as well. :-)
11:52 🔗 JAA (And no more error page redirects)
13:06 🔗 JAA Seems like the login redirects may be cookie-based in some cases. Seriously, fuck this site.
13:12 🔗 VerifiedJ has quit IRC (Quit: Leaving)
13:16 🔗 OrIdow6 Sounds familiar
13:19 🔗 JAA Initial directory structure scan is done except for a few errors, about 117k files exist in total.
13:20 🔗 OrIdow6 I assume you're going to hammer it with Qwarc now
13:23 🔗 JAA Only a little bit. Can't go too fast because their servers are falling over immediately if it's more than a light breeze.
13:25 🔗 JAA Looks like the request to any directory called ".svn" in silvertejp is broken, and sometimes that then causes all further requests for other things to also fall into the login trap.
13:31 🔗 OrIdow6 Depending on how the website works, it could help your max speed to be parallel across different projects/subdomains, instead of within each
13:31 🔗 OrIdow6 I.e., depending on what resource it is that's being used up
13:33 🔗 JAA Errors seem to be random, and it's all in the same CVS repository anyway (each project is a "module" in CVS terms).
13:33 🔗 OrIdow6 Oh
13:46 🔗 JAA More issues. Sigh...
13:49 🔗 benjins has joined #archiveteam-bs
13:50 🔗 OrIdow6 I was thinking of suggesting stopping the AB job to get you more resources, but at this rate, if the site shuts down soon, it may end get the most
13:50 🔗 Terbium_ has joined #archiveteam-bs
13:50 🔗 OrIdow6 *getting
13:51 🔗 maxfan8_ has joined #archiveteam-bs
13:52 🔗 OrIdow6 Disadvantage of a structured grab
13:52 🔗 JAA Yeah, but the AB grab will be incomplete as well due to those error and login redirects.
13:52 🔗 Maylay_ has joined #archiveteam-bs
13:52 🔗 Maylay_ has quit IRC (Remote host closed the connection!)
13:52 🔗 JAA My grab's stuck in a loop at the moment. Never encountered this before.
13:52 🔗 Maylay_ has joined #archiveteam-bs
13:58 🔗 JAA Yup, that's a bug in qwarc. Heh
14:02 🔗 Terbium has quit IRC (se.hub efnet.deic.eu)
14:02 🔗 benjinsmi has quit IRC (se.hub efnet.deic.eu)
14:02 🔗 Maylay has quit IRC (se.hub efnet.deic.eu)
14:02 🔗 maxfan8 has quit IRC (se.hub efnet.deic.eu)
14:20 🔗 Ryz has quit IRC (Remote host closed the connection)
14:20 🔗 kiska1825 has quit IRC (Remote host closed the connection)
14:21 🔗 kiska1825 has joined #archiveteam-bs
14:21 🔗 Ryz has joined #archiveteam-bs
14:23 🔗 Raccoon has quit IRC (Ping timeout: 272 seconds)
14:54 🔗 Arcorann has quit IRC (Read error: Connection reset by peer)
15:08 🔗 lennier1 has quit IRC (Ping timeout: 265 seconds)
15:09 🔗 lennier1 has joined #archiveteam-bs
16:48 🔗 Raccoon has joined #archiveteam-bs
17:11 🔗 mgrandi has quit IRC (Leaving)
17:19 🔗 rklane has quit IRC (Quit: This computer has gone to sleep)
17:24 🔗 tobbez has joined #archiveteam-bs
18:10 🔗 nicolas17 has joined #archiveteam-bs
18:25 🔗 JAA Turiver: about one third done
18:25 🔗 JAA Komixxy: a bit under halfway done with posts, 170k+ users discovered so far
18:25 🔗 JAA Tigris: mess
19:38 🔗 Ryz has quit IRC (Remote host closed the connection)
19:38 🔗 kiska1825 has quit IRC (Remote host closed the connection)
19:39 🔗 kiska1825 has joined #archiveteam-bs
19:39 🔗 Ryz has joined #archiveteam-bs
19:49 🔗 JAA Six projects are still running on my Tigris website crawl.
19:50 🔗 qwebirc91 has joined #archiveteam-bs
19:51 🔗 qwebirc91 has quit IRC (Client Quit)
19:51 🔗 JAA And my CVS web grab is nearly done with the current revision retrieval.
19:51 🔗 qwebirc31 has joined #archiveteam-bs
19:51 🔗 qwebirc31 has quit IRC (Client Quit)
19:51 🔗 Nikchemny has joined #archiveteam-bs
19:52 🔗 Nikchemny JAA: Sorry, I didn't edit wikipage today because I was busy
19:54 🔗 JAA Nikchemny: No worries, I've been busy with other things anyway.
19:55 🔗 Nikchemny JAA: Btw, I've saved article from news.mail.ru with chromebot and archivebot. I started these with my phone
20:22 🔗 Nikchemny JAA: This article: https://news.mail.ru/politics/42417368/
20:22 🔗 JAA Mhm
20:31 🔗 JAA Correction on the Tigris CVS progress: 96k files done, 21.5k remaining; it has started with the file revisions at 23k done, 73k remaining.
21:47 🔗 robogoat has quit IRC (Ping timeout: 745 seconds)
21:55 🔗 robogoat has joined #archiveteam-bs
22:00 🔗 Nikchemny has quit IRC (Quit: Page closed)
22:06 🔗 nicolas17 https://lists.openstreetmap.org/pipermail/dev/2020-July/030958.html OpenStreetMap Trac and SVN going away
22:17 🔗 JAA Only being made read-only for now according to that.
22:18 🔗 nicolas17 ah right
22:18 🔗 nicolas17 put it at the bottom of the list :)
22:42 🔗 OrIdow6 has quit IRC (Ping timeout: 265 seconds)
22:46 🔗 OrIdow6 has joined #archiveteam-bs
22:55 🔗 godane https://www.reddit.com/r/DataHoarder/comments/hk64ye/americas_network_magazines_scans_and_other_rare/
23:16 🔗 Arcorann has joined #archiveteam-bs
23:24 🔗 OrIdow6 has quit IRC (Ping timeout: 265 seconds)
23:30 🔗 JAA Four projects remaining on the Tigris website crawl: argouml, argouml-stats, subversion, and tortoisesvn. All just recursing through the discussion forums.
23:32 🔗 JAA 16k files and 44k file revisions remaining on the CVS web crawl. (Each of the files will further queue a file revisions item later, so that's really 60k file revs remaining.)
23:38 🔗 JAA SketchCow: FOS /2 is nearly full.
23:39 🔗 kiska1825 has quit IRC (Remote host closed the connection)
23:39 🔗 Ryz has quit IRC (Remote host closed the connection)
23:39 🔗 JAA ~8 TiB of it is AB.
23:39 🔗 kiska1825 has joined #archiveteam-bs
23:39 🔗 Ryz has joined #archiveteam-bs
23:58 🔗 godane can anyone else here archive this : https://www.twitch.tv/reckful/videos?filter=archives&sort=time
23:59 🔗 godane https://www.reddit.com/r/DataHoarder/comments/hk66tb/twitch_streamer_reckful_died_preserving_the/
23:59 🔗 BlueMax has joined #archiveteam-bs

irclogger-viewer