[00:13] SketchCow: here is a japanese automatic cappuccino machine manual from DeLonghi: https://archive.org/details/japanese-manual-133000 [00:16] *** ivan has joined #archiveteam-bs [00:25] *** ranma has joined #archiveteam-bs [00:28] did you scan yourself godane ? [00:31] Frogging: discourse is getting some crawler view improvements because ie11 is going to start getting the no-js view pretty soon [00:32] riking_: What's the deal with ie11? [00:32] can't use css variables and a bunch of other nice things [00:32] *css custom properties [00:33] and why is this going to lead to improvements to Wayback's display of Discourse? [00:33] *** ranma_ has quit IRC (Ping timeout: 745 seconds) [00:34] I'm not arguing, I'm just curious how they are related [00:34] oh, the wayback view is still broken, as of now, but hopefully it can get swung so that whatever error breaks the replay triggers that code [00:34] and so the JS yanks the noscript content into view [00:35] Oh you mean the Discourse devs might do that to fix IE11 and then it might improve the Wayback experience as a side effect [00:36] mhm [00:36] ah, neat [00:36] also, the noscript view is getting styling improvements because people with actual browsers are going to be seeing it pretty soon [00:37] that "actual browser" being ie11 [00:38] oh hey it seems to be working now? for the homepage http://web.archive.org/web/20200514003833/https://meta.discourse.org/ [00:39] topic view still broken [00:40] "InternalError: too much recursion" [00:40] mm [00:41] Wonder if one could make a browser side script to tweak it into displaying properly [00:41] Probably would be a tedious endeavour [00:41] Uncaught SyntaxError: Invalid regular expression: /^[^A-Za-zÀ-ÖØ-öø-ʸ̀-֐ࠀ-῿Ⰰ-﬜﷾-﹯﻽-￿]*[֑-߿יִ-﷽ﹰ-ﻼ]/: Range out of order in character class [00:42] I wonder if WebAssembly proliferation will worsen all of this kind of thing [00:43] Frogging: here's your browser side script, `window.unsupportedBrowser=true;` [00:44] There are some programmers who would like to see the DOM be a thing of the past. *shudder*. Anyway, -ot [00:44] riking_: oh, that's it? Really? [00:44] Does that activate the JS that deactivates the JS? :P [00:46] yes [00:47] https://usercontent.irccloud-cdn.com/file/0rRjAgpf/image.png [00:48] works! [00:49] ... i have a dirty trick in mind [00:52] Frogging: how terrible of an idea is this commit https://github.com/discourse/discourse/compare/master...riking:war-replay?expand=1 [00:55] (answer: very, it'll break for any other archiving site) [01:01] oh, can test for window.__wbhack or window.__wm instead [01:03] this is still a bad idea. [01:22] i'm thinking of actually landing that patch, if anyone wants to talk me out of it you have 24 hours [01:31] *** RichardG_ has joined #archiveteam-bs [01:31] *** RichardG has quit IRC (Read error: Connection reset by peer) [01:48] *** godane has quit IRC (Ping timeout: 272 seconds) [02:25] Yeah, I don't think that's a good idea. [03:24] *** qw3rty_ has joined #archiveteam-bs [03:31] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [03:32] *** qw3rty has quit IRC (Read error: Operation timed out) [03:39] *** Walk has quit IRC (Read error: Operation timed out) [03:49] *** HP_Archiv has quit IRC (Quit: Leaving) [04:09] *** BlueMax has quit IRC (Read error: Connection reset by peer) [04:34] *** DopefishJ has quit IRC (Remote host closed the connection) [04:39] *** DFJustin has joined #archiveteam-bs [04:41] *** duh has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [04:42] *** RichardG_ has quit IRC (Read error: Connection reset by peer) [04:42] *** RichardG has joined #archiveteam-bs [04:42] *** legoktm has joined #archiveteam-bs [06:09] *** Ctrl has quit IRC (Read error: Operation timed out) [06:16] *** lennier1 has quit IRC (Ping timeout: 615 seconds) [06:19] *** HP_Archiv has joined #archiveteam-bs [06:33] *** RichardG_ has joined #archiveteam-bs [06:33] *** RichardG has quit IRC (Read error: Connection reset by peer) [07:06] *** DopefishJ has joined #archiveteam-bs [07:10] *** HP_Archiv has quit IRC (Ping timeout: 610 seconds) [07:11] *** HP_Archiv has joined #archiveteam-bs [07:17] *** DFJustin has quit IRC (Read error: Connection timed out) [07:45] *** Lord_Nigh has quit IRC (Quit: ZNC - http://znc.in) [07:46] *** Lord_Nigh has joined #archiveteam-bs [07:49] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [07:51] *** Lord_Nigh has joined #archiveteam-bs [07:53] *** qw3rty_ has quit IRC (Read error: Operation timed out) [08:29] *** BlueMax has joined #archiveteam-bs [11:26] *** BlueMax has quit IRC (Read error: Connection reset by peer) [11:27] *** BlueMax has joined #archiveteam-bs [11:32] *** HP_Archiv has quit IRC (Quit: Leaving) [11:32] *** HP_Archiv has joined #archiveteam-bs [11:37] *** kiska has quit IRC (Remote host closed the connection) [11:37] *** kiska has joined #archiveteam-bs [11:42] *** MaximeleG has joined #archiveteam-bs [12:18] *** BlueMax has quit IRC (Read error: Connection reset by peer) [12:27] *** dxrt has joined #archiveteam-bs [12:27] *** Iglooop1 sets mode: +o dxrt [12:29] *** svchfoo1 has joined #archiveteam-bs [12:41] *** svchfoo1 has left [12:55] *** DLoader_ has joined #archiveteam-bs [13:07] *** DLoader has quit IRC (Ping timeout: 745 seconds) [13:07] *** DLoader_ is now known as DLoader [13:58] *** RichardG_ has quit IRC (Ping timeout: 260 seconds) [14:00] *** RichardG has joined #archiveteam-bs [14:00] *** vitzli has joined #archiveteam-bs [14:04] *** Ryz has quit IRC (Remote host closed the connection) [14:04] *** kiska1825 has quit IRC (Remote host closed the connection) [14:05] *** Ryz has joined #archiveteam-bs [14:08] *** fredgido has joined #archiveteam-bs [14:09] *** RichardG has quit IRC (Read error: Connection reset by peer) [14:09] *** RichardG has joined #archiveteam-bs [14:09] *** vitzli has quit IRC (Leaving) [14:19] *** no1dead has joined #archiveteam-bs [14:19] *** Arcorann has joined #archiveteam-bs [14:21] There are dumps of Citizendium linked at http://en.citizendium.org/wiki/CZ:Downloads but according to the HTTP headers, those are from 2017. [14:22] It was dumped with the WikiTeam tools in February: https://archive.org/download/wiki-encitizendiumorg [14:23] February is basically current given how dead the site is [14:24] *** RichardG_ has joined #archiveteam-bs [14:24] We could also throw it into ArchiveBot to get it into the Wayback Machine, but it's definitely quite large. [14:24] *** RichardG has quit IRC (Read error: Operation timed out) [14:26] http://en.citizendium.org/wiki?title=Special:RecentChanges&limit=500&days=180 <-- for reference on how dead the site is [14:32] I've thrown Heroes Wiki into ArchiveBot and forwarded it to #wikiteam. [14:33] Citizendium on ArchiveBot would blow up to millions of URLs unless we exclude past revisions and outlinks. [14:40] no1dead: I assume that https://de.heroeswiki.com/ and https://fr.heroeswiki.com/ are also affected? [14:42] Yeah I'm assuming the entire sites going down [14:43] Since he doesn't sound like he's going to be keeping it up [14:49] *** Datechnom has quit IRC (Read error: Operation timed out) [15:03] *** abstract has quit IRC (Quit: *.banana *.split) [15:05] *** Datechnom has joined #archiveteam-bs [15:11] *** Arcorann has quit IRC (Read error: Connection reset by peer) [15:21] *** Guest has joined #archiveteam-bs [15:47] *** godane has joined #archiveteam-bs [16:03] *** DogsRNice has joined #archiveteam-bs [16:56] *** RichardG_ is now known as RichardG [16:59] *** no1dead has quit IRC (Quit: Connection closed for inactivity) [17:10] *** lennier1 has joined #archiveteam-bs [17:54] *** godane has quit IRC (Read error: Connection reset by peer) [18:19] *** RichardG_ has joined #archiveteam-bs [18:19] *** RichardG has quit IRC (Read error: Connection reset by peer) [18:30] *** RichardG_ is now known as RichardG [18:44] *** lennier2 has joined #archiveteam-bs [18:46] *** lennier1 has quit IRC (Ping timeout: 255 seconds) [18:46] *** lennier2 is now known as lennier1 [18:55] *** RichardG_ has joined #archiveteam-bs [18:55] *** RichardG has quit IRC (Read error: Connection reset by peer) [19:01] *** MaximeleG has quit IRC (Quit: MaximeleG) [19:52] JAA: Can ArchiveBot handle something of the size of myfigurecollection.net? [19:54] *** godane has joined #archiveteam-bs [20:07] Radtoo: It can handle things up to dozens of millions of URLs, but it takes a very long time and is quite inefficient. [20:34] JAA: Hm. Is there some better method? Currently (asking a bot) to archive it in *some* way seems preferable to not doing it. [20:36] Is it shutting down? [20:37] JAA: No, just periodically loosing content. [20:38] JAA:Also doesn't seem to be archived yet [20:38] Hmm [21:19] *** RichardG has joined #archiveteam-bs [21:19] *** RichardG_ has quit IRC (Read error: Connection reset by peer) [21:40] JAA: Back from dinner. Well, more accurately I don't see any archived anyhwere nearly complete crawl. The front page has been hit by archive.org and so on a few times, but gallery/ (mostly CC licensed so it surely wasn't purged later), user profile/ club/ and other bits seem generally not archived. What would happen if the crawl ended up being big? Does it block a lot of other crawls on the bot? [21:42] Radtoo: Well, it would block a pipeline slot for months and eat up a lot of CPU time on the pipeline. [21:53] Also, pagination is limited to 500 pages, so it might not discover everything. The images on the figure page (beyond the first one) are loaded over JS and wouldn't be retrieved by AB. Search params from the picture search are passed on to the picture page URL, so those would be grabbed multiple times. [21:54] This needs a different approach. [21:54] Hm. High CPU usage? Is it putting all into a single archive? Either way, I *generally* think it's worthwhile to crawl since this one is one was fairly information dense & has lots of CC-BY work plus various secondary sources of information integrated (event industry and so on news) [21:54] I think the image filenames are static, it grabs them multiple times? [22:12] High CPU usage because the tool behind ArchiveBot (wpull) isn't very efficient for such large crawls. [22:14] Not the images themselves, but the pages showing them. For example, the first link on https://myfigurecollection.net/pictures.php?itemId=546829 is https://myfigurecollection.net/picture/2240529&context%5B%5D=itemId%3A546829 and when you filter for figures additionally, the same page is linked as https://myfigurecollection.net/picture/2240529&context%5B%5D=categoryId%3A1&context%5B%5D=itemId%3A546829 [22:14] etc. [22:15] I don't disagree with archiving such sites, but AB is probably not quite the right tool for it. [22:18] Ah, good to know. Are there any tools that are very good with this kind of website that might produce a WARC or such? [22:31] *** RichardG_ has joined #archiveteam-bs [22:31] *** RichardG has quit IRC (Read error: Connection reset by peer) [22:46] *** bsmith093 has quit IRC (Ping timeout: 265 seconds) [22:47] *** Arcorann has joined #archiveteam-bs [22:47] *** bsmith093 has joined #archiveteam-bs [22:47] *** Arcorann has quit IRC (Read error: Connection reset by peer) [22:48] *** Arcorann has joined #archiveteam-bs [23:19] *** lennier2 has joined #archiveteam-bs [23:23] *** lennier1 has quit IRC (Ping timeout: 265 seconds) [23:23] *** lennier2 is now known as lennier1