#archiveteam-bs 2020-05-14,Thu

↑back Search

Time Nickname Message
00:13 πŸ”— godane SketchCow: here is a japanese automatic cappuccino machine manual from DeLonghi: https://archive.org/details/japanese-manual-133000
00:16 πŸ”— ivan has joined #archiveteam-bs
00:25 πŸ”— ranma has joined #archiveteam-bs
00:28 πŸ”— Walk did you scan yourself godane ?
00:31 πŸ”— riking_ Frogging: discourse is getting some crawler view improvements because ie11 is going to start getting the no-js view pretty soon
00:32 πŸ”— Frogging riking_: What's the deal with ie11?
00:32 πŸ”— riking_ can't use css variables and a bunch of other nice things
00:32 πŸ”— riking_ *css custom properties
00:33 πŸ”— Frogging and why is this going to lead to improvements to Wayback's display of Discourse?
00:33 πŸ”— ranma_ has quit IRC (Ping timeout: 745 seconds)
00:34 πŸ”— Frogging I'm not arguing, I'm just curious how they are related
00:34 πŸ”— riking_ oh, the wayback view is still broken, as of now, but hopefully it can get swung so that whatever error breaks the replay triggers that code
00:34 πŸ”— riking_ and so the JS yanks the noscript content into view
00:35 πŸ”— Frogging Oh you mean the Discourse devs might do that to fix IE11 and then it might improve the Wayback experience as a side effect
00:36 πŸ”— riking_ mhm
00:36 πŸ”— Frogging ah, neat
00:36 πŸ”— riking_ also, the noscript view is getting styling improvements because people with actual browsers are going to be seeing it pretty soon
00:37 πŸ”— riking_ that "actual browser" being ie11
00:38 πŸ”— riking_ oh hey it seems to be working now? for the homepage http://web.archive.org/web/20200514003833/https://meta.discourse.org/
00:39 πŸ”— riking_ topic view still broken
00:40 πŸ”— riking_ "InternalError: too much recursion"
00:40 πŸ”— Frogging mm
00:41 πŸ”— Frogging Wonder if one could make a browser side script to tweak it into displaying properly
00:41 πŸ”— Frogging Probably would be a tedious endeavour
00:41 πŸ”— riking_ Uncaught SyntaxError: Invalid regular expression: /^[^A-Za-zÀ-Γƒβ€“ΓƒΛœ-âø-ΓŠΒΈΓŒβ‚¬-֐ࠀ-Ñ¿¿Ò°€-Γ―Β¬Ε“Γ―Β·ΒΎ-﹯﻽-Γ―ΒΏΒΏ]*[Γ–β€˜-߿יִ-Γ―Β·Β½Γ―ΒΉΒ°-ﻼ]/: Range out of order in character class
00:42 πŸ”— Frogging I wonder if WebAssembly proliferation will worsen all of this kind of thing
00:43 πŸ”— riking_ Frogging: here's your browser side script, `window.unsupportedBrowser=true;`
00:44 πŸ”— Frogging There are some programmers who would like to see the DOM be a thing of the past. *shudder*. Anyway, -ot
00:44 πŸ”— Frogging riking_: oh, that's it? Really?
00:44 πŸ”— Frogging Does that activate the JS that deactivates the JS? :P
00:46 πŸ”— riking_ yes
00:47 πŸ”— riking_ https://usercontent.irccloud-cdn.com/file/0rRjAgpf/image.png
00:48 πŸ”— riking_ works!
00:49 πŸ”— riking_ ... i have a dirty trick in mind
00:52 πŸ”— riking_ Frogging: how terrible of an idea is this commit https://github.com/discourse/discourse/compare/master...riking:war-replay?expand=1
00:55 πŸ”— riking_ (answer: very, it'll break for any other archiving site)
01:01 πŸ”— riking_ oh, can test for window.__wbhack or window.__wm instead
01:03 πŸ”— riking_ this is still a bad idea.
01:22 πŸ”— riking_ i'm thinking of actually landing that patch, if anyone wants to talk me out of it you have 24 hours
01:31 πŸ”— RichardG_ has joined #archiveteam-bs
01:31 πŸ”— RichardG has quit IRC (Read error: Connection reset by peer)
01:48 πŸ”— godane has quit IRC (Ping timeout: 272 seconds)
02:25 πŸ”— JAA Yeah, I don't think that's a good idea.
03:24 πŸ”— qw3rty_ has joined #archiveteam-bs
03:31 πŸ”— DogsRNice has quit IRC (Read error: Connection reset by peer)
03:32 πŸ”— qw3rty has quit IRC (Read error: Operation timed out)
03:39 πŸ”— Walk has quit IRC (Read error: Operation timed out)
03:49 πŸ”— HP_Archiv has quit IRC (Quit: Leaving)
04:09 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
04:34 πŸ”— DopefishJ has quit IRC (Remote host closed the connection)
04:39 πŸ”— DFJustin has joined #archiveteam-bs
04:41 πŸ”— duh has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
04:42 πŸ”— RichardG_ has quit IRC (Read error: Connection reset by peer)
04:42 πŸ”— RichardG has joined #archiveteam-bs
04:42 πŸ”— legoktm has joined #archiveteam-bs
06:09 πŸ”— Ctrl has quit IRC (Read error: Operation timed out)
06:16 πŸ”— lennier1 has quit IRC (Ping timeout: 615 seconds)
06:19 πŸ”— HP_Archiv has joined #archiveteam-bs
06:33 πŸ”— RichardG_ has joined #archiveteam-bs
06:33 πŸ”— RichardG has quit IRC (Read error: Connection reset by peer)
07:06 πŸ”— DopefishJ has joined #archiveteam-bs
07:10 πŸ”— HP_Archiv has quit IRC (Ping timeout: 610 seconds)
07:11 πŸ”— HP_Archiv has joined #archiveteam-bs
07:17 πŸ”— DFJustin has quit IRC (Read error: Connection timed out)
07:45 πŸ”— Lord_Nigh has quit IRC (Quit: ZNC - http://znc.in)
07:46 πŸ”— Lord_Nigh has joined #archiveteam-bs
07:49 πŸ”— Lord_Nigh has quit IRC (Read error: Operation timed out)
07:51 πŸ”— Lord_Nigh has joined #archiveteam-bs
07:53 πŸ”— qw3rty_ has quit IRC (Read error: Operation timed out)
08:29 πŸ”— BlueMax has joined #archiveteam-bs
11:26 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
11:27 πŸ”— BlueMax has joined #archiveteam-bs
11:32 πŸ”— HP_Archiv has quit IRC (Quit: Leaving)
11:32 πŸ”— HP_Archiv has joined #archiveteam-bs
11:37 πŸ”— kiska has quit IRC (Remote host closed the connection)
11:37 πŸ”— kiska has joined #archiveteam-bs
11:42 πŸ”— MaximeleG has joined #archiveteam-bs
12:18 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
12:27 πŸ”— dxrt has joined #archiveteam-bs
12:27 πŸ”— Iglooop1 sets mode: +o dxrt
12:29 πŸ”— svchfoo1 has joined #archiveteam-bs
12:41 πŸ”— svchfoo1 has left
12:55 πŸ”— DLoader_ has joined #archiveteam-bs
13:07 πŸ”— DLoader has quit IRC (Ping timeout: 745 seconds)
13:07 πŸ”— DLoader_ is now known as DLoader
13:58 πŸ”— RichardG_ has quit IRC (Ping timeout: 260 seconds)
14:00 πŸ”— RichardG has joined #archiveteam-bs
14:00 πŸ”— vitzli has joined #archiveteam-bs
14:04 πŸ”— Ryz has quit IRC (Remote host closed the connection)
14:04 πŸ”— kiska1825 has quit IRC (Remote host closed the connection)
14:05 πŸ”— Ryz has joined #archiveteam-bs
14:08 πŸ”— fredgido has joined #archiveteam-bs
14:09 πŸ”— RichardG has quit IRC (Read error: Connection reset by peer)
14:09 πŸ”— RichardG has joined #archiveteam-bs
14:09 πŸ”— vitzli has quit IRC (Leaving)
14:19 πŸ”— no1dead has joined #archiveteam-bs
14:19 πŸ”— Arcorann has joined #archiveteam-bs
14:21 πŸ”— JAA There are dumps of Citizendium linked at http://en.citizendium.org/wiki/CZ:Downloads but according to the HTTP headers, those are from 2017.
14:22 πŸ”— JAA It was dumped with the WikiTeam tools in February: https://archive.org/download/wiki-encitizendiumorg
14:23 πŸ”— Arcorann February is basically current given how dead the site is
14:24 πŸ”— RichardG_ has joined #archiveteam-bs
14:24 πŸ”— JAA We could also throw it into ArchiveBot to get it into the Wayback Machine, but it's definitely quite large.
14:24 πŸ”— RichardG has quit IRC (Read error: Operation timed out)
14:26 πŸ”— Arcorann http://en.citizendium.org/wiki?title=Special:RecentChanges&limit=500&days=180 <-- for reference on how dead the site is
14:32 πŸ”— JAA I've thrown Heroes Wiki into ArchiveBot and forwarded it to #wikiteam.
14:33 πŸ”— JAA Citizendium on ArchiveBot would blow up to millions of URLs unless we exclude past revisions and outlinks.
14:40 πŸ”— JAA no1dead: I assume that https://de.heroeswiki.com/ and https://fr.heroeswiki.com/ are also affected?
14:42 πŸ”— no1dead Yeah I'm assuming the entire sites going down
14:43 πŸ”— no1dead Since he doesn't sound like he's going to be keeping it up
14:49 πŸ”— Datechnom has quit IRC (Read error: Operation timed out)
15:03 πŸ”— abstract has quit IRC (Quit: *.banana *.split)
15:05 πŸ”— Datechnom has joined #archiveteam-bs
15:11 πŸ”— Arcorann has quit IRC (Read error: Connection reset by peer)
15:21 πŸ”— Guest has joined #archiveteam-bs
15:47 πŸ”— godane has joined #archiveteam-bs
16:03 πŸ”— DogsRNice has joined #archiveteam-bs
16:56 πŸ”— RichardG_ is now known as RichardG
16:59 πŸ”— no1dead has quit IRC (Quit: Connection closed for inactivity)
17:10 πŸ”— lennier1 has joined #archiveteam-bs
17:54 πŸ”— godane has quit IRC (Read error: Connection reset by peer)
18:19 πŸ”— RichardG_ has joined #archiveteam-bs
18:19 πŸ”— RichardG has quit IRC (Read error: Connection reset by peer)
18:30 πŸ”— RichardG_ is now known as RichardG
18:44 πŸ”— lennier2 has joined #archiveteam-bs
18:46 πŸ”— lennier1 has quit IRC (Ping timeout: 255 seconds)
18:46 πŸ”— lennier2 is now known as lennier1
18:55 πŸ”— RichardG_ has joined #archiveteam-bs
18:55 πŸ”— RichardG has quit IRC (Read error: Connection reset by peer)
19:01 πŸ”— MaximeleG has quit IRC (Quit: MaximeleG)
19:52 πŸ”— Radtoo JAA: Can ArchiveBot handle something of the size of myfigurecollection.net?
19:54 πŸ”— godane has joined #archiveteam-bs
20:07 πŸ”— JAA Radtoo: It can handle things up to dozens of millions of URLs, but it takes a very long time and is quite inefficient.
20:34 πŸ”— Radtoo JAA: Hm. Is there some better method? Currently (asking a bot) to archive it in *some* way seems preferable to not doing it.
20:36 πŸ”— JAA Is it shutting down?
20:37 πŸ”— Radtoo JAA: No, just periodically loosing content.
20:38 πŸ”— Radtoo JAA:Also doesn't seem to be archived yet
20:38 πŸ”— JAA Hmm
21:19 πŸ”— RichardG has joined #archiveteam-bs
21:19 πŸ”— RichardG_ has quit IRC (Read error: Connection reset by peer)
21:40 πŸ”— Radtoo JAA: Back from dinner. Well, more accurately I don't see any archived anyhwere nearly complete crawl. The front page has been hit by archive.org and so on a few times, but gallery/ (mostly CC licensed so it surely wasn't purged later), user profile/ club/ and other bits seem generally not archived. What would happen if the crawl ended up being big? Does it block a lot of other crawls on the bot?
21:42 πŸ”— JAA Radtoo: Well, it would block a pipeline slot for months and eat up a lot of CPU time on the pipeline.
21:53 πŸ”— JAA Also, pagination is limited to 500 pages, so it might not discover everything. The images on the figure page (beyond the first one) are loaded over JS and wouldn't be retrieved by AB. Search params from the picture search are passed on to the picture page URL, so those would be grabbed multiple times.
21:54 πŸ”— JAA This needs a different approach.
21:54 πŸ”— Radtoo Hm. High CPU usage? Is it putting all into a single archive? Either way, I *generally* think it's worthwhile to crawl since this one is one was fairly information dense & has lots of CC-BY work plus various secondary sources of information integrated (event industry and so on news)
21:54 πŸ”— Radtoo I think the image filenames are static, it grabs them multiple times?
22:12 πŸ”— JAA High CPU usage because the tool behind ArchiveBot (wpull) isn't very efficient for such large crawls.
22:14 πŸ”— JAA Not the images themselves, but the pages showing them. For example, the first link on https://myfigurecollection.net/pictures.php?itemId=546829 is https://myfigurecollection.net/picture/2240529&context%5B%5D=itemId%3A546829 and when you filter for figures additionally, the same page is linked as https://myfigurecollection.net/picture/2240529&context%5B%5D=categoryId%3A1&context%5B%5D=itemId%3A546829
22:14 πŸ”— JAA etc.
22:15 πŸ”— JAA I don't disagree with archiving such sites, but AB is probably not quite the right tool for it.
22:18 πŸ”— Radtoo Ah, good to know. Are there any tools that are very good with this kind of website that might produce a WARC or such?
22:31 πŸ”— RichardG_ has joined #archiveteam-bs
22:31 πŸ”— RichardG has quit IRC (Read error: Connection reset by peer)
22:46 πŸ”— bsmith093 has quit IRC (Ping timeout: 265 seconds)
22:47 πŸ”— Arcorann has joined #archiveteam-bs
22:47 πŸ”— bsmith093 has joined #archiveteam-bs
22:47 πŸ”— Arcorann has quit IRC (Read error: Connection reset by peer)
22:48 πŸ”— Arcorann has joined #archiveteam-bs
23:19 πŸ”— lennier2 has joined #archiveteam-bs
23:23 πŸ”— lennier1 has quit IRC (Ping timeout: 265 seconds)
23:23 πŸ”— lennier2 is now known as lennier1

irclogger-viewer