[00:12] ZiNC: #archiveteam = important announcements ("oh shit this site is on fire" etc.), -bs = any archival discussion (unless there's a project-specific channel), -ot = anything more or less [00:12] Thanks. [00:13] bs = archival projects? Considering ot is "general archiving". [00:14] Yeah, AT archival projects in -bs, more general here. [00:15] Right. [00:18] Is Heritrix considered the common tool? [00:19] The default goto tool. [00:28] At the Internet Archive, I think so. At ArchiveTeam, I don't think we've ever used it. Some played around with it in the past and reported that it's quite a PITA to setup and manage. [00:29] That's my impression so far as well. :) [00:29] Our distributed projects nearly always use a wget fork, and ArchiveBot and many individual archivals use wpull. [00:29] v3 [00:30] I do most of my archivals with my own tool (qwarc) nowadays since I can tune it to my liking, but I can't recommend it for general use. [00:30] Is there a way to have enough control over the crawl with wget (and similar)? [00:31] Rather than following every URL, or N-levels deep. [00:31] Plain wget: no [00:31] wget-lua: more or less [00:31] wpull: quite good [00:32] So why use anything but wpull? [00:32] It has quite a number of bugs, unfortunately. [00:33] Disastrous, or just annoyances? [00:33] If you just `pip install wpull`, you'll pretty much get a non-functional version (2.0.1). You'll want 1.2.3 or 2.0.3. [00:33] You might also want to look into grab-site. [00:35] Any worthwhile GUI-based tools/frontends? [00:36] (I suppose grab-site is just a minimal GUI around wpull.) [00:39] Don't think so, unless you can make one of the countless wget GUIs work with it. But it's pretty much all terminal-based here. Even grab-site is, it just also has a web interface for monitoring, but you can't control anything through it. [00:39] grab-site is roughly the same as ArchiveBot, by the way, just without the distributed and IRC parts. [00:42] Thinking of capturing a forum. [00:43] Might it be better to use a few templated URLs, rather than follow URLs? [00:54] Depends a bit on the forum software and how important it is to you to also capture forum listings etc. But if you're just interested in the thread contents and the URL format allows it, yeah, I'd just do that. (Keep thread pagination in mind though.) [00:56] Aren't there any forum-specific grabbers? [00:56] I haven't seen any that spit out WARCs [00:56] With some forum-software-specific knowledge it might be cleaner and maybe simpler. [00:56] Also to do period delta captures. [00:56] periodic [00:57] I've been thinking about that a bit recently, so maybe soon. :-) [00:57] Might even extract just the actual data, [00:57] something like a --discourse option for archivebot would be nice [00:57] and use templates to recreate the pagination/subforums listings. [00:58] Might require some manual work to fix/create a template for a specific site, [00:58] The one I have in mind is based on qwarc and archives everything relevant for pywb/WBM playback but also produces some machine-readable format of the actual contents. I haven't really thought about it very much in detail yet though. [00:58] or maybe even that could be automated with forum-software knowledge/presets. [00:59] hook54321: Is that a real option of some tool? [00:59] I don't think it's worth having a generic software that can reproduce a particular forum's layout etc. The content is what matters, and for everything else, we have WARCs. [00:59] ZiNC: no [01:00] JAA: I think that's what bibanon does, right? I'm not a big fan. [01:00] Well, it could be nice to keep a forum's "feel" as well. [01:00] hook54321: No idea what they're doing. I'm not particularly interested in imageboards. [01:00] And why save all that repeating HTML 100Ks of times. [01:01] In the case of Discourse, yeah, no need to keep the layout. It sucks anyway. :) [01:02] whether or not we like the layout is pretty much irrevelant [01:03] Do wpull/something else have good ways to do delta captures? [01:07] Well, generically, the best thing you can do is dedupe against previous captured, as with wpull's --warc-dedup. But that's just on the storage side, it doesn't save you from reretrieving everything, and if there are small differences in the responses, it won't work at all. [01:07] previously captured responses* [01:08] *** dxrt has quit IRC (Remote host closed the connection) [01:09] No tools with a way to define custom/per-job stop rules? [01:09] Anything more efficient would be specific to a particular website, software, etc. [01:09] For example, in a forum, one might follow the "newest posts" list, until reaching a date that's already been captured before. [01:09] Well yeah, you can probably do that to some degree with URL filters/ignores. [01:10] *** dxrt has joined #archiveteam-ot [01:11] Can you script these things? [01:11] https://wpull.readthedocs.io/en/master/scripting.html [01:12] It may also require keeping a global state, and just per-page/download. [01:12] *** Ctrl has quit IRC (Read error: Operation timed out) [01:12] +not just [01:15] It's sad that the Wayback Machine tends to have forums captured very partially. [01:15] Any idea if it's better now than it was? [01:15] Seems like it isn't fond of following pagination. [01:16] And other problems, like junk in the URLs (session IDs and such). [01:16] Depends entirely on how it was captured. If it's part of IA's web-wide crawls, then those crawls stop at depth 3 I believe, so naturally it won't get very far in long threads. [01:16] ... or in thread list pagination. [01:16] Yep, session IDs are a common annoyance in ArchiveBot. [01:16] And the starting point would be the homepage, or forum index? [01:17] Why stop at depth 3? [01:18] I think they start those from a domain list. [01:18] Because it takes something like half a year to retrieve it up to depth 3 and would probably take several years to do depth 4. [01:19] sessionIds could be filtered out. Might require software-specific code, but that's not too hard. [01:19] wpull does filter out some common session IDs. They're still annoying because the links will be broken. [01:19] No link fixing? [01:19] Rule 1 of web archival: never, ever modify anything sent by the web server. [01:20] So dynamically with JS, while viewing. [01:20] That said, the Wayback Machine does work around that problem by ignoring some session ID parameters, but again, that's only some common ones. [01:20] E.g. sid=[0-9a-f]{32} would be handled but something more custom wouldn't. [01:21] You could detect the software, then add rules based on that. [01:21] You could, but wpull is a generic software and shouldn't contain code like that. [01:22] Though generic ones might indeed cover most of it. [01:22] Can be done with hooks though. [01:22] Or if that fails, you can directly replace the relevant portions of wpull since it's all Python. [01:22] generic + domain-specific knowledge isn't a bad combination. [01:22] Anything that normalizes or ignores parameter order? [01:23] Hardcoding domain-specific things in a generic software is bad though. Requires more maintenance, the test code now depends on some random web service, etc. [01:23] wpull is *already* not getting enough maintenance. Such extra fluff would be outdated by years by this point. [01:24] Not necessary hardcoding. Could be scripts made easy to modify, with minimal content. [01:24] Nope, parameter order is preserved because there's no generic way to say if it matters or not. Some web servers require a certain order, and the HTTP spec doesn't allow random reordering. [01:25] In GET params? [01:25] That's exactly what hook scripts are. Feel free to write one, throw it in some repo, etc. But it doesn't belong in the main wpull code. [01:25] GET, POST, whatever. [01:25] What software minds the order in ?a&b&c ? [01:26] I don't know, but I'm certain it exists. [01:26] Maybe. But that'd be odd. [01:26] The internet is full of odd things. [01:27] Half the art of web archival is working around that crap. [01:27] For strict ordering I'd expect URL rewriting, or something similar. [01:28] So if IA's depth is always 3, [01:29] does this mean 99% of forum content was never archived? [01:29] It isn't. [01:29] That was just one example of why deep recursion might not happen. [01:29] BTW, any advantage to wpull 1.2.x rather than 2.0.x? [01:30] But there's certainly a lot of forum content out there that hasn't been archived. [01:30] Would you say the majority/large majority? [01:32] Mostly stability. 1.2.3 is very stable and reliable. 2.0.x has a better hook API, uses asyncio instead of trollius, and has a bunch of other internal nice changes, but it's definitely nowhere near as stable as 1.2.3. [01:32] As long as resuming works... [01:32] No idea, I never did an audit of what forums exist or how much IA has captured. [01:33] If you use grab-site, it'll be fine. With the additional plugins etc., that version of wpull is quite stable despite being essentially 2.0.3. [01:33] Same goes for ArchiveBot. [01:34] Speaking from experience of heavily using the WBM, no, forum coverage is terrible [01:34] But you will eventually run into weird TLS connection hangs and things like that. [01:34] v2-specific? [01:35] OrIdow6: Maybe captured but not yet live? :) [01:36] Yes, v2.x [01:37] Alright. [01:37] Thanks for the help so far. [01:37] (And wpull_ludios 3.x, which is the wpull fork grab-site uses.) [01:38] Er, ludios_wpull* [01:38] :) [01:38] Well, I'd better call it a day. [01:39] Adios. [01:39] See ya [01:42] *** ZiNC has quit IRC () [02:02] *** Ctrl has joined #archiveteam-ot [03:05] *** atphoenix has quit IRC (Ping timeout: 276 seconds) [03:06] *** atphoenix has joined #archiveteam-ot [03:11] *** atphoenix has quit IRC (Read error: Operation timed out) [03:18] *** atphoenix has joined #archiveteam-ot [03:22] *** atphoeni_ has joined #archiveteam-ot [03:23] *** atphoenix has quit IRC (Ping timeout: 276 seconds) [03:23] *** atphoeni_ is now known as atphoenix [03:57] *** kiska has quit IRC (Ping timeout (120 seconds)) [03:57] *** SJon__ has quit IRC (Read error: Connection reset by peer) [03:57] *** SJon__ has joined #archiveteam-ot [03:59] *** kiska has joined #archiveteam-ot [04:05] *** qw3rty_ has joined #archiveteam-ot [04:05] *** ShellyRol has joined #archiveteam-ot [04:13] *** qw3rty has quit IRC (Read error: Operation timed out) [04:36] *** wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES) [04:43] *** jake_test has quit IRC (Read error: Connection reset by peer) [04:43] *** JAA has quit IRC (Read error: Operation timed out) [04:43] *** jake_test has joined #archiveteam-ot [04:45] *** Larsenv has quit IRC (ZNC 1.7.5 - https://znc.in) [04:46] *** Larsenv has joined #archiveteam-ot [04:47] *** JAA has joined #archiveteam-ot [04:47] *** AlsoJAA sets mode: +o JAA [04:47] *** wp494 has joined #archiveteam-ot [04:56] *** dhyan_nat has joined #archiveteam-ot [05:02] *** OrIdow6 has quit IRC (Quit: Leaving.) [05:26] *** dhyan_nat has quit IRC (Read error: Operation timed out) [05:29] *** JAA has quit IRC (Read error: Operation timed out) [05:29] *** JAA has joined #archiveteam-ot [05:30] *** AlsoJAA sets mode: +o JAA [06:16] *** OrIdow6 has joined #archiveteam-ot [07:06] *** Ctrl has quit IRC (Read error: Operation timed out) [07:12] *** Wingy8 has joined #archiveteam-ot [07:13] *** Wingy has quit IRC (Read error: Operation timed out) [07:13] *** Wingy8 is now known as Wingy [08:45] *** dhyan_nat has joined #archiveteam-ot [10:18] *** fuzzy802 has joined #archiveteam-ot [10:24] *** fuzzy8021 has quit IRC (Read error: Operation timed out) [10:27] *** HP_Archiv has quit IRC (Ping timeout: 276 seconds) [10:28] *** fuzzy802 is now known as fuzzy8021 [10:39] *** HP_Archiv has joined #archiveteam-ot [10:57] *** BlueMax has quit IRC (Quit: Leaving) [11:07] *** qw3rty_ has quit IRC (Read error: Connection reset by peer) [11:10] *** qw3rty has joined #archiveteam-ot [13:43] *** HP_Archiv has quit IRC (Quit: Leaving) [14:59] *** VerifiedJ has joined #archiveteam-ot [15:22] *** wp494 has quit IRC (Ping timeout: 610 seconds) [15:23] *** wp494 has joined #archiveteam-ot [17:11] *** Ctrl has joined #archiveteam-ot [17:16] *** HP_Archiv has joined #archiveteam-ot [17:17] *** HP_Archiv has quit IRC (Client Quit) [17:45] *** jake_test has quit IRC (Read error: Operation timed out) [17:46] *** jake_test has joined #archiveteam-ot [18:54] *** HP_Archiv has joined #archiveteam-ot [19:45] *** jodizzle has quit IRC (Read error: Operation timed out) [19:45] *** JAA has quit IRC (Read error: Operation timed out) [19:45] *** jodizzle has joined #archiveteam-ot [19:46] *** JAA has joined #archiveteam-ot [19:46] *** AlsoJAA sets mode: +o JAA [20:04] *** girst has quit IRC (Read error: Operation timed out) [20:42] *** girst has joined #archiveteam-ot [20:51] *** girst_ has joined #archiveteam-ot [20:51] *** girst has quit IRC (Read error: Connection reset by peer) [20:51] *** girst_ is now known as girst [21:30] *** DogsRNice has joined #archiveteam-ot [21:48] *** dhyan_nat has quit IRC (Read error: Operation timed out) [21:55] *** ivan has quit IRC (Quit: Leaving) [22:11] *** ivan has joined #archiveteam-ot [22:12] *** girst has quit IRC (Read error: Operation timed out) [22:15] *** girst has joined #archiveteam-ot [22:40] *** girst has quit IRC (Ping timeout: 258 seconds) [22:50] *** girst has joined #archiveteam-ot [22:54] *** BlueMax has joined #archiveteam-ot [22:56] *** BlueMax has quit IRC (Read error: Connection timed out) [22:56] *** BlueMax has joined #archiveteam-ot [23:45] *** girst has quit IRC (Read error: Operation timed out) [23:52] *** girst has joined #archiveteam-ot