#archiveteam-ot 2020-03-27,Fri

↑back Search

Time Nickname Message
00:12 🔗 JAA ZiNC: #archiveteam = important announcements ("oh shit this site is on fire" etc.), -bs = any archival discussion (unless there's a project-specific channel), -ot = anything more or less
00:12 🔗 ZiNC Thanks.
00:13 🔗 ZiNC bs = archival projects? Considering ot is "general archiving".
00:14 🔗 JAA Yeah, AT archival projects in -bs, more general here.
00:15 🔗 ZiNC Right.
00:18 🔗 ZiNC Is Heritrix considered the common tool?
00:19 🔗 ZiNC The default goto tool.
00:28 🔗 JAA At the Internet Archive, I think so. At ArchiveTeam, I don't think we've ever used it. Some played around with it in the past and reported that it's quite a PITA to setup and manage.
00:29 🔗 ZiNC That's my impression so far as well. :)
00:29 🔗 JAA Our distributed projects nearly always use a wget fork, and ArchiveBot and many individual archivals use wpull.
00:29 🔗 ZiNC v3
00:30 🔗 JAA I do most of my archivals with my own tool (qwarc) nowadays since I can tune it to my liking, but I can't recommend it for general use.
00:30 🔗 ZiNC Is there a way to have enough control over the crawl with wget (and similar)?
00:31 🔗 ZiNC Rather than following every URL, or N-levels deep.
00:31 🔗 JAA Plain wget: no
00:31 🔗 JAA wget-lua: more or less
00:31 🔗 JAA wpull: quite good
00:32 🔗 ZiNC So why use anything but wpull?
00:32 🔗 JAA It has quite a number of bugs, unfortunately.
00:33 🔗 ZiNC Disastrous, or just annoyances?
00:33 🔗 JAA If you just `pip install wpull`, you'll pretty much get a non-functional version (2.0.1). You'll want 1.2.3 or 2.0.3.
00:33 🔗 JAA You might also want to look into grab-site.
00:35 🔗 ZiNC Any worthwhile GUI-based tools/frontends?
00:36 🔗 ZiNC (I suppose grab-site is just a minimal GUI around wpull.)
00:39 🔗 JAA Don't think so, unless you can make one of the countless wget GUIs work with it. But it's pretty much all terminal-based here. Even grab-site is, it just also has a web interface for monitoring, but you can't control anything through it.
00:39 🔗 JAA grab-site is roughly the same as ArchiveBot, by the way, just without the distributed and IRC parts.
00:42 🔗 ZiNC Thinking of capturing a forum.
00:43 🔗 ZiNC Might it be better to use a few templated URLs, rather than follow URLs?
00:54 🔗 JAA Depends a bit on the forum software and how important it is to you to also capture forum listings etc. But if you're just interested in the thread contents and the URL format allows it, yeah, I'd just do that. (Keep thread pagination in mind though.)
00:56 🔗 ZiNC Aren't there any forum-specific grabbers?
00:56 🔗 hook54321 I haven't seen any that spit out WARCs
00:56 🔗 ZiNC With some forum-software-specific knowledge it might be cleaner and maybe simpler.
00:56 🔗 ZiNC Also to do period delta captures.
00:56 🔗 ZiNC periodic
00:57 🔗 JAA I've been thinking about that a bit recently, so maybe soon. :-)
00:57 🔗 ZiNC Might even extract just the actual data,
00:57 🔗 hook54321 something like a --discourse option for archivebot would be nice
00:57 🔗 ZiNC and use templates to recreate the pagination/subforums listings.
00:58 🔗 ZiNC Might require some manual work to fix/create a template for a specific site,
00:58 🔗 JAA The one I have in mind is based on qwarc and archives everything relevant for pywb/WBM playback but also produces some machine-readable format of the actual contents. I haven't really thought about it very much in detail yet though.
00:58 🔗 ZiNC or maybe even that could be automated with forum-software knowledge/presets.
00:59 🔗 ZiNC hook54321: Is that a real option of some tool?
00:59 🔗 JAA I don't think it's worth having a generic software that can reproduce a particular forum's layout etc. The content is what matters, and for everything else, we have WARCs.
00:59 🔗 hook54321 ZiNC: no
01:00 🔗 hook54321 JAA: I think that's what bibanon does, right? I'm not a big fan.
01:00 🔗 ZiNC Well, it could be nice to keep a forum's "feel" as well.
01:00 🔗 JAA hook54321: No idea what they're doing. I'm not particularly interested in imageboards.
01:00 🔗 ZiNC And why save all that repeating HTML 100Ks of times.
01:01 🔗 ZiNC In the case of Discourse, yeah, no need to keep the layout. It sucks anyway. :)
01:02 🔗 hook54321 whether or not we like the layout is pretty much irrevelant
01:03 🔗 ZiNC Do wpull/something else have good ways to do delta captures?
01:07 🔗 JAA Well, generically, the best thing you can do is dedupe against previous captured, as with wpull's --warc-dedup. But that's just on the storage side, it doesn't save you from reretrieving everything, and if there are small differences in the responses, it won't work at all.
01:07 🔗 JAA previously captured responses*
01:08 🔗 dxrt has quit IRC (Remote host closed the connection)
01:09 🔗 ZiNC No tools with a way to define custom/per-job stop rules?
01:09 🔗 JAA Anything more efficient would be specific to a particular website, software, etc.
01:09 🔗 ZiNC For example, in a forum, one might follow the "newest posts" list, until reaching a date that's already been captured before.
01:09 🔗 JAA Well yeah, you can probably do that to some degree with URL filters/ignores.
01:10 🔗 dxrt has joined #archiveteam-ot
01:11 🔗 ZiNC Can you script these things?
01:11 🔗 JAA https://wpull.readthedocs.io/en/master/scripting.html
01:12 🔗 ZiNC It may also require keeping a global state, and just per-page/download.
01:12 🔗 Ctrl has quit IRC (Read error: Operation timed out)
01:12 🔗 ZiNC +not just
01:15 🔗 ZiNC It's sad that the Wayback Machine tends to have forums captured very partially.
01:15 🔗 ZiNC Any idea if it's better now than it was?
01:15 🔗 ZiNC Seems like it isn't fond of following pagination.
01:16 🔗 ZiNC And other problems, like junk in the URLs (session IDs and such).
01:16 🔗 JAA Depends entirely on how it was captured. If it's part of IA's web-wide crawls, then those crawls stop at depth 3 I believe, so naturally it won't get very far in long threads.
01:16 🔗 JAA ... or in thread list pagination.
01:16 🔗 JAA Yep, session IDs are a common annoyance in ArchiveBot.
01:16 🔗 ZiNC And the starting point would be the homepage, or forum index?
01:17 🔗 ZiNC Why stop at depth 3?
01:18 🔗 JAA I think they start those from a domain list.
01:18 🔗 JAA Because it takes something like half a year to retrieve it up to depth 3 and would probably take several years to do depth 4.
01:19 🔗 ZiNC sessionIds could be filtered out. Might require software-specific code, but that's not too hard.
01:19 🔗 JAA wpull does filter out some common session IDs. They're still annoying because the links will be broken.
01:19 🔗 ZiNC No link fixing?
01:19 🔗 JAA Rule 1 of web archival: never, ever modify anything sent by the web server.
01:20 🔗 ZiNC So dynamically with JS, while viewing.
01:20 🔗 JAA That said, the Wayback Machine does work around that problem by ignoring some session ID parameters, but again, that's only some common ones.
01:20 🔗 JAA E.g. sid=[0-9a-f]{32} would be handled but something more custom wouldn't.
01:21 🔗 ZiNC You could detect the software, then add rules based on that.
01:21 🔗 JAA You could, but wpull is a generic software and shouldn't contain code like that.
01:22 🔗 ZiNC Though generic ones might indeed cover most of it.
01:22 🔗 JAA Can be done with hooks though.
01:22 🔗 JAA Or if that fails, you can directly replace the relevant portions of wpull since it's all Python.
01:22 🔗 ZiNC generic + domain-specific knowledge isn't a bad combination.
01:22 🔗 ZiNC Anything that normalizes or ignores parameter order?
01:23 🔗 JAA Hardcoding domain-specific things in a generic software is bad though. Requires more maintenance, the test code now depends on some random web service, etc.
01:23 🔗 JAA wpull is *already* not getting enough maintenance. Such extra fluff would be outdated by years by this point.
01:24 🔗 ZiNC Not necessary hardcoding. Could be scripts made easy to modify, with minimal content.
01:24 🔗 JAA Nope, parameter order is preserved because there's no generic way to say if it matters or not. Some web servers require a certain order, and the HTTP spec doesn't allow random reordering.
01:25 🔗 ZiNC In GET params?
01:25 🔗 JAA That's exactly what hook scripts are. Feel free to write one, throw it in some repo, etc. But it doesn't belong in the main wpull code.
01:25 🔗 JAA GET, POST, whatever.
01:25 🔗 ZiNC What software minds the order in ?a&b&c ?
01:26 🔗 JAA I don't know, but I'm certain it exists.
01:26 🔗 ZiNC Maybe. But that'd be odd.
01:26 🔗 JAA The internet is full of odd things.
01:27 🔗 JAA Half the art of web archival is working around that crap.
01:27 🔗 ZiNC For strict ordering I'd expect URL rewriting, or something similar.
01:28 🔗 ZiNC So if IA's depth is always 3,
01:29 🔗 ZiNC does this mean 99% of forum content was never archived?
01:29 🔗 JAA It isn't.
01:29 🔗 JAA That was just one example of why deep recursion might not happen.
01:29 🔗 ZiNC BTW, any advantage to wpull 1.2.x rather than 2.0.x?
01:30 🔗 JAA But there's certainly a lot of forum content out there that hasn't been archived.
01:30 🔗 ZiNC Would you say the majority/large majority?
01:32 🔗 JAA Mostly stability. 1.2.3 is very stable and reliable. 2.0.x has a better hook API, uses asyncio instead of trollius, and has a bunch of other internal nice changes, but it's definitely nowhere near as stable as 1.2.3.
01:32 🔗 ZiNC As long as resuming works...
01:32 🔗 JAA No idea, I never did an audit of what forums exist or how much IA has captured.
01:33 🔗 JAA If you use grab-site, it'll be fine. With the additional plugins etc., that version of wpull is quite stable despite being essentially 2.0.3.
01:33 🔗 JAA Same goes for ArchiveBot.
01:34 🔗 OrIdow6 Speaking from experience of heavily using the WBM, no, forum coverage is terrible
01:34 🔗 JAA But you will eventually run into weird TLS connection hangs and things like that.
01:34 🔗 ZiNC v2-specific?
01:35 🔗 ZiNC OrIdow6: Maybe captured but not yet live? :)
01:36 🔗 JAA Yes, v2.x
01:37 🔗 ZiNC Alright.
01:37 🔗 ZiNC Thanks for the help so far.
01:37 🔗 JAA (And wpull_ludios 3.x, which is the wpull fork grab-site uses.)
01:38 🔗 JAA Er, ludios_wpull*
01:38 🔗 ZiNC :)
01:38 🔗 ZiNC Well, I'd better call it a day.
01:39 🔗 ZiNC Adios.
01:39 🔗 JAA See ya
01:42 🔗 ZiNC has quit IRC ()
02:02 🔗 Ctrl has joined #archiveteam-ot
03:05 🔗 atphoenix has quit IRC (Ping timeout: 276 seconds)
03:06 🔗 atphoenix has joined #archiveteam-ot
03:11 🔗 atphoenix has quit IRC (Read error: Operation timed out)
03:18 🔗 atphoenix has joined #archiveteam-ot
03:22 🔗 atphoeni_ has joined #archiveteam-ot
03:23 🔗 atphoenix has quit IRC (Ping timeout: 276 seconds)
03:23 🔗 atphoeni_ is now known as atphoenix
03:57 🔗 kiska has quit IRC (Ping timeout (120 seconds))
03:57 🔗 SJon__ has quit IRC (Read error: Connection reset by peer)
03:57 🔗 SJon__ has joined #archiveteam-ot
03:59 🔗 kiska has joined #archiveteam-ot
04:05 🔗 qw3rty_ has joined #archiveteam-ot
04:05 🔗 ShellyRol has joined #archiveteam-ot
04:13 🔗 qw3rty has quit IRC (Read error: Operation timed out)
04:36 🔗 wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES)
04:43 🔗 jake_test has quit IRC (Read error: Connection reset by peer)
04:43 🔗 JAA has quit IRC (Read error: Operation timed out)
04:43 🔗 jake_test has joined #archiveteam-ot
04:45 🔗 Larsenv has quit IRC (ZNC 1.7.5 - https://znc.in)
04:46 🔗 Larsenv has joined #archiveteam-ot
04:47 🔗 JAA has joined #archiveteam-ot
04:47 🔗 AlsoJAA sets mode: +o JAA
04:47 🔗 wp494 has joined #archiveteam-ot
04:56 🔗 dhyan_nat has joined #archiveteam-ot
05:02 🔗 OrIdow6 has quit IRC (Quit: Leaving.)
05:26 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
05:29 🔗 JAA has quit IRC (Read error: Operation timed out)
05:29 🔗 JAA has joined #archiveteam-ot
05:30 🔗 AlsoJAA sets mode: +o JAA
06:16 🔗 OrIdow6 has joined #archiveteam-ot
07:06 🔗 Ctrl has quit IRC (Read error: Operation timed out)
07:12 🔗 Wingy8 has joined #archiveteam-ot
07:13 🔗 Wingy has quit IRC (Read error: Operation timed out)
07:13 🔗 Wingy8 is now known as Wingy
08:45 🔗 dhyan_nat has joined #archiveteam-ot
10:18 🔗 fuzzy802 has joined #archiveteam-ot
10:24 🔗 fuzzy8021 has quit IRC (Read error: Operation timed out)
10:27 🔗 HP_Archiv has quit IRC (Ping timeout: 276 seconds)
10:28 🔗 fuzzy802 is now known as fuzzy8021
10:39 🔗 HP_Archiv has joined #archiveteam-ot
10:57 🔗 BlueMax has quit IRC (Quit: Leaving)
11:07 🔗 qw3rty_ has quit IRC (Read error: Connection reset by peer)
11:10 🔗 qw3rty has joined #archiveteam-ot
13:43 🔗 HP_Archiv has quit IRC (Quit: Leaving)
14:59 🔗 VerifiedJ has joined #archiveteam-ot
15:22 🔗 wp494 has quit IRC (Ping timeout: 610 seconds)
15:23 🔗 wp494 has joined #archiveteam-ot
17:11 🔗 Ctrl has joined #archiveteam-ot
17:16 🔗 HP_Archiv has joined #archiveteam-ot
17:17 🔗 HP_Archiv has quit IRC (Client Quit)
17:45 🔗 jake_test has quit IRC (Read error: Operation timed out)
17:46 🔗 jake_test has joined #archiveteam-ot
18:54 🔗 HP_Archiv has joined #archiveteam-ot
19:45 🔗 jodizzle has quit IRC (Read error: Operation timed out)
19:45 🔗 JAA has quit IRC (Read error: Operation timed out)
19:45 🔗 jodizzle has joined #archiveteam-ot
19:46 🔗 JAA has joined #archiveteam-ot
19:46 🔗 AlsoJAA sets mode: +o JAA
20:04 🔗 girst has quit IRC (Read error: Operation timed out)
20:42 🔗 girst has joined #archiveteam-ot
20:51 🔗 girst_ has joined #archiveteam-ot
20:51 🔗 girst has quit IRC (Read error: Connection reset by peer)
20:51 🔗 girst_ is now known as girst
21:30 🔗 DogsRNice has joined #archiveteam-ot
21:48 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
21:55 🔗 ivan has quit IRC (Quit: Leaving)
22:11 🔗 ivan has joined #archiveteam-ot
22:12 🔗 girst has quit IRC (Read error: Operation timed out)
22:15 🔗 girst has joined #archiveteam-ot
22:40 🔗 girst has quit IRC (Ping timeout: 258 seconds)
22:50 🔗 girst has joined #archiveteam-ot
22:54 🔗 BlueMax has joined #archiveteam-ot
22:56 🔗 BlueMax has quit IRC (Read error: Connection timed out)
22:56 🔗 BlueMax has joined #archiveteam-ot
23:45 🔗 girst has quit IRC (Read error: Operation timed out)
23:52 🔗 girst has joined #archiveteam-ot

irclogger-viewer