#archiveteam-ot 2020-03-27,Fri

↑back Search

Time	Nickname	Message
00:12 ^🔗	JAA	ZiNC: #archiveteam = important announcements ("oh shit this site is on fire" etc.), -bs = any archival discussion (unless there's a project-specific channel), -ot = anything more or less
00:12 ^🔗	ZiNC	Thanks.
00:13 ^🔗	ZiNC	bs = archival projects? Considering ot is "general archiving".
00:14 ^🔗	JAA	Yeah, AT archival projects in -bs, more general here.
00:15 ^🔗	ZiNC	Right.
00:18 ^🔗	ZiNC	Is Heritrix considered the common tool?
00:19 ^🔗	ZiNC	The default goto tool.
00:28 ^🔗	JAA	At the Internet Archive, I think so. At ArchiveTeam, I don't think we've ever used it. Some played around with it in the past and reported that it's quite a PITA to setup and manage.
00:29 ^🔗	ZiNC	That's my impression so far as well. :)
00:29 ^🔗	JAA	Our distributed projects nearly always use a wget fork, and ArchiveBot and many individual archivals use wpull.
00:29 ^🔗	ZiNC	v3
00:30 ^🔗	JAA	I do most of my archivals with my own tool (qwarc) nowadays since I can tune it to my liking, but I can't recommend it for general use.
00:30 ^🔗	ZiNC	Is there a way to have enough control over the crawl with wget (and similar)?
00:31 ^🔗	ZiNC	Rather than following every URL, or N-levels deep.
00:31 ^🔗	JAA	Plain wget: no
00:31 ^🔗	JAA	wget-lua: more or less
00:31 ^🔗	JAA	wpull: quite good
00:32 ^🔗	ZiNC	So why use anything but wpull?
00:32 ^🔗	JAA	It has quite a number of bugs, unfortunately.
00:33 ^🔗	ZiNC	Disastrous, or just annoyances?
00:33 ^🔗	JAA	If you just `pip install wpull`, you'll pretty much get a non-functional version (2.0.1). You'll want 1.2.3 or 2.0.3.
00:33 ^🔗	JAA	You might also want to look into grab-site.
00:35 ^🔗	ZiNC	Any worthwhile GUI-based tools/frontends?
00:36 ^🔗	ZiNC	(I suppose grab-site is just a minimal GUI around wpull.)
00:39 ^🔗	JAA	Don't think so, unless you can make one of the countless wget GUIs work with it. But it's pretty much all terminal-based here. Even grab-site is, it just also has a web interface for monitoring, but you can't control anything through it.
00:39 ^🔗	JAA	grab-site is roughly the same as ArchiveBot, by the way, just without the distributed and IRC parts.
00:42 ^🔗	ZiNC	Thinking of capturing a forum.
00:43 ^🔗	ZiNC	Might it be better to use a few templated URLs, rather than follow URLs?
00:54 ^🔗	JAA	Depends a bit on the forum software and how important it is to you to also capture forum listings etc. But if you're just interested in the thread contents and the URL format allows it, yeah, I'd just do that. (Keep thread pagination in mind though.)
00:56 ^🔗	ZiNC	Aren't there any forum-specific grabbers?
00:56 ^🔗	hook54321	I haven't seen any that spit out WARCs
00:56 ^🔗	ZiNC	With some forum-software-specific knowledge it might be cleaner and maybe simpler.
00:56 ^🔗	ZiNC	Also to do period delta captures.
00:56 ^🔗	ZiNC	periodic
00:57 ^🔗	JAA	I've been thinking about that a bit recently, so maybe soon. :-)
00:57 ^🔗	ZiNC	Might even extract just the actual data,
00:57 ^🔗	hook54321	something like a --discourse option for archivebot would be nice
00:57 ^🔗	ZiNC	and use templates to recreate the pagination/subforums listings.
00:58 ^🔗	ZiNC	Might require some manual work to fix/create a template for a specific site,
00:58 ^🔗	JAA	The one I have in mind is based on qwarc and archives everything relevant for pywb/WBM playback but also produces some machine-readable format of the actual contents. I haven't really thought about it very much in detail yet though.
00:58 ^🔗	ZiNC	or maybe even that could be automated with forum-software knowledge/presets.
00:59 ^🔗	ZiNC	hook54321: Is that a real option of some tool?
00:59 ^🔗	JAA	I don't think it's worth having a generic software that can reproduce a particular forum's layout etc. The content is what matters, and for everything else, we have WARCs.
00:59 ^🔗	hook54321	ZiNC: no
01:00 ^🔗	hook54321	JAA: I think that's what bibanon does, right? I'm not a big fan.
01:00 ^🔗	ZiNC	Well, it could be nice to keep a forum's "feel" as well.
01:00 ^🔗	JAA	hook54321: No idea what they're doing. I'm not particularly interested in imageboards.
01:00 ^🔗	ZiNC	And why save all that repeating HTML 100Ks of times.
01:01 ^🔗	ZiNC	In the case of Discourse, yeah, no need to keep the layout. It sucks anyway. :)
01:02 ^🔗	hook54321	whether or not we like the layout is pretty much irrevelant
01:03 ^🔗	ZiNC	Do wpull/something else have good ways to do delta captures?
01:07 ^🔗	JAA	Well, generically, the best thing you can do is dedupe against previous captured, as with wpull's --warc-dedup. But that's just on the storage side, it doesn't save you from reretrieving everything, and if there are small differences in the responses, it won't work at all.
01:07 ^🔗	JAA	previously captured responses*
01:08 ^🔗		dxrt has quit IRC (Remote host closed the connection)
01:09 ^🔗	ZiNC	No tools with a way to define custom/per-job stop rules?
01:09 ^🔗	JAA	Anything more efficient would be specific to a particular website, software, etc.
01:09 ^🔗	ZiNC	For example, in a forum, one might follow the "newest posts" list, until reaching a date that's already been captured before.
01:09 ^🔗	JAA	Well yeah, you can probably do that to some degree with URL filters/ignores.
01:10 ^🔗		dxrt has joined #archiveteam-ot
01:11 ^🔗	ZiNC	Can you script these things?
01:11 ^🔗	JAA	https://wpull.readthedocs.io/en/master/scripting.html
01:12 ^🔗	ZiNC	It may also require keeping a global state, and just per-page/download.
01:12 ^🔗		Ctrl has quit IRC (Read error: Operation timed out)
01:12 ^🔗	ZiNC	+not just
01:15 ^🔗	ZiNC	It's sad that the Wayback Machine tends to have forums captured very partially.
01:15 ^🔗	ZiNC	Any idea if it's better now than it was?
01:15 ^🔗	ZiNC	Seems like it isn't fond of following pagination.
01:16 ^🔗	ZiNC	And other problems, like junk in the URLs (session IDs and such).
01:16 ^🔗	JAA	Depends entirely on how it was captured. If it's part of IA's web-wide crawls, then those crawls stop at depth 3 I believe, so naturally it won't get very far in long threads.
01:16 ^🔗	JAA	... or in thread list pagination.
01:16 ^🔗	JAA	Yep, session IDs are a common annoyance in ArchiveBot.
01:16 ^🔗	ZiNC	And the starting point would be the homepage, or forum index?
01:17 ^🔗	ZiNC	Why stop at depth 3?
01:18 ^🔗	JAA	I think they start those from a domain list.
01:18 ^🔗	JAA	Because it takes something like half a year to retrieve it up to depth 3 and would probably take several years to do depth 4.
01:19 ^🔗	ZiNC	sessionIds could be filtered out. Might require software-specific code, but that's not too hard.
01:19 ^🔗	JAA	wpull does filter out some common session IDs. They're still annoying because the links will be broken.
01:19 ^🔗	ZiNC	No link fixing?
01:19 ^🔗	JAA	Rule 1 of web archival: never, ever modify anything sent by the web server.
01:20 ^🔗	ZiNC	So dynamically with JS, while viewing.
01:20 ^🔗	JAA	That said, the Wayback Machine does work around that problem by ignoring some session ID parameters, but again, that's only some common ones.
01:20 ^🔗	JAA	E.g. sid=[0-9a-f]{32} would be handled but something more custom wouldn't.
01:21 ^🔗	ZiNC	You could detect the software, then add rules based on that.
01:21 ^🔗	JAA	You could, but wpull is a generic software and shouldn't contain code like that.
01:22 ^🔗	ZiNC	Though generic ones might indeed cover most of it.
01:22 ^🔗	JAA	Can be done with hooks though.
01:22 ^🔗	JAA	Or if that fails, you can directly replace the relevant portions of wpull since it's all Python.
01:22 ^🔗	ZiNC	generic + domain-specific knowledge isn't a bad combination.
01:22 ^🔗	ZiNC	Anything that normalizes or ignores parameter order?
01:23 ^🔗	JAA	Hardcoding domain-specific things in a generic software is bad though. Requires more maintenance, the test code now depends on some random web service, etc.
01:23 ^🔗	JAA	wpull is already not getting enough maintenance. Such extra fluff would be outdated by years by this point.
01:24 ^🔗	ZiNC	Not necessary hardcoding. Could be scripts made easy to modify, with minimal content.
01:24 ^🔗	JAA	Nope, parameter order is preserved because there's no generic way to say if it matters or not. Some web servers require a certain order, and the HTTP spec doesn't allow random reordering.
01:25 ^🔗	ZiNC	In GET params?
01:25 ^🔗	JAA	That's exactly what hook scripts are. Feel free to write one, throw it in some repo, etc. But it doesn't belong in the main wpull code.
01:25 ^🔗	JAA	GET, POST, whatever.
01:25 ^🔗	ZiNC	What software minds the order in ?a&b&c ?
01:26 ^🔗	JAA	I don't know, but I'm certain it exists.
01:26 ^🔗	ZiNC	Maybe. But that'd be odd.
01:26 ^🔗	JAA	The internet is full of odd things.
01:27 ^🔗	JAA	Half the art of web archival is working around that crap.
01:27 ^🔗	ZiNC	For strict ordering I'd expect URL rewriting, or something similar.
01:28 ^🔗	ZiNC	So if IA's depth is always 3,
01:29 ^🔗	ZiNC	does this mean 99% of forum content was never archived?
01:29 ^🔗	JAA	It isn't.
01:29 ^🔗	JAA	That was just one example of why deep recursion might not happen.
01:29 ^🔗	ZiNC	BTW, any advantage to wpull 1.2.x rather than 2.0.x?
01:30 ^🔗	JAA	But there's certainly a lot of forum content out there that hasn't been archived.
01:30 ^🔗	ZiNC	Would you say the majority/large majority?
01:32 ^🔗	JAA	Mostly stability. 1.2.3 is very stable and reliable. 2.0.x has a better hook API, uses asyncio instead of trollius, and has a bunch of other internal nice changes, but it's definitely nowhere near as stable as 1.2.3.
01:32 ^🔗	ZiNC	As long as resuming works...
01:32 ^🔗	JAA	No idea, I never did an audit of what forums exist or how much IA has captured.
01:33 ^🔗	JAA	If you use grab-site, it'll be fine. With the additional plugins etc., that version of wpull is quite stable despite being essentially 2.0.3.
01:33 ^🔗	JAA	Same goes for ArchiveBot.
01:34 ^🔗	OrIdow6	Speaking from experience of heavily using the WBM, no, forum coverage is terrible
01:34 ^🔗	JAA	But you will eventually run into weird TLS connection hangs and things like that.
01:34 ^🔗	ZiNC	v2-specific?
01:35 ^🔗	ZiNC	OrIdow6: Maybe captured but not yet live? :)
01:36 ^🔗	JAA	Yes, v2.x
01:37 ^🔗	ZiNC	Alright.
01:37 ^🔗	ZiNC	Thanks for the help so far.
01:37 ^🔗	JAA	(And wpull_ludios 3.x, which is the wpull fork grab-site uses.)
01:38 ^🔗	JAA	Er, ludios_wpull*
01:38 ^🔗	ZiNC	:)
01:38 ^🔗	ZiNC	Well, I'd better call it a day.
01:39 ^🔗	ZiNC	Adios.
01:39 ^🔗	JAA	See ya
01:42 ^🔗		ZiNC has quit IRC ()
02:02 ^🔗		Ctrl has joined #archiveteam-ot
03:05 ^🔗		atphoenix has quit IRC (Ping timeout: 276 seconds)
03:06 ^🔗		atphoenix has joined #archiveteam-ot
03:11 ^🔗		atphoenix has quit IRC (Read error: Operation timed out)
03:18 ^🔗		atphoenix has joined #archiveteam-ot
03:22 ^🔗		atphoeni_ has joined #archiveteam-ot
03:23 ^🔗		atphoenix has quit IRC (Ping timeout: 276 seconds)
03:23 ^🔗		atphoeni_ is now known as atphoenix
03:57 ^🔗		kiska has quit IRC (Ping timeout (120 seconds))
03:57 ^🔗		SJon__ has quit IRC (Read error: Connection reset by peer)
03:57 ^🔗		SJon__ has joined #archiveteam-ot
03:59 ^🔗		kiska has joined #archiveteam-ot
04:05 ^🔗		qw3rty_ has joined #archiveteam-ot
04:05 ^🔗		ShellyRol has joined #archiveteam-ot
04:13 ^🔗		qw3rty has quit IRC (Read error: Operation timed out)
04:36 ^🔗		wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES)
04:43 ^🔗		jake_test has quit IRC (Read error: Connection reset by peer)
04:43 ^🔗		JAA has quit IRC (Read error: Operation timed out)
04:43 ^🔗		jake_test has joined #archiveteam-ot
04:45 ^🔗		Larsenv has quit IRC (ZNC 1.7.5 - https://znc.in)
04:46 ^🔗		Larsenv has joined #archiveteam-ot
04:47 ^🔗		JAA has joined #archiveteam-ot
04:47 ^🔗		AlsoJAA sets mode: +o JAA
04:47 ^🔗		wp494 has joined #archiveteam-ot
04:56 ^🔗		dhyan_nat has joined #archiveteam-ot
05:02 ^🔗		OrIdow6 has quit IRC (Quit: Leaving.)
05:26 ^🔗		dhyan_nat has quit IRC (Read error: Operation timed out)
05:29 ^🔗		JAA has quit IRC (Read error: Operation timed out)
05:29 ^🔗		JAA has joined #archiveteam-ot
05:30 ^🔗		AlsoJAA sets mode: +o JAA
06:16 ^🔗		OrIdow6 has joined #archiveteam-ot
07:06 ^🔗		Ctrl has quit IRC (Read error: Operation timed out)
07:12 ^🔗		Wingy8 has joined #archiveteam-ot
07:13 ^🔗		Wingy has quit IRC (Read error: Operation timed out)
07:13 ^🔗		Wingy8 is now known as Wingy
08:45 ^🔗		dhyan_nat has joined #archiveteam-ot
10:18 ^🔗		fuzzy802 has joined #archiveteam-ot
10:24 ^🔗		fuzzy8021 has quit IRC (Read error: Operation timed out)
10:27 ^🔗		HP_Archiv has quit IRC (Ping timeout: 276 seconds)
10:28 ^🔗		fuzzy802 is now known as fuzzy8021
10:39 ^🔗		HP_Archiv has joined #archiveteam-ot
10:57 ^🔗		BlueMax has quit IRC (Quit: Leaving)
11:07 ^🔗		qw3rty_ has quit IRC (Read error: Connection reset by peer)
11:10 ^🔗		qw3rty has joined #archiveteam-ot
13:43 ^🔗		HP_Archiv has quit IRC (Quit: Leaving)
14:59 ^🔗		VerifiedJ has joined #archiveteam-ot
15:22 ^🔗		wp494 has quit IRC (Ping timeout: 610 seconds)
15:23 ^🔗		wp494 has joined #archiveteam-ot
17:11 ^🔗		Ctrl has joined #archiveteam-ot
17:16 ^🔗		HP_Archiv has joined #archiveteam-ot
17:17 ^🔗		HP_Archiv has quit IRC (Client Quit)
17:45 ^🔗		jake_test has quit IRC (Read error: Operation timed out)
17:46 ^🔗		jake_test has joined #archiveteam-ot
18:54 ^🔗		HP_Archiv has joined #archiveteam-ot
19:45 ^🔗		jodizzle has quit IRC (Read error: Operation timed out)
19:45 ^🔗		JAA has quit IRC (Read error: Operation timed out)
19:45 ^🔗		jodizzle has joined #archiveteam-ot
19:46 ^🔗		JAA has joined #archiveteam-ot
19:46 ^🔗		AlsoJAA sets mode: +o JAA
20:04 ^🔗		girst has quit IRC (Read error: Operation timed out)
20:42 ^🔗		girst has joined #archiveteam-ot
20:51 ^🔗		girst_ has joined #archiveteam-ot
20:51 ^🔗		girst has quit IRC (Read error: Connection reset by peer)
20:51 ^🔗		girst_ is now known as girst
21:30 ^🔗		DogsRNice has joined #archiveteam-ot
21:48 ^🔗		dhyan_nat has quit IRC (Read error: Operation timed out)
21:55 ^🔗		ivan has quit IRC (Quit: Leaving)
22:11 ^🔗		ivan has joined #archiveteam-ot
22:12 ^🔗		girst has quit IRC (Read error: Operation timed out)
22:15 ^🔗		girst has joined #archiveteam-ot
22:40 ^🔗		girst has quit IRC (Ping timeout: 258 seconds)
22:50 ^🔗		girst has joined #archiveteam-ot
22:54 ^🔗		BlueMax has joined #archiveteam-ot
22:56 ^🔗		BlueMax has quit IRC (Read error: Connection timed out)
22:56 ^🔗		BlueMax has joined #archiveteam-ot
23:45 ^🔗		girst has quit IRC (Read error: Operation timed out)
23:52 ^🔗		girst has joined #archiveteam-ot

irclogger-viewer