#warrior 2018-07-19,Thu

↑back Search

Time	Nickname	Message
01:03 ^🔗		vectr0n` has quit IRC (Remote host closed the connection)
03:31 ^🔗		Frogging has joined #warrior
13:09 ^🔗	kiska	But here is good as well, at least it won't get buried with Major messages JAA arkiver
13:09 ^🔗	arkiver	https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua is a nice example
13:11 ^🔗	kiska	I'll take a look at it in the morning
13:12 ^🔗	arkiver	JAA: you here?
13:13 ^🔗	JAA	Yes, I'm here.
13:13 ^🔗	arkiver	When a data is downloaded, it is passed through get_urls https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L131
13:14 ^🔗	JAA	Right, same callback as in wpull. What's the last paramater, "iri"?
13:14 ^🔗	arkiver	the documentation is here https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks
13:15 ^🔗	arkiver	(I never use iri
13:15 ^🔗	arkiver	default for almost every project is this, https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L200-L217
13:15 ^🔗	arkiver	not some fancy HTML or script parsing, but trying to extract as many possible URLs as possible
13:16 ^🔗	JAA	Is % the regex escape char in Lua?
13:16 ^🔗	arkiver	yes
13:16 ^🔗	JAA	Ew
13:16 ^🔗	JAA	:-)
13:16 ^🔗	arkiver	and unlike regex you have to % a -
13:17 ^🔗	JAA	Ah yeah, it's not real regex, is it?
13:17 ^🔗	arkiver	no, pattern matching
13:17 ^🔗	JAA	Right
13:17 ^🔗	arkiver	it has no (?:www\.)?
13:17 ^🔗	arkiver	but I use https?://[^/]*google.com/ instead, which works pretty well, but will also match something else then www.google.com
13:18 ^🔗	JAA	Yeah, like evilgoogle.com
13:18 ^🔗	arkiver	for example
13:18 ^🔗	arkiver	yep
13:19 ^🔗	arkiver	Every possible URL discovered in https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L200-L217 is checked and completed in checknewurl and checknewshorturl
13:19 ^🔗	arkiver	full URLs are then passed to https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L137-L147
13:19 ^🔗	arkiver	there the # is removed, and & replaces by &
13:20 ^🔗	arkiver	(testtoken stuff is wikispaces only, ignore that)
13:20 ^🔗	arkiver	so for every discovered URL we check if it was already downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L141
13:20 ^🔗	arkiver	and if we want to archive it https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L142
13:20 ^🔗	arkiver	that happens in the ´allowed´ function https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L34
13:23 ^🔗	JAA	Ack
13:23 ^🔗	arkiver	so what basically happens is we extract as many URLs as possible and select the ones we want to archive
13:23 ^🔗	JAA	arkiver: Sorry, need to leave for a bit, I'll be back in half an hour or so probably.
13:23 ^🔗	arkiver	note that most discovered URLs will be script or css garbage, etc., but those are then filtered out in the `allowed` function
13:23 ^🔗	arkiver	JAA: ok
13:24 ^🔗	arkiver	ping me when you´re back please
14:18 ^🔗	JAA	arkiver: Back
14:32 ^🔗	arkiver	ok
14:33 ^🔗	arkiver	besides extracting URLs with get_urls, wget also extracts URLs. Each URL extracted by wget is passed through https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L112
14:34 ^🔗	arkiver	where it is checked for being downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L120 and if it is allowed https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L121
14:35 ^🔗	arkiver	also note the ´html == 0´ in download_child_p https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L121
14:36 ^🔗	JAA	What are the /file/history/ and /page/history/ checks doing there?
14:36 ^🔗	arkiver	html == 0 is for example so external images are archived, even if they are not allowed by the `allowed` function
14:36 ^🔗	arkiver	Those two history pages are here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L122-L123 and here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L182-L183
14:37 ^🔗	arkiver	They are not blocked in the `allowed` function, because we want to archive them, but nothing we don´t want to archive new URLs found on them, so we ignore them in those places
14:38 ^🔗	arkiver	(ignore the ´nothing´ there)
14:38 ^🔗	JAA	Oh, right, didn't notice that it was checking the parent (i.e. the already retrieved) URL.
14:38 ^🔗	arkiver	yep
14:40 ^🔗	arkiver	so as you can see pretty much any URL passes through `allowed`. So if we want to extract a list of for example users, we can do that in `allowed` before the URL is ignored https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L46-L49
14:40 ^🔗	arkiver	and this is custom stuff for wikispaces https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L185-L199
14:42 ^🔗	arkiver	httploop_result can check status_codes and ABORT, EXIT, CONTINUE or NOTHING
14:43 ^🔗	arkiver	To be able to trigger an abort outside of httploop_result, the variable `abortgrab` exists, which is checked in httploop_result and finally here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L292-L297
14:43 ^🔗	arkiver	And we have a list of ignored URLs, https://github.com/ArchiveTeam/wikispaces-grab/blob/master/ignore-list
14:43 ^🔗	arkiver	those are loaded and set to downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L19-L21
14:47 ^🔗	JAA	arkiver: What's this bit in the retry code? https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L261-L265
14:49 ^🔗	arkiver	after 5 tries https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L257 for a set of status_codes https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L250-L252
14:49 ^🔗	arkiver	and remember we can archive URLs with html == 0 here https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L121 that are just external images on for example a forum
14:51 ^🔗	arkiver	so after 5 tries if an URL is `allowed` (we want to have/it is important) we abort the crawl (item stays in out items), if it is not an important URL (just external forum image for example) we `wget.actions.EXIT` or skip the URL.
14:51 ^🔗	arkiver	and those not important URLs were for example downloaded because of the html == 0 rule in download_child_p
14:55 ^🔗	JAA	Oh, EXIT means "skip". Well, that's intuitive...
14:57 ^🔗	arkiver	haha yeah
14:57 ^🔗	arkiver	https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks
14:57 ^🔗	arkiver	wget.actions.NOTHING: follow the normal Wget procedure for this result.
14:57 ^🔗	arkiver	wget.actions.CONTINUE: retry this URL.
14:57 ^🔗	arkiver	wget.actions.CONTINUE: retry this URL.
14:57 ^🔗	arkiver	wget.actions.EXIT: finish this URL (ignore any error).
14:57 ^🔗	arkiver	wget.actions.ABORT: Wget will abort() and exit immediately.
15:00 ^🔗	arkiver	So the whole script may seem a little big and complicated at times, but it´s grown into this over years of warrior project, and works pretty well with most projects
15:00 ^🔗	JAA	https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L273-L277 This is useless in this case, correct? I assume that's used for rate limiting?
15:01 ^🔗	arkiver	That is used for sleeping between archiving URLs. It´s kind of something that´s always been there, didn´t do much with it
15:01 ^🔗	arkiver	set `sleep_time` to something and will sleep for that long between URLs
15:01 ^🔗	JAA	Right
15:02 ^🔗	arkiver	This is a much simpler example https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua
15:02 ^🔗	JAA	The list of discovered users gets upload as a separate file to the normal rsync target, correct?
15:02 ^🔗	arkiver	only project specific thing in YTMND lua script is https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua#L36-L44 and https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua#L115-L124
15:03 ^🔗	arkiver	ah well normally you would use the YTMND script example, this does not do discovery
15:04 ^🔗	arkiver	but if you want to do discovery, make sure to extract data (for example in `allowed`), write it to the file at the end https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L282-L290 and make sure you create and handle the *_data.txt files in https://github.com/ArchiveTeam/wikispaces-grab/blob/master/pipeline.py (search for all lines with _data.txt)
15:06 ^🔗	arkiver	vidme is also a nice example https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua, has a lot of custom extraction https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L169-L225 and does a discovery, but has a small `allowed` function
15:07 ^🔗	arkiver	you´ll see plenty of nice examples when you go through the lua scripts of different (recent) warrior projects
15:09 ^🔗	JAA	Right, I'll do that.
15:09 ^🔗	arkiver	vid.me also loads a json file, https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L198, in that case make sure you have the JSON.lua file and load it https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L3 and have the function to load a file https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L27-L33
15:10 ^🔗	arkiver	JAA: let me know if you have any questions :) and if you need me to create a tracker for a project
15:13 ^🔗	JAA	Don't worry, I will. :-)
15:14 ^🔗	arkiver	I hope this was all a little clear, I now see the big bunch of github URLs
15:16 ^🔗	JAA	It was pretty clear. I'm trying to understand the flow of information at the moment.
15:17 ^🔗	JAA	The environment variables at the top of the script (item_* and warc_file_base) are set by seesaw, I assume?
15:18 ^🔗	JAA	Ah no, found it in pipeline.py.
15:18 ^🔗	arkiver	yeah, was just about to write that
15:18 ^🔗	arkiver	https://github.com/ArchiveTeam/wikispaces-grab/blob/master/pipeline.py#L248-L253 for the record
15:19 ^🔗	JAA	I feel like I asked you this before, but how do you test before launching a project? Do you just use the tracker directly?
15:19 ^🔗	arkiver	you can also run the tracker locally, but I just create a project on tracker.archiveteam.org and use that to test
15:19 ^🔗	arkiver	I find that much easier
15:19 ^🔗	JAA	Right.

irclogger-viewer