[01:03] *** vectr0n` has quit IRC (Remote host closed the connection)
[03:31] *** Frogging has joined #warrior
[13:09] <kiska> But here is good as well, at least it won't get buried with Major messages JAA arkiver
[13:09] <arkiver> https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua is a nice example
[13:11] <kiska> I'll take a look at it in the morning
[13:12] <arkiver> JAA: you here?
[13:13] <JAA> Yes, I'm here.
[13:13] <arkiver> When a data is downloaded, it is passed through get_urls https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L131
[13:14] <JAA> Right, same callback as in wpull. What's the last paramater, "iri"?
[13:14] <arkiver> the documentation is here https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks
[13:15] <arkiver> (I never use iri
[13:15] <arkiver> default for almost every project is this, https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L200-L217
[13:15] <arkiver> not some fancy HTML or script parsing, but trying to extract as many possible URLs as possible
[13:16] <JAA> Is % the regex escape char in Lua?
[13:16] <arkiver> yes
[13:16] <JAA> Ew
[13:16] <JAA> :-)
[13:16] <arkiver> and unlike regex you have to % a -
[13:17] <JAA> Ah yeah, it's not real regex, is it?
[13:17] <arkiver> no, pattern matching
[13:17] <JAA> Right
[13:17] <arkiver> it has no (?:www\.)?
[13:17] <arkiver> but I use https?://[^/]*google.com/ instead, which works pretty well, but will also match something else then www.google.com
[13:18] <JAA> Yeah, like evilgoogle.com
[13:18] <arkiver> for example
[13:18] <arkiver> yep
[13:19] <arkiver> Every possible URL discovered in https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L200-L217 is checked and completed in checknewurl and checknewshorturl
[13:19] <arkiver> full URLs are then passed to https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L137-L147
[13:19] <arkiver> there the # is removed, and &amp; replaces by &
[13:20] <arkiver> (testtoken stuff is wikispaces only, ignore that)
[13:20] <arkiver> so for every discovered URL we check if it was already downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L141
[13:20] <arkiver> and if we want to archive it https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L142
[13:20] <arkiver> that happens in the ´allowed´ function https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L34
[13:23] <JAA> Ack
[13:23] <arkiver> so what basically happens is we extract as many URLs as possible and select the ones we want to archive
[13:23] <JAA> arkiver: Sorry, need to leave for a bit, I'll be back in half an hour or so probably.
[13:23] <arkiver> note that most discovered URLs will be script or css garbage, etc., but those are then filtered out in the `allowed` function
[13:23] <arkiver> JAA: ok
[13:24] <arkiver> ping me when you´re back please
[14:18] <JAA> arkiver: Back
[14:32] <arkiver> ok
[14:33] <arkiver> besides extracting URLs with get_urls, wget also extracts URLs. Each URL extracted by wget is passed through https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L112
[14:34] <arkiver> where it is checked for being downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L120 and if it is allowed https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L121
[14:35] <arkiver> also note the ´html == 0´ in download_child_p https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L121
[14:36] <JAA> What are the /file/history/ and /page/history/ checks doing there?
[14:36] <arkiver> html == 0 is for example so external images are archived, even if they are not allowed by the `allowed` function
[14:36] <arkiver> Those two history pages are here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L122-L123 and here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L182-L183
[14:37] <arkiver> They are not blocked in the `allowed` function, because we want to archive them, but nothing we don´t want to archive new URLs found on them, so we ignore them in those places
[14:38] <arkiver> (ignore the ´nothing´ there)
[14:38] <JAA> Oh, right, didn't notice that it was checking the *parent* (i.e. the already retrieved) URL.
[14:38] <arkiver> yep
[14:40] <arkiver> so as you can see pretty much any URL passes through `allowed`. So if we want to extract a list of for example users, we can do that in `allowed` before the URL is ignored https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L46-L49
[14:40] <arkiver> and this is custom stuff for wikispaces https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L185-L199
[14:42] <arkiver> httploop_result can check status_codes and ABORT, EXIT, CONTINUE or NOTHING
[14:43] <arkiver> To be able to trigger an abort outside of httploop_result, the variable `abortgrab` exists, which is checked in httploop_result and finally here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L292-L297
[14:43] <arkiver> And we have a list of ignored URLs, https://github.com/ArchiveTeam/wikispaces-grab/blob/master/ignore-list
[14:43] <arkiver> those are loaded and set to downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L19-L21
[14:47] <JAA> arkiver: What's this bit in the retry code? https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L261-L265
[14:49] <arkiver> after 5 tries https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L257 for a set of status_codes https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L250-L252
[14:49] <arkiver> and remember we can archive URLs with html == 0 here https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L121 that are just external images on for example a forum
[14:51] <arkiver> so after 5 tries if an URL is `allowed` (we want to have/it is important) we abort the crawl (item stays in out items), if it is not an important URL (just external forum image for example) we `wget.actions.EXIT` or skip the URL.
[14:51] <arkiver> and those not important URLs were for example downloaded because of the html == 0 rule in download_child_p
[14:55] <JAA> Oh, EXIT means "skip". Well, that's intuitive...
[14:57] <arkiver> haha yeah
[14:57] <arkiver> https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks
[14:57] <arkiver>  wget.actions.NOTHING: follow the normal Wget procedure for this result.
[14:57] <arkiver>  wget.actions.CONTINUE: retry this URL.
[14:57] <arkiver>  wget.actions.CONTINUE: retry this URL.
[14:57] <arkiver>  wget.actions.EXIT: finish this URL (ignore any error).
[14:57] <arkiver>  wget.actions.ABORT: Wget will abort() and exit immediately.
[15:00] <arkiver> So the whole script may seem a little big and complicated at times, but it´s grown into this over years of warrior project, and works pretty well with most projects
[15:00] <JAA> https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L273-L277  This is useless in this case, correct? I assume that's used for rate limiting?
[15:01] <arkiver> That is used for sleeping between archiving URLs. It´s kind of something that´s always been there, didn´t do much with it
[15:01] <arkiver> set `sleep_time` to something and will sleep for that long between URLs
[15:01] <JAA> Right
[15:02] <arkiver> This is a much simpler example https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua
[15:02] <JAA> The list of discovered users gets upload as a separate file to the normal rsync target, correct?
[15:02] <arkiver> only project specific thing in YTMND lua script is https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua#L36-L44 and https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua#L115-L124
[15:03] <arkiver> ah well normally you would use the YTMND script example, this does not do discovery
[15:04] <arkiver> but if you want to do discovery, make sure to extract data (for example in `allowed`), write it to the file at the end https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L282-L290 and make sure you create and handle the *_data.txt files in https://github.com/ArchiveTeam/wikispaces-grab/blob/master/pipeline.py (search for all lines with _data.txt)
[15:06] <arkiver> vidme is also a nice example https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua, has a lot of custom extraction https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L169-L225 and does a discovery, but has a small `allowed` function
[15:07] <arkiver> you´ll see plenty of nice examples when you go through the lua scripts of different (recent) warrior projects
[15:09] <JAA> Right, I'll do that.
[15:09] <arkiver> vid.me also loads a json file, https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L198, in that case make sure you have the JSON.lua file and load it https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L3 and have the function to load a file https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L27-L33
[15:10] <arkiver> JAA: let me know if you have any questions :) and if you need me to create a tracker for a project
[15:13] <JAA> Don't worry, I will. :-)
[15:14] <arkiver> I hope this was all a little clear, I now see the big bunch of github URLs
[15:16] <JAA> It was pretty clear. I'm trying to understand the flow of information at the moment.
[15:17] <JAA> The environment variables at the top of the script (item_* and warc_file_base) are set by seesaw, I assume?
[15:18] <JAA> Ah no, found it in pipeline.py.
[15:18] <arkiver> yeah, was just about to write that
[15:18] <arkiver> https://github.com/ArchiveTeam/wikispaces-grab/blob/master/pipeline.py#L248-L253 for the record
[15:19] <JAA> I feel like I asked you this before, but how do you test before launching a project? Do you just use the tracker directly?
[15:19] <arkiver> you can also run the tracker locally, but I just create a project on tracker.archiveteam.org and use that to test
[15:19] <arkiver> I find that much easier
[15:19] <JAA> Right.