[01:03] *** vectr0n` has quit IRC (Remote host closed the connection) [03:31] *** Frogging has joined #warrior [13:09] But here is good as well, at least it won't get buried with Major messages JAA arkiver [13:09] https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua is a nice example [13:11] I'll take a look at it in the morning [13:12] JAA: you here? [13:13] Yes, I'm here. [13:13] When a data is downloaded, it is passed through get_urls https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L131 [13:14] Right, same callback as in wpull. What's the last paramater, "iri"? [13:14] the documentation is here https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks [13:15] (I never use iri [13:15] default for almost every project is this, https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L200-L217 [13:15] not some fancy HTML or script parsing, but trying to extract as many possible URLs as possible [13:16] Is % the regex escape char in Lua? [13:16] yes [13:16] Ew [13:16] :-) [13:16] and unlike regex you have to % a - [13:17] Ah yeah, it's not real regex, is it? [13:17] no, pattern matching [13:17] Right [13:17] it has no (?:www\.)? [13:17] but I use https?://[^/]*google.com/ instead, which works pretty well, but will also match something else then www.google.com [13:18] Yeah, like evilgoogle.com [13:18] for example [13:18] yep [13:19] Every possible URL discovered in https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L200-L217 is checked and completed in checknewurl and checknewshorturl [13:19] full URLs are then passed to https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L137-L147 [13:19] there the # is removed, and & replaces by & [13:20] (testtoken stuff is wikispaces only, ignore that) [13:20] so for every discovered URL we check if it was already downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L141 [13:20] and if we want to archive it https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L142 [13:20] that happens in the ´allowed´ function https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L34 [13:23] Ack [13:23] so what basically happens is we extract as many URLs as possible and select the ones we want to archive [13:23] arkiver: Sorry, need to leave for a bit, I'll be back in half an hour or so probably. [13:23] note that most discovered URLs will be script or css garbage, etc., but those are then filtered out in the `allowed` function [13:23] JAA: ok [13:24] ping me when you´re back please [14:18] arkiver: Back [14:32] ok [14:33] besides extracting URLs with get_urls, wget also extracts URLs. Each URL extracted by wget is passed through https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L112 [14:34] where it is checked for being downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L120 and if it is allowed https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L121 [14:35] also note the ´html == 0´ in download_child_p https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L121 [14:36] What are the /file/history/ and /page/history/ checks doing there? [14:36] html == 0 is for example so external images are archived, even if they are not allowed by the `allowed` function [14:36] Those two history pages are here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L122-L123 and here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L182-L183 [14:37] They are not blocked in the `allowed` function, because we want to archive them, but nothing we don´t want to archive new URLs found on them, so we ignore them in those places [14:38] (ignore the ´nothing´ there) [14:38] Oh, right, didn't notice that it was checking the *parent* (i.e. the already retrieved) URL. [14:38] yep [14:40] so as you can see pretty much any URL passes through `allowed`. So if we want to extract a list of for example users, we can do that in `allowed` before the URL is ignored https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L46-L49 [14:40] and this is custom stuff for wikispaces https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L185-L199 [14:42] httploop_result can check status_codes and ABORT, EXIT, CONTINUE or NOTHING [14:43] To be able to trigger an abort outside of httploop_result, the variable `abortgrab` exists, which is checked in httploop_result and finally here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L292-L297 [14:43] And we have a list of ignored URLs, https://github.com/ArchiveTeam/wikispaces-grab/blob/master/ignore-list [14:43] those are loaded and set to downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L19-L21 [14:47] arkiver: What's this bit in the retry code? https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L261-L265 [14:49] after 5 tries https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L257 for a set of status_codes https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L250-L252 [14:49] and remember we can archive URLs with html == 0 here https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L121 that are just external images on for example a forum [14:51] so after 5 tries if an URL is `allowed` (we want to have/it is important) we abort the crawl (item stays in out items), if it is not an important URL (just external forum image for example) we `wget.actions.EXIT` or skip the URL. [14:51] and those not important URLs were for example downloaded because of the html == 0 rule in download_child_p [14:55] Oh, EXIT means "skip". Well, that's intuitive... [14:57] haha yeah [14:57] https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks [14:57] wget.actions.NOTHING: follow the normal Wget procedure for this result. [14:57] wget.actions.CONTINUE: retry this URL. [14:57] wget.actions.CONTINUE: retry this URL. [14:57] wget.actions.EXIT: finish this URL (ignore any error). [14:57] wget.actions.ABORT: Wget will abort() and exit immediately. [15:00] So the whole script may seem a little big and complicated at times, but it´s grown into this over years of warrior project, and works pretty well with most projects [15:00] https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L273-L277 This is useless in this case, correct? I assume that's used for rate limiting? [15:01] That is used for sleeping between archiving URLs. It´s kind of something that´s always been there, didn´t do much with it [15:01] set `sleep_time` to something and will sleep for that long between URLs [15:01] Right [15:02] This is a much simpler example https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua [15:02] The list of discovered users gets upload as a separate file to the normal rsync target, correct? [15:02] only project specific thing in YTMND lua script is https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua#L36-L44 and https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua#L115-L124 [15:03] ah well normally you would use the YTMND script example, this does not do discovery [15:04] but if you want to do discovery, make sure to extract data (for example in `allowed`), write it to the file at the end https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L282-L290 and make sure you create and handle the *_data.txt files in https://github.com/ArchiveTeam/wikispaces-grab/blob/master/pipeline.py (search for all lines with _data.txt) [15:06] vidme is also a nice example https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua, has a lot of custom extraction https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L169-L225 and does a discovery, but has a small `allowed` function [15:07] you´ll see plenty of nice examples when you go through the lua scripts of different (recent) warrior projects [15:09] Right, I'll do that. [15:09] vid.me also loads a json file, https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L198, in that case make sure you have the JSON.lua file and load it https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L3 and have the function to load a file https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L27-L33 [15:10] JAA: let me know if you have any questions :) and if you need me to create a tracker for a project [15:13] Don't worry, I will. :-) [15:14] I hope this was all a little clear, I now see the big bunch of github URLs [15:16] It was pretty clear. I'm trying to understand the flow of information at the moment. [15:17] The environment variables at the top of the script (item_* and warc_file_base) are set by seesaw, I assume? [15:18] Ah no, found it in pipeline.py. [15:18] yeah, was just about to write that [15:18] https://github.com/ArchiveTeam/wikispaces-grab/blob/master/pipeline.py#L248-L253 for the record [15:19] I feel like I asked you this before, but how do you test before launching a project? Do you just use the tracker directly? [15:19] you can also run the tracker locally, but I just create a project on tracker.archiveteam.org and use that to test [15:19] I find that much easier [15:19] Right.