#warrior 2018-07-19,Thu

↑back Search

Time Nickname Message
01:03 πŸ”— vectr0n` has quit IRC (Remote host closed the connection)
03:31 πŸ”— Frogging has joined #warrior
13:09 πŸ”— kiska But here is good as well, at least it won't get buried with Major messages JAA arkiver
13:09 πŸ”— arkiver https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua is a nice example
13:11 πŸ”— kiska I'll take a look at it in the morning
13:12 πŸ”— arkiver JAA: you here?
13:13 πŸ”— JAA Yes, I'm here.
13:13 πŸ”— arkiver When a data is downloaded, it is passed through get_urls https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L131
13:14 πŸ”— JAA Right, same callback as in wpull. What's the last paramater, "iri"?
13:14 πŸ”— arkiver the documentation is here https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks
13:15 πŸ”— arkiver (I never use iri
13:15 πŸ”— arkiver default for almost every project is this, https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L200-L217
13:15 πŸ”— arkiver not some fancy HTML or script parsing, but trying to extract as many possible URLs as possible
13:16 πŸ”— JAA Is % the regex escape char in Lua?
13:16 πŸ”— arkiver yes
13:16 πŸ”— JAA Ew
13:16 πŸ”— JAA :-)
13:16 πŸ”— arkiver and unlike regex you have to % a -
13:17 πŸ”— JAA Ah yeah, it's not real regex, is it?
13:17 πŸ”— arkiver no, pattern matching
13:17 πŸ”— JAA Right
13:17 πŸ”— arkiver it has no (?:www\.)?
13:17 πŸ”— arkiver but I use https?://[^/]*google.com/ instead, which works pretty well, but will also match something else then www.google.com
13:18 πŸ”— JAA Yeah, like evilgoogle.com
13:18 πŸ”— arkiver for example
13:18 πŸ”— arkiver yep
13:19 πŸ”— arkiver Every possible URL discovered in https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L200-L217 is checked and completed in checknewurl and checknewshorturl
13:19 πŸ”— arkiver full URLs are then passed to https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L137-L147
13:19 πŸ”— arkiver there the # is removed, and & replaces by &
13:20 πŸ”— arkiver (testtoken stuff is wikispaces only, ignore that)
13:20 πŸ”— arkiver so for every discovered URL we check if it was already downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L141
13:20 πŸ”— arkiver and if we want to archive it https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L142
13:20 πŸ”— arkiver that happens in the Β΄allowedΒ΄ function https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L34
13:23 πŸ”— JAA Ack
13:23 πŸ”— arkiver so what basically happens is we extract as many URLs as possible and select the ones we want to archive
13:23 πŸ”— JAA arkiver: Sorry, need to leave for a bit, I'll be back in half an hour or so probably.
13:23 πŸ”— arkiver note that most discovered URLs will be script or css garbage, etc., but those are then filtered out in the `allowed` function
13:23 πŸ”— arkiver JAA: ok
13:24 πŸ”— arkiver ping me when youΒ΄re back please
14:18 πŸ”— JAA arkiver: Back
14:32 πŸ”— arkiver ok
14:33 πŸ”— arkiver besides extracting URLs with get_urls, wget also extracts URLs. Each URL extracted by wget is passed through https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L112
14:34 πŸ”— arkiver where it is checked for being downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L120 and if it is allowed https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L121
14:35 πŸ”— arkiver also note the Β΄html == 0Β΄ in download_child_p https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L121
14:36 πŸ”— JAA What are the /file/history/ and /page/history/ checks doing there?
14:36 πŸ”— arkiver html == 0 is for example so external images are archived, even if they are not allowed by the `allowed` function
14:36 πŸ”— arkiver Those two history pages are here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L122-L123 and here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L182-L183
14:37 πŸ”— arkiver They are not blocked in the `allowed` function, because we want to archive them, but nothing we donΒ΄t want to archive new URLs found on them, so we ignore them in those places
14:38 πŸ”— arkiver (ignore the Β΄nothingΒ΄ there)
14:38 πŸ”— JAA Oh, right, didn't notice that it was checking the *parent* (i.e. the already retrieved) URL.
14:38 πŸ”— arkiver yep
14:40 πŸ”— arkiver so as you can see pretty much any URL passes through `allowed`. So if we want to extract a list of for example users, we can do that in `allowed` before the URL is ignored https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L46-L49
14:40 πŸ”— arkiver and this is custom stuff for wikispaces https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L185-L199
14:42 πŸ”— arkiver httploop_result can check status_codes and ABORT, EXIT, CONTINUE or NOTHING
14:43 πŸ”— arkiver To be able to trigger an abort outside of httploop_result, the variable `abortgrab` exists, which is checked in httploop_result and finally here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L292-L297
14:43 πŸ”— arkiver And we have a list of ignored URLs, https://github.com/ArchiveTeam/wikispaces-grab/blob/master/ignore-list
14:43 πŸ”— arkiver those are loaded and set to downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L19-L21
14:47 πŸ”— JAA arkiver: What's this bit in the retry code? https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L261-L265
14:49 πŸ”— arkiver after 5 tries https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L257 for a set of status_codes https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L250-L252
14:49 πŸ”— arkiver and remember we can archive URLs with html == 0 here https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L121 that are just external images on for example a forum
14:51 πŸ”— arkiver so after 5 tries if an URL is `allowed` (we want to have/it is important) we abort the crawl (item stays in out items), if it is not an important URL (just external forum image for example) we `wget.actions.EXIT` or skip the URL.
14:51 πŸ”— arkiver and those not important URLs were for example downloaded because of the html == 0 rule in download_child_p
14:55 πŸ”— JAA Oh, EXIT means "skip". Well, that's intuitive...
14:57 πŸ”— arkiver haha yeah
14:57 πŸ”— arkiver https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks
14:57 πŸ”— arkiver wget.actions.NOTHING: follow the normal Wget procedure for this result.
14:57 πŸ”— arkiver wget.actions.CONTINUE: retry this URL.
14:57 πŸ”— arkiver wget.actions.CONTINUE: retry this URL.
14:57 πŸ”— arkiver wget.actions.EXIT: finish this URL (ignore any error).
14:57 πŸ”— arkiver wget.actions.ABORT: Wget will abort() and exit immediately.
15:00 πŸ”— arkiver So the whole script may seem a little big and complicated at times, but itΒ΄s grown into this over years of warrior project, and works pretty well with most projects
15:00 πŸ”— JAA https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L273-L277 This is useless in this case, correct? I assume that's used for rate limiting?
15:01 πŸ”— arkiver That is used for sleeping between archiving URLs. ItΒ΄s kind of something thatΒ΄s always been there, didnΒ΄t do much with it
15:01 πŸ”— arkiver set `sleep_time` to something and will sleep for that long between URLs
15:01 πŸ”— JAA Right
15:02 πŸ”— arkiver This is a much simpler example https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua
15:02 πŸ”— JAA The list of discovered users gets upload as a separate file to the normal rsync target, correct?
15:02 πŸ”— arkiver only project specific thing in YTMND lua script is https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua#L36-L44 and https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua#L115-L124
15:03 πŸ”— arkiver ah well normally you would use the YTMND script example, this does not do discovery
15:04 πŸ”— arkiver but if you want to do discovery, make sure to extract data (for example in `allowed`), write it to the file at the end https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L282-L290 and make sure you create and handle the *_data.txt files in https://github.com/ArchiveTeam/wikispaces-grab/blob/master/pipeline.py (search for all lines with _data.txt)
15:06 πŸ”— arkiver vidme is also a nice example https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua, has a lot of custom extraction https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L169-L225 and does a discovery, but has a small `allowed` function
15:07 πŸ”— arkiver youΒ΄ll see plenty of nice examples when you go through the lua scripts of different (recent) warrior projects
15:09 πŸ”— JAA Right, I'll do that.
15:09 πŸ”— arkiver vid.me also loads a json file, https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L198, in that case make sure you have the JSON.lua file and load it https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L3 and have the function to load a file https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L27-L33
15:10 πŸ”— arkiver JAA: let me know if you have any questions :) and if you need me to create a tracker for a project
15:13 πŸ”— JAA Don't worry, I will. :-)
15:14 πŸ”— arkiver I hope this was all a little clear, I now see the big bunch of github URLs
15:16 πŸ”— JAA It was pretty clear. I'm trying to understand the flow of information at the moment.
15:17 πŸ”— JAA The environment variables at the top of the script (item_* and warc_file_base) are set by seesaw, I assume?
15:18 πŸ”— JAA Ah no, found it in pipeline.py.
15:18 πŸ”— arkiver yeah, was just about to write that
15:18 πŸ”— arkiver https://github.com/ArchiveTeam/wikispaces-grab/blob/master/pipeline.py#L248-L253 for the record
15:19 πŸ”— JAA I feel like I asked you this before, but how do you test before launching a project? Do you just use the tracker directly?
15:19 πŸ”— arkiver you can also run the tracker locally, but I just create a project on tracker.archiveteam.org and use that to test
15:19 πŸ”— arkiver I find that much easier
15:19 πŸ”— JAA Right.

irclogger-viewer