Time |
Nickname |
Message |
01:03
π
|
|
vectr0n` has quit IRC (Remote host closed the connection) |
03:31
π
|
|
Frogging has joined #warrior |
13:09
π
|
kiska |
But here is good as well, at least it won't get buried with Major messages JAA arkiver |
13:09
π
|
arkiver |
https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua is a nice example |
13:11
π
|
kiska |
I'll take a look at it in the morning |
13:12
π
|
arkiver |
JAA: you here? |
13:13
π
|
JAA |
Yes, I'm here. |
13:13
π
|
arkiver |
When a data is downloaded, it is passed through get_urls https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L131 |
13:14
π
|
JAA |
Right, same callback as in wpull. What's the last paramater, "iri"? |
13:14
π
|
arkiver |
the documentation is here https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks |
13:15
π
|
arkiver |
(I never use iri |
13:15
π
|
arkiver |
default for almost every project is this, https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L200-L217 |
13:15
π
|
arkiver |
not some fancy HTML or script parsing, but trying to extract as many possible URLs as possible |
13:16
π
|
JAA |
Is % the regex escape char in Lua? |
13:16
π
|
arkiver |
yes |
13:16
π
|
JAA |
Ew |
13:16
π
|
JAA |
:-) |
13:16
π
|
arkiver |
and unlike regex you have to % a - |
13:17
π
|
JAA |
Ah yeah, it's not real regex, is it? |
13:17
π
|
arkiver |
no, pattern matching |
13:17
π
|
JAA |
Right |
13:17
π
|
arkiver |
it has no (?:www\.)? |
13:17
π
|
arkiver |
but I use https?://[^/]*google.com/ instead, which works pretty well, but will also match something else then www.google.com |
13:18
π
|
JAA |
Yeah, like evilgoogle.com |
13:18
π
|
arkiver |
for example |
13:18
π
|
arkiver |
yep |
13:19
π
|
arkiver |
Every possible URL discovered in https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L200-L217 is checked and completed in checknewurl and checknewshorturl |
13:19
π
|
arkiver |
full URLs are then passed to https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L137-L147 |
13:19
π
|
arkiver |
there the # is removed, and & replaces by & |
13:20
π
|
arkiver |
(testtoken stuff is wikispaces only, ignore that) |
13:20
π
|
arkiver |
so for every discovered URL we check if it was already downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L141 |
13:20
π
|
arkiver |
and if we want to archive it https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L142 |
13:20
π
|
arkiver |
that happens in the Β΄allowedΒ΄ function https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L34 |
13:23
π
|
JAA |
Ack |
13:23
π
|
arkiver |
so what basically happens is we extract as many URLs as possible and select the ones we want to archive |
13:23
π
|
JAA |
arkiver: Sorry, need to leave for a bit, I'll be back in half an hour or so probably. |
13:23
π
|
arkiver |
note that most discovered URLs will be script or css garbage, etc., but those are then filtered out in the `allowed` function |
13:23
π
|
arkiver |
JAA: ok |
13:24
π
|
arkiver |
ping me when youΒ΄re back please |
14:18
π
|
JAA |
arkiver: Back |
14:32
π
|
arkiver |
ok |
14:33
π
|
arkiver |
besides extracting URLs with get_urls, wget also extracts URLs. Each URL extracted by wget is passed through https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L112 |
14:34
π
|
arkiver |
where it is checked for being downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L120 and if it is allowed https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L121 |
14:35
π
|
arkiver |
also note the Β΄html == 0Β΄ in download_child_p https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L121 |
14:36
π
|
JAA |
What are the /file/history/ and /page/history/ checks doing there? |
14:36
π
|
arkiver |
html == 0 is for example so external images are archived, even if they are not allowed by the `allowed` function |
14:36
π
|
arkiver |
Those two history pages are here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L122-L123 and here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L182-L183 |
14:37
π
|
arkiver |
They are not blocked in the `allowed` function, because we want to archive them, but nothing we donΒ΄t want to archive new URLs found on them, so we ignore them in those places |
14:38
π
|
arkiver |
(ignore the Β΄nothingΒ΄ there) |
14:38
π
|
JAA |
Oh, right, didn't notice that it was checking the *parent* (i.e. the already retrieved) URL. |
14:38
π
|
arkiver |
yep |
14:40
π
|
arkiver |
so as you can see pretty much any URL passes through `allowed`. So if we want to extract a list of for example users, we can do that in `allowed` before the URL is ignored https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L46-L49 |
14:40
π
|
arkiver |
and this is custom stuff for wikispaces https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L185-L199 |
14:42
π
|
arkiver |
httploop_result can check status_codes and ABORT, EXIT, CONTINUE or NOTHING |
14:43
π
|
arkiver |
To be able to trigger an abort outside of httploop_result, the variable `abortgrab` exists, which is checked in httploop_result and finally here https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L292-L297 |
14:43
π
|
arkiver |
And we have a list of ignored URLs, https://github.com/ArchiveTeam/wikispaces-grab/blob/master/ignore-list |
14:43
π
|
arkiver |
those are loaded and set to downloaded https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L19-L21 |
14:47
π
|
JAA |
arkiver: What's this bit in the retry code? https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L261-L265 |
14:49
π
|
arkiver |
after 5 tries https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L257 for a set of status_codes https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L250-L252 |
14:49
π
|
arkiver |
and remember we can archive URLs with html == 0 here https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L121 that are just external images on for example a forum |
14:51
π
|
arkiver |
so after 5 tries if an URL is `allowed` (we want to have/it is important) we abort the crawl (item stays in out items), if it is not an important URL (just external forum image for example) we `wget.actions.EXIT` or skip the URL. |
14:51
π
|
arkiver |
and those not important URLs were for example downloaded because of the html == 0 rule in download_child_p |
14:55
π
|
JAA |
Oh, EXIT means "skip". Well, that's intuitive... |
14:57
π
|
arkiver |
haha yeah |
14:57
π
|
arkiver |
https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks |
14:57
π
|
arkiver |
wget.actions.NOTHING: follow the normal Wget procedure for this result. |
14:57
π
|
arkiver |
wget.actions.CONTINUE: retry this URL. |
14:57
π
|
arkiver |
wget.actions.CONTINUE: retry this URL. |
14:57
π
|
arkiver |
wget.actions.EXIT: finish this URL (ignore any error). |
14:57
π
|
arkiver |
wget.actions.ABORT: Wget will abort() and exit immediately. |
15:00
π
|
arkiver |
So the whole script may seem a little big and complicated at times, but itΒ΄s grown into this over years of warrior project, and works pretty well with most projects |
15:00
π
|
JAA |
https://github.com/ArchiveTeam/wikispaces-grab/blob/5f66ed0a90bb29ad5de44d1b7a5795edce7dac9a/wikispaces.lua#L273-L277 This is useless in this case, correct? I assume that's used for rate limiting? |
15:01
π
|
arkiver |
That is used for sleeping between archiving URLs. ItΒ΄s kind of something thatΒ΄s always been there, didnΒ΄t do much with it |
15:01
π
|
arkiver |
set `sleep_time` to something and will sleep for that long between URLs |
15:01
π
|
JAA |
Right |
15:02
π
|
arkiver |
This is a much simpler example https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua |
15:02
π
|
JAA |
The list of discovered users gets upload as a separate file to the normal rsync target, correct? |
15:02
π
|
arkiver |
only project specific thing in YTMND lua script is https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua#L36-L44 and https://github.com/ArchiveTeam/ytmnd-grab/blob/master/ytmnd.lua#L115-L124 |
15:03
π
|
arkiver |
ah well normally you would use the YTMND script example, this does not do discovery |
15:04
π
|
arkiver |
but if you want to do discovery, make sure to extract data (for example in `allowed`), write it to the file at the end https://github.com/ArchiveTeam/wikispaces-grab/blob/master/wikispaces.lua#L282-L290 and make sure you create and handle the *_data.txt files in https://github.com/ArchiveTeam/wikispaces-grab/blob/master/pipeline.py (search for all lines with _data.txt) |
15:06
π
|
arkiver |
vidme is also a nice example https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua, has a lot of custom extraction https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L169-L225 and does a discovery, but has a small `allowed` function |
15:07
π
|
arkiver |
youΒ΄ll see plenty of nice examples when you go through the lua scripts of different (recent) warrior projects |
15:09
π
|
JAA |
Right, I'll do that. |
15:09
π
|
arkiver |
vid.me also loads a json file, https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L198, in that case make sure you have the JSON.lua file and load it https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L3 and have the function to load a file https://github.com/ArchiveTeam/vidme-grab/blob/master/vidme.lua#L27-L33 |
15:10
π
|
arkiver |
JAA: let me know if you have any questions :) and if you need me to create a tracker for a project |
15:13
π
|
JAA |
Don't worry, I will. :-) |
15:14
π
|
arkiver |
I hope this was all a little clear, I now see the big bunch of github URLs |
15:16
π
|
JAA |
It was pretty clear. I'm trying to understand the flow of information at the moment. |
15:17
π
|
JAA |
The environment variables at the top of the script (item_* and warc_file_base) are set by seesaw, I assume? |
15:18
π
|
JAA |
Ah no, found it in pipeline.py. |
15:18
π
|
arkiver |
yeah, was just about to write that |
15:18
π
|
arkiver |
https://github.com/ArchiveTeam/wikispaces-grab/blob/master/pipeline.py#L248-L253 for the record |
15:19
π
|
JAA |
I feel like I asked you this before, but how do you test before launching a project? Do you just use the tracker directly? |
15:19
π
|
arkiver |
you can also run the tracker locally, but I just create a project on tracker.archiveteam.org and use that to test |
15:19
π
|
arkiver |
I find that much easier |
15:19
π
|
JAA |
Right. |