Time |
Nickname |
Message |
00:30
🔗
|
ivan` |
how much free memory does the redis server on http://tracker.archiveteam.org/ have? |
00:31
🔗
|
ivan` |
I'm going to have many gigabytes of feed URLs |
19:32
🔗
|
alard |
ivan`: The tracker has a few (undocumented) features that may be useful for your project. |
19:33
🔗
|
alard |
(Nice to see someone using the Lua extension.) |
19:36
🔗
|
alard |
GetItemFromTracker returns a json object to the warrior. The properties of that object end up in the item dictionary. One of them is set by default, the "item_name", but you can add custom keys. |
19:37
🔗
|
alard |
The custom data is generated by a little bit of Ruby that runs on the tracker. This script gets the item name and can fill the data object with other things. |
19:37
🔗
|
alard |
Because it's Ruby, it can also read a file. |
19:39
🔗
|
alard |
In your case, I think it would be handy to put the batch IDs as the items in the Redis queue, and write the url list for each batch to a file with a filename that corresponds to the batch ID. |
19:39
🔗
|
alard |
The extra-parameters script can then read that file and add the urls to the json response. |
19:40
🔗
|
alard |
You can remove the interesting Custom* contraptions from your pipeline. :) |
19:41
🔗
|
alard |
And it'll save a lot of Redis memory, since only the batch IDs are kept in memory. |
19:45
🔗
|
alard |
Last trick: you can give URL lists to Wget via the STDIN pipe. For example: https://github.com/ArchiveTeam/yahoo-upcoming-grab/blob/master/pipeline.py#L276-L281 |
19:46
🔗
|
alard |
For the Upcoming project, the tracker would add a "URL1 |
19:47
🔗
|
alard |
"URL1\nURL2\n...URLn\n" value in the item["task_urls"] field the json response. |