#warrior 2013-05-24,Fri

↑back Search

Time Nickname Message
00:30 🔗 ivan` how much free memory does the redis server on http://tracker.archiveteam.org/ have?
00:31 🔗 ivan` I'm going to have many gigabytes of feed URLs
19:32 🔗 alard ivan`: The tracker has a few (undocumented) features that may be useful for your project.
19:33 🔗 alard (Nice to see someone using the Lua extension.)
19:36 🔗 alard GetItemFromTracker returns a json object to the warrior. The properties of that object end up in the item dictionary. One of them is set by default, the "item_name", but you can add custom keys.
19:37 🔗 alard The custom data is generated by a little bit of Ruby that runs on the tracker. This script gets the item name and can fill the data object with other things.
19:37 🔗 alard Because it's Ruby, it can also read a file.
19:39 🔗 alard In your case, I think it would be handy to put the batch IDs as the items in the Redis queue, and write the url list for each batch to a file with a filename that corresponds to the batch ID.
19:39 🔗 alard The extra-parameters script can then read that file and add the urls to the json response.
19:40 🔗 alard You can remove the interesting Custom* contraptions from your pipeline. :)
19:41 🔗 alard And it'll save a lot of Redis memory, since only the batch IDs are kept in memory.
19:45 🔗 alard Last trick: you can give URL lists to Wget via the STDIN pipe. For example: https://github.com/ArchiveTeam/yahoo-upcoming-grab/blob/master/pipeline.py#L276-L281
19:46 🔗 alard For the Upcoming project, the tracker would add a "URL1
19:47 🔗 alard "URL1\nURL2\n...URLn\n" value in the item["task_urls"] field the json response.

irclogger-viewer