#warrior 2013-05-24,Fri

↑back Search

Time	Nickname	Message
00:30 ^🔗	ivan`	how much free memory does the redis server on http://tracker.archiveteam.org/ have?
00:31 ^🔗	ivan`	I'm going to have many gigabytes of feed URLs
19:32 ^🔗	alard	ivan`: The tracker has a few (undocumented) features that may be useful for your project.
19:33 ^🔗	alard	(Nice to see someone using the Lua extension.)
19:36 ^🔗	alard	GetItemFromTracker returns a json object to the warrior. The properties of that object end up in the item dictionary. One of them is set by default, the "item_name", but you can add custom keys.
19:37 ^🔗	alard	The custom data is generated by a little bit of Ruby that runs on the tracker. This script gets the item name and can fill the data object with other things.
19:37 ^🔗	alard	Because it's Ruby, it can also read a file.
19:39 ^🔗	alard	In your case, I think it would be handy to put the batch IDs as the items in the Redis queue, and write the url list for each batch to a file with a filename that corresponds to the batch ID.
19:39 ^🔗	alard	The extra-parameters script can then read that file and add the urls to the json response.
19:40 ^🔗	alard	You can remove the interesting Custom* contraptions from your pipeline. :)
19:41 ^🔗	alard	And it'll save a lot of Redis memory, since only the batch IDs are kept in memory.
19:45 ^🔗	alard	Last trick: you can give URL lists to Wget via the STDIN pipe. For example: https://github.com/ArchiveTeam/yahoo-upcoming-grab/blob/master/pipeline.py#L276-L281
19:46 ^🔗	alard	For the Upcoming project, the tracker would add a "URL1
19:47 ^🔗	alard	"URL1\nURL2\n...URLn\n" value in the item["task_urls"] field the json response.

irclogger-viewer