Support multiprocessing

Registered by trbs on 2009-02-23

Would like to propose support for Python's multiprocessing module.
On my machine (Lenovo T61 Notebook) a batch process is mostly cpu bound.

As i would imaging batch operations are collections of N-times a sequence of operations. Where N is the number of images.

Therefor (optionally) using multiprocessing on platforms where it's of use and available could speed up large batches substantially.

I would be willing to code patches up for this time permitting.

http://docs.python.org/library/multiprocessing.html

Blueprint information

Status:
Started
Approver:
Stani
Priority:
High
Drafter:
trbs
Direction:
Needs approval
Assignee:
trbs
Definition:
Drafting
Series goal:
Accepted for 0.3
Implementation:
Slow progress
Milestone target:
None
Started by
trbs on 2009-06-11

Sprints

Whiteboard

Ido, 28-11-2010:
Regretfully work has stalled :(
But not completely forgotten :)
I'm hoping 2011 will give me some more free time to pick up this work again.
(stani: cool!)

stani, 10-6-2009:
If you want to work on multiprocessing, you'll have to refactor core/api.py and evaluate if it is more efficient to share the
cache or each process has its own cache. In case of the shared cache dictionary take in account that some data it contains is not pickable, mainly PIL Image objects. Or maybe it would be a good idea to have two caches: a shared one, which contains eg a gps timedict, and an non shared one, which contains eg large images. Maybe you want to work on something else, which is fine as well. In case you consider multiprocessing for speed increase, I guess Phatch could use some profiling as well as its code is not optimized for speed. My main focus is that "it just works(TM)" out of the box, stability (no bugs) and apolished UI. So some aspects of Phatch are very speed inefficient, which because of lack of time I leave to anyone who likes to work on it. For example at every start, Phatch imports all actions and scales all icons from wxPython->PIL (antialias) -> wxPython for the smaller icons in the tree view. Probably it would be better to cache the icons at the right sizes. The icons rescaling of course happens only when Phatch is launched as a GUI, but even in server mode all actions are imported, while probably it would be better to import only those which are in the action lists. Of course to know the bottlenecks Phatch has to be profiled.

Ido, 10-6-2009:
I see, at the moment I don't think I have much time to work at it... and now it seems like something which is potentially a lot more work then 'just adding multiprocessing to the mix'. I'm personally not so worried about speed issues when there in the startup sequence :) Though it would make sense to look at some profiling data before jumping into multi-processing. When I wrote the proposal I though to put multiprocessing at the level of image files only. So (at least initially) it would only use different processes for different images and process the entire action stack in the same process so to avoid all dealings with dependencies inside the action_list or sharing of big blocks of data. Just meta data and importing all modules needed for the actions should be enough for each process. Then have a queue of images files which is drained by N-worker processors which execute the action stack. With some additional code doing error handling in MP code.

stani, 11-6-2009:
Maybe when someone else shows up who wants to work on it, you can help him. I thought of using something cool as this for the progress dialog: http://xoomer.virgilio.it/infinity77/main/PeakMeterCtrl.html

Ido, 11-6-2009:
I took a stab at prototyping/hacking MP into Phatch and I got a very crude version working by just modifying api.py:apply_actions. With a simple action list of 'round' + 'scale' + 'save' on 135 images, it took 0:02:43 seconds on a dual-core machine with 4 worker threads compared to 0:04:25 seconds on a non-modified version of Phatch. Considering how 'easy' hacking it was I would think that making a cleaner patch would not take too much time. Since I got away with basically only changing how apply_actions works it would same that multi-processing capabilities could be a settings/switch potentially disabled by default until it would be proven stable. (I have not tested this with anything else then an Ubuntu/Debian Linux system) I cannot see a place here to attach some of the files but I pushed the hack into a branch for others to look at. (The code/branch is not intended to be pushed to trunk, it's just my little attempt to see how much time a very basic implementation would take.)

Stani, 12-06-2009:
I haven't looked at your code yet, but this is great news. Although I have no time to work on multiprocessing, I will definitely welcome this feature to Phatch. Make indeed everything optional, even importing the multiprocessing module. Phatch now supports python2.4, python2.5 and python2.6 and I prefer to keep it that way. In case the multiprocessing module also exists for older python versions we could ship it. Please subscribe to the the patch-dev mailing list, as it is a better place to report your progress. You will have a very interested audience ;-) Keep also your branch up to date, so when the multiprocessing gets stable we can merge it.
https://launchpad.net/~phatch-dev

Erich, 15-06-2009:
I have several thoughts on this. First off, good job, taking advantage of multiple cores/processors is a huge step for this project! Secondly, we should very carefully investigate whether falling back to threads in phatch is the right solution. There are cases where GIL contention makes threaded code worse than non-threaded. We should make sure phatch is not one of them. Finally, I really like the architectural decisions made here. The ImageFilesQueue object is a very nice abstraction. Perhaps a similar abstraction could be made with the shared_state dict?

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.