Better handling of concurrent compilation of theano function
Currently when a process compile a theano function, it take a lock on the compile dir. This make that we compile everything only once, but in doing we we have many processes waiting for no good reason AND we print useless warning messages.
We also take the lock while we read the content of the compiledir. When this happen? When we load theano? When we compile for the first time? Both?
This are some other direction. We can do 0 and 1 in all case to lower the duration that we have the lock. 2 and 3 are two different structure to avoid it or push the problem to some other system.
0) Make sure that when we compile we take it for the less amount of time. This mean that we find witch op need compilation before taking the lock. I'm not sure we do this.
1) Should we just don't take the lock when we read the content of the compiledir to lower contention? Then if we detect inconsistency of some compiled op, we clean it only if th op compiledir is older then a new threashold (~1h).
2) Use a real db system to handle all this? SQLite?
3) Don't use lock, but a lock less algo? Here is the draw of one from my head(Is it really working in all case? Can we just accept the problem in the case that it don't)
- read the compiledir without lock and record the number of directory in it.
- ignore bad compiled module that are younger then 1h(some other concurrent process could be compiling it)
- record those directory that we skipped.
- When a process want to compile a op that is not in the cache:
- check if the directory we skipped are now valid.
- Check if the number compilation directory changed. If yes, rescan the top level directory to find new compiled op without rescanning the compilation directory that we already scanned
- If nothing we didn't find a compiled op compile one.
- record new op that this process compiled.
- When we exit, rescan the compiledir to find potential duplicate that we compiled. If our if older then the duplicate, we delete it.(We keep the younger as it will be reused for longer)
- Should we keep the older as we know we didn't crashed and we do not know if the youger will not crash?
- What to do in the case that some process crashed?
- delete duplicate with a shorter threashold time?
- write in the op compilation dir the process and computer that compiled it. When we exit we delete it. So when we check the status of the compiled directory, we can detect if the process is still running. We could use ssh to connect on other computer under linux? We delete them only if the process is on the same computer?(easier and faster)
Another thing related, there is a limit of 32k sub directory on FC9. We should take care to don't have too much of them. Should we add a threashold on the number of directory at the same time? This is another ticket
We also take the lock while we read the content of the compiledir. When this happen? When we load theano? When we compile for the first time? Both?
This are some other direction. We can do 0 and 1 in all case to lower the duration that we have the lock. 2 and 3 are two different structure to avoid it or push the problem to some other system.
0) Make sure that when we compile we take it for the less amount of time. This mean that we find witch op need compilation before taking the lock. I'm not sure we do this.
1) Should we just don't take the lock when we read the content of the compiledir to lower contention? Then if we detect inconsistency of some compiled op, we clean it only if th op compiledir is older then a new threashold (~1h).
2) Use a real db system to handle all this? SQLite?
3) Don't use lock, but a lock less algo? Here is the draw of one from my head(Is it really working in all case? Can we just accept the problem in the case that it don't)
- read the compiledir without lock and record the number of directory in it.
- ignore bad compiled module that are younger then 1h(some other concurrent process could be compiling it)
- record those directory that we skipped.
- When a process want to compile a op that is not in the cache:
- check if the directory we skipped are now valid.
- Check if the number compilation directory changed. If yes, rescan the top level directory to find new compiled op without rescanning the compilation directory that we already scanned
- If nothing we didn't find a compiled op compile one.
- record new op that this process compiled.
- When we exit, rescan the compiledir to find potential duplicate that we compiled. If our if older then the duplicate, we delete it.(We keep the younger as it will be reused for longer)
- Should we keep the older as we know we didn't crashed and we do not know if the youger will not crash?
- What to do in the case that some process crashed?
- delete duplicate with a shorter threashold time?
- write in the op compilation dir the process and computer that compiled it. When we exit we delete it. So when we check the status of the compiled directory, we can detect if the process is still running. We could use ssh to connect on other computer under linux? We delete them only if the process is on the same computer?(easier and faster)
Another thing related, there is a limit of 32k sub directory on FC9. We should take care to don't have too much of them. Should we add a threashold on the number of directory at the same time? This is another ticket
Leave a comment
Though it seems a tiny bit like overkill, sqlite might indeed be really useful here, in terms of future code maintainability, minimizing race conditions, etc.
Unfortunately, it looks like out of the box sqlite support only starts at Python 2.5, so it'd mean extra installation steps for 2.4 users (cf. http://stackoverflow.com/questions/789030/how-can-i-import-the-sqlite3-module-into-python-2-4).
Unfortunately, it looks like out of the box sqlite support only starts at Python 2.5, so it'd mean extra installation steps for 2.4 users (cf. http://stackoverflow.com/questions/789030/how-can-i-import-the-sqlite3-module-into-python-2-4).
Here are a few thoughts on this, and a concrete proposal.
With those in mind, I propose the following lock free system (inspired by Fred's) that might lead to multiple processes doing the same work, but in exchange is fairly simple, avoids polling except for cleanup (which is less critical), and avoids delays.
1. When compiling:
a. Hash the cmodule cache key to create a compilation directory prefix. Using this hash lets you avoid walking the entire compiledir to find an appropriate compiled module -- you only have to look at subdirectories with the appropriate prefix. It also means that 'key.pkl' is no longer needed (although we could still write it, for debugging purposes). Using the hash only as a prefix fed to mkdtemp rather than as the entire directory name avoids collisions across multiple processes simultaneously compiling the same module, without the need for locks.
b. Create a compilation directory using mkdtemp, using the above prefix.
c. Compile in that directory.
d. Finally, once everything else is done, create +init+.py. Because this is nothing more than creating the file, it is (effectively) atomic.
2. When loading a module:
a. Look one by one at all directories in the compiledir with the correct prefix, in alphabetical order. For each one:
1. Use os.utime() to update the mtime for +init+.py. Note that this will* not* create +init+.py if it doesn't already exist. The goal of updating +init+.py's mtime is to make it easy later to clear out unused cache entries, while leaving in entries that are still in active use.
2. Try importing the module. If compilation isn't completed yet, then +init+.py won't exist, and the import will fail immediately. If the import succeeds, we're done. If it fails, try the next file that has the correct prefix.
3. If we've exhausted all the candidates (or there were none), then compile the file ourselves, per above.
3. When cleaning up (both are optional, strictly speaking):
a. For each compilation dir, if there's no +init+.py, and it hasn't been modified in a long time (1 day? configurable), then delete the whole directory -- it is the result of a failed compilation.
b. For each compilation dir, if it has an +init+.py, and but it hasn't been modified in a long time (1 week? configurable), then no process has used that module in a long time, and it can be removed. Note that if there are two compiled copies of the same module, one of them will be effectively ignored (because it is after the other alphabetically), and will eventually be cleaned up by this mechanism.
c. For each of these clean-ups, start by deleting +init+.py and +init+.pyc, so that no other process accidentally tries to load the module. Once both of those are safely removed, delete the rest of the files, and the containing directory.
Other comments on this proposal:
I'd love to get comments on this (up to and including 'you're crazy'). Making any changes to this is going to require a decent amount of work, so it'd be good to get the overall design right before starting.
- Unfortunately, sqlite is not multiprocess safe on NFS (see http://www.sqlite.org/faq.html#q5). Combined with no out of the box Python 2.4 support, it seems like a non-starter here.
- It seems like the current setup and Fred's lockless proposal are both focussed on never duplicating work: There should ever only be one theano process compiling a given module at a given time, and if another process is compiling it, the first process should wait. While I feel the pull of this, things get complicated quickly. First, you have to poll to find out when the compilation is done, or set up some kind of IPC. Second, it's hard to handle the case in which the first process crashes. You end up having to wait for it to time out in some manner, which takes a long time, and again requires polling.
With those in mind, I propose the following lock free system (inspired by Fred's) that might lead to multiple processes doing the same work, but in exchange is fairly simple, avoids polling except for cleanup (which is less critical), and avoids delays.
1. When compiling:
a. Hash the cmodule cache key to create a compilation directory prefix. Using this hash lets you avoid walking the entire compiledir to find an appropriate compiled module -- you only have to look at subdirectories with the appropriate prefix. It also means that 'key.pkl' is no longer needed (although we could still write it, for debugging purposes). Using the hash only as a prefix fed to mkdtemp rather than as the entire directory name avoids collisions across multiple processes simultaneously compiling the same module, without the need for locks.
b. Create a compilation directory using mkdtemp, using the above prefix.
c. Compile in that directory.
d. Finally, once everything else is done, create +init+.py. Because this is nothing more than creating the file, it is (effectively) atomic.
2. When loading a module:
a. Look one by one at all directories in the compiledir with the correct prefix, in alphabetical order. For each one:
1. Use os.utime() to update the mtime for +init+.py. Note that this will* not* create +init+.py if it doesn't already exist. The goal of updating +init+.py's mtime is to make it easy later to clear out unused cache entries, while leaving in entries that are still in active use.
2. Try importing the module. If compilation isn't completed yet, then +init+.py won't exist, and the import will fail immediately. If the import succeeds, we're done. If it fails, try the next file that has the correct prefix.
3. If we've exhausted all the candidates (or there were none), then compile the file ourselves, per above.
3. When cleaning up (both are optional, strictly speaking):
a. For each compilation dir, if there's no +init+.py, and it hasn't been modified in a long time (1 day? configurable), then delete the whole directory -- it is the result of a failed compilation.
b. For each compilation dir, if it has an +init+.py, and but it hasn't been modified in a long time (1 week? configurable), then no process has used that module in a long time, and it can be removed. Note that if there are two compiled copies of the same module, one of them will be effectively ignored (because it is after the other alphabetically), and will eventually be cleaned up by this mechanism.
c. For each of these clean-ups, start by deleting +init+.py and +init+.pyc, so that no other process accidentally tries to load the module. Once both of those are safely removed, delete the rest of the files, and the containing directory.
Other comments on this proposal:
- The precompiled modules (cuda_ndarray and cutils_ext, and perhaps someday others -- see #403) could use their name (e.g. 'cuda_ndarray') rather than a hash of the cache key as their directory prefix. They then fit into the same general schema as the rest of the ad hoc compilation, rather than being special-cased, as they are now.
- This may increase the number of filesystem stats, because rather than reading the whole compiledir contents once and saving it, we (effectively) list the directory and select directories with the right prefix each time we load a module. On the other hand, we're not opening and unpickling a lot of key.pkl files, so that should compensate.
- To lower the probability of lots of processes repeating the same compilation work, one could try to accumulate all the compilations that need doing, and then do them in a random order, so even the same process executed multiple times simultaneously will likely take different compilation paths. This may be more trouble than it is worth.
- This also will make it easier to start multiple compilations simultaneously from the same process, if we want to do that down the road.
I'd love to get comments on this (up to and including 'you're crazy'). Making any changes to this is going to require a decent amount of work, so it'd be good to get the overall design right before starting.
I suppose that if we compile a module (because no corresponding module was present), then we should load it directly, not go through alphabetical order another time.
I'm not sure how likely it is to happen, but how would we handle collisions between hashes? Maybe we would still have to use a key.pkl to check the key itself.
I'm not sure how likely it is to happen, but how would we handle collisions between hashes? Maybe we would still have to use a key.pkl to check the key itself.
> I suppose that if we compile a module (because no corresponding module was present), then we should load it directly, not go through alphabetical order another time.
Agreed -- that makes sense.
> I'm not sure how likely it is to happen, but how would we handle collisions between hashes? Maybe we would still have to use a key.pkl to check the key itself.
I generally don't worry about collisions between hashes, but we could definitely keep key.pkl as a sanity check to make double sure that it doesn't happen. Good idea.
Agreed -- that makes sense.
> I'm not sure how likely it is to happen, but how would we handle collisions between hashes? Maybe we would still have to use a key.pkl to check the key itself.
I generally don't worry about collisions between hashes, but we could definitely keep key.pkl as a sanity check to make double sure that it doesn't happen. Good idea.
We should not make the problem of too many compiled stuff (* 638) worst by having multiple copy of the same code. Also, the full buildbot on Theano create ~13k different compiled code. Also there is a limit of 32k file/subdirecotory on many filesystem. So we must take care of having a good clean up strategy.
I like the idea of an hash of the key in the dir name. That will also help ticket* 436 as we won't need to preload all key.pkl file when starting Theano.
Comments/modifications proposed to your proposal:
1d(comment) file creation is not atomic on NFS. On NFS only directory creation is atomic. But as each process have its own directory, it is ok.
1de(modif) I will move the current 1d to 1e and create this new 1d:
After compiling try to load other compiled directory that are higher in alphabetical order then the one we compiled with the good key. If it succeed, remove the one we just compiled. This will remove many duplicate compilation and we won't need to wait for 7 days before deleting them.
2a(modif) We should cache the last listing of the compile directory. In the case when everything is already compiled, this will remove access to the file server. This make compilation faster and lower the load on the files erver.
2a1(comment) Many NFS file server don't update file access time for efficiency. So we MUST use modification time as you told, not access time. Currently we use the access time, so this will remove the deletion/recompilation that happen now. I don't think as we do it only once per program invocation even if we use it many time.
I like the idea of an hash of the key in the dir name. That will also help ticket* 436 as we won't need to preload all key.pkl file when starting Theano.
Comments/modifications proposed to your proposal:
1d(comment) file creation is not atomic on NFS. On NFS only directory creation is atomic. But as each process have its own directory, it is ok.
1de(modif) I will move the current 1d to 1e and create this new 1d:
After compiling try to load other compiled directory that are higher in alphabetical order then the one we compiled with the good key. If it succeed, remove the one we just compiled. This will remove many duplicate compilation and we won't need to wait for 7 days before deleting them.
2a(modif) We should cache the last listing of the compile directory. In the case when everything is already compiled, this will remove access to the file server. This make compilation faster and lower the load on the files erver.
2a1(comment) Many NFS file server don't update file access time for efficiency. So we MUST use modification time as you told, not access time. Currently we use the access time, so this will remove the deletion/recompilation that happen now. I don't think as we do it only once per program invocation even if we use it many time.
If we want to go lockless, we must be sure the lock is not used for something else. I found use it it only in those file: theano/gof/cmodule.py theano/gof/cc.py theano/gof/cutils.py and it always seam to take the lock for compilation.
So there should be no problem in that case to my understanding.
So there should be no problem in that case to my understanding.
If we want to go lockless, we must be sure the lock is not used for something else. I found use it it only in those file: theano/gof/cmodule.py theano/gof/cc.py theano/gof/cutils.py and it always seam to take the lock for compilation.
So there should be no problem in that case to my understanding.
So there should be no problem in that case to my understanding.
I made 2 changes that should lower the contentions on lock.
1) We take the lock on the compiledir not when we start linker a theano function, but when we compile the first c code. So in the case where all is already compiled, we won't take a lock.
2) When we load theano, we compile some utility function. In the past we where always taking the lock before we try to load it. Now we try to load it before we take the lock. If the load fail, we take the lock AND we retry to load it in case of concurrent compilation.
We still take the lock when we read the content of the compilation directory when we compile the first theano function and when refresh and clean up the old compiled file when theano exit.
1) We take the lock on the compiledir not when we start linker a theano function, but when we compile the first c code. So in the case where all is already compiled, we won't take a lock.
2) When we load theano, we compile some utility function. In the past we where always taking the lock before we try to load it. Now we try to load it before we take the lock. If the load fail, we take the lock AND we retry to load it in case of concurrent compilation.
We still take the lock when we read the content of the compilation directory when we compile the first theano function and when refresh and clean up the old compiled file when theano exit.