Remove Elemwise and Dot on the inputs of scan
The more is done outside scan the better. This optimization refers to the case when on sequences or non_sequences Elemwise operations are applied before doing anything else. Those elemwise operations can be applied on this inputs before feeding them to scan which would makes things faster. There are several things to be considered ( this is by no means a easy to solve task ):
When taking out of scan a slice of a sequence xi we replace it by
`x[mintap-i:x.shaperevision:0-minitap+i]`, which yields another optimization problem. Assume that in the inner function of scan we use `x[k1] + 3` and `x[k2] + 3`. Applying this optimization, will make the outter graph to contain the nodes
`x[range1] +3` and `x[range2]+3` which I highly doubt that will be reduced to `(x+3)[range1]` and `(x+3)[range2]` such that theano does not compute the same thing twice. Implementing such an optimization is a tricky thing, because pusing subtensor outside an elemwise operation is not always a good idea. Computing `(x+3)[range]` is more expensive then computing `x[range]+3` and as long as we can not guarantee that this will lead to merging some subgraphs together one should not do this step. Also as a bonus, if we manage to push the elemwise inside the subtensor, we can reduce the subtensor (when provided as input to scan) into taps.
Another issue related to this is memory consumption. Assume that inside scan we do something of the form `TT.dot(x[k], W)`. If we take this outside scan and do `TT.dot(x,W)` we increase the memory consumption which might lead to slower code. Especially if `x.shaperevision:1` is small while `W.shaperevision:1` is very large.
This optimization should be controlled by flags that allow deciding what to do and what not to do.
- Elemwise on non_seqs can always be taken out
- Any theano operation where all inputs are non_seqs ( i.e.scan does not iterate over them) can be taken out of the inner function
- TT.dot between a sequence and a non sequence can be taken out of the inner function
- Certain reshapes, shape inferences, and certain operations related to random numbers can be taken out of the inner function
- more ?
When taking out of scan a slice of a sequence xi we replace it by
`x[mintap-i:x.shaperevision:0-minitap+i]`, which yields another optimization problem. Assume that in the inner function of scan we use `x[k1] + 3` and `x[k2] + 3`. Applying this optimization, will make the outter graph to contain the nodes
`x[range1] +3` and `x[range2]+3` which I highly doubt that will be reduced to `(x+3)[range1]` and `(x+3)[range2]` such that theano does not compute the same thing twice. Implementing such an optimization is a tricky thing, because pusing subtensor outside an elemwise operation is not always a good idea. Computing `(x+3)[range]` is more expensive then computing `x[range]+3` and as long as we can not guarantee that this will lead to merging some subgraphs together one should not do this step. Also as a bonus, if we manage to push the elemwise inside the subtensor, we can reduce the subtensor (when provided as input to scan) into taps.
Another issue related to this is memory consumption. Assume that inside scan we do something of the form `TT.dot(x[k], W)`. If we take this outside scan and do `TT.dot(x,W)` we increase the memory consumption which might lead to slower code. Especially if `x.shaperevision:1` is small while `W.shaperevision:1` is very large.
This optimization should be controlled by flags that allow deciding what to do and what not to do.
Leave a comment