create an op ElemwiseReduce that combine elemwise operation followed by an reduce operation.
As an optimization, we could create an op ElemwiseReduce that make the elemwise and the reduce at the same time. Here is a direction to do it on the cpu:
1) init output to 0(or we can make and if when we write if it is not in an inner loop.)
2) in the loop for a dim that we reduce, reset the output ptr to where it was at the first iteration of the loop at each iteration. Then we make a reduce op before we write to the place.
2a)if sum dims are the innermost dims: move the output write to the loop dimensions before the ones we sums
2b)if sum to outers dims...
How to do this on the gpu?
1) We could use the reduce algo and add the elemwise. Good performance?
2) We could use the elemwise algo and reduce before we write. May need a second kernel for final reduction. Good performance?
3) Choose dinamycally one of the other depending of the size of the inputs... If small use Reduce block size, if big use Elemwise block size with a second kernel, if medium?
1) init output to 0(or we can make and if when we write if it is not in an inner loop.)
2) in the loop for a dim that we reduce, reset the output ptr to where it was at the first iteration of the loop at each iteration. Then we make a reduce op before we write to the place.
2a)if sum dims are the innermost dims: move the output write to the loop dimensions before the ones we sums
2b)if sum to outers dims...
How to do this on the gpu?
1) We could use the reduce algo and add the elemwise. Good performance?
2) We could use the elemwise algo and reduce before we write. May need a second kernel for final reduction. Good performance?
3) Choose dinamycally one of the other depending of the size of the inputs... If small use Reduce block size, if big use Elemwise block size with a second kernel, if medium?
Leave a comment