Optimize reduce for long reductions
current reduce implementation is suitable for many small reductions. These
are done in parallel.
To reduce efficiently in the case of longer sums / products/ mins /maxes,
it would be faster to use block-level parallelism within each sum.
Code a version of reduce that assigns all the threads in a block to the
same reduction, but starting at different places. After thread has added
up its little bit, combine the partial sums in O(log(N)) time.
In this algo, each block would handle one reduction at a time. Then it
would leap ahead by n_blocks to deal with the next one, until all
reductions are complete.
are done in parallel.
To reduce efficiently in the case of longer sums / products/ mins /maxes,
it would be faster to use block-level parallelism within each sum.
Code a version of reduce that assigns all the threads in a block to the
same reduction, but starting at different places. After thread has added
up its little bit, combine the partial sums in O(log(N)) time.
In this algo, each block would handle one reduction at a time. Then it
would leap ahead by n_blocks to deal with the next one, until all
reductions are complete.
Leave a comment