Should we unroll the full patch multiplication when the image and kernel fit in shared memory?
maybe we can have a good speed up by unrolling the full patch
multiplication instead of just the multiplication by a row? This could be
faster? What is the impact on the number of register?
multiplication instead of just the multiplication by a row? This could be
faster? What is the impact on the number of register?
Leave a comment