Should we unroll the full patch multiplication when the image and kernel fit in shared memory?