WIP: CUDA single precision transforms
This adds some support for single precision transforms in the CUDA kernels. It currently presents some merge/rebase conflicts, since it was written against the bitbucket master, not cuda-devel. I can work those out sometime soon.