If anyones intrested...
I finally had some time to do tests with the CUDA convolution kernel i'm working on.
The figures from a test i ran look promising, except for the part of retreiving
the data back from the card:
block size: 256
memory move time: 0.016909 (ms)
FFT processing time: 0.069471 (ms)
complex multiply: 0.023685 (ms) (0.021075 ms even with a bigger 177152 block size)
iFFT processing time: 0.061804 (ms)
move data back: 1.78119 (ms)
I tried to simulate a situation where i would have a 256 sample block size(the latency i currently use in studio myself)
and would need to move only that amount of data back and forth during every block. I have no idea how VST's internally work concerning the latency times, and if the times presented here will be enough for it to work, but i'm going to try it anyway

With bigger blocksizes the memory moves increase linearly, and when building in debug mode all kernels seem to freeze up (the time they take to run increases almost 100x), so if you are evaluating the CUDA platform for some project remember those.
The tests were done on a 2,8GHz PentiumD with a 96 core 8800GTS card(G80) and the CUDA 2.0.