I was running some experiments and there seems to be a performance issue with CUDA. However, I am not sure whether odlcuda is causing it or this is just a general CUDA phenomenon. Below is the code that I ran with its output. I don't think it is very important to know what this code is doing but if need be, am happy to post that.
There seems to be a phenomenon that with CUDA the code scales linearly with the number of subsets even though in each run of the code approximately the same number of flops are executed. The timings are pretty constant for numpy. The question is whether this is an issue with odlcuda or this is a general CUDA phenomenon. In all of these timings, there is no copying from or to the GPU. I am aware that executing kernels generates an overhead but I would not have thought this is so dramatic. In particular as one can see that the smallest subset still contains 262 x 65000 = 17 million elements.
Do you think that this is related to the way ODL is written or do you think this is a general CUDA thing?
for impl in ['cuda', 'numpy']:
for nsubsets in [1, 4, 16]:
shape = [4200 // nsubsets, 65000]
print('impl:{}, shape:{}, nsubsets:{}'.format(impl, shape, nsubsets))
Y = odl.ProductSpace(odl.uniform_discr([0, 0], shape, shape,
impl=impl), nsubsets)
data = Y.one()
background = Y.one()
f = src_odl.KullbackLeibler(Y, data, background)
x = 2 * Y.one()
%time fx = f(x)
shape = [4200, 65000]
print('impl:{}, shape:{}'.format(impl, shape))
Y = odl.uniform_discr([0, 0], shape, shape, impl=impl)
data = Y.one()
background = Y.one()
f = src_odl.KullbackLeibler(Y, data, background)
x = 2 * Y.one()
%time fx = f(x)
impl:cuda, shape:[4200, 65000], nsubsets:1
CPU times: user 88.2 ms, sys: 24 ms, total: 112 ms
Wall time: 114 ms
impl:cuda, shape:[1050, 65000], nsubsets:4
CPU times: user 325 ms, sys: 28.7 ms, total: 354 ms
Wall time: 361 ms
impl:cuda, shape:[262, 65000], nsubsets:16
CPU times: user 1.43 s, sys: 31.7 ms, total: 1.46 s
Wall time: 1.49 s
impl:cuda, shape:[4200, 65000]
CPU times: user 93.6 ms, sys: 20.2 ms, total: 114 ms
Wall time: 116 ms
impl:numpy, shape:[4200, 65000], nsubsets:1
CPU times: user 10.7 s, sys: 352 ms, total: 11.1 s
Wall time: 6.88 s
impl:numpy, shape:[1050, 65000], nsubsets:4
CPU times: user 13.4 s, sys: 562 ms, total: 14 s
Wall time: 6.9 s
impl:numpy, shape:[262, 65000], nsubsets:16
CPU times: user 24.7 s, sys: 1 s, total: 25.7 s
Wall time: 7.04 s
impl:numpy, shape:[4200, 65000]
CPU times: user 10.8 s, sys: 380 ms, total: 11.1 s
Wall time: 6.92 s
for impl in ['cuda', 'numpy']:
for nsubsets in [1, 4, 16]:
shape = [4200 // nsubsets, 65000]
print('impl:{}, shape:{}, nsubsets:{}'.format(impl, shape, nsubsets))
Y = odl.ProductSpace(odl.uniform_discr([0, 0], shape, shape, impl=impl), nsubsets)
data = Y.one()
background = Y.one()
f = src_odl.KullbackLeibler(Y, data, background)
x = 2 * Y.one()
out = Y.element()
t = 0
for i in range(len(Y)):
f_prox = f[i].convex_conj.proximal(x[i])
src.tic()
f_prox(x[i], out=out[i])
t += src.toc()
print('time:{}, average:{}'.format(t, t / len(Y)))
shape = [4200, 65000]
print('impl:{}, shape:{}'.format(impl, shape))
Y = odl.uniform_discr([0, 0], shape, shape, impl=impl)
data = Y.one()
background = Y.one()
f = src_odl.KullbackLeibler(Y, data, background)
x = 2 * Y.one()
out = Y.element()
f_prox = f.convex_conj.proximal(x)
%time f_prox(x, out=out)
impl:cuda, shape:[4200, 65000], nsubsets:1
time:0.607445955276, average:0.607445955276
impl:cuda, shape:[1050, 65000], nsubsets:4
time:0.405075311661, average:0.101268827915
impl:cuda, shape:[262, 65000], nsubsets:16
time:2.36511826515, average:0.147819891572
impl:cuda, shape:[4200, 65000]
CPU times: user 146 ms, sys: 127 µs, total: 146 ms
Wall time: 150 ms
impl:numpy, shape:[4200, 65000], nsubsets:1
time:3.18624901772, average:3.18624901772
impl:numpy, shape:[1050, 65000], nsubsets:4
time:3.12681221962, average:0.781703054905
impl:numpy, shape:[262, 65000], nsubsets:16
time:3.25435972214, average:0.203397482634
impl:numpy, shape:[4200, 65000]
CPU times: user 13.8 s, sys: 1.47 s, total: 15.2 s
Wall time: 3.19 s
I was running some experiments and there seems to be a performance issue with CUDA. However, I am not sure whether odlcuda is causing it or this is just a general CUDA phenomenon. Below is the code that I ran with its output. I don't think it is very important to know what this code is doing but if need be, am happy to post that.
There seems to be a phenomenon that with CUDA the code scales linearly with the number of subsets even though in each run of the code approximately the same number of flops are executed. The timings are pretty constant for numpy. The question is whether this is an issue with odlcuda or this is a general CUDA phenomenon. In all of these timings, there is no copying from or to the GPU. I am aware that executing kernels generates an overhead but I would not have thought this is so dramatic. In particular as one can see that the smallest subset still contains 262 x 65000 = 17 million elements.
Do you think that this is related to the way ODL is written or do you think this is a general CUDA thing?