performance of odlcuda

I was running some experiments and there seems to be a performance issue with CUDA. However, I am not sure whether odlcuda is causing it or this is just a general CUDA phenomenon. Below is the code that I ran with its output. I don't think it is very important to know what this code is doing but if need be, am happy to post that.

There seems to be a phenomenon that with CUDA the code scales linearly with the number of subsets even though in each run of the code approximately the same number of flops are executed. The timings are pretty constant for numpy. The question is whether this is an issue with odlcuda or this is a general CUDA phenomenon. In all of these timings, there is no copying from or to the GPU. I am aware that executing kernels generates an overhead but I would not have thought this is so dramatic. In particular as one can see that the smallest subset still contains 262 x 65000 = 17 million elements. 

Do you think that this is related to the way ODL is written or do you think this is a general CUDA thing?

```
for impl in ['cuda', 'numpy']:
    for nsubsets in [1, 4, 16]:
        shape = [4200 // nsubsets, 65000]
        print('impl:{}, shape:{}, nsubsets:{}'.format(impl, shape, nsubsets))
        Y = odl.ProductSpace(odl.uniform_discr([0, 0], shape, shape,
                                               impl=impl), nsubsets)
        data = Y.one()
        background = Y.one()
        f = src_odl.KullbackLeibler(Y, data, background)
        x = 2 * Y.one()
        %time fx = f(x)
    
    shape = [4200, 65000]
    print('impl:{}, shape:{}'.format(impl, shape))
    Y = odl.uniform_discr([0, 0], shape, shape, impl=impl)
    data = Y.one()
    background = Y.one()
    f = src_odl.KullbackLeibler(Y, data, background)
    x = 2 * Y.one()
    %time fx = f(x)
```

> impl:cuda, shape:[4200, 65000], nsubsets:1
> CPU times: user 88.2 ms, sys: 24 ms, total: 112 ms
> Wall time: 114 ms
> 
> impl:cuda, shape:[1050, 65000], nsubsets:4
> CPU times: user 325 ms, sys: 28.7 ms, total: 354 ms
> Wall time: 361 ms
> 
> impl:cuda, shape:[262, 65000], nsubsets:16
> CPU times: user 1.43 s, sys: 31.7 ms, total: 1.46 s
> Wall time: 1.49 s
> 
> impl:cuda, shape:[4200, 65000]
> CPU times: user 93.6 ms, sys: 20.2 ms, total: 114 ms
> Wall time: 116 ms
> 
> impl:numpy, shape:[4200, 65000], nsubsets:1
> CPU times: user 10.7 s, sys: 352 ms, total: 11.1 s
> Wall time: 6.88 s
> 
> impl:numpy, shape:[1050, 65000], nsubsets:4
> CPU times: user 13.4 s, sys: 562 ms, total: 14 s
> Wall time: 6.9 s
> 
> impl:numpy, shape:[262, 65000], nsubsets:16
> CPU times: user 24.7 s, sys: 1 s, total: 25.7 s
> Wall time: 7.04 s
> 
> impl:numpy, shape:[4200, 65000]
> CPU times: user 10.8 s, sys: 380 ms, total: 11.1 s
> Wall time: 6.92 s


```
for impl in ['cuda', 'numpy']:
    for nsubsets in [1, 4, 16]:
        shape = [4200 // nsubsets, 65000]
        print('impl:{}, shape:{}, nsubsets:{}'.format(impl, shape, nsubsets))
        Y = odl.ProductSpace(odl.uniform_discr([0, 0], shape, shape, impl=impl), nsubsets)
        data = Y.one()
        background = Y.one()
        f = src_odl.KullbackLeibler(Y, data, background)
        x = 2 * Y.one()
        out = Y.element()
        
        t = 0        
        for i in range(len(Y)):
            f_prox = f[i].convex_conj.proximal(x[i])
            src.tic()
            f_prox(x[i], out=out[i])
            t += src.toc()
        print('time:{}, average:{}'.format(t, t / len(Y)))
    
    shape = [4200, 65000]
    print('impl:{}, shape:{}'.format(impl, shape))
    Y = odl.uniform_discr([0, 0], shape, shape, impl=impl)
    data = Y.one()
    background = Y.one()
    f = src_odl.KullbackLeibler(Y, data, background)
    x = 2 * Y.one()
    out = Y.element()
    f_prox = f.convex_conj.proximal(x)
    %time f_prox(x, out=out)
```


> impl:cuda, shape:[4200, 65000], nsubsets:1
> time:0.607445955276, average:0.607445955276
> 
> impl:cuda, shape:[1050, 65000], nsubsets:4
> time:0.405075311661, average:0.101268827915
> 
> impl:cuda, shape:[262, 65000], nsubsets:16
> 
> time:2.36511826515, average:0.147819891572
> 
> impl:cuda, shape:[4200, 65000]
> CPU times: user 146 ms, sys: 127 µs, total: 146 ms
> Wall time: 150 ms
> 
> impl:numpy, shape:[4200, 65000], nsubsets:1
> time:3.18624901772, average:3.18624901772
> 
> impl:numpy, shape:[1050, 65000], nsubsets:4
> time:3.12681221962, average:0.781703054905
> 
> impl:numpy, shape:[262, 65000], nsubsets:16
> time:3.25435972214, average:0.203397482634
> 
> impl:numpy, shape:[4200, 65000]
> CPU times: user 13.8 s, sys: 1.47 s, total: 15.2 s
> Wall time: 3.19 s



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

performance of odlcuda #29

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

performance of odlcuda #29

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions