Skip to content

performance of odlcuda #29

Description

@mehrhardt

I was running some experiments and there seems to be a performance issue with CUDA. However, I am not sure whether odlcuda is causing it or this is just a general CUDA phenomenon. Below is the code that I ran with its output. I don't think it is very important to know what this code is doing but if need be, am happy to post that.

There seems to be a phenomenon that with CUDA the code scales linearly with the number of subsets even though in each run of the code approximately the same number of flops are executed. The timings are pretty constant for numpy. The question is whether this is an issue with odlcuda or this is a general CUDA phenomenon. In all of these timings, there is no copying from or to the GPU. I am aware that executing kernels generates an overhead but I would not have thought this is so dramatic. In particular as one can see that the smallest subset still contains 262 x 65000 = 17 million elements.

Do you think that this is related to the way ODL is written or do you think this is a general CUDA thing?

for impl in ['cuda', 'numpy']:
    for nsubsets in [1, 4, 16]:
        shape = [4200 // nsubsets, 65000]
        print('impl:{}, shape:{}, nsubsets:{}'.format(impl, shape, nsubsets))
        Y = odl.ProductSpace(odl.uniform_discr([0, 0], shape, shape,
                                               impl=impl), nsubsets)
        data = Y.one()
        background = Y.one()
        f = src_odl.KullbackLeibler(Y, data, background)
        x = 2 * Y.one()
        %time fx = f(x)
    
    shape = [4200, 65000]
    print('impl:{}, shape:{}'.format(impl, shape))
    Y = odl.uniform_discr([0, 0], shape, shape, impl=impl)
    data = Y.one()
    background = Y.one()
    f = src_odl.KullbackLeibler(Y, data, background)
    x = 2 * Y.one()
    %time fx = f(x)

impl:cuda, shape:[4200, 65000], nsubsets:1
CPU times: user 88.2 ms, sys: 24 ms, total: 112 ms
Wall time: 114 ms

impl:cuda, shape:[1050, 65000], nsubsets:4
CPU times: user 325 ms, sys: 28.7 ms, total: 354 ms
Wall time: 361 ms

impl:cuda, shape:[262, 65000], nsubsets:16
CPU times: user 1.43 s, sys: 31.7 ms, total: 1.46 s
Wall time: 1.49 s

impl:cuda, shape:[4200, 65000]
CPU times: user 93.6 ms, sys: 20.2 ms, total: 114 ms
Wall time: 116 ms

impl:numpy, shape:[4200, 65000], nsubsets:1
CPU times: user 10.7 s, sys: 352 ms, total: 11.1 s
Wall time: 6.88 s

impl:numpy, shape:[1050, 65000], nsubsets:4
CPU times: user 13.4 s, sys: 562 ms, total: 14 s
Wall time: 6.9 s

impl:numpy, shape:[262, 65000], nsubsets:16
CPU times: user 24.7 s, sys: 1 s, total: 25.7 s
Wall time: 7.04 s

impl:numpy, shape:[4200, 65000]
CPU times: user 10.8 s, sys: 380 ms, total: 11.1 s
Wall time: 6.92 s

for impl in ['cuda', 'numpy']:
    for nsubsets in [1, 4, 16]:
        shape = [4200 // nsubsets, 65000]
        print('impl:{}, shape:{}, nsubsets:{}'.format(impl, shape, nsubsets))
        Y = odl.ProductSpace(odl.uniform_discr([0, 0], shape, shape, impl=impl), nsubsets)
        data = Y.one()
        background = Y.one()
        f = src_odl.KullbackLeibler(Y, data, background)
        x = 2 * Y.one()
        out = Y.element()
        
        t = 0        
        for i in range(len(Y)):
            f_prox = f[i].convex_conj.proximal(x[i])
            src.tic()
            f_prox(x[i], out=out[i])
            t += src.toc()
        print('time:{}, average:{}'.format(t, t / len(Y)))
    
    shape = [4200, 65000]
    print('impl:{}, shape:{}'.format(impl, shape))
    Y = odl.uniform_discr([0, 0], shape, shape, impl=impl)
    data = Y.one()
    background = Y.one()
    f = src_odl.KullbackLeibler(Y, data, background)
    x = 2 * Y.one()
    out = Y.element()
    f_prox = f.convex_conj.proximal(x)
    %time f_prox(x, out=out)

impl:cuda, shape:[4200, 65000], nsubsets:1
time:0.607445955276, average:0.607445955276

impl:cuda, shape:[1050, 65000], nsubsets:4
time:0.405075311661, average:0.101268827915

impl:cuda, shape:[262, 65000], nsubsets:16

time:2.36511826515, average:0.147819891572

impl:cuda, shape:[4200, 65000]
CPU times: user 146 ms, sys: 127 µs, total: 146 ms
Wall time: 150 ms

impl:numpy, shape:[4200, 65000], nsubsets:1
time:3.18624901772, average:3.18624901772

impl:numpy, shape:[1050, 65000], nsubsets:4
time:3.12681221962, average:0.781703054905

impl:numpy, shape:[262, 65000], nsubsets:16
time:3.25435972214, average:0.203397482634

impl:numpy, shape:[4200, 65000]
CPU times: user 13.8 s, sys: 1.47 s, total: 15.2 s
Wall time: 3.19 s

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions