GPU, GPGPU, OpenCL, CUDA notes

From Helpful
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

You may also be looking for

Compute: OpenCL
Graphics: OpenGL
Sound: OpenAL · OpenSL




GPGPU or not?

Nvidia/CUDA or not?

OpenCL is open source, runs on everything, and is a shared project between various vendors.

CUDA is proprietary, and runs only on NVidia hardware.

CUDA used to be a bunch faster mostly because it came earlier and was better optimised already. There is now little difference.

That said, and while AMD (through OpenCL) can give you more speed for money in the mid-range, the fancy hardware is mostly nVidia.

This seems to be the reason various people are sticking with CUDA.

That and laziness.

That said, there is currently an active connection between nvidia (the company, also support-wise) and e.g. academic research (more than there is for AMD), which sometimes results in hardware tweaks.


Choices within Nvidia hardware

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

On microarchitectures:

  • Tesla (~2006)
note: no direct relation to the Tesla GPGPU brand
  • Fermi (~2009)
used in gaming, Quadro, and some Tesla cards
  • Kepler (~2012)
used in higher-end gaming, Quadro
  • Maxwell (~2014)
Used in GeForce 700 series, 800 series, 900 series, some Quadro
  • Pascal (~2016)
targets both processing and gaming
  • Volta (~2018)
  • Turing (~2019)
available both with tensor and ray tracing (20 series), and without (16 series)
  • Ampere (~2020, planned)

On series: (note: not clean splits between microarchitectures, but an decent indication):

100, 200, 300: Tesla
400 and 500: Fermi (mostly)
600: Fermi and Kepler
700: Kepler (mostly, a few Maxwell)
800: Maxwell
900: Maxwell, Pascal
10: Pascal
16: Turing
20: Turing

Other terms:

  • Ti stuck on the end usually means "Same thing, more shaders", and regularly also more memory
  • Founders Edition basically means reference design
  • NVS basically means "prioritizing low-power, and good for a bunch of large monitors, but don't count on gaming" i.e. good for business use.

See also:

On single-percision versus double-precision floating point:

  • Tesla is made to be good at both, in that DP speed will be no worse than half of SP speed
  • Quadro also does double precision well
...though when you only use SP, a good GTX (Titan or no) is much cheaper for the same power

  • Geforce (GT, GTX) seems capped at DP. As in, the metal could probably do a little better, but the driver limits it (verify). This makes them relatively better at SP.
Titan notes:
GTX Titan (700-series, GK110, 2013), DP
GTX Titan Z (700-series, dual GK110, 2014), DP
GTX Titan X (900-series, GM200, 2015), SP (more performance per core, though a Z beats it overall)
GTX 1080 (first Pascal card, already challenging the Titans cost-efficiency-wise)
  • Grid K1, Grid K2 is essentially two or four Kepler cores in one cards (~= GT 640)
these seem to be meant for GPU virtualisation (and heat efficiency?), rather than cost-efficient crunching

Some are sensible splits (e.g. semi-compute, semi-gamer would not serve either customer well), though also some crippling/marketing going on, which means there are some really weird differences. For example, there are some Quadros that are much more expensive than equivalent-speed (and sometimes noticeably faster) GeForces.


  • Compute cards are typically better cooled and generally more reliable than cut-throat gaming designs, though not always by much
  • Bit-flips are more likely on gamer cards - and while in games you may not even see the errors, while for professional use you would care
The difference seems to be quite real, but not necessarily large.


  • Memory bandwidth varies, but not hugely
  • Tesla and Quadro cards tend to have more
whether it helps depends entirely on task (and coding)

Theano notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Settings file


In ~/.theanorc (%USERPROFILE%\.theanorc in windows) put something like:

device = gpu
floatX = float32

flags = --use-local-env  --cl-version=2008

Testing whether it uses the GPU

Just seeing that it's much faster is a damn good hint. The following from here will do a dot product in numpy (CPU) and in theano (CPU or GPU, you'll see).

import numpy as np
import time
import theano
A = np.random.rand(1000,10000).astype(theano.config.floatX)
B = np.random.rand(10000,1000).astype(theano.config.floatX)
np_start = time.time()
AB =
np_end = time.time()
X,Y = theano.tensor.matrices('XY')
mf = theano.function([X,Y],
t_start = time.time()
tAB = mf(A,B)
t_end = time.time()
# times should be close when run on CPU
print "NP time: %f[s], theano time: %f[s]" %( np_end-np_start, t_end-t_start)
print "Result difference: %f" % (np.abs(AB-tAB).max(), )

On CPU, the output is:

NP time: 14.204086[s], theano time: 13.161064[s]
Result difference: 0.000000

On GPU it's:

Using gpu device 0: GeForce GTX 580
NP time: 14.876987[s], theano time: 0.052242[s]
Result difference: 0.000977

(Note that with 64-bit precision, the GPU may be CPU-speed)

Or the test from here:

from theano import function, config, shared, tensor, sandbox
import numpy
import time
vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
    print('Used the gpu')

(for me ~4.5sec on CPU, ~0.75sec on GPU)

On speed

(Some of this also applies in general)


Open Computing Language,


Getting a graphics card

Setting up OpenCL



CUDA notes


"‘memcpy’ was not declared in this scope", in the context of nvcc


/usr/include/string.h: In function 'void* __mempcpy_inline(void*, const void*, size_t)':
/usr/include/string.h:652:42: error: 'memcpy' was not declared in this scope
     return (char *) memcpy (__dest, __src, __n) + __n;

Seems be be due to a change in gcc 5.


  • Add -D_FORCE_INLINES to the nvcc compile flags. Details of doing so vary per build system.

See also:

CuFFT Invalid Plan

While this can be caused by coding bugs, the likelier reason is often that you ran out of GPU memory.

Compute capability

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) the feature set, which increased over time, and is primarily tied to microarchitecture development:

Tesla   1.x  (Tesla microarchitecture, not Tesla-brand cards)
Fermi   2.x
Kepler  3.x
Maxwell 5.x
Pascal  6.x
Volta   7.x
Ampere  8.x

As such, it relates to age much more than to price.

For a detailed summary, see the programming guide (linked below)

See also:


CULA is a set of GPU-accelerated linear algebra libraries