Number crunching library notes

FFTW

FFTW is he Fastest Fourier Transform library in the West.

Where 'fastest' should probably say "fastest library that's also generic, flexible, and portable" - there are ways to get faster transforms that are less flexible or more paid -- e.g. low-level architecture-specific optimization can occasionally make Intel MKL faster (on recent Intel chips, assuming you're not dwarfed by any sort of overhead and further such generally applicable caveats). Also, ASICs, well-used GPUs, etc.

It's pretty good out of the box.

A little tweaking is useful if you're going to do a lot of repetitive number crunching.

Some acronyms

Generic, double-precision (double, 64-bit):

include <fftw3.h> (same in all cases)
use fftw_ prefix on functions
link using -lfftw3

To force single-precision (float, 32-bit):

include <fftw3.h> (same in all cases)
use fftwf_ prefix on functions
link using -lfftw3f

To force long-double precision (long double, 80-bit, though in some cases is 64-bit, or not supported at all):

include <fftw3.h> (same in all cases)
use fftwl_ prefix on functions
link using -lfftw3l

There is also quad precision (128-bit):

include <fftw3.h> (same in all cases)
use fftwq_ prefix on functions
link using -lfftw3q -lm -lquadmath

Note that as these are separate libraries with distinctly named functions, you can load multiple at the same time.

If you want threaded calculation (shared-memory parallelism):

include ?
additionally link using -lfftw3_threads / -lfftw3f_threads / -lfftw3l_threads (whichever applies, see above)
keep in mind that you may only see speed gains for large FFTs (because of moderate overhead of this kind of parallelism)

To use from MPI (distributed-memory parallelism, also shared-memory paralellism):

include ?
link with fftw_mpi

fftw3ff - access to ffmpeg's FFT functions

yes, you could use it directly, but that can be more bothersome.

With FFTW you get some of its clever planning. (and its API, which can save a little time/code if you use/allow both)

fftwn - Neon-enabled processor, asm (Note that in function names, fftwn can also refer to n-dimensional FFT)

fftwni - Neon-enabled processor, intrinsics

On plans and wisdom

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Since there are a lot of specific ways to call FFTW (size, data type, data properties, data dimensions in place or not, forward or reverse), the possible micro-optimizations vary in effectiveness.

So FFTW has a planner that tries to pick the best.

At runtime, you can ask it to estimate which is best for specific input parameters, to some degree of exhaustiveness. FFTW_MEASURE is thorough and slow. FFTW_ESTIMATE is faster prep but somewhat less optimal.

The intended case for this is a program that plans once for a lot of identical transforms, to the point where the startup time is irrelevant and worth it.

A plan is specific for its given parameters.

Wisdom is accumulated plans (which can also speed up later planning).

Wisdom is cumulative, stored and managed by fftw itself though you control whether it is used, and saved (also note that FFTW_ESTIMATE can use but not save wisdom).

(You can also store and load it yourself, but be careful: you could accidentally load things for the wrong architecture. If you know what you're doing you can sometimes avoid that very problem in, say, clusters)

If you do a lot of number crunching, it is recommended to ensure your hosts have wisdom - a bootstrapping set of plans for common sizes.

It's easy enough to just make your programs do planning so that you know a plan/wisdom always applies.

...but there still is value in creating some system wisdom using:

fftw-wisdom
fftwf-wisdom
fftwl-wisdom

See their man pages

The quick and dirty version is:

fftw-wisdom -v -c -o wisdom

...and put that wisdom in /etc/fftw/wisdom Note that PLANNING PROBLEM is not an error. This actually means "I am currently planning for this particular problem case".

See also:

Linear algebra library notes

BLAS

BLAS (Basic Linear Algebra Subprograms)

is a specification (not implementation) for numerical linear algebra - matrix and vector stuff

abstract, so allows optimization on specific platforms

...and BLAS-using code doesn't have to care which implementation ends up doing the calculations

(origins lie in a Fortran library)

There are quite a few implementations of BLAS.

Some of the better known implementations:

Netlib reference implementation, not optimized for speed

ATLAS

AT for 'Automatically Tuned'. Basically, it compiles many variants and finds which is fastest for the host it's compiling on

Doesn't make much sense as a binary package

makes it portable - a works-everywhere, reasonably-fast-everywhere implementation.

apparently has improved recently, now closer to OpenBLAS/MKL

OpenBLAS

specifically tuned for a set of modern processors

(note: also covers the most common LAPACK calls(verify))

also quite portable -- sometimes easier to deal with than ATLAS

GotoBLAS

Specific to ~2002-2008 processors. Very good at the time, since merged into OpenBLAS?

Wider things:

MKL, Intel Math Kernel Library[1]

(covers BLAS, LAPACK, and some other things that is sometimes very convenient to have)

paid-for

Best of this list on some operations when run on specific ranges of Intel CPUs. For other cases there is less difference.

ACML, AMD Core Math Library[2]

Comparable to MKL, but for AMD CPUs

and free

(apparently does not scale up to multicore as well as MKL?)

Apple Accelerate Framework

includes BLAS, LAPACK, and various other things

easy choice when programming only for apple, because it's already installed (verify)

LAPACK

Functionally: LAPACK solves some more complex, higher-level problems than BLAS does.

LAPACK is also a fairly specific impementation, not an abstract spec.

It contains some rather clever algorithms (in some cases the only open-source implementation of said algorithm).

In other words, for some applications you are happy with just BLAS, in some cases you want to add LAPACK (or similar).

Speed-wise: LAPACK is a modern, cache-aware rewrite of LAPACK (replaces LINPACK and EISPACK) so typically faster than them.

FFTPACK

Pragmatically: a slightly-slower and slightly-more-portable alternative to FFTW or others

As far as I can tell FFTPACK does not use AVX, which means that in some conditions (mostly larger transforms), FFTW (≥3.3), MKL, and such can be more than a little faster.

See also:

https://en.wikipedia.org/wiki/FFTPACK

And perhaps:

http://www.fftw.org/benchmark/fft-software.html

On processor specialization

GPU, GPGPU, OpenCL, CUDA notes

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

You may also be looking for

Compute: OpenCL

Graphics: OpenGL · OpenVG

Sound: OpenAL · OpenSL

GPU

GPGPU

GPGPU or not?

Nvidia/CUDA or not?

OpenCL is open source, runs on everything, and is a shared project between various vendors.

CUDA is proprietary, and runs only on NVidia hardware.

CUDA used to be a bunch faster mostly because it came earlier and was better optimised already. There is now little difference.

That said, and while AMD (through OpenCL) can give you more speed for money in the mid-range, the fancy hardware is mostly nVidia.

This seems to be the reason various people are sticking with CUDA.

That and laziness.

That said, there is currently an active connection between nvidia (the company, also support-wise) and e.g. academic research (more than there is for AMD), which sometimes results in hardware tweaks.

Optimization

Choices within Nvidia hardware

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

On microarchitectures:

Tesla (~2006)

note: no direct relation to the Tesla GPGPU brand

Fermi (~2009)

used in gaming, Quadro, and some Tesla cards

Kepler (~2012)

used in higher-end gaming, Quadro

Maxwell (~2014)

Used in GeForce 700 series, 800 series, 900 series, some Quadro

Pascal (~2016)

targets both processing and gaming

Volta (~2018)

Turing (~2019)

available both with tensor and ray tracing (20 series), and without (16 series)

Ampere (~2020)

Lovelace (~2022)

On series: (note: not clean splits between microarchitectures, but an decent indication):

100, 200, 300: Tesla

400 and 500: Fermi (mostly)

600: Fermi and Kepler

700: Kepler (mostly, a few Maxwell)

800: Maxwell

900: Maxwell, Pascal

10: Pascal

16: Turing

20: Turing

30: Ampere

40: Lovelace

Other terms:

Ti stuck on the end usually means "Same thing, more shaders", and regularly also more memory

Founders Edition basically means reference design

NVS basically means "prioritizing low-power, and good for a bunch of large monitors, but don't count on gaming" i.e. good for business use.

See also:

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units

On single-percision versus double-precision floating point:

Tesla is made to be good at both, in that DP speed will be no worse than half of SP speed

Quadro also does double precision well

...though when you only use SP, a good GTX (Titan or no) is much cheaper for the same power

Geforce (GT, GTX) seems capped at DP. As in, the metal could probably do a little better, but the driver limits it (verify). This makes them relatively better at SP.

Titan notes:

GTX Titan (700-series, GK110, 2013), DP

GTX Titan Z (700-series, dual GK110, 2014), DP

GTX Titan X (900-series, GM200, 2015), SP (more performance per core, though a Z beats it overall)

GTX 1080 (first Pascal card, already challenging the Titans cost-efficiency-wise)

Grid K1, Grid K2 is essentially two or four Kepler cores in one cards (~= GT 640)

these seem to be meant for GPU virtualisation (and heat efficiency?), rather than cost-efficient crunching

Some are sensible splits (e.g. semi-compute, semi-gamer would not serve either customer well), though also some crippling/marketing going on, which means there are some really weird differences. For example, there are some Quadros that are much more expensive than equivalent-speed (and sometimes noticeably faster) GeForces.

Reliability

Compute cards are typically better cooled and generally more reliable than cut-throat gaming designs, though not always by much
Bit-flips are more likely on gamer cards - and while in games you may not even see the errors, while for professional use you would care

The difference seems to be quite real, but not necessarily large.

Memory:

Memory bandwidth varies, but not hugely
Tesla and Quadro cards tend to have more

whether it helps depends entirely on task (and coding)

CUDA notes

CUDA related errors

"‘memcpy’ was not declared in this scope", in the context of nvcc

Specifically:

/usr/include/string.h: In function 'void* __mempcpy_inline(void*, const void*, size_t)':
/usr/include/string.h:652:42: error: 'memcpy' was not declared in this scope
     return (char *) memcpy (__dest, __src, __n) + __n;

Seems be be due to a change in gcc 5.

Solution:

Add -D_FORCE_INLINES to the nvcc compile flags. Details of doing so vary per build system.

See also:

CuFFT Invalid Plan

While this can be caused by coding bugs, the likelier reason is often that you ran out of GPU memory.

Compute capability

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

...is the feature set, which increased over time, and is primarily tied to microarchitecture development (and as such, compute relates to age much more than to price):

Tesla             1.x    (Tesla microarchitecture, not Tesla-brand cards)
Fermi             2.x
Kepler            3.x
Maxwell           5.x
Pascal            6.x
Volta, Turing     7.x
Ampere            8.x
Hopper, Lovelace  9.x

See also:

the appendix of the CUDA C programming guide

https://developer.nvidia.com/cuda-gpus

https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

Unsorted

CULA is a set of GPU-accelerated linear algebra libraries

OpenCL notes

You may also be looking for

Compute: OpenCL

Graphics: OpenGL · OpenVG

Sound: OpenAL · OpenSL

Open Computing Language,

Versions

Getting a graphics card

Setting up OpenCL

Libraries

Boilerplate

On CUDA, OpenCL, SYCL, SPIR, and others

GPU programming notes

Why does it need so much memory?

Broad comparison

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Keras notes

TensorFlow notes

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Installation

darknet notes

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Deep learning

cupy

thinc

prodigy

Theano notes

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Errors

ImportError: cannot import name 'multi_gpu_model'

Keras version muck.

That function was deprecated, and removed in 2020[3].

Use tf.distribute.MirroredStrategy instead?

AttributeError: module 'keras' has no attribute 'api

(no module named keras.api seems to be the same error?)

This may be a mismatch between tensorflow and keras version.

They should match and (I think) be new enough.

I didn't figure this one out, I just trial-and-errored it until it stopped whining.

...but

pip3 show keras tensorflow

may be a good start to tell whether there's a version mismatch.

And

pip3 install keras==

...is basically intentionally getting a version wrong so it tells you which there are

Another metric with the same name already exists.

That seems to mean a mismatch of versions between keras and tensorflow, because while they are more entangled now, they also live in separate packages, and dependency management is for people who don't want job security.

you may have more luck _downgrading_ tensorflow to get it to match.

ImportError: cannot import name 'LayerNormalization'

You're probably importing keras yourself.

Tensorflow doesn't like it when you do that.

Try do do it its way.

ModuleNotFoundError: No module named 'tensorflow.contrib'

Version muck.

Probably means the code was written for TF 1; tf.contrib doesn't exist in 2.

The solution is usually go do a web search for the specific import (minus tensorflow.contrib) to see what where it is in 2.

And then hope the API didn't change.

="This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations"

Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

(tensorflow)

Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory

(tensorflow)

TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly

(tensorflow)

Optimized number crunching

Number crunching library notes

FFTW

Some acronyms

On plans and wisdom

Linear algebra library notes

BLAS

LAPACK

FFTPACK

On processor specialization

GPU, GPGPU, OpenCL, CUDA notes

GPU

GPGPU

GPGPU or not?

Nvidia/CUDA or not?

Optimization

Choices within Nvidia hardware

CUDA notes

CUDA related errors

"‘memcpy’ was not declared in this scope", in the context of nvcc

CuFFT Invalid Plan

Compute capability

Unsorted

OpenCL notes

Versions

Getting a graphics card

Setting up OpenCL

Libraries

Boilerplate

On CUDA, OpenCL, SYCL, SPIR, and others

GPU programming notes

Why does it need so much memory?

Broad comparison

Keras notes

TensorFlow notes

Installation

darknet notes

Deep learning

cupy

thinc

prodigy

Theano notes

Errors

ImportError: cannot import name 'multi_gpu_model'

AttributeError: module 'keras' has no attribute 'api

Another metric with the same name already exists.

ImportError: cannot import name 'LayerNormalization'

ModuleNotFoundError: No module named 'tensorflow.contrib'

="This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations"

Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory

TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly

Navigation menu