Optimized number crunching
Number crunching library notes
FFTW
FFTW is he Fastest Fourier Transform library in the West.
...where 'fastest' should probably say "fastest library that's also generic, flexible, and portable" - there are ways to get faster transforms - that are often less flexible, or more paid -- e.g. low-level architecture-specific optimization can occasionally make Intel MKL faster (on recent Intel chips). Also, see ASICs, well-used GPGPU, etc. (assuming you're not dwarfed by any sort of overhead, and other such generally-applicable caveats)
tl;dr: FFTW is pretty good out of the box.
A little tweaking is useful if you're going to do a lot of repetitive number crunching.
Some acronyms
Generic, double-precision (double, 64-bit):
- include <fftw3.h> (same in all cases)
- use fftw_ prefix on functions
- link using -lfftw3
To force single-precision (float, 32-bit):
- include <fftw3.h> (same in all cases)
- use fftwf_ prefix on functions
- link using -lfftw3f
To force long-double precision (long double, 80-bit, though in some cases is 64-bit, or not supported at all):
- include <fftw3.h> (same in all cases)
- use fftwl_ prefix on functions
- link using -lfftw3l
There is also quad precision (128-bit):
- include <fftw3.h> (same in all cases)
- use fftwq_ prefix on functions
- link using -lfftw3q -lm -lquadmath
Notes
- as these are separate libraries with distinctly named functions, you can load multiple at the same time, and use them without conflict.
- If you want threaded calculation (shared-memory parallelism):
- additionally link using -lfftw3_threads / -lfftw3f_threads / -lfftw3l_threads (whichever applies, see above)
- keep in mind that you may only see speed gains for large FFTs (because of moderate overhead of this kind of parallelism)
- To use from MPI (distributed-memory parallelism, also shared-memory paralellism):
- link with fftw_mpi
- Relatedly, fftw3ff gives access to ffmpeg's FFT functions
- yes, you could use it directly, but that can be more bothersome.
- this way you get some of FFTW's clever planning. (and its API, which can save a little time/code if you use/allow both)
- Relatedly, fftwn can be useful for Neon-enabled ARM (Note that in function names, fftwn can also refer to n-dimensional FFT)
- Relatedly, fftwni - Neon-enabled processor, intrinsics
On plans and wisdom
Since there are a lot of specific ways to call FFTW (forward or reverse transform, size, data type, data properties, in-place or not), the possible (micro-)optimizations vary in effectiveness.
So FFTW has a planner that tries to pick the best.
At runtime, you can ask it to estimate which is best for specific input parameters, to some degree of exhaustiveness. FFTW_MEASURE is thorough and slow. FFTW_ESTIMATE is faster prep but somewhat less optimal.
The intended case for this is a program that plans once for a lot of identical transforms, to the point where the startup time is irrelevant and worth it.
A plan is specific for its given parameters.
Wisdom is accumulated plans (which can also speed up later planning).
Wisdom is cumulative, stored and managed by fftw itself though you control whether it is used, and saved (also note that FFTW_ESTIMATE can use but not save wisdom).
(You can also store and load it yourself, but be careful: you could accidentally load things for the wrong architecture. If you know what you're doing you can sometimes avoid that very problem in, say, clusters)
If you do a lot of number crunching, it is recommended to ensure your hosts have wisdom - a bootstrapping set of plans for common sizes.
It's easy enough to just make your programs do planning so that you know a plan/wisdom always applies.
...but there still is value in creating some system wisdom using:
fftw-wisdom fftwf-wisdom fftwl-wisdom
See their man pages
The quick and dirty version is:
fftw-wisdom -v -c -o wisdom
...and put that wisdom in /etc/fftw/wisdom Note that PLANNING PROBLEM is not an error. This actually means "I am currently planning for this particular problem case".
See also:
Linear algebra library notes
BLAS
BLAS (Basic Linear Algebra Subprograms)
- is a specification (not implementation) for numerical linear algebra - matrix and vector stuff
- abstract, so allows optimization on specific platforms
- ...and BLAS-using code doesn't have to care which implementation ends up doing the calculations
- (origins lie in a Fortran library)
There are quite a few implementations of BLAS.
Some of the better known implementations:
- Netlib reference implementation, not optimized for speed
- AT for 'Automatically Tuned'. Basically, it compiles many variants and finds which is fastest for the host it's compiling on
- Doesn't make much sense as a binary package
- makes it portable - a works-everywhere, reasonably-fast-everywhere implementation.
- apparently has improved recently, now closer to OpenBLAS/MKL
- specifically tuned for a set of modern processors
- (note: also covers the most common LAPACK calls(verify))
- also quite portable -- sometimes easier to deal with than ATLAS
- Specific to ~2002-2008 processors. Very good at the time, since merged into OpenBLAS?
Wider things:
- MKL, Intel Math Kernel Library[1]
- (covers BLAS, LAPACK, and some other things that is sometimes very convenient to have)
- paid-for
- Best of this list on some operations when run on specific ranges of Intel CPUs. For other cases there is less difference.
- ACML, AMD Core Math Library[2]
- Comparable to MKL, but for AMD CPUs
- and free
- (apparently does not scale up to multicore as well as MKL?)
- Apple Accelerate Framework
- includes BLAS, LAPACK, and various other things
- easy choice when programming only for apple, because it's already installed (verify)
LAPACK
Functionally: LAPACK solves some more complex, higher-level problems than BLAS does.
LAPACK is also a fairly specific impementation, not an abstract spec.
It contains some rather clever algorithms (in some cases the only open-source implementation of said algorithm).
In other words, for some applications you are happy with just BLAS, in some cases you want to add LAPACK (or similar).
Speed-wise:
LAPACK is a modern, cache-aware rewrite of LAPACK (replaces LINPACK and EISPACK) so typically faster than them.
FFTPACK
Pragmatically: a slightly-slower and slightly-more-portable alternative to FFTW or others
As far as I can tell FFTPACK does not use AVX, which means that in some conditions (mostly larger transforms), FFTW (≥3.3), MKL, and such can be more than a little faster.
See also:
And perhaps:
On processor specialization
GPU, GPGPU, OpenCL, CUDA notes
You may also be looking for
GPU
GPGPU
GPGPU or not?
Nvidia/CUDA or not?
OpenCL is open source, runs on everything, and is a shared project between various vendors.
CUDA is proprietary, and runs only on NVidia hardware.
CUDA used to be a bunch faster mostly because it came earlier and was better optimised already. There is now little difference.
That said, and while AMD (through OpenCL) can give you more speed for money in the mid-range, the fancy hardware is mostly nVidia.
This seems to be the reason various people are sticking with CUDA.
That and laziness.
That said, there is currently an active connection between nvidia (the company, also support-wise) and e.g. academic research (more than there is for AMD), which sometimes results in hardware tweaks.
Optimization
Choices within Nvidia hardware
On microarchitectures:
- Tesla (~2006)
- note: no direct relation to the Tesla GPGPU brand
- Fermi (~2009)
- used in gaming, Quadro, and some Tesla cards
- Kepler (~2012)
- used in higher-end gaming, Quadro
- Maxwell (~2014)
- Used in GeForce 700 series, 800 series, 900 series, some Quadro
- Pascal (~2016)
- targets both processing and gaming
- Volta (~2018)
- Turing (~2019)
- available both with tensor and ray tracing (20 series), and without (16 series)
- Ampere (~2020)
- Lovelace (~2022)
On series: (note: not clean splits between microarchitectures, but an decent indication):
- 100, 200, 300: Tesla
- 400 and 500: Fermi (mostly)
- 600: Fermi and Kepler
- 700: Kepler (mostly, a few Maxwell)
- 800: Maxwell
- 900: Maxwell, Pascal
- 10: Pascal
- 16: Turing
- 20: Turing
- 30: Ampere
- 40: Lovelace
Other terms:
- Ti stuck on the end usually means "Same thing, more shaders", and regularly also more memory
- Founders Edition basically means reference design
- NVS basically means "prioritizing low-power, and good for a bunch of large monitors, but don't count on gaming" i.e. good for business use.
See also:
On single-percision versus double-precision floating point:
- Tesla is made to be good at both, in that DP speed will be no worse than half of SP speed
- Quadro also does double precision well
- ...though when you only use SP, a good GTX (Titan or no) is much cheaper for the same power
- Geforce (GT, GTX) seems capped at DP. As in, the metal could probably do a little better, but the driver limits it (verify). This makes them relatively better at SP.
- Titan notes:
- GTX Titan (700-series, GK110, 2013), DP
- GTX Titan Z (700-series, dual GK110, 2014), DP
- GTX Titan X (900-series, GM200, 2015), SP (more performance per core, though a Z beats it overall)
- GTX 1080 (first Pascal card, already challenging the Titans cost-efficiency-wise)
- Grid K1, Grid K2 is essentially two or four Kepler cores in one cards (~= GT 640)
- these seem to be meant for GPU virtualisation (and heat efficiency?), rather than cost-efficient crunching
Some are sensible splits (e.g. semi-compute, semi-gamer would not serve either customer well), though also some crippling/marketing going on, which means there are some really weird differences. For example, there are some Quadros that are much more expensive than equivalent-speed (and sometimes noticeably faster) GeForces.
Reliability
- Compute cards are typically better cooled and generally more reliable than cut-throat gaming designs, though not always by much
- Bit-flips are more likely on gamer cards - and while in games you may not even see the errors, while for professional use you would care
- The difference seems to be quite real, but not necessarily large.
Memory:
- Memory bandwidth varies, but not hugely
- Tesla and Quadro cards tend to have more
- whether it helps depends entirely on task (and coding)
CUDA notes
"‘memcpy’ was not declared in this scope", in the context of nvcc
Specifically:
/usr/include/string.h: In function 'void* __mempcpy_inline(void*, const void*, size_t)': /usr/include/string.h:652:42: error: 'memcpy' was not declared in this scope return (char *) memcpy (__dest, __src, __n) + __n;
Seems be be due to a change in gcc 5.
Solution:
- Add -D_FORCE_INLINES to the nvcc compile flags. Details of doing so vary per build system.
See also:
CuFFT Invalid Plan
While this can be caused by coding bugs, the likelier reason is often that you ran out of GPU memory.
Compute capability
...is the feature set, which increased over time, and is primarily tied to microarchitecture development (and as such, compute relates to age much more than to price):
Tesla 1.x (Tesla microarchitecture, not Tesla-brand cards) Fermi 2.x Kepler 3.x Maxwell 5.x Pascal 6.x Volta, Turing 7.x Ampere 8.x Hopper, Lovelace 9.x
See also:
- the appendix of the CUDA C programming guide
Unsorted
CULA is a set of GPU-accelerated linear algebra libraries
AMD ROCm notes
Intel oneAPI notes
OpenCL notes
You may also be looking for
Open Computing Language,
Versions
Getting a graphics card
Setting up OpenCL
Libraries
Boilerplate
On CUDA, OpenCL, SYCL, SPIR, and others
GPU programming notes
Why does it need so much memory?
Broad comparison
Keras notes
TensorFlow notes
Installation
darknet notes
Deep learning
cupy
thinc
prodigy
Theano notes
Errors
ModuleNotFoundError: No module named 'torch._prims_common'
ImportError: from torch import sym_float, sym_int, sym_max
ImportError: cannot import name 'multi_gpu_model'
Keras version muck.
That function was deprecated, and removed in 2020[3].
Use tf.distribute.MirroredStrategy instead?
AttributeError: module 'keras' has no attribute 'api
(no module named keras.api seems to be the same error?)
This may be a mismatch between tensorflow and keras version.
They should match and (I think) be new enough.
I didn't figure this one out, I just trial-and-errored it until it stopped whining.
...but
pip3 show keras tensorflow
may be a good start to tell whether there's a version mismatch.
And
pip3 install keras==
...is basically intentionally getting a version wrong so it tells you which there are
Another metric with the same name already exists.
That seems to mean a mismatch of versions between keras and tensorflow, because while they are more entangled now, they also live in separate packages, and dependency management is for people who don't want job security.
you may have more luck _downgrading_ tensorflow to get it to match.
ImportError: cannot import name 'LayerNormalization'
You're probably importing keras yourself.
Tensorflow doesn't like it when you do that.
Try do do it its way.
ModuleNotFoundError: No module named 'tensorflow.contrib'
Version muck.
Probably means the code was written for TF 1; tf.contrib doesn't exist in 2.
The solution is usually go do a web search for the specific import (minus tensorflow.contrib) to see what where it is in 2.
And then hope the API didn't change.
="This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations"
Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(tensorflow)
(tensorflow)
TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly
(tensorflow)