Optimized number crunching

From Helpful
Jump to navigation Jump to search

Some fragmented programming-related notes, not meant as introduction or tutorial

Data: Numbers in computers ·· Computer dates and times ·· Data structures

Wider abstractions: Programming language typology and glossary · Generics and templating ·· Some abstractions around programming · · Computational complexity theory notes · Synchronous, asynchronous · First-class citizen

Syntaxy abstractions: Constness · Memory aliasing · Binding, assignment, and such · Closures · Context manager · Garbage collection

Sharing stuff: Communicated state and calls · Locking, data versioning, concurrency, and larger-scale computing notes

Language specific: Python notes ·· C and C++ notes · Compiling and linking ·· Lua notes

Teams and products: Programming in teams, working on larger systems, keeping code healthy · Benchmarking, performance testing, load testing, stress testing, etc. · Maintainability

More applied notes: Optimized number crunching · File polling, event notification · Webdev · GUI toolkit notes

Mechanics of duct taping software together: Automation, remote management, configuration management · Build tool notes · Installers

Number crunching library notes


FFTW is he Fastest Fourier Transform library in the West.

...where 'fastest' should probably say "fastest library that's also generic, flexible, and portable" - there are ways to get faster transforms - that are often less flexible, or more paid -- e.g. low-level architecture-specific optimization can occasionally make Intel MKL faster (on recent Intel chips). Also, see ASICs, well-used GPGPU, etc. (assuming you're not dwarfed by any sort of overhead, and other such generally-applicable caveats)

tl;dr: FFTW is pretty good out of the box.

A little tweaking is useful if you're going to do a lot of repetitive number crunching.

Some acronyms

Generic, double-precision (double, 64-bit):

  • include <fftw3.h> (same in all cases)
  • use fftw_ prefix on functions
  • link using -lfftw3

To force single-precision (float, 32-bit):

  • include <fftw3.h> (same in all cases)
  • use fftwf_ prefix on functions
  • link using -lfftw3f

To force long-double precision (long double, 80-bit, though in some cases is 64-bit, or not supported at all):

  • include <fftw3.h> (same in all cases)
  • use fftwl_ prefix on functions
  • link using -lfftw3l

There is also quad precision (128-bit):

  • include <fftw3.h> (same in all cases)
  • use fftwq_ prefix on functions
  • link using -lfftw3q -lm -lquadmath


  • as these are separate libraries with distinctly named functions, you can load multiple at the same time, and use them without conflict.

  • If you want threaded calculation (shared-memory parallelism):
additionally link using -lfftw3_threads / -lfftw3f_threads / -lfftw3l_threads (whichever applies, see above)
keep in mind that you may only see speed gains for large FFTs (because of moderate overhead of this kind of parallelism)
  • To use from MPI (distributed-memory parallelism, also shared-memory paralellism):
link with fftw_mpi

  • Relatedly, fftw3ff gives access to ffmpeg's FFT functions
yes, you could use it directly, but that can be more bothersome.
this way you get some of FFTW's clever planning. (and its API, which can save a little time/code if you use/allow both)
  • Relatedly, fftwn can be useful for Neon-enabled ARM (Note that in function names, fftwn can also refer to n-dimensional FFT)
  • Relatedly, fftwni - Neon-enabled processor, intrinsics

On plans and wisdom

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Since there are a lot of specific ways to call FFTW (forward or reverse transform, size, data type, data properties, in-place or not), the possible (micro-)optimizations vary in effectiveness.

So FFTW has a planner that tries to pick the best.

At runtime, you can ask it to estimate which is best for specific input parameters, to some degree of exhaustiveness. FFTW_MEASURE is thorough and slow. FFTW_ESTIMATE is faster prep but somewhat less optimal.

The intended case for this is a program that plans once for a lot of identical transforms, to the point where the startup time is irrelevant and worth it.

A plan is specific for its given parameters.

Wisdom is accumulated plans (which can also speed up later planning).

Wisdom is cumulative, stored and managed by fftw itself though you control whether it is used, and saved (also note that FFTW_ESTIMATE can use but not save wisdom).

(You can also store and load it yourself, but be careful: you could accidentally load things for the wrong architecture. If you know what you're doing you can sometimes avoid that very problem in, say, clusters)

If you do a lot of number crunching, it is recommended to ensure your hosts have wisdom - a bootstrapping set of plans for common sizes.

It's easy enough to just make your programs do planning so that you know a plan/wisdom always applies.

...but there still is value in creating some system wisdom using:


See their man pages

The quick and dirty version is:

fftw-wisdom -v -c -o wisdom

...and put that wisdom in /etc/fftw/wisdom Note that PLANNING PROBLEM is not an error. This actually means "I am currently planning for this particular problem case".

See also:

Linear algebra library notes


BLAS (Basic Linear Algebra Subprograms)

is a specification (not implementation) for numerical linear algebra - matrix and vector stuff
abstract, so allows optimization on specific platforms
...and BLAS-using code doesn't have to care which implementation ends up doing the calculations
(origins lie in a Fortran library)

There are quite a few implementations of BLAS.

Some of the better known implementations:

AT for 'Automatically Tuned'. Basically, it compiles many variants and finds which is fastest for the host it's compiling on
Doesn't make much sense as a binary package
makes it portable - a works-everywhere, reasonably-fast-everywhere implementation.
apparently has improved recently, now closer to OpenBLAS/MKL
specifically tuned for a set of modern processors
(note: also covers the most common LAPACK calls(verify))
also quite portable -- sometimes easier to deal with than ATLAS
Specific to ~2002-2008 processors. Very good at the time, since merged into OpenBLAS?

Wider things:

  • MKL, Intel Math Kernel Library[1]
(covers BLAS, LAPACK, and some other things that is sometimes very convenient to have)
Best of this list on some operations when run on specific ranges of Intel CPUs. For other cases there is less difference.
  • ACML, AMD Core Math Library[2]
Comparable to MKL, but for AMD CPUs
and free
(apparently does not scale up to multicore as well as MKL?)

  • Apple Accelerate Framework
includes BLAS, LAPACK, and various other things
easy choice when programming only for apple, because it's already installed (verify)


Functionally: LAPACK solves some more complex, higher-level problems than BLAS does.

LAPACK is also a fairly specific impementation, not an abstract spec.

It contains some rather clever algorithms (in some cases the only open-source implementation of said algorithm).

In other words, for some applications you are happy with just BLAS, in some cases you want to add LAPACK (or similar).

Speed-wise: LAPACK is a modern, cache-aware rewrite of LAPACK (replaces LINPACK and EISPACK) so typically faster than them.


Pragmatically: a slightly-slower and slightly-more-portable alternative to FFTW or others

As far as I can tell FFTPACK does not use AVX, which means that in some conditions (mostly larger transforms), FFTW (≥3.3), MKL, and such can be more than a little faster.

See also:

And perhaps:

On processor specialization

GPU, GPGPU, OpenCL, CUDA notes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

You may also be looking for

Compute: OpenCL
Graphics: OpenGL · OpenVG
Sound: OpenAL · OpenSL



GPGPU or not?

Nvidia/CUDA or not?

OpenCL is open source, runs on everything, and is a shared project between various vendors.

CUDA is proprietary, and runs only on NVidia hardware.

CUDA used to be a bunch faster mostly because it came earlier and was better optimised already. There is now little difference.

That said, and while AMD (through OpenCL) can give you more speed for money in the mid-range, the fancy hardware is mostly nVidia.

This seems to be the reason various people are sticking with CUDA.

That and laziness.

That said, there is currently an active connection between nvidia (the company, also support-wise) and e.g. academic research (more than there is for AMD), which sometimes results in hardware tweaks.


Choices within Nvidia hardware

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

On microarchitectures:

  • Tesla (~2006)
note: no direct relation to the Tesla GPGPU brand
  • Fermi (~2009)
used in gaming, Quadro, and some Tesla cards
  • Kepler (~2012)
used in higher-end gaming, Quadro
  • Maxwell (~2014)
Used in GeForce 700 series, 800 series, 900 series, some Quadro
  • Pascal (~2016)
targets both processing and gaming
  • Volta (~2018)
  • Turing (~2019)
available both with tensor and ray tracing (20 series), and without (16 series)
  • Ampere (~2020)
  • Lovelace (~2022)

On series: (note: not clean splits between microarchitectures, but an decent indication):

100, 200, 300: Tesla
400 and 500: Fermi (mostly)
600: Fermi and Kepler
700: Kepler (mostly, a few Maxwell)
800: Maxwell
900: Maxwell, Pascal
10: Pascal
16: Turing
20: Turing
30: Ampere
40: Lovelace

Other terms:

  • Ti stuck on the end usually means "Same thing, more shaders", and regularly also more memory
  • Founders Edition basically means reference design
  • NVS basically means "prioritizing low-power, and good for a bunch of large monitors, but don't count on gaming" i.e. good for business use.

See also:

On single-percision versus double-precision floating point:

  • Tesla is made to be good at both, in that DP speed will be no worse than half of SP speed
  • Quadro also does double precision well
...though when you only use SP, a good GTX (Titan or no) is much cheaper for the same power

  • Geforce (GT, GTX) seems capped at DP. As in, the metal could probably do a little better, but the driver limits it (verify). This makes them relatively better at SP.
Titan notes:
GTX Titan (700-series, GK110, 2013), DP
GTX Titan Z (700-series, dual GK110, 2014), DP
GTX Titan X (900-series, GM200, 2015), SP (more performance per core, though a Z beats it overall)
GTX 1080 (first Pascal card, already challenging the Titans cost-efficiency-wise)
  • Grid K1, Grid K2 is essentially two or four Kepler cores in one cards (~= GT 640)
these seem to be meant for GPU virtualisation (and heat efficiency?), rather than cost-efficient crunching

Some are sensible splits (e.g. semi-compute, semi-gamer would not serve either customer well), though also some crippling/marketing going on, which means there are some really weird differences. For example, there are some Quadros that are much more expensive than equivalent-speed (and sometimes noticeably faster) GeForces.


  • Compute cards are typically better cooled and generally more reliable than cut-throat gaming designs, though not always by much
  • Bit-flips are more likely on gamer cards - and while in games you may not even see the errors, while for professional use you would care
The difference seems to be quite real, but not necessarily large.


  • Memory bandwidth varies, but not hugely
  • Tesla and Quadro cards tend to have more
whether it helps depends entirely on task (and coding)

CUDA notes

CUDA related errors

"‘memcpy’ was not declared in this scope", in the context of nvcc


/usr/include/string.h: In function 'void* __mempcpy_inline(void*, const void*, size_t)':
/usr/include/string.h:652:42: error: 'memcpy' was not declared in this scope
     return (char *) memcpy (__dest, __src, __n) + __n;

Seems be be due to a change in gcc 5.


  • Add -D_FORCE_INLINES to the nvcc compile flags. Details of doing so vary per build system.

See also:

CuFFT Invalid Plan

While this can be caused by coding bugs, the likelier reason is often that you ran out of GPU memory.

Compute capability

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

...is the feature set, which increased over time, and is primarily tied to microarchitecture development (and as such, compute relates to age much more than to price):

Tesla             1.x    (Tesla microarchitecture, not Tesla-brand cards)
Fermi             2.x
Kepler            3.x
Maxwell           5.x
Pascal            6.x
Volta, Turing     7.x
Ampere            8.x
Hopper, Lovelace  9.x

See also:


CULA is a set of GPU-accelerated linear algebra libraries

OpenCL notes

You may also be looking for

Compute: OpenCL
Graphics: OpenGL · OpenVG
Sound: OpenAL · OpenSL

Open Computing Language,


Getting a graphics card

Setting up OpenCL



On CUDA, OpenCL, SYCL, SPIR, and others

GPU programming notes

Why does it need so much memory?

Broad comparison

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Keras notes

TensorFlow notes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


darknet notes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Deep learning




Theano notes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


ImportError: cannot import name 'multi_gpu_model'

Keras version muck.

That function was deprecated, and removed in 2020[3].

Use tf.distribute.MirroredStrategy instead?

AttributeError: module 'keras' has no attribute 'api

(no module named keras.api seems to be the same error?)

This may be a mismatch between tensorflow and keras version.

They should match and (I think) be new enough.

I didn't figure this one out, I just trial-and-errored it until it stopped whining.


pip3 show keras tensorflow

may be a good start to tell whether there's a version mismatch.


pip3 install keras==

...is basically intentionally getting a version wrong so it tells you which there are

Another metric with the same name already exists.

That seems to mean a mismatch of versions between keras and tensorflow, because while they are more entangled now, they also live in separate packages, and dependency management is for people who don't want job security.

you may have more luck _downgrading_ tensorflow to get it to match.

ImportError: cannot import name 'LayerNormalization'

You're probably importing keras yourself.

Tensorflow doesn't like it when you do that.

Try do do it its way.

ModuleNotFoundError: No module named 'tensorflow.contrib'

Version muck.

Probably means the code was written for TF 1; tf.contrib doesn't exist in 2.

The solution is usually go do a web search for the specific import (minus tensorflow.contrib) to see what where it is in 2.

And then hope the API didn't change.

="This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations"

Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory


TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly