Intrinsics

From Helpful
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Intrinsics are pieces of code that a compiler handles as a special case, for optimization purposes.


Intrinsic functions

A function call replaced with a platform-specific and/or inline implementation.


For example, the C standard library has strcpy() and memset(), both very simple, very well defined, and very commonly called.

Without intrinsics this will be a regular function call into libc.

Aside from basic inlining to save a few stack pushes and jumps implied by the call, it can often use an implementation that is a little more efficient on the CPU you are compiling for.


Intrinsic functions can also refer to

  • a programmer trying their best to get a specific opcode, e.g. NOPs to implement tiny wait times
  • using functions that are somewhat hardware-specific, like __enable_interrupt()
or using such a common name instead of the actual instructions behind it
  • using code the compiler by default wouldn't generate, such as SIMD instructions like MMX, SSE, FMA, AVX.


SIMD and such

Most general-purpose languages are more imperative than declarative about number crunching, which implies compilers can't easily/conclusively analyse when SIMD is useful, is faster, or when the data reordering might help or even hinder speed.


One way we deal with this is, well, work almost the other way around: You use a library that forces you to write it as SIMD (use library functions aimed at MMX, SSE, FMA, AVX), but is not implementation yet.

The compiler can then decide, based on the target CPU selected at compile time, whether to compile that as SIMD opcodes, or as generic loops that you probably would've written yourself (in theory you could also compile both and decide at runtime).


(Similar compile time tricks are relevant to other optimisations, e.g. FFTW and BLAS, which as an API present FFT and linear algebra to you, but have a bunch of different implementations and run the one that seems to run fastest on your machine. And of this can potentially apply to static compilation, JIT / AoT compilation, and also significantly to vector processors, multiprocessing platforms, etc.)