Notes on numbers in computers

From Helpful
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Floating point numbers

Floating point numbers are approximations of real numbers.

Intuitively, they are similar to scientific notation -- but in base 2 instead of base 10.

Take 20400. Scientific notation would as something like 2.04 * 104.

Floating point numbers store that same number as something like 1.2451172 * 214

It uses base-2 largely because it's slightly more efficient use of the same amount of silicon.

(Floating point can also be contrasted with fixed point numbers, which were approximations of stored within integers (before floating point operations were ubiquitous in processors) and which are somewhat simpler to understand, and sometimes an exercise given in computer science, but are more bothersome to deal with and (mostly) slower than an FPU, which is why they fell out of style except in CPUs without FPUs - mostly embedded.)

On the limited precision

You should never assume floating point numbers or calculations are fully precise.

Assume that (counting from the largest digit) numbers will be stored 32-bit float is precise up to the first six digits and 64-bit precise up to the first sixteen or so.

This inaccuracy is close enough for a lot of practice, yet in some cases you want to consider it in designs.

...but that's not the full story.

Some will be more precise, a few exact.

More importantly few digits in decimal representation doesn't mean few digits in float representation.

For example:

  • 0.5 is stored as precisely that in a float
  • 0.1 and 0.2 are not stored precisely that in a float
  • while 20400.75 is stored precisely, 20400.8 is not
  • with 32-bit floats, all integers up to 16777215 are exact

Operations make things a little more interesting.

Storing numbers less-than-fully-precisely breaks some properties you initially don't expect, such as commutativity. For example

0.1+(0.2+0.3) != (0.1+0.2)+0.3. 

Sometimes more pressingly,

0.3-(0.2+0.1) != 0.0 you often want zero tests to actually be something like

Combining numbers of vastly different scales (with any operation) is an issue.

Consider again scientific notation, e.g. adding 1E2 + 1E5 Assuming you need to work within this representation you would probably solve this by considering that 1E2 is equal to 0.001E5, so now you can do 0.001E5 + 1E5 = 1.001E5

When you're scaling up the smaller number's exponent to fit the larger number in floating point, though, you have the issue that the number before the E has limited precision. The larger the magnitude change, the more that that this scaling requires you to lose digits.

If there is a large enough scale difference, the smaller number falls away completely. Consider that 1.00000001E5, when you can only only a few digits, would become just 1E5.

And this scale stuff introduces more interesting cases. Remember a bit above where

0.1+(0.2+0.3) != (0.1+0.2)+0.3


0.1+(0.2+100) == (0.1+100)+0.2 

which you probably didn't expect, and have to stare at for a while to figure out why.

Another implied detail is that if you keep on updating a number with operations, you should expect rounding errors to happen each calculation, and accumulate. For example, if you do

view = rotate(current_view, 10 degrees)

36 times you will not be precisely back where you started.

In games that may not matter. But in other cases it does, and you want to know the alleviations.

Some mathematical operations do not map easily to floating point, and when specs don't strictly define required accuracy, there may be a possible tradeoffs that implementations do between faster and more precise. One of the better examples being exp, expf. This is one of a few reasons different libraries and different hardware should not be expected to all give bit-identical results.

In particular GPUs sometimes play a little loose with floating point specs (more so in the past).

It's usually well quantified how large the error may be, e.g. via ulp.

It's often easy to throw 64-bit at a problem, but at the cost of speed.

Note that most x86 CPUs do most calculations in 80-bit. Intuitively, this because (the internal steps within what looks like) a single FPU operation imply losing a few digits, so if you work with more, then the result is actually accurate to 64 bit. (It's overkill for 32-bit, but doesn't hurt).

It does not mean subsequent operations get near-80-bit precision, because the intermediate 'fetch this out to memory' is typically 64-bit. Even if your language exposes 80-bit floats in their typing system (e.g. C's long double), and even if it can loads/saves them into the FPU registers, there are various footnotes to this.

Note also that things like SSE does not do this 80-bit stuff.[1].

This also means that some speed optimizations alter the precision of the overall calculation, so again, not all calculation is bit-identical (between compilations of the same code, in this case).

Note that because certain combinations of operations are common (like multiply and add), there may be opcodes that do the two in one, so that you avoid some rounding in the middle, for lower overall rounding error.

For this example, see Fused-Multiply-and-Add.

It gives slightly more precise results, although the floating point spec considers the separated one correct as well.

Bit division and layout

IEEE 754 floats consist divide their bits into sign, mantissa, and exponent.

The standard float types in most programming languages will be (IEEE 754) 32-bit and/or 64-bit floats because most any processor can can handle those natively and therefore quickly.

In a (IEEE 754) 32-bit float, there is 1 bit for the sign, 23 for the mantissa and 8 for the exponent. The bit layout:


For (IEEE 754) 64-bit 'double-precision' floats (a.k.a. doubles), you use 1 bit for base, 52 for the mantissa and 11 for the exponent. The bit layout:


There are other-sized variations (80-bit extended precision, 128-bit quadruple, 16-bit half, and also 36-bit, 48-bit, and 60-bit, and I've heard mention of 120-bit).

You can always emulate lower-precision floats, e.g. use 80-bit to emulate 64-bit, by throwing away the excess precision only when the result is fetched out.

Some calculations show lower error when done this way, processors may do this basically always (apparently all x86 since the 8087), meaning some operations may be slightly more accurate than you expected. (your language may allow some control over this[2])

(This is also one of a few reasons you should not assume that floating point calculations are bit-identical between computers, compilers, etc.)

Representation details

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

You may want to read this thinking about how this affects inaccuracy, rather than trying to learn all the details. For in-depth details, there are good resources out there already. This is intended as a summary.

The value a float represents is calculated like:

sign * (1+mantissa) * 2exponent

Sign is coded as 1 meaning negative, 0 meaning positive, though in the formula above, think of it as -1 and 1.

Exponent is an integer, in 32-bit floats representing values within -126..127 range, zeroed around value 127

  • so e.g.
coded value 1 represents -126
coded value 127 represents 0
coded value 254 represents 127
  • values 0 and 255 (which would represent -127 and 128) are used for special cases:
    • Zero (which can't be represented directly because of the 1+ in the formula) is represented by exponent=00000000 and mantissa=0
    • Infinity is represented by exponent=11111111 and mantissa=0.
    • Note that the last two don't mention sign -- both infinity and zero can be positive and negative
    • NaN ('Not a Number') is represented by exponent=11111111 and non-zero mantissa
      • technically, there is a further distinction into quiet NaNs for indeterminate operations, and signaling NaNs for invalid operations

Mantissa bits represent 2-1 (=0.5), 2-2 (=0.25), and so on. For a 32-bit float there are 23 mantissa bits, so before the exponent gets involved they represent:


For 64-bit floats there are another 29 of these.

So yes, the mantissa part is representing fractions. some explanations introduce the whole thing this way.

Example from IEEE bits to represented number:

  • 1 01111111 10000000000000000000000
  • Sign is 1, representing -1
  • exponent is 127 (decimal int), representing 0
  • mantissa is 10000000000000000000000, representing 0.5

Filling in the values: (-1*(1+0.5)*20), is -1.5

Example from number to IEEE bits

Encoding a number into float form can be intuitively understood as:

  1. start with base-2 scientific notation (with exponent 0)
  2. incrementing the exponent while halving the part that goes into the mantissa (which keep the represented number the same to within error).
  3. repeat step 2 until value-that-goes-into-the-mantissa is between 0 and 1
(Note that mantissa values between 0 and 1 represents 1+that, so in the examples below we divide until it's between 1 and 2)

For example: 7.0

  • is 7.0*20
  • is 3.5*21
  • is 1.75*22 (mantissa stores (that-1), 0.75, so we're now done with the shifting)

We have sign=positive, exponent=2, and mantissa=0.75. In bit coding:

  • sign bit: 0
  • exponent: we want represent 2, so store 129, i.e. 10000001
  • mantissa: we want to represent 0.75, which is 2-1+2-2, (0.5+0.25) so 11000000000000000000000

So 7.0 is 0 10000001 11000000000000000000000

The 20400 mentioned above works out as 1.2451171875*214.

  • Sign: 0
  • Exponent: 10001101 (141, representing 14)
  • Mantissa: 00111110110000000000000
Using the table above, that mantissa represents 0.125+0.0625+0.03125+0.015625+0.0078125+0.001953125+0.0009765625, which equals 0.2451171875.

So can you tell whether floating point is approximating, and if so, how?

Not easily.

As [3] mentions, the reason for inaccuracy is akin to the reason we can't write 1/3 in decimal. While 0.333333 is pretty close, stop at any finite number of threes and it's inaccurate. In floats it's anything you cannot express as a base 2 fraction, after the exponent logic rescaled the mantissa to the 1..2 scale it needs to be in.

Some things can trade between mantissa and exponent without loss, which is much of the reason e.g. integers up a point are accurate. And fractions tend to do decently up to that order for a similar reason, but this intuition is iffy, and falls apart.

You may like tools like for further inspection.

See also:

Accuracy issues

Denormal numbers

Denormal numbers, a.k.a. subnormal numbers, are ones that are so small (near zero) that the exponent has already bottomed out, so we have to have leading zeroes in the mantissa (we normally avoid losing this precision, by choosing the exponent so that the mantissa is 0..1).

This implies you have fewer bits of precision than you would normally have.

In many cases these are so close to zero you can just treat them as zero, for practical reasons, and sometimes for performance reasons as well.

See also:

Significance loss

Storing integers

A good range of integers can be stored exactly, namely those for which the binary representation needs at most the amount of bits the mantissa has (plus one implicit bit).

32-bit floats have 23 bits for the mantissa, so can store integers up to 223+1-1, i.e. up to 16777215 (and down to negative that, because the sign bit is a separate thing)

64-bit floats have 52 mantissa bits, so it can store -9007199254740991..9007199254740991.

(technically you can store up to 2mantissabits+1 rather than mantissabits+1-1, but languages tend to define safe values as those for which n and n+1 are exactly representable, or perhaps more to the point, as those that are not also approximations for other numbers. So it ends one lower.)

After this point it will only be able to store every second, then every fourth integer, etc. (and before that you could also store halves, before that quarter)

If you've ever played with fixed-point integers (using integers to imitate floats) this may be somewhat intuitive.

Floating point compression


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Older GPUs were more specialized devices, and deviated from standard IEEE, see e.g. [4].

Time-wise this is roughly pre-CUDA, and CUDA 1.x still had some issues (e.g. dealing with denormal numbers).

Modern GPUs (since Compute Capability 2) provide something close to IEEE-754 floating point.

But there are still some footnotes that mean they are not equivalent to FPUs.

Also, while FPUs in the x86 line do calculations in 80-bit (basically since always), on GPU 32-bit is actually 32-bit, and 64-bit is actually 64-bit (closer to e.g. SSE), meaning they can collect errors faster. In some case you can easily work around that, in some cases it's a limiting factor.

You may be able to make GPUs use float16, though don't expect ease of use, and while faster probably won't be twice the speed.

Speed of 16, 32, and 64-bit FP in GPU isn't a direct halving/doubling, because of various implementation details, some of which also vary between architectures. But also on that most cards focus on SP - because they're assigning more die to things useful in gaming, for which SP is enough.

When lower precision is okay, even integer / fixed-point may help speed on GPU (verify)

float64 is less of a performance hit on most FPUs(verify) than GPUs.

If it matters, you may want to check every GPU (and every GPU/driver change)

On repeatability

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Differences between CPU and GPU results are expected

See above for the reasons.

The operations will typically still be within IEEE 754 specs, which is not as strict as you may think for many operations.

Differences between different GPUs and drivers will happen

The precision for an operation in a range is usually well characterizable, and typically within IEEE spec, but may differ between hardware and with different drivers.

Also, the GPU is effectively a JIT optimizer, so it might rearrange operations with minor side effects(verify)

See e.g.

Some code is not deterministic for speed reasons

e.g CUDA atomics: the order in which concurrent atomic updates are performed is not defined, so this can have rounding/associativity-related side effects.

You can avoid this with different code. But in some applications you may want the speed increase (see e.g. tensorflow).

Differences between runs on the same GPU+driver should not happen

...yet sometimes do.

It seems that GPUs pushed too hard make mistakes (presumably in memory more than in calculation?). You wouldn't notice this in gaming, or necessarily in NN stuff, but you would care in scientific calculation.

Sometimes this can be fixed with drivers that use the hardware a little more carefully.

Sometimes it's a risk-speed tradeoff, one you can tweak in settings, and one that may change with hardware age.

On FPU-less CPUs

See also

Interesting note relevant to audio coding:

Relevant standards:

  • IEEE 754 [5] for a long time referred specifically to IEEE 754-1985, The "IEEE Standard for Binary Floating-Point Arithmetic" and is the most common reference.
  • recently, IEEE 754-2008 was published, which is mostly just the combination of IEEE 754-1985 and IEEE 854 (radix-independent). Before it was released, it was known as IEEE 754r[6], ('revision')
  • IEEE 854, specifically IEEE 854-1987, is a radix-independent variation [7]
  • IEC 60559:1989, "Binary floating-point arithmetic for microprocessor systems", is the same as 754-1985


There are other, more specific floating-point implementations


Signed and unsigned integers

An unsigned integer is one that can store only positive numbers.

Usually chosen when you don't want to lose part of the usable range to (negative) values that you won't ever use anyway.

A signed integers can store negative values.

There are a few different ways of representing such numbers, and each implies how working with these values works.

The most commonly used is two's complement, partly because they are some of the easiest and most efficient to implement in hardware - most operations for unsigned numbers actually do the correct thing on two's-complement signed as well. (Not true for one's complement)

Arbitrary-precision ints

Fixed point numbers


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Fixed point numbers are, in most implementations, a library that abuses integers to represent more fractional numbers.

This may be easiest to explain via an example.

Say you have an entirely regular 8-bit int. That means:

  • bit 1 represents 1
  • bit 2 represents 2
  • bit 3 represents 4
  • Bit 4 represents 8
  • Bit 5 represents 16
  • Bit 6 represents 32
  • Bit 7 represents 64
  • Bit 8 represents 128
  • ...and so on.

In which case:

  • 00000100 represents 4
  • 00000111 represents 7 (4 + 2 + 1)

Say that you decide instead to do:

  • bit 1 represents 1/4
  • bit 2 represents 1/2
  • bit 3 represents 1
  • Bit 4 represents 2
  • Bit 5 represents 4
  • Bit 6 represents 8
  • Bit 7 represents 16
  • Bit 8 represents 32
  • ...etc

In which case:

  • 000001 00 represents 1.00
  • 000001 11 represents 1.75 (4 + 1/2 + 1/4)

It's called "fixed point" because (as the spacing there suggests), you just pretend that you shifted the decimal point two bits to the left, and sits there always.

Put another way, you're still counting, but you decide you're counting units of 1/4.

Which also means various operations on these numbers make sense as-is.

Some other operations not so much, so you do need some extra bookkeeping, and often want to detect bordercases like overflow (stuff which in floating point are handled by the specs / hardware flags) - see e.g. saturation.

So in the end this isn't the best method, or the fastest.


More detail

See also


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Binary Coded Digits are a way of coding integers that is different from your typical two's complement integers.

BCD codes each digits in four bits. For example, 12 would be 0001 0010

One side effect that the hexadecimal representation of the coded value will look like the number: that's 0x12.

However, you don't use all your bits, some operations are harder, and in a processor would require a little more silicon. So why would you do this?

The main benefit is easy conversion to decimal, and it makes sense when displaying a limited amount of numbers - clocks, calculators, 7-segment drivers, particularly doing so without a processor - because this is fairly easy to express in simple digital electronics.

It is also seen where you want simple electronics, and/or want to avoid floating-point errors (you may see fixed point in the same place).

Pocket calculators will tend to show BCD -- and also do their operations in BCD.

On rounding

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Word, octet, nibble


In computer hardware and programming, a word is the fundamental storage size in an architecture, reflected by the processor and usually also the main bus.

A 16-bit computer has 16-bit words, a 64-bit computer has 64-bit words, etc. On all but ancient computers, they're multiples of eight.


  • In programming, 'word' often means 'an architecture-sized integer.'
  • Some use it in the implicit context of the architecture they are used to, such as indicating 16 bits in the days of 16-bit CPUs. In the context of 32-bit processors, a halfword is 16 bits, a doubleword 64 bits, and a quadword 128 bits. These terms may stick for a while even in 64-bit processor era.
  • 'Long word' can also be used, but is ambiguous

Basically, this is a confusing practice.

This refers specifically to integer sizes because floating point numbers are available in a few standardized sizes (usually IEEE 754).

In languages like C, int carries the meaning of a word; it may differ in size depending on what architecture it is compiled for. Some coders don't use specific-sized integers when they should, which is quite sloppy and can lead to bugs - or be perfectly fine when it just needs to store some not-too-large numbers. It may interfere with signed integers (Two's complement negative numbers), particularly if you count on certain behaviour, such as that -1 would be equal to 0xffffffff, which is only true for 32-bit signed ints, not for signed ints in general.


Fancy word for a byte - eight bits large.

Seen in some standard definitions, largely because some older (mostly ancient and/or unusual) computers used sizes more creatively, which also implied that 'byte' sometimes meant sizes other than 8 bits, and because 'byte' carries more of a connotation of being already binary coded - and octet more that of just the concept of grouping eight bits as a unit.

(For more details, see foldoc on the byte)

A few countries, like france, use megaoctet / Mo instead of the megabyte. (Though for french it seems this is also because it sounds potentially rude - though this applies to bit, not byte[8])


(Also seen spelled nybble, even nyble)

Half a byte: four bits.

Most commonly used in code comments to describe the fact you are storing two things in a byte, in two logical parts, the high and low nibble.

Few architectures or languages have operations at nibble level, probably largely because bitmask operations are simple enough to do, and more flexible.

Has seen other definitions in ancient/strange computers.