Notes on numbers in computers

From Helpful
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Floating point numbers

Floating point numbers store approximations of real numbers.

Things like rational numbers and transcendental and irrational numbers are hard for computers to store and use, both at all and efficiently.

When that is necessary, as in e.g. varied computer algebra and symbolic math, such software has its own solutions, and accepts it is sometimes slow.

But for a lot of things, you don't need all that precision. Floating point is a pragmatic tradeoff, with limited but fairly-predictable precision, and consistent speed. Speed that comes in part from some dedicated CPU silicon; simulating floating point calculations is factors slower (so in the cases you have no floating point operations, e.g. simple microcontrollers, there are sometimes faster integer based solutions, though the precision difference isn't trivial to quantify).

From a practical view, most math involving decimal points, fractions, very large numbers, very small numbers, or transcendental numbers like pi or e, are precise enough to do in floating point, particularly if you know its limitations.

Within computers, floating point are mainly contrasted with

which could be seen as a makeshift precursor to floating point (and are still useful on platforms without floating point calculations)
  • arbitrary-precision arithmetic,
    • BigInt

Intuitively, floating point can be understood as an idea similar to scientific notation.

Consider 20400.

In scientific notation you might write that as 2.04 * 104.

Floating point would actually store that same number as something like 1.2451172 * 214. (Exactly why and how isn't hugely important, e.g. its use of base-2 is mostly details about slightly more efficient use of the same amount of silicon.)

Bit division and layout

To give some idea how it's stored: IEEE 754 floats consist divide their allotted bits into sign, mantissa, and exponent.

The standard float types in most programming languages will be IEEE 754 32-bit and/or IEEE 754 64-bit because most CPUs can can handle those natively and therefore quickly.

An IEEE 754 32-bit float, uses 1 bit for the sign, 23 for the mantissa and 8 for the exponent. The bit layout:


An IEEE 754 64-bit floats (a.k.a. 'double-precision', doubles), uses 1 bit for sign, 52 for the mantissa and 11 for the exponent. The bit layout:


There are other-sized variations, like 80-bit extended precision, 128-bit quadruple, 256-bit octuple, 16-bit half, and also 36-bit, 48-bit, and 60-bit, and I've heard mention of 120-bit and one or two others

You can always emulate lower-precision floats, e.g. use 80-bit to emulate 64-bit, by throwing away the excess precision only when the result is fetched out.


  • Apparently all x86 since the 8087 is internally 80-bit, making some operations slightly more accurate than you might expect - but this is done out of necessity, because some operations are by nature less precise than the precision its intermediates are stored in.
  • Such are also reasons you should not assume that floating point calculations are bit-identical between computers, compilers, etc.
  • Your programming language may allow some use or control over extended precision[1]), but this is a topic with footnotes of its own.

Representation details

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

You may want to read the following thinking only about how this affects inaccuracy - trying to learn the details is probably not very useful for much else. for in-depth details, there are good resources out there already. This is intended as a summary.

The value a float represents is calculated something like:

sign * (1+mantissa) * 2exponent

Sign is coded as 1 meaning negative, 0 meaning positive, though in the formula above, think of it as -1 and 1.

Exponent is an integer which (in 32-bit floats) represents values within -126..127 range

  • so e.g.
coded value 1 represents -126
coded value 127 represents 0
coded value 254 represents 127
  • values 0 and 255 (which would represent -127 and 128) are used for special cases:
    • Zero (which can't be represented directly because of the 1+ in the formula) is represented by exponent=00000000 and mantissa=0
    • Infinity is represented by exponent=11111111 and mantissa=0.
    • Note that the last two don't mention sign -- both infinity and zero can be positive and negative
    • NaN ('Not a Number') is represented by exponent=11111111 and non-zero mantissa
      • there is a further distinction into quiet NaNs for indeterminate operations, and signaling NaNs for invalid operations, but you often wouldn't care

Mantissa bits represent 2-1 (=0.5), 2-2 (=0.25), and so on. For a 32-bit float there are 23 mantissa bits, so before the exponent gets involved they represent:


For 64-bit floats there are another 29 of these.

So yes, the mantissa part is representing fractions - some explanations introduce the whole thing this way - but only in a directly understandable way when you ignore the exponent).

Example from IEEE bits to represented number:

  • 1 01111111 10000000000000000000000
  • Sign is 1, representing -1
  • exponent is 127 (decimal int), representing 0
  • mantissa is 10000000000000000000000, representing 0.5

Filling in the values into (sign * (1+mantissa) * 2exponent) makes for (-1*(1+0.5)*20), is -1.5

Example from number to IEEE bits

Encoding a number into float form can be understood as:

  1. start with base-2 scientific notation (with exponent 0)
  2. increment the exponent while halving the part that goes into the mantissa (which keep the represented number the same, to within error).
  3. repeat step 2 until value-that-goes-into-the-mantissa is between 0 and 1
(Note that mantissa values between 0 and 1 represents 1+that, so in the examples below we divide until it's between 1 and 2)

For example: 7.0

  • is 7.0*20
  • is 3.5*21
  • is 1.75*22 (mantissa stores (that-1), 0.75, so we're now done with the shifting)

We have sign=positive, exponent=2, and mantissa=0.75. In bit coding:

  • sign bit: 0
  • exponent: we want represent 2, so store 129, i.e. 10000001
  • mantissa: we want to represent 0.75, which is 2-1+2-2, (0.5+0.25) so 11000000000000000000000

So 7.0 is 0 10000001 11000000000000000000000

The 20400 mentioned above works out as 1.2451171875*214.

  • Sign: 0
  • Exponent: 10001101 (141, representing 14)
  • Mantissa: 00111110110000000000000
Using the table above, that mantissa represents 0.125+0.0625+0.03125+0.015625+0.0078125+0.001953125+0.0009765625 (which equals 0.2451171875)

If you want to play with this, see e.g.

See also:

On the limited precision

Never assume floating point numbers or calculations are fully precise

As [2] mentions, one reason for inaccuracy is akin to the reason we can't write 1/3 in decimal either.

Writing 0.333333333 is imprecise no matter how many threes you add.

But at the same time, after some amount it's good enough for almost all uses.

Floating point is a similar tradeoff: if you accept our 'good enough', we can give you that is predictably faster than doing it fully correctly.

The (in)accuracy of 32-bit floats is good enough for a lot of practical use, and where that's cutting it close it's often easy to throw 64-bit at the problem to push down inaccuracies at the cost of some speed.

Introductions often mention that (counting from the largest digit)

32-bit float is precise up to the first six digits or so
64-bit precise up to the first sixteen digits or so

There's some practical footnotes to that but it's broadly right.

That "stop anywhere and it's inexact" example isn't the best intuition, though, because some numbers will be less imprecise than others, and a few will be exact.

Consider that:

  • 0.5 is stored precisely in a float
  • 0.1 and 0.2 are not stored precisely
  • while 20400.75 is stored precisely, 20400.8 is not
note that few digits in decimal representation is unrelated few digits in float representation
  • with 32-bit floats, all integers within -16777215..16777215 are exact
(and with 64-bit floats, all integers in -9007199254740991..9007199254740991)

Even if you say 'fair enough' about that, operations make things worse, intuitively because each result may have to become whatever representable number is nearest.

Okay, but you may not have considered that this breaks some properties you initially don't expect, such as commutativity.


0.1+(0.2+0.3) != (0.1+0.2)+0.3

and, sometimes more pressingly,

0.3-(0.2+0.1) != 0.0

This in part because what we write down precisely in base 10 isn't necessarily precise in float representation. And then, operations can mess with that further, as each individual result may have this issue - more on that below.

There frankly isn't a good mental model of how and when these inaccuracies happen.

You're best off assuming everything is always off withing rounding error.

One basic implication is that tests for zero should often be something like

You should also assume these errors may accumulate over time, meaning there is no perfect choice for that 'almost zero' value.

Operations make things worse, part 2.

Combining numbers of (vastly) different scales creates another issue.

For an intuition why, consider scientific notation again, e.g. adding 1E2 + 1E5

If you stick to this notation, you would probably do that this by considering that 1E2 is equal to 0.001E5, so now you can do 0.001E5 + 1E5 = 1.001E5

That same "scaling up the smaller number's exponent to fit the larger number" in float means increasing the exponent while dividing the mantissa.

The larger the magnitude change, the more that that this scaling requires you to lose digits (and that's ignoring the rounding errors).

If there is a large enough scale difference, the smaller number falls away completely. Say, 1 + 1E-5 is 1.00001, but 1 + 1E-23 is 1.0 even in 64-bit floats.

It's not that it can't store numbers that small, it's that the precision limit means you can't combine numbers of a certain magnitude difference.

Operations make things worse, part 3.

Scale stuff introduces more interesting cases.

Remember a bit above where

0.1+(0.2+0.3) != (0.1+0.2)+0.3


0.1+(0.2+100) == (0.1+100)+0.2 

You may not expect that, and it takes more than a little starting to figure out why.

Operations make things worse, part 4.

Rounding errors accumulate.

For example, if you keep on updating a number with operations, you should expect rounding errors to happen each calculation, and therefore accumulate. For example, if you do

view = rotate(current_view, 10 degrees)

36 times you will not be precisely back where you started. Do that another couple thousand times, with weird angles, and who knows where you are exactly?

In games that may not matter. Usually. Except when it does.

In applications it may matter more to start with, and you want to know how to work around this.

Operations make things worse, part 5.

Some mathematical operations do not map easily to floating point, and when specs don't strictly define required accuracy, there may be a possible tradeoffs that implementations may be doing, for more speed and less precision.

One of the better known examples is exp, expf (exponentiation is the typical example)) This is one of a few reasons different libraries and different hardware should not be expected to all give bit-identical results.

Note that x87 (x86 FPUs) operations do most calculations in 80-bit internally, and have for a long time(verify). Intuitively because (the internal steps within what looks like) a single FPU operation imply losing a few digits, so if you work with more, then the result is actually accurate to 64 bit. (It's overkill for 32-bit, but doesn't hurt).


  • It does not mean subsequent operations get near-80-bit precision, because the intermediate 'fetch this out to memory' is typically 64-bit.
Even if your language exposes 80-bit floats in their typing system (e.g. C's long double), and even if it can loads/saves them into the FPU registers, there are various footnotes to this.
  • SIMD things like SSE do not do this 80-bit stuff.[3].

  • GPUs also do not.
Actually, GPUs sometimes play loose with floating point specs in other ways, and have done so more in the past

It's usually well quantified how large the error may be, e.g. via ulp.

Denormal numbers

Denormal numbers, a.k.a. subnormal numbers, are ones that are so small (near zero) that the exponent has already bottomed out, so we have to have leading zeroes in the mantissa (for larger numbers we can avoid losing this precision by choosing the exponent so that the mantissa is 0..1).

This implies you have fewer bits of precision than you would normally have.

In many cases these are so close to zero you can just treat them as zero, for practical reasons, and occasionally for performance reasons as well.

See also:

Significance loss

Storing integers

A good range of integers can be stored exactly, namely those for which the binary representation needs at most the amount of bits the mantissa has (plus one implicit bit).

32-bit floats have 23 bits for the mantissa, so can store integers up to 223+1-1, i.e. up to 16777215 (and down to negative that, because the sign bit is a separate thing)

64-bit floats have 52 mantissa bits, so it can store -9007199254740991..9007199254740991 as integers.

(technically you can store up to 2mantissabits+1 rather than mantissabits+1-1, but languages tend to define safe values as those for which n and n+1 are exactly representable, or perhaps more to the point, as those that are not also approximations for other numbers. So it ends one lower.)

After this point it will only be able to store every second, then every fourth integer, etc. (and before that you could also store halves, before that quarter)

If you've ever played with fixed-point integers (using integers to imitate floats) this may be somewhat intuitive.

Floating point compression

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The way floating point numbers are stored in bits doesn't lend them to be compressed very well using generic lossless compression such as the LZ family or most others.

There is some research and work aimed at lossless compression specifically of floating-point data, but it doesn't help much except in some specific, controllable cases cases.

Say, time series may be very predictable. For example, influxdb has some interesting notes on what it does to its data.

Lossy floating point compression is easier, and in amounts to decimating/rounding to some degree.

Intuitively, if you don't care about the last few digits in each number(/bits in a mantissa), you can chop off a little.

It's like storing a float64 in a float32, or a float32 in float16, except you can do this in finer-grained steps.


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Older GPUs were more specialized devices, and deviated from standard IEEE, see e.g. [4]. Old as in roughly pre-CUDA, and CUDA 1.x still had some issues (e.g. dealing with denormal numbers).

Modern GPUs (since Compute Capability 2) provide something much closer to IEEE-754 floating point.

Still with a few footnotes that mean they are not equivalent to FPUs.

Also, while FPUs in the x86 line do calculations in 80-bit (basically since always), on GPU 32-bit is actually 32-bit, and 64-bit is actually 64-bit (closer to e.g. SSE), meaning they can collect errors faster. In some case you can easily work around that, in some cases it's a limiting factor.

You may be able to make GPUs use float16, though don't expect ease of use, and while it may be faster, it probably won't be twice the speed, so it may not be worth it.

Speed of 16, 32, and 64-bit FP in GPU isn't a direct halving/doubling, because of various implementation details, some of which also vary between architectures. But also on that most cards currently focus on 32-bit - because they're assigning most silicon to things most useful in gaming, for which SP is enough.

When lower precision is okay, even integer / fixed-point may help speed on GPU (verify)

float64 is less of a performance hit on most FPUs(verify) than GPUs.

If it matters, you may want to check every GPU (and every GPU/driver change)

On repeatability

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Differences between CPUs and GPUs are expected

See above for the reasons.

The operations will typically still be within IEEE 754 specs, which is not as strict as you may think for many operations.

Differences between different GPUs and drivers will happen

The precision for an operation in a range is usually well characterizable, and typically within IEEE spec, but may differ between hardware and with different drivers.

Also, the GPU and driver is effectively a JIT optimizer, so it might rearrange operations with minor side effects(verify)

See e.g.

Some code is not deterministic for speed reasons

e.g CUDA atomics: the order in which concurrent atomic updates are performed is not defined, so this can have rounding/associativity-related side effects.

You can avoid this with different code. But in some applications you may want the speed increase (see e.g. tensorflow).

Differences between runs on the same GPU+driver should not happen

...yet sometimes do.

It seems that GPUs pushed too hard make mistakes (presumably in memory more than in calculation?). You wouldn't notice this in gaming, or necessarily in NN stuff, but you would care in scientific calculation.

Sometimes this can be fixed with drivers that use the hardware a little more carefully.

Sometimes it's a risk-speed tradeoff, one you can tweak in settings, and one that may change with hardware age.

On FPU-less CPUs

See also (floats)

Interesting note relevant to audio coding:

Relevant standards:

  • IEEE 754 [5] for a long time referred specifically to IEEE 754-1985, The "IEEE Standard for Binary Floating-Point Arithmetic" and is the most common reference.
  • recently, IEEE 754-2008 was published, which is mostly just the combination of IEEE 754-1985 and IEEE 854 (radix-independent). Before it was released, it was known as IEEE 754r[6], ('revision')
  • IEEE 854, specifically IEEE 854-1987, is a radix-independent variation [7]
  • IEC 60559:1989, "Binary floating-point arithmetic for microprocessor systems", is the same as 754-1985


There are other, more specific floating-point implementations


Signed and unsigned integers

An unsigned integer is one that can store only positive numbers.

Usually chosen when you don't want to lose part of the usable range to (negative) values that you won't ever use anyway.

A signed integers can store negative values.

There are a few different ways of representing such numbers, and each implies how working with these values works.

The most commonly used is two's complement, partly because they are some of the easiest and most efficient to implement in hardware - most operations for unsigned numbers actually do the correct thing on two's-complement signed as well. (Not true for one's complement)

Arbitrary-precision ints

Fixed point numbers


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Fixed point numbers are, in most implementations, a library that abuses integers to represent more fractional numbers.

Introduction by example

Consider that a regular integer usually works like:

  • bit 1 represents 1
  • bit 2 represents 2
  • bit 3 represents 4
  • Bit 4 represents 8
  • Bit 5 represents 16
  • Bit 6 represents 32
  • Bit 7 represents 64
  • Bit 8 represents 128
  • ...and so on.

In which case:

  • 00000100 represents 4
  • 00000111 represents 7 (4 + 2 + 1)

Say that you instead say:

  • bit 1 represents 1/4
  • bit 2 represents 1/2
  • bit 3 represents 1
  • Bit 4 represents 2
  • Bit 5 represents 4
  • Bit 6 represents 8
  • Bit 7 represents 16
  • Bit 8 represents 32
  • ...etc

In which case:

  • 000001 00 represents 1.00
  • 000001 11 represents 1.75 (1 + 1/2 + 1/4)

Put another way, you're still just counting, but you decide you're counting units of 1/4.

It's called "fixed point" because (as the spacing in the bits there suggests), you just pretend that you shifted the decimal point two bits to the left, and sits there always.

Which also means various operations on these numbers make sense as-is. You don't even need a library, as long as you're aware of the rough edges.

Some other operations not so much, so you do need some extra bookkeeping, and often want to detect bordercases like overflow (stuff which in floating point are handled by the specs / hardware flags) - see e.g. saturation.


The fixed point trick was useful when Floating Point Units were not yet ubiquitous part of CPUs, because it means you can use regular integers to calculate things with more digits, in a way that's faster than emulating an IEEE float.

(In a few specific well-tuned cases it could even beat shabby FPUs)

Fixed point is still used, e.g.:

  • on platforms without FPU (e.g. microcontrollers, some embedded CPUs, and such)
in libraries that may have to run on such
  • to make rounding fully predictable and bit-perfect across systems, in calculation and in data storage.
e.g. SQL defines fixed-point data, so most databases can store these.
  • for quick and dirty high-precision calculation (since you can extend this into more bits, sort of like a specialized bigint)
    • mostly because from scratch, it's easier to implement fixed-point logic with more bits than floating point logic with more bits (though these days there are great high/arbitrary-precision float libraries)

More detail

See also


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Binary Coded Digits are a way of coding integers that is different from your typical two's complement integers.

BCD codes each digits in four bits. For example, 12 would be 0001 0010

One side effect that the hexadecimal representation of the coded value will look like the number: that's 0x12.

However, you don't use all your bits, some operations are harder on this representation and in a processor would require a little more silicon. So why would you do this?

The main benefit is easy conversion to decimal, and it makes sense when displaying a limited amount of numbers - clocks, calculators, 7-segment drivers, particularly doing so without a processor - because this is fairly easy to express in simple digital electronics.

It is also seen where you want simple electronics, and/or want something more akin to fixed-point than floating-point calculations.

Pocket calculators will tend to work in BCD(verify).

On rounding

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Word, octet, nibble


In computer hardware and programming, a word is the fundamental storage size in an architecture, reflected by the processor and usually also the main bus.

A 16-bit computer has 16-bit words, a 64-bit computer has 64-bit words, etc. On all but ancient computers, they're multiples of eight.


  • In programming, 'word' often means 'an architecture-sized integer.'
  • Some use it in the implicit context of the architecture they are used to, such as indicating 16 bits in the days of 16-bit CPUs. In the context of 32-bit processors, a halfword is 16 bits, a doubleword 64 bits, and a quadword 128 bits. These terms may stick for a while even in 64-bit processor era.
  • 'Long word' can also be used, but is ambiguous

Basically, this is a confusing practice.

This refers specifically to integer sizes because floating point numbers are available in a few standardized sizes (usually IEEE 754).

In languages like C, int carries the meaning of a word; it may differ in size depending on what architecture it is compiled for. Some coders don't use specific-sized integers when they should, which is quite sloppy and can lead to bugs - or be perfectly fine when it just needs to store some not-too-large numbers. It may interfere with signed integers (Two's complement negative numbers), particularly if you count on certain behaviour, such as that -1 would be equal to 0xffffffff, which is only true for 32-bit signed ints, not for signed ints in general.


Fancy word for a byte - eight bits large.

Seen in some standard definitions, largely because some older (mostly ancient and/or unusual) computers used sizes more creatively, which also implied that 'byte' sometimes meant sizes other than 8 bits, and because 'byte' carries more of a connotation of being already binary coded - and octet more that of just the concept of grouping eight bits as a unit.

(For more details, see foldoc on the byte)

A few countries, like france, use megaoctet / Mo instead of the megabyte. (Though for french it seems this is also because it sounds potentially rude - though this applies to bit, not byte[8])


(Also seen spelled nybble, even nyble)

Half a byte: four bits.

Most commonly used in code comments to describe the fact you are storing two things in a byte, in two logical parts, the high and low nibble.

Few architectures or languages have operations at nibble level, probably largely because bitmask operations are simple enough to do, and more flexible.

Has seen other definitions in ancient/strange computers.

On architecture bit sizes