Notes on numbers in computers

From Helpful
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)

Floating point numbers

Floating point numbers store approximations of real numbers.

Things like rational numbers and transcendental and irrational numbers are hard for computers to store and use, both at all and efficiently.

When that is necessary, as in e.g. varied computer algebra and symbolic math, such software has its own solutions, and accepts it is sometimes slow.

But for a lot of things, you don't need all that precision. Floating point is a pragmatic tradeoff, with limited but fairly-predictable precision, and consistent speed. Speed that comes in part from some dedicated CPU silicon; simulating floating point calculations is factors slower (so in the cases you have no floating point operations, e.g. simple microcontrollers, there are sometimes faster integer based solutions, though the precision difference isn't trivial to quantify).

From a practical view, most math involving decimal points, fractions, very large numbers, very small numbers, or transcendental numbers like pi or e, are precise enough to do in floating point, particularly if you know its limitations.

Within computers, floating point are mainly contrasted with

which could be seen as a makeshift precursor to floating point (and are still useful on platforms without floating point calculations)
  • arbitrary-precision arithmetic,
    • BigInt

Intuitively, floating point can be understood as an idea similar to scientific notation.

Consider 20400.

In scientific notation you might write that as 2.04 * 104.

Floating point would actually store that same number as something like 1.2451172 * 214. (Exactly why and how isn't hugely important, e.g. its use of base-2 is mostly details about slightly more efficient use of the same amount of silicon.)

Bit division and layout


To give some idea how it's stored: IEEE 754 floats divide their allotted bits into sign, mantissa, and exponent.

The standard float types in most programming languages will be IEEE 754 32-bit and/or IEEE 754 64-bit

largely because most CPUs can can handle those natively and therefore quickly
also we're used to them - e.g. arduinos have to simulate floats with a bunch of work for each floating point operation, but it's hacked in because it's very convenient

An IEEE 754 32-bit float uses 1 bit for the sign, 23 for the mantissa and 8 for the exponent. The bit layout:


An IEEE 754 64-bit floats (a.k.a. 'double-precision', doubles) uses 1 bit for sign, 52 for the mantissa and 11 for the exponent. The bit layout:


There are other-sized variations, like 80-bit extended precision, 128-bit quadruple, 256-bit octuple, 16-bit half, and also 36-bit, 48-bit, and 60-bit, and I've heard mention of 120-bit and one or two others.

You can always emulate lower-precision floats, e.g. use 80-bit to emulate 64-bit, by throwing away the excess precision only when the result is fetched out.

This is apparently what x86 FPUs have done since the 8087 - they are internally 80-bit, making some operations slightly more accurate than you might expect - but this is actually done out of necessity, because some operations are by nature less precise than the precision its intermediates are stored in, so you need some extra bits for those few operations not to often/always work poorly.

Your programming language may allow some use or control over extended precision[1]), but this is a topic with footnotes of its own.

Representation details

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)

You may want to read the following thinking only about how this affects inaccuracy - trying to learn the details is probably not very useful for much else. for in-depth details, there are good resources out there already. This is intended as a summary.

The value a float represents is calculated something like:

sign * (1+mantissa) * 2exponent

Sign is coded as 1 meaning negative, 0 meaning positive, though in the formula above, think of it as -1 and 1.

Exponent is an integer which (in 32-bit floats) represents values within -126..127 range

  • so e.g.
coded value 1 represents -126
coded value 127 represents 0
coded value 254 represents 127
  • values 0 and 255 (which would represent -127 and 128) are used for special cases:
    • Zero (which can't be represented directly because of the 1+ in the formula) is represented by exponent=00000000 and mantissa=0
    • Infinity is represented by exponent=11111111 and mantissa=0.
    • Note that the last two don't mention sign -- both infinity and zero can be positive and negative
    • NaN ('Not a Number') is represented by exponent=11111111 and non-zero mantissa
      • there is a further distinction into quiet NaNs for indeterminate operations, and signaling NaNs for invalid operations, but you often wouldn't care

Mantissa bits represent 2-1 (=0.5), 2-2 (=0.25), and so on. For a 32-bit float there are 23 mantissa bits, so before the exponent gets involved they represent:


For 64-bit floats there are another 29 of these.

So yes, the mantissa part is representing fractions - some explanations introduce the whole thing this way - but only in a directly understandable way when you ignore the exponent).

Example from IEEE bits to represented number:

  • 1 01111111 10000000000000000000000
  • Sign is 1, representing -1
  • exponent is 127 (decimal int), representing 0
  • mantissa is 10000000000000000000000, representing 0.5

Filling in the values into (sign * (1+mantissa) * 2exponent) makes for (-1*(1+0.5)*20), is -1.5

Example from number to IEEE bits

Encoding a number into float form can be understood as:

  1. start with base-2 scientific notation (with exponent 0)
  2. increment the exponent while halving the part that goes into the mantissa (which keep the represented number the same, to within error).
  3. repeat step 2 until value-that-goes-into-the-mantissa is between 0 and 1
(Note that mantissa values between 0 and 1 represents 1+that, so in the examples below we divide until it's between 1 and 2)

For example: 7.0

  • is 7.0*20
  • is 3.5*21
  • is 1.75*22 (mantissa stores (that-1), 0.75, so we're now done with the shifting)

We have sign=positive, exponent=2, and mantissa=0.75. In bit coding:

  • sign bit: 0
  • exponent: we want represent 2, so store 129, i.e. 10000001
  • mantissa: we want to represent 0.75, which is 2-1+2-2, (0.5+0.25) so 11000000000000000000000

So 7.0 is 0 10000001 11000000000000000000000

The 20400 mentioned above works out as 1.2451171875*214.

  • Sign: 0
  • Exponent: 10001101 (141, representing 14)
  • Mantissa: 00111110110000000000000
Using the table above, that mantissa represents 0.125+0.0625+0.03125+0.015625+0.0078125+0.001953125+0.0009765625 (which equals 0.2451171875)

If you want to play with this, see e.g.

See also:

On the limited precision

Never assume floating point numbers or calculations are fully precise

As [2] mentions, one reason for inaccuracy is akin to the reason we can't write 1/3 in decimal either.

Writing 0.333333333 is imprecise no matter how many threes you add.

At the same time, for many uses (arguably most except maybe space program and medical) a handful of digits is quite good enough.

Floating point is a similar tradeoff: if you know when and why the imprecision is 'good enough', we can give you calculations that are faster, and predictably faster, than doing it fully correctly.

The (in)accuracy of 32-bit floats is good enough for a lot of practical use, and where that's cutting it close it's often easy to throw 64-bit at the problem to push down inaccuracies at the cost of some speed.

Introductions often mention that (counting from the largest digit)

32-bit float is precise up to the first six digits or so
64-bit precise up to the first sixteen digits or so

There's some practical footnotes to that but it's broadly right.

That "stop anywhere and it's inexact" example isn't the best intuition, though, because some numbers will be less imprecise than others, and a few will be exact.

Consider that:

  • 0.5 is stored precisely in a float
  • 0.1 and 0.2 are not stored precisely
  • while 20400.75 is stored precisely, 20400.8 is not
note that having few digits in decimal representation is (mostly) unrelated few digits in float representation
  • with 32-bit floats, all integers within -16777215..16777215 are exact
with 64-bit floats, all integers in -9007199254740991..9007199254740991 are exact
this is why some programming languages opt to skip integer types entirely (e.g. Javascript, Lua)

Even if you say 'fair enough' about that, operations make things worse, intuitively because each result may have to become whatever representable number is nearest.

Okay, but you may not have considered that this breaks some properties you might not expect, such as commutativity.


0.1+(0.2+0.3) != (0.1+0.2)+0.3

and, sometimes more pressingly,

0.3-(0.2+0.1) != 0.0
(tests for zero should often be something like
instead, but it's not always clear how close you should require it to be)

In part it's because what we write down precisely in base 10 isn't necessarily precise in float representation.

In part because operations can mess things up further.

There frankly isn't a good mental model of how or when these inaccuracies happen.

You're best off assuming everything may be incorrect to within rounding error, and that every operation might contribute to that.

You should also assume these errors may accumulate over time, meaning there is no perfect choice for that 'almost zero' value.

Operations make things worse, part 2.

Combining numbers of (vastly) different scales creates another issue.

For an intuition why, consider scientific notation again, e.g. adding 1E2 + 1E5

If you stick to this notation, you would probably do that this by considering that 1E2 is equal to 0.001E5, so now you can do 0.001E5 + 1E5 = 1.001E5

That same "scaling up the smaller number's exponent to fit the larger number" in float means increasing the exponent, while dividing the mantissa.

The larger this magnitude change, the more that digits fall out of the mantissa (and that's ignoring the rounding errors).

If there is a large enough scale difference, the smaller number falls away completely.

1 + 1E-5
is 1.00001, but
1 + 1E-23
is 1.0 even in 64-bit floats.

It's not that it can't store numbers that small, it's that the precision limit means you can't combine numbers of a certain magnitude difference.

Operations make things worse, part 3.

Scale stuff introduces more interesting cases.

Remember a bit above where

0.1+(0.2+0.3) != (0.1+0.2)+0.3


0.1+(0.2+100) == (0.1+100)+0.2 

You may not expect that, and it takes some staring to figure out why.

Operations make things worse, part 4.

Rounding errors accumulate.

For example, if you keep updating a number with operations, you should expect rounding errors to happen each calculation, and therefore accumulate. For example, if you do

view = rotate(current_view, 10 degrees)

and do it 36 times you would expect to be precisely back where you started, but you would not expect that of floats. Do that another couple thousand times, with weird angles, and who knows where you are exactly?

In games that may not matter. Usually. Except when it does.

There are ways to avoid such issues, in general and in games, but this it takes some specialized knowledge.

Operations make things worse, part 5.

Some mathematical operations do not map easily to floating point, and when specs don't strictly define required accuracy there may be a possible tradeoffs that implementations may be doing, for more speed and less precision.

One of the better known examples is exp, expf (exponentiation) This is one of a few reasons different libraries and different hardware should not be expected to all give bit-identical results.

Note that x87 (x86 FPUs) operations do most calculations in 80-bit internally (and have for a much longer time that you'ld think). Intuitively because (the internal steps within what looks like) a single FPU operation imply losing a few digits (depending on the operation), so you need to work with more to have your result stay actually accurate to within your 64 bit numbers. (It's overkill for 32-bit, but doesn't hurt).


  • It does not mean subsequent operations get near-80-bit precision, because the intermediate 'fetch this out to memory' is typically 64-bit.
Even if your language exposes 80-bit floats in their typing system (e.g. C's long double), and even if it can loads/saves them into the FPU registers, there are various footnotes.
  • SIMD things like SSE do not do this 80-bit stuff.[3].
  • GPUs also do not
Actually, GPUs sometimes play looser with floating point specs in other ways, though less so now than they did early days
  • because the x87 is a coprocessor (with its own register stack), it's clunkier and harder to optimize and CPU makers have apparently been trying to get rid of it in favour of the SIMD variants. There are cases where this is better, there are cases where it is not better.

It's usually well quantified how large the error may be on any specific hardware, e.g. via ulp.

Though note that compiler optimizations can still change things from being bit identical.

Denormal numbers

Denormal numbers, a.k.a. subnormal numbers, are ones that are so small (near zero) that the exponent has already bottomed out, so we have to have leading zeroes in the mantissa (for larger numbers we can avoid losing this precision by choosing the exponent so that the mantissa is 0..1).

This implies you have fewer bits of precision than you would normally have.

In many cases these are so close to zero you can just treat them as zero, for practical reasons, and occasionally for performance reasons as well.

See also:

Significance loss

Storing integers

A good range of integers can be stored exactly in floating point.

Basically it's those for which the binary representation needs at most the amount of bits the mantissa actually has (plus one implicit bit).

For example, 32-bit floats have 23 bits for the mantissa, so can store integers up to 223+1-1, so practically integers within -16777215 .. 16777215

64-bit floats have 52 mantissa bits, so can store integes within -9007199254740991..9007199254740991


  • technically you can store up to 2mantissabits+1 rather than mantissabits+1-1, but languages tend to define safe values as those for which n and n+1 are exactly representable, or perhaps more to the point, as those that are not also approximations for other numbers. So it ends one lower.

  • After the abovementioned limit, you effectively get every second integer, after that every fourth, etc.
(and before that you could also store halves, before that quarter)
If you've ever played with fixed-point integers (using integers to imitate floats) this may be somewhat intuitive to you

Floating point compression

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)

The way floating point numbers are stored in bits doesn't lend them to be compressed very well using generic lossless compression such as the LZ family or most others.

There is some research and work aimed at lossless compression specifically of floating-point data, but it doesn't help much except in some specific, controllable cases cases.

Say, time series may be very predictable. For example, influxdb has some interesting notes on what it does to its data.

Lossy floating point compression is easier, and in amounts to decimating/rounding to some degree.

Intuitively, if you don't care about the last few digits in each number(/bits in a mantissa), you can chop off a little.

It's like storing a float64 in a float32, or a float32 in float16, except you can do this in finer-grained steps.


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)

Older GPUs were more specialized devices, and deviated from standard IEEE floats. Old as in roughly pre-CUDA, and CUDA 1.x still had some issues (e.g. dealing with denormal numbers). See e.g. [4].

Modern GPUs (since Compute Capability 2) provide something much closer to IEEE-754 floating point.

Still with a few footnotes that mean they are not equivalent to FPUs.

Also, while FPUs in the x86 line do calculations in 80-bit (basically since always), on GPU 32-bit is actually 32-bit, and 64-bit is actually 64-bit (closer to e.g. SSE), meaning they can accumulate errors faster. In some case you can easily work around that, in some cases it's a limiting factor.

You may be able to make GPUs use float16, though don't expect ease of use, and while it may be faster, it probably won't be twice the speed, so it may not be worth it.

Speed of 16, 32, and 64-bit FP in GPU isn't a direct halving/doubling, because of various implementation details, some of which also vary between architectures. But also on that most cards currently focus on 32-bit - because they're assigning most silicon to things most useful in gaming, for which singles are enough.

When lower precision is okay, even integer / fixed-point may help speed on GPU (verify)

float64 is less of a performance hit on most FPUs(verify) than GPUs.

If it matters, you may want to check every GPU (and every GPU/driver change)

On repeatability

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)

Differences between CPUs and GPUs are expected

See above for the reasons.

The operations will typically still be within IEEE 754 specs. ...which for many operations are not quite as strict as you may think.

Differences between different GPUs and drivers will happen

The precision for an operation in a range is usually well characterizED, and typically within IEEE spec, but may differ between hardware and with different drivers.

Also, the GPU and driver is effectively a JIT optimizer, so it might rearrange operations with minor side effects(verify)

See e.g.

Some code is not deterministic for speed reasons

e.g CUDA atomics: the order in which concurrent atomic updates are performed is not defined, so this can have rounding/associativity-related side effects.

You can avoid this with different code. But in some applications you may want the speed increase (see e.g. tensorflow).

Differences between runs on the same GPU+driver should not happen

...yet sometimes do.

It seems that GPUs pushed too hard makes mistakes (presumably in memory more than in calculation?). You wouldn't notice this in gaming, or necessarily in NN stuff, but you would care in scientific calculation.

Sometimes this can be fixed with drivers that use the hardware a little more carefully.

Sometimes it's a risk-speed tradeoff, one you can tweak in settings, and one that may change with hardware age.

On FPU-less CPUs

See also (floats)

Interesting note relevant to audio coding:

Relevant standards:

  • IEEE 754 [5] for a long time referred specifically to IEEE 754-1985, The "IEEE Standard for Binary Floating-Point Arithmetic" and is the most common reference.
  • recently, IEEE 754-2008 was published, which is mostly just the combination of IEEE 754-1985 and IEEE 854 (radix-independent). Before it was released, it was known as IEEE 754r[6], ('revision')
  • IEEE 854, specifically IEEE 854-1987, is a radix-independent variation [7]
  • IEC 60559:1989, "Binary floating-point arithmetic for microprocessor systems", is the same as 754-1985


There are other, more specific floating-point implementations


Signed and unsigned integers

A signed integers can store negative values, an unsigned integer is one that can store only positive numbers.

Presumably it's just referring to whether a minus sign gets involved (I've never checked the origins of the terms).

Unsigned integers are often chosen to be able to count a little further, because signed integer cut the range in half.

Say, a signed 16-bit int counts from −32768 to 32767 (...assuming two's complement style storage, see below), an unsigned 16-bit int counts from 0 to 65535.

How we could store negative numbers

There are a few different ways of representing negative numbers.

The most commonly used is two's complement, partly because they are some of the easiest and most efficient to implement in hardware - most operations for unsigned numbers actually do the correct thing on two's-complement signed as well. (Not true for one's complement)

Arbitrary-precision integers

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)

Arbitrary-precision integers, also frequently named something shorter like 'bignum' or 'bigint', is integer that can represent any large value -- until it doesn't fill available storage, anyway.

A toy implementation of storing these, and some basic operations on them, is surprisingly reasonable to make.

Say, one approach that may not be very efficient yet is easy to understand (and program) is to keep a list of int32s, each representing three decimal digits at a time. For example, 1034123 would be stored as the list
[1, 34,123]

Many operations can be done with some simple piece-by-piece logic with the underlying integer operations, followed by pass of 'if an element is >999, increment the next highest element' (like carry when doing multiplication on paper) (right-to-left, in case that pushes the next one over 999 as well). For example:

  • 1034123+1980
= [1,34,123] + [1,980]
= [1,35,1103]
= [1,36,103] (= 1036103)
  • 1980*5002
= [1,980] * [5,2]
= [2,1960] + [5,4900,0]
= [3,960] + [9,900,0]
= [3,960] + [9,900,0]
= [9,903,960] (= 9903960)


  • Serious bignum interpretations do something similar, but are a bunch cleverer in terms of speed, space, and operations.
  • The choice above for 1000 in int32 is not very space-efficient at all. It was chosen for the toy implementation because
add or multiply can't can overflow the value (1000*1000 is well within the range) and
it's pretty trivial to print (and to see the meaning of the stored data during debugging)
  • The choice of when to do the carry test doesn't matter so much in this toy implementation (is an avoidable intermediate step in the multiplication example), but matters in real implementations

See also:

Fixed point numbers


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)

Fixed point numbers are, in most implementations, a library that abuses integers to represent more fractional numbers.

Introduction by example

Consider that a regular integer works something like:

  • bit 1 represents 1
  • bit 2 represents 2
  • bit 3 represents 4
  • Bit 4 represents 8
  • Bit 5 represents 16
  • Bit 6 represents 32
  • Bit 7 represents 64
  • Bit 8 represents 128
  • ...and so on.

In which case:

  • 00000100 represents 4
  • 00000111 represents 7 (4 + 2 + 1)

Now you say that that instead:

  • bit 1 represents 1/4
  • bit 2 represents 1/2
  • bit 3 represents 1
  • Bit 4 represents 2
  • Bit 5 represents 4
  • Bit 6 represents 8
  • Bit 7 represents 16
  • Bit 8 represents 32
  • ...etc

In which case:

  • 000001 00 represents 1.00
  • 000001 11 represents 1.75 (1 + 1/2 + 1/4)

Put another way, you're still just counting, but you decided you're counting units of 1/4.

It's called "fixed point" because (as the spacing in the bits there suggests), you just pretend that you shifted the decimal point two bits to the left, and sits there always.

Which also means various operations on these numbers make sense as-is. You don't even need a library, as long as you're aware of the rough edges.

...but some other operations not so much, so you do need some extra bookkeeping, and often want to detect bordercases like overflow (stuff which in floating point are handled by the specs / hardware flags) - see e.g. saturation.


The fixed point trick was useful when Floating Point Units were not yet ubiquitous part of CPUs, because it means you can use regular integers to calculate things with more digits, in a way that's faster than emulating an IEEE float.

(In a few specific well-tuned cases it could even beat shabby FPUs)

Fixed point is still used, e.g.:

  • on platforms without floating point calculation, e.g. many microcontrollers, some embedded CPUs, and such
  • in libraries that may have to run on such platforms
  • to make rounding fully predictable and bit-perfect across systems, in calculation and in data storage.
e.g. SQL defines fixed-point data, so most SQL databases can store and calculate with what is functionally roughly the same (implementation may be more focused on being fractions, but it amounts to much the same).
  • for quick and dirty high-precision calculation (since you can extend this into more bits, sort of like a specialized bigint)
    • mostly because from scratch, it's easier to implement fixed-point logic with more bits than floating point logic with more bits (though these days there are great high/arbitrary-precision float libraries)

More detail

See also

Storing fractions


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)

Binary Coded Digits are a way of coding integers that is different from your typical two's complement integers.

BCD codes each digits in four bits. For example, (decimal) 12 would be 0001 0010

One side effect that the hexadecimal representation of the coded value will look like the number: decimal 12 is coded like 0x12.

However, you don't use all your bits, some operations are harder on this representation, and in a processor would require a little more silicon. So why would you do this?

The main benefit is easy conversion to something you show as decimal, which makes sense when displaying a limited amount of numbers - clocks, calculators, 7-segment drivers, particularly when doing so without a processor.

It is also seen where you want simple electronics, and/or want something more akin to fixed-point than floating-point calculations.

Pocket calculators will (still(verify)) tend to work in BCD(verify).

On rounding

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)

Word, octet, nibble


In computer hardware and programming, a word is the fundamental storage size in an architecture, reflected by the processor and usually also the main bus.

A 16-bit computer has 16-bit words, a 64-bit computer has 64-bit words, etc. On all but ancient computers, they're multiples of eight.


  • In programming, 'word' often means 'an architecture-sized integer.'
  • Some use it in the implicit context of the architecture they are used to, such as indicating 16 bits in the days of 16-bit CPUs. In the context of 32-bit processors, a halfword is 16 bits, a doubleword 64 bits, and a quadword 128 bits. These terms may stick for a while even in 64-bit processor era.
  • 'Long word' can also be used, but is ambiguous

Basically, this is a confusing practice.

This refers specifically to integer sizes because floating point numbers are available in a few standardized sizes (usually IEEE 754).

In languages like C, int carries the meaning of a word; it may differ in size depending on what architecture it is compiled for. Some coders don't use specific-sized integers when they should, which is quite sloppy and can lead to bugs - or be perfectly fine when it just needs to store some not-too-large numbers. It may interfere with signed integers (Two's complement negative numbers), particularly if you count on certain behaviour, such as that -1 would be equal to 0xffffffff, which is only true for 32-bit signed ints, not for signed ints in general.


Fancy word for a byte - eight bits large.

Seen in some standard definitions, largely because some older (mostly ancient and/or unusual) computers used sizes more creatively, which also implied that 'byte' sometimes meant sizes other than 8 bits, and because 'byte' carries more of a connotation of being already binary coded - and octet more that of just the concept of grouping eight bits as a unit.

(For more details, see foldoc on the byte)

A few countries, like france, use megaoctet / Mo instead of the megabyte. (Though for french it seems this is also because it sounds potentially rude - though this applies to bit, not byte[8])


(Also seen spelled nybble, even nyble)

Half a byte: four bits.

Most commonly used in code comments to describe the fact you are storing two things in a byte, in two logical parts, the high and low nibble.

Few architectures or languages have operations at nibble level, probably largely because bitmask operations are simple enough to do, and more flexible.

Has seen other definitions in ancient/strange computers.

On architecture bit sizes