Notes on numbers in computers: Difference between revisions

From Helpful
Jump to navigation Jump to search
(CvjwHMGVNSGGmOwKw)
mNo edit summary
 
(241 intermediate revisions by 5 users not shown)
Line 1: Line 1:
Real brain power on dsilpay. Thanks for that answer!
{{#addbodyclass:tag_tech}}
{{stub}}
 
<!--
=tl;dr=
 
* whole numbers are typically stored in integers, either
: signed integers refer to going positive and negative
:: Usually [https://nl.wikipedia.org/wiki/Two%27s_complement two's complement] coding for practical reasons
: unsigned to not allowing negative numbers, which for the same amount of bits lets you count twice as high
 
* real numbers are typically approximated by floating point numbers
: typically IEEE 754 coding
:: which you probably don't ''want'' to understand all the details of
:: (but you may get a sense of the inaccuracies it does)
 
-->
 
=Floating point numbers=
 
Floating point numbers store approximations of [http://en.wikipedia.org/wiki/Real_number real numbers],
and are the most common way of doing so in computers.
 
 
When (more) precise real numbers are necessary, we look to [https://en.wikipedia.org/wiki/Computer_algebra#Numbers computer algebra] and symbolic math, such software has its own solutions, and accepts it is sometimes slow. {{comment|(those also deal more easily and sensibly with [https://en.wikipedia.org/wiki/Rational_number rational numbers] and [https://en.wikipedia.org/wiki/Transcendental_number transcendental] and [https://en.wikipedia.org/wiki/Irrational_number irrational numbers] are harder for computers to store and use, both at all and efficiently)}}
 
 
Yet from a practical view, ''most'' math involving decimal points, fractions, very large numbers, very small numbers, or transcendental numbers like pi or e, are precise ''enough'' to do in floating point, particularly if you know its limitations.
 
So floating point is a pragmatic tradeoff, with limited and fairly-predictable precision, and consistent speed. 
 
Speed that comes in part from some CPU silicon dedicated to floating point calculations. If you didn't you could simulate it, but it would be factors slower, and on simple microcontrollers this is still the case, and there are some considerations of doing part of your calculations in integer, or in [[fixed point]] ways.
 
 
 
Within computers, floating point are mainly contrasted with
* integer math
* integers used to do [[#Fixed_point_numbers|fixed point numbers]]
:: which could be seen as a makeshift precursor to floating point (and are still useful on platforms without floating point calculations)
<!--
approximations of stored within integers (used with some regularity before floating point operations became standard in processors) and which are somewhat simpler to understand, and sometimes an exercise given in computer science -- but are bothersome to deal with in practice and (mostly) slower than an FPU, which is why they fell out of style except in CPUs without FPUs, e.g. some embedded/microcontroller cases.)}}
-->
* arbitrary-precision arithmetic,
** BigInt
 
 
 
Intuitively, floating point can be understood as an idea ''similar'' to scientific notation.
 
Consider 20400.
 
In scientific notation you might write that as 2.04 * 10<sup>4</sup>.
 
Floating point would actually store that same number as something more like 1.2451172 * 2<sup>14</sup>. {{comment|(Exactly why and how isn't hugely important, e.g. its use of base-2 is mostly details about slightly more efficient use of the same amount of silicon.)}}
 
 
 
 
==Bit division and layout==
 
You don't need to know much of this, but to give some idea how it's stored: IEEE 754 floats divide their allotted bits into '''sign''', '''mantissa''', and '''exponent'''.
 
The standard float types in most programming languages will be IEEE 754 32-bit and/or IEEE 754 64-bit
: largely because most CPUs can can handle those natively and therefore quickly
: also we're used to them - e.g. arduinos have to simulate floats with a ''bunch'' of work for each floating point operation, but it's hacked in because it's ''very convenient''
 
 
An IEEE 754 32-bit float uses 1 bit for the sign, 23 for the mantissa and 8 for the exponent. The bit layout:
seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm
 
An IEEE 754 64-bit floats (a.k.a. 'double-precision', ''double''s) uses 1 bit for sign, 52 for the mantissa and 11 for the exponent. The bit layout:
seeeeeeeeeeemmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
 
 
 
There are other-sized variations, like 80-bit extended precision, 128-bit quadruple, 256-bit octuple, 16-bit half, and also 36-bit, 48-bit, and 60-bit, and I've heard mention of 120-bit and one or two others.
 
 
You can always emulate lower-precision floats, e.g. use 80-bit to emulate 64-bit, by throwing away the excess precision only when the result is fetched out.
 
This is apparently what x86 FPUs have done since the 8087 - they are ''internally'' 80-bit, out of necessity, because ''some'' operations are by nature less precise than the precision its intermediates are stored in, so you need some extra bits for those few operations not to work poorly.
 
 
Most programming have 32-bit and/or 64-bit IEEE floats as a built-in type. A few might allow ''some'' use or control over extended precision[https://en.wikipedia.org/wiki/Extended_precision#Introduction_to_use]), but this is a topic with footnotes of its own.
 
 
 
===Representation details===
{{stub}}
 
{{info|You may want to read the following thinking '''only''' about how this affects inaccuracy, because most of these internals are not very useful to know.
<br/>And for in-depth details, there are better resources out there already. This is intended as a summary.}}
 
 
 
The value a float represents is calculated something like:
sign * (1+mantissa) * 2<sup>exponent</sup>
 
 
'''Sign''' is ''coded'' as 1 meaning negative, 0 meaning positive, though in the formula above, think of it as -1 and 1.
 
'''Exponent''' is an integer which (in 32-bit floats) represents values within -126..127 range
* so e.g.
: coded value 1 represents -126
: coded value 127 represents 0
: coded value 254 represents 127
* values 0 and 255 (which would represent -127 and 128) are used for '''special cases''':
** Zero (which can't be represented directly because of the 1+ in the formula) is represented by exponent=00000000 and mantissa=0
** Infinity is represented by exponent=11111111 and mantissa=0.
** Note that the last two don't mention sign -- both infinity and zero can be positive and negative
** NaN ('Not a Number') is represented by exponent=11111111 and ''non''-zero mantissa
*** there is a further distinction into quiet NaNs for indeterminate operations, and signaling NaNs for invalid operations, but you often wouldn't care
 
'''Mantissa''' bits represent 2<sup>-1</sup> (=0.5), 2<sup>-2</sup> (=0.25), and so on. For a 32-bit float there are 23 mantissa bits, so '''''before'' the exponent gets involved''' they represent:
0.5
0.25
0.125
0.0625
0.03125
0.015625
0.0078125
0.00390625
0.001953125
0.0009765625
0.00048828125
0.000244140625
0.0001220703125
0.00006103515625
0.000030517578125
0.0000152587890625
0.00000762939453125
0.000003814697265625
0.0000019073486328125
0.00000095367431640625
0.000000476837158203125
0.0000002384185791015625
0.00000011920928955078125
For 64-bit floats there are another 29 of these.
 
So yes, the mantissa part is representing fractions - [https://docs.python.org/2/tutorial/floatingpoint.html some explanations] introduce the whole thing this way - but only in a directly understandable way when you ignore the exponent).
 
 
'''Example from IEEE bits to represented number''':
* 1 01111111 10000000000000000000000
* Sign is 1, representing -1
* exponent is 127 (decimal int), representing 0
* mantissa is 10000000000000000000000, representing 0.5
 
Filling in the values into ''(sign * (1+mantissa) * 2<sup>exponent</sup>)'' makes for ''(-1*(1+0.5)*2<sup>0</sup>)'', is -1.5
 
 
 
'''Example from number to IEEE bits'''
 
Encoding a number into float form can be understood as:
# start with base-2 scientific notation (with exponent 0)
# increment the exponent while halving the part that goes into the mantissa (which keep the represented number the same, to within error).
# repeat step 2 until value-that-goes-into-the-mantissa is between 0 and 1
: (Note that mantissa values between 0 and 1 represents 1+that, so in the examples below we divide until it's between 1 and 2)
 
 
For example: 7.0
* is 7.0*2<sup>0</sup>
* is 3.5*2<sup>1</sup>
* is 1.75*2<sup>2</sup> {{comment|(mantissa stores ''(that-1)'', 0.75, so we're now done with the shifting)}}
 
We have sign=positive, exponent=2, and mantissa=0.75. In bit coding:
* sign bit: 0
* exponent: we want represent 2, so store 129, i.e. 10000001
* mantissa: we want to represent 0.75, which is 2<sup>-1</sup>+2<sup>-2</sup>, (0.5+0.25)  so 11000000000000000000000
 
So 7.0 is <tt>0 10000001 11000000000000000000000</tt>
 
 
The 20400 mentioned above works out as 1.2451171875*2<sup>14</sup>.
* Sign: 0
* Exponent: 10001101 (141, representing 14)
* Mantissa: 00111110110000000000000
: Using the table above, that mantissa represents 0.125+0.0625+0.03125+0.015625+0.0078125+0.001953125+0.0009765625 (which equals 0.2451171875)
 
 
If you want to play with this, see e.g. https://www.h-schmidt.net/FloatConverter/IEEE754.html
 
 
 
See also:
* http://en.wikipedia.org/wiki/IEEE_754-2008
* http://steve.hollasch.net/cgindex/coding/ieeefloat.html
* http://academicearth.org/lectures/c-data-types-interpretations
* https://docs.python.org/3/tutorial/floatingpoint.html
 
==So how is that precision distributed?==
<!--
https://jvns.ca/blog/2023/06/23/new-zine--how-integers-and-floats-work/
-->
 
==On the limited precision==
 
===Never assume floating point numbers or calculations are fully precise===
 
As [https://docs.python.org/3/tutorial/floatingpoint.html] mentions,
one reason for inaccuracy is akin to the reason we can't write 1/3 in decimal either - writing 0.333333333 is imprecise no matter how many threes you add.
 
Sure, others are precise, but when we're talking about limitations, we want to know the worst case.
<!--
At the same time, for a lot of uses a handful of digits is quite good enough, particularly if you design knowing these limitations.
Floating point is a similar tradeoff: if you know when and why the imprecision is 'good enough',
we can give you calculations that are faster, and ''predictably'' faster, than doing it fully correctly.
 
The (in)accuracy of 32-bit floats is good enough for a lot of practical use,
and where that's cutting it close it's often easy to throw 64-bit at the problem to push down inaccuracies at the cost of some speed.
-->
 
 
 
Introductions often mention that (counting from the largest digit)
: 32-bit float is precise up to the first six digits or so
: 64-bit precise up to the first sixteen digits or so
There's some practical footnotes to that but it's broadly right.
 
 
 
 
That "stop anywhere and it's inexact" example isn't the ''best'' intuition,
though, because some numbers will be less imprecise than others, and a few will be exact.
 
Consider that:
* 0.5 is stored precisely in a float
 
* 0.1 and 0.2 are '''not''' stored precisely
 
* while 20400.75 is stored precisely, 20400.8 is not
:: note that having few digits in decimal representation is (mostly) unrelated few digits in float representation
 
* with 32-bit floats, all integers within -16777215..16777215 are exact
: with 64-bit floats, all integers in -9007199254740991..9007199254740991 are exact
: this is why some programming languages opt to not have integer types at all, for example Javascript and Lua
 
 
 
Even if you say 'fair enough' about that, '''operations make things worse''',
intuitively because each intermediate result becomes the nearest representable number.
 
 
Okay, but you may not have considered that this '''breaks some properties you might not expect''', such as commutativity?
 
Consider:
0.1+(0.2+0.3) != (0.1+0.2)+0.3
and, sometimes more pressingly,
0.3-(0.2+0.1) != 0.0
 
{{comment|(practically, tests for zero should often be something like {{inlinecode|abs(v)<0.0001}} instead, but it's not always clear ''how'' close you should require it to be)}}
 
 
'''Why do things go wrong?'''
 
In part it's because what we write down precisely in base 10 isn't necessarily precise in float representation.
 
In part because operations can mess things up further.
 
 
There frankly isn't a single best mental model of how or when these inaccuracies happen.
 
You're best off assuming ''everything'' may be incorrect to within rounding error,
and that every operation might contribute to that.
 
You should also assume these errors may accumulate over time, meaning there is no perfect choice for that 'almost zero' value.
 
 
 
'''Operations make things worse, part 2'''.
 
Combining numbers of (vastly) different scales creates another issue.
 
For an intuition why, consider scientific notation again, e.g. adding 1E2 + 1E5
 
If you stick to this notation, you would probably do that this by considering that
1E2 is equal to 0.001E5, so now you can do 0.001E5 + 1E5 = 1.001E5
 
That same "scaling up the smaller number's exponent to fit the larger number"
in float means increasing the exponent, while dividing the mantissa.
 
The larger this magnitude change, the more that digits fall out of the mantissa
(and that's ignoring the rounding errors).
 
If there is a large enough scale difference, the smaller number falls away completely.
Say, {{inlinecode|1 + 1E-5}} is 1.00001, but {{inlinecode|1 + 1E-23}} is 1.0 even in 64-bit floats.
 
It's not that it can't store numbers that small, it's that the precision limit means you can't ''combine'' numbers of a certain magnitude difference.
 
 
 
'''Operations make things worse, part 3'''.
 
Scale stuff introduces more interesting cases.
 
Remember a bit above where
0.1+(0.2+0.3) != (0.1+0.2)+0.3
Well,
0.1+(0.2+100) == (0.1+100)+0.2
 
You may not expect that, and it takes some staring to figure out why.
 
 
 
'''Operations make things worse, part 4'''.
 
Errors accumulate.
 
For example, if you keep updating a number with operations, you should expect rounding errors to happen each calculation, and therefore accumulate. For example, if you do
view = rotate(current_view, 10 degrees)
and do it 36 times you would expect to be precisely back where you started - but by now you probably would not expect that of floats.
Do that another couple thousand times, and who knows where you are exactly?
 
In games that ''may'' not matter. Usually. Except when it does.
 
There are ways to avoid such issues, in general and in games,
and it's not even hard, but it is specific knowledge.
 
 
 
'''Operations make things worse, part 5'''.
 
Some mathematical operations do not map easily to floating point,
and when specs don't strictly define required accuracy
there may be a possible tradeoffs that implementations may be doing, for more speed and less precision.
 
One of the better known examples is exp, expf ([https://en.wikipedia.org/wiki/Exponentiation exponentiation])
This is also one of a few reasons '''different libraries and different hardware should not be expected to all give bit-identical results'''.
 
 
 
Note that x87 (x86 FPUs) operations do most calculations in 80-bit internally (and have for a much longer time that you'ld think).
Intuitively because (the internal steps within what looks like) a single FPU operation imply losing a few digits (depending on the operation), so you ''need'' to work with more to have your result stay actually accurate to within your 64 bit numbers. {{comment|(It's overkill for 32-bit, but doesn't hurt)}}.
 
However:
* It does ''not'' mean subsequent operations get near-80-bit precision, because the intermediate 'fetch this out to memory' is typically 64-bit.
: Even if your language exposes 80-bit floats in their typing system (e.g. C's long double), and even if it can loads/saves them into the FPU registers, there are various footnotes.
 
* SIMD things like SSE do ''not'' do this 80-bit stuff.[https://stackoverflow.com/questions/3206101/extended-80-bit-double-floating-point-in-x87-not-sse2-we-dont-miss-it].
 
* GPUs also do not
: Actually, GPUs sometimes play looser with floating point specs in other ways, though less so now than they did early days
 
* because the x87 is a coprocessor (with its own register ''stack''), it's clunkier and harder to optimize and CPU makers have apparently been trying to get rid of it in favour of the SIMD variants. There are cases where this is better, there are cases where it is ''not'' better.
 
 
It's usually well quantified how large the error may be on any specific hardware, e.g. via [https://en.wikipedia.org/wiki/Unit_in_the_last_place ulp].
 
 
Though note that compiler optimizations can still change things from being bit identical.
 
 
 
<!--
This also means that some speed optimizations alter the precision of the overall calculation, so again, not all calculation is bit-identical (between compilations of the same code, in this case).
 
 
Note that because certain combinations of operations are common (like multiply and add),
there may be opcodes that do the two in one, which avoids some rounding inbetween, for lower overall rounding error.
 
For this example, see [https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation#Fused_multiply–add Fused-Multiply-and-Add].
 
It gives slightly more precise results, although the floating point spec considers the separated one correct as well.
 
-->
 
 
 
====Denormal numbers====
 
'''Denormal''' numbers, a.k.a. '''subnormal''' numbers, are ones that are so small (near zero) that the exponent has already bottomed out, so we have to have leading zeroes in the mantissa.
 
This implies you have fewer bits of precision than you would normally have.
 
 
In ''many'' cases these are so close to zero you can just treat them as zero, for practical reasons, and occasionally for performance reasons as well.
 
<!--  It may also be a reason to apply the log trick, or do other case-specific tweaks.  -->
 
See also:
* https://en.wikipedia.org/wiki/Denormal_number
 
====Significance loss====
<!--
You can make some qualifying statements about storage, e.g. that float32 stores roughly six digits accurately.
 
 
Things get more interesting when doing operations.
Most issues that arise come from large changes in exponent.
 
 
For example, consider addition. 
 
Let's do this in the scientific-notation analogy.
 
Say we add ten to ten million, i.e. 1E1 + 1E7.
 
The larger number settles the exponent of the result, which means the smaller number gets rescaled to that exponent,
so actually the operation becomes .000001E7 + 1E7, is 1.000001e+07
 
In this analogy, the accuracy limit is us writing down only so many digits.
If it (arbitrarily in the analogy) is seven, then the result is 1.000001e+07,  if it's five it's 1E7.
 
In other words, adding sufficiently small numbers doesn't do anything.
 
 
And far before that lies rounding.
If we have seven digits, 1.1234567E1 + 1E7 is still 1.000001e+07
The rest of that is lost.
 
 
This also shows up when adding ''many'' things.
For example, counting in floating point by adding 1 will eventually stop increasing for this reason, for example (1E18+1)==1E18,
 
In the accuracy of the value is good for output, in that in practice you rarely if ever care whether it's a quintillion or a quintillion and one,
but it's a potential but that it stops counting somewhere, and it's not always as easy to characterize ''where''.
 
You can often alleviate this by keeping numbers in the same scale somehow. That many items suggests a distributed system anyway,
and it's often easy to sum for a largeish chunk at a time, then sum those sums - a map-reduce type thing.
The loss of the -and-one is postponed as long as possible, which is more accurate.
 
 
 
Consider subtraction of two very similar numbers, say 1.1234567 and 1.1234566.
The result is 0.0000001, or rather 1e-07
 
Often enough those digits were removed when the number was converted to a float,
but in some cases the detailed digits are important, and doing the math in a different
order removes fewer digits.
 
...though the easier workaround is to use higher-precision floats, e.g. a double instead of a single.
If enough digits are ''really'' important, use a arbitrary-precision library (note that these are simulated, so they will be considerably slower).
 
 
 
Multiplication and division will often lose some precision, as they by nature deal with numbers of different magnitude. Pragmatically this is often less serious.
 
It is sometimes a good reason to
: multiply before dividing
 
 
-->
 
 
 
<!--
'''Can you tell whether floating point is approximate or accurate, and if so, how?'''
 
Not easily.
 
 
As [https://docs.python.org/3/tutorial/floatingpoint.html] mentions, the reason for inaccuracy is akin to the reason we can't write 1/3 in decimal. While 0.333333 is pretty close, stop at any finite number of threes and it's inaccurate.
 
For floats you can describe it as "anything you cannot express as a base 2 fraction, after the exponent logic rescaled the mantissa to the 1..2 scale it needs to be in", but this is as accurate as it is useless to us humans.
 
 
Some things can trade between mantissa and exponent without loss,
which is much of the reason e.g. integers up a point are accurate.
 
And some fractions do decently up to that order for a similar reason,
but this intuition is iffy, and falls apart.
 
You may like tools like https://www.h-schmidt.net/FloatConverter/IEEE754.html for further inspection.
 
-->
 
<!--
catastrophic cancellation
-->
 
====Storing integers====
 
'''A good range of integers ''can'' be stored exactly in floating point'''.
 
 
Basically it's those for which the binary representation needs at most the amount of bits the mantissa actually has (plus one implicit bit).
 
For example, 32-bit floats have 23 bits for the mantissa, so can store integers up to 2<sup>23+1</sup>-1, so practically integers within -16777215 .. 16777215
 
64-bit floats have 52 mantissa bits, so can store integes within -9007199254740991..9007199254740991
 
 
 
Notes:
* technically you can '''store''' up to 2<sup>mantissabits+1</sup> rather than <sup>mantissabits+1</sup>-1, but languages tend to define safe values as those for which n and n+1 are exactly representable, or perhaps more to the point, as those that are not ''also'' approximations for other numbers. So it ends one lower.
 
 
* After the abovementioned limit, you effectively get every ''second'' integer, after that every ''fourth'', etc.
: (and before that you could also store halves, before that quarter)
: If you've ever played with [[fixed-point integers]] (using integers to imitate floats) this may be somewhat intuitive to you
 
 
<!--
There are still accuracy implications when you do operations,
of course, and they are less trivial to understand than integer types,
but it is sometimes very
 
 
This is why some languages can get awy with having one numeric type rather than a lot of different ones.
 
 
-->
 
===ulps and wobble===
<!--
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
 
https://stackoverflow.com/questions/19649178/relative-error-and-wobble
-->
 
==Floating point compression==
{{stub}}
 
The way floating point numbers are stored in bits doesn't lend their binary form to be compressed very well using generic lossless compression.
 
 
'''Lossless floating point compression''' has seen some research, but the short answer is that only some relatively specific cases get more than a little compression.
{{comment|(Say, time series may be very predictable. For example, [https://docs.influxdata.com/influxdb/v1.8/concepts/storage_engine/#compression influxdb has some interesting notes on what it does to its data])}}
 
 
'''Lossy floating point compression''' is easier, and in amounts to decimating/rounding to some degree.
 
Intuitively, if you don't care about the last few digits in each number(/bits in a mantissa), you can chop off a little.
It's like storing a float64 in a float32, except you can do this in finer-grained steps.
 
==On GPUs==
{{stub}}
 
Older GPUs were more specialized devices, and deviated from standard IEEE floats.
 
Old as in roughly pre-CUDA, and CUDA 1.x still had some issues (e.g. dealing with [[denormal numbers]]).
See e.g. [https://web.archive.org/web/20130620140300/http://www.gpgpu.org/static/sc2006/workshop/Apple_GPUintrinsics.pdf].
 
 
Modern GPUs (since Compute Capability 2) provide something much closer to IEEE-754 floating point.
 
Still with a few footnotes that mean they are not equivalent to FPUs.
 
Also, while FPUs in the x86 line do calculations in 80-bit (basically since always), on GPU 32-bit is actually 32-bit, and 64-bit is actually 64-bit (closer to e.g. SSE), meaning they can accumulate errors faster.
In some case you can easily work around that, in some cases it's a limiting factor.
 
 
You ''may'' be able to make GPUs use float16, though don't expect ease of use, and while it may be faster, it probably ''won't'' be twice the speed, so it may not be worth it.
 
Speed of 16, 32, and 64-bit FP in GPU isn't a direct halving/doubling, because of various implementation details, some of which also vary between architectures.
But also on that most cards currently focus on 32-bit - because they're assigning most silicon to things most useful in gaming, for which singles are enough.
 
 
 
When lower precision is okay, even integer / fixed-point may help speed on GPU {{verify}}
 
 
float64 is less of a performance hit on most FPUs{{verify}} than GPUs.
 
 
If it matters, you may want to check every GPU (and every GPU/driver ''change'')
http://graphics.stanford.edu/projects/gpubench/test_precision.html
 
http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf
 
 
 
 
===On repeatability===
{{stub}}
 
'''Differences between CPUs and GPUs are expected'''
 
See above for the reasons.
 
The operations will typically still be within IEEE 754 specs.
...which for many operations are not quite as strict as you may think.
 
 
 
'''Differences between different GPUs and drivers will happen'''
 
The precision for an operation in a range is usually well characterizED,
and typically within IEEE spec, but may differ between hardware and with different drivers.
 
Also, the GPU and driver is effectively a JIT optimizer, so it might rearrange operations with minor side effects{{verify}}
 
See e.g.
: http://graphics.stanford.edu/projects/gpubench/test_precision.html
: https://www-pequan.lip6.fr/~jezequel/ARTICLES/article_CANA2015.pdf
 
 
 
'''Some code is not deterministic for speed reasons'''
 
e.g CUDA atomics: the order in which concurrent atomic updates are performed is not defined,
so this can have rounding/associativity-related side effects.
 
You can avoid this with different code.
But in some applications you may want the speed increase (see e.g. [http://jkschin.com/2017/06/30/non-determinism.html tensorflow]).
 
 
 
'''Differences between runs on the same GPU+driver should not happen'''
 
...yet ''sometimes'' do.
 
It seems that GPUs pushed too hard makes mistakes (presumably in memory more than in calculation?).
You wouldn't notice this in gaming, or necessarily in NN stuff,
but you would care in scientific calculation.
 
Sometimes this can be fixed with drivers that use the hardware
a little more carefully.
 
Sometimes it's a risk-speed tradeoff,
one you can tweak in settings,
and one that may change with hardware age.
 
==On FPU-less CPUs==
<!--
{{stub}}
 
FPUs were once separate, and x86 has had an FPU built in since roughly the 486 (~1990),
but in particular minimal microcontrollers still may not, because simpler uses may never need it and it saves silicon space and thereby cost.
 
 
Note that the compiler may be emulating floating point for you, so that you can use the language's existing types on a platform that doesn't natively support them.  This is much like using a library yourself, but less awkward in that you get to use the language's regular syntax.
 
For example, AVRs do not have floating point operations {{comment|(or integers larger than 8 bit, for that matter)}}, yet if you use avr-libc (you probably do) you get floating point operations, apparently not just for basic operations but also various higher-level math functions.
{{comment|(While fairly optimized assembly it's never going to be fast, but it's probably as decent as it'll get on the platform)}}
 
 
 
It is hard to answer "how much slower is it" because that really depends on 'compared to what'?
 
If you can do your task with integers, then why are you using float, ever?
 
But also you run into platform-specific things, like that AVRs (or PICs{{verify}}) don't have integer division in hardware{{verify}}, so some part of speed is whether you write your code aware of that (e.g. multiply by reciprocal instead, or bitshift where possible{{verify}}).
But then compilers are aware of this so in ''some'' do clever things for you[https://stackoverflow.com/questions/34136339/how-does-division-by-constant-work-in-avr-gcc].
 
 
Any specific simple code (consider e.g. an EWMA lowpass for sensor smoothing), might be done in fixed-point, but you may want to special-case it so you can replace divisions with bitshifts[http://bleaklow.com/2012/06/20/sensor_smoothing_and_optimised_maths_on_the_arduino.html] because that's likely to be faster on an AVR.
 
If you are comparing with fixed point, then it varies with things
: how much precision you still need.
: what operations you mostly do - e.g. fixed point addition is likely a good factor faster, but division with its extra work may be
 
And what features. You can optimize floating point libraries, you can optimize fixed point libraries, and you can get some speed with a compliance/feature versus "I know what I'm doing" version, including your own code.
 
 
 
Division is slow.
: Fixed point lets you do specific-cased divisions with bitshifts
: If the numbers involved don't really allow bitshifts, then it's typically better to multiply by the reciprocal
: if you ''have'' to use division, then floating point may be no slower - and easier.
 
-->
 
==See also (floats)==
* http://en.wikipedia.org/wiki/IEEE_754-1985
* http://en.wikipedia.org/wiki/IEEE_754-2008
* http://en.wikipedia.org/wiki/Floating_point
* http://en.wikipedia.org/wiki/Single_precision (often implies IEEE 754)
* http://en.wikipedia.org/wiki/Double_precision (often implies IEEE 754)
 
 
* http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html
* http://www-1.ibm.com/support/docview.wss?uid=swg21194436
* http://stevehollasch.com/cgindex/coding/ieeefloat.html
* http://scholar.hw.ac.uk/site/computing/topic21.asp
* http://cch.loria.fr/documentation/IEEE754/ACM/goldberg.pdf
 
Interesting note relevant to audio coding:
* http://www.dsprelated.com/showmessage/48307/1.php
 
 
Relevant standards:
 
* IEEE 754  [http://en.wikipedia.org/wiki/IEEE_754] for a long time referred specifically to IEEE 754-1985, The ''"IEEE Standard for Binary Floating-Point Arithmetic"'' and is the most common reference.
 
* recently, IEEE 754-2008 was published, which is mostly just the combination of IEEE 754-1985 and IEEE 854 (radix-independent). Before it was released, it was known as IEEE 754r[http://en.wikipedia.org/wiki/IEEE_754_revision], ('revision')
 
* IEEE 854, specifically IEEE 854-1987, is a radix-independent variation [http://www.savrola.com/resources/IEEE854.html]
 
* IEC 60559:1989, ''"Binary floating-point arithmetic for microprocessor systems"'', is the same as 754-1985
 
 
Unsorted:
* http://stackoverflow.com/questions/6118231/why-do-i-need-17-significant-digits-and-not-16-to-represent-a-double/
 
* http://aggregate.org/MAGIC/
 
 
There are other, more specific floating-point implementations
 
=Integers=
 
==Signed and unsigned integers==
 
A '''signed integers''' can store negative values, an '''unsigned integer''' is one that can store only positive numbers. 
 
Presumably it's just referring to whether a minus sign gets involved (I've never checked the origins of the terms).
 
 
Unsigned integers are often chosen to be able to count a little further, because ''signed'' integer cut the range in half.
 
Say, a signed 16-bit int counts from −32768 to 32767 {{comment|(...assuming two's complement style storage, see below)}}, an unsigned 16-bit int counts from 0 to 65535.<!-- (2<sup>16</sup>) -->
 
 
===How we could store negative numbers===
 
There are a few different ways of representing negative numbers.
 
The most commonly used is two's complement, partly because they are some of the easiest and most efficient to implement in hardware - most operations for unsigned numbers actually do the correct thing on two's-complement signed as well. (Not true for one's complement)
 
 
<!--
* '''Two's complement''' (most common)
: So named because
: Few special cases. You can e.g. add and subtract positive and negative numbers  as if they are unsigned ints.
: Works out as the easiest to work with, and easy on a hardware level, so is very common
: https://en.wikipedia.org/wiki/Ones%27_complement
 
* '''One's complement'''
: -val is calculated by flipping all bits (fast)
: however
:: 0 != -0
:: various other operations are slightly more complex - than two's complement, and than they need to be
 
* '''Excess-N'''
: somewhat like two's complement, but biased around an arbitrary value.
: Used when there is a reason for such a bias (e.g. in floating point numbers)
 
* '''Sign-and-magnitude'''
: Use one bit for sign, leave the rest for the magnitude
: Means +0 is not the same value as -0
: Various operations (e.g. addition of two of these values) isn't a simple bitwise operation
 
 
Notes:
* in two's complement and in many implementations of sign-and-magnitude you can tell whether it's negative by the most significant bit, but you usually shouldn't alter it unless you know what you're doing.
 
* The meaning of, and distinction between, the names "one's complement" and "two's complement" are not actually so obvious (until you've dived into the details).
 
-->
 
==Arbitrary-precision integers==
{{stub}}
 
Arbitrary-precision integers, also frequently named something shorter like 'bignum' or 'bigint', is integer that can represent any large value -- until it doesn't fill available storage, anyway.
 
 
 
A toy implementation of storing these, and some basic operations on them, is surprisingly reasonable to make.
 
Say, one approach that may not be very efficient yet is easy to understand (and program) is to keep a list of int32s, each representing three decimal digits at a time. For example, 1034123 would be stored as the list {{inlinecode|[1, 34, 123]}}.
 
 
Many operations can be done with some simple piece-by-piece logic with the underlying integer operations, followed by pass of 'if an element is >999, increment the next highest element' (like carry when doing multiplication on paper) (right-to-left, in case that pushes the next one over 999 as well).
For example:
* 1034123+1980
:: = [1,34,123] + [1,980]
:: = [1,35,1103]
:: = [1,36,103] (= 1036103)
* 1980*5002
:: = [1,980] * [5,2]
:: = [2,1960] + [5,4900,0]
:: = [3,960] + [9,900,0]
:: = [9,903,960] (= 9903960)
 
 
Notes:
* Serious bignum interpretations do something similar, but are a bunch cleverer in terms of speed, space efficiency, and edge cases of various operations.
 
* The choice above for 1000 in int32 is not very space-efficient at all. It was chosen for the toy implementation because
:: add or multiply can't can overflow the value (1000*1000 is well within the range) ''and''
:: it's pretty trivial to print (and to see the meaning of the stored data during debugging)
 
* The choice of when to do the carry test doesn't matter so much in this toy implementation (is an avoidable intermediate step in the multiplication example), but matters in real implementations
 
 
See also:
* http://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic
 
 
<!--
==See also==
* http://en.wikipedia.org/wiki/Signed_number_representations
 
* http://en.wikipedia.org/wiki/Computer_numbering_formats
 
* http://en.wikipedia.org/wiki/Two%27s_complement
-->
 
=Fixed point numbers=
==What==
{{stub}}
 
Fixed point numbers are, in most implementations, a library of functions that abuses integers to represent more fractional numbers.
 
 
'''Introduction by example'''
 
Consider that a regular integer works something like:
* bit 1 represents 1
* bit 2 represents 2
* bit 3 represents 4
* Bit 4 represents 8
* Bit 5 represents 16
* Bit 6 represents 32
* Bit 7 represents 64
* Bit 8 represents 128
* ...and so on.
 
In which case:
* 00000100 represents 4
* 00000111 represents 7 (4 + 2 + 1)
 
 
Now you say that that instead:
* bit 1 represents 1/4
* bit 2 represents 1/2
* bit 3 represents 1
* Bit 4 represents 2
* Bit 5 represents 4
* Bit 6 represents 8
* Bit 7 represents 16
* Bit 8 represents 32
* ...etc
 
In which case:
* 000001&#x2009;00 represents 1.00
* 000001&#x2009;11 represents 1.75 (1 + 1/2 + 1/4)
 
 
Put another way, you're still just counting, but you decided you're counting units of 1/4.
 
 
It's called "fixed point" because {{comment|(as the spacing in the bits there suggests)}},
you just pretend that you shifted the decimal point two bits to the left, and sits there always.
 
Which also means various operations on these numbers make sense as-is.
For adding you don't even ''need'' a library, as long as you're aware of the rough edges.
 
More complex operations make things messier, so you ''do'' need some extra bookkeeping, and often want to detect bordercases like overflow {{comment|(stuff which in floating point are handled by the specs / hardware flags)}} - see e.g. [https://en.wikipedia.org/wiki/Fixed-point_arithmetic#Precision_loss_and_overflow saturation].
 
==Why==
 
The fixed point trick was useful when Floating Point Units were not yet ubiquitous part of CPUs,
because it means you can use regular integers to calculate things with more digits,
in a way that's faster than emulating an IEEE float.
 
{{comment|(In a few specific well-tuned cases it could even beat shabby FPUs)}}
 
 
Fixed point is still used, e.g.:
* on platforms without floating point calculation, e.g. many microcontrollers, some embedded CPUs, and such
 
* in libraries that ''may'' have to run on such platforms
 
* to make rounding fully predictable and bit-perfect across systems, in calculation and in data storage.
: e.g. SQL defines fixed-point data, so most SQL databases can store and calculate with what is functionally roughly the same (implementation may be more focused on being fractions, but it amounts to much the same).
 
* for quick and dirty high-precision calculation (since you can extend this into more bits, sort of like a specialized [[bigint]])
** mostly because from scratch, it's easier to implement fixed-point logic with more bits than floating point logic with more bits (though these days there are great high/arbitrary-precision float libraries)
 
==More detail==
<!--
There are two types of fixed-point numbers, binary and decimal, referring to the base used for the digits. Decimal made displaying easier, binary used the bits more efficiently.
 
 
For example, in binary fixed point numbers you could split a 16-bit int into 8 and 8, usually denoted 8.8 or Q8.8.
 
This means you get integer range of 256 (unsigned or signed, depending on how you interpret it), and 256 levels for the fraction. The value 0x0308 would mean 3 + 8/256, i.e. 3.03125
 
Q3.29 would spend 3 bits on the integer (8 levels), 29 on the fraction (536870912 levels).
 
 
Signedness in fixed-point numbers basically has to use an extra bit{{verify}}.
-->
 
 
==See also==
* http://en.wikipedia.org/wiki/Fixed-point_arithmetic
 
 
 
 
<!--
=More complex value types, imitations=
 
 
==Complex numbers==
It isn't very hard to define a complex number type and operations on them.
 
Seen e.g. in Python
 
 
==Rational numbers==
It is also relatively easy to define a [http://en.wikipedia.org/wiki/Rational_numbers rational-number] data type (fractions of two integers) and the operations on them.
 
This lets you, for example, evaluate 1/3 * 2/7, and without any rounding until you ask for conversion to a float, int, or such.
 
 
It often makes sense to use bignums under the cover, as overflows are likelier given when you'll often look for a least common multiple.
 
Seen e.g. in Common LISP.
 
 
Symbolic math evaluation (see e.g. Mathematica) can often to do this and more.
-->
 
=Storing fractions=
<!--
Some ways of doing this can resemble fixed point integers.
 
But if you write this yourself, you more often just store numerator and denominator separately,
and give people the value as a float when they ask nicely.
 
Most operations are fairly simple, but there are some hairy detail to not unnecessarily losing precision when converting.
 
 
-->
 
=BCD=
{{stub}}
 
Binary Coded Digits are a way of coding integers that is different from your typical two's complement integers.
 
 
BCD codes each digits in four bits. For example, (decimal) 12 would be 0001 0010
 
One side effect that the hexadecimal representation of the coded value will look like the number: decimal 12 is coded like 0x12.
 
 
However, you don't use all your bits, some operations are harder on this representation, and in a processor would require a little more silicon.
So why would you do this?
 
 
The main benefit is easy conversion to something you show as decimal, which makes sense when displaying a limited amount of numbers - clocks, calculators, 7-segment drivers, particularly when doing so without a processor.
 
It is also seen where you want simple electronics, and/or want something more akin to fixed-point than floating-point calculations.
 
 
Pocket calculators will (still{{verify}}) tend to work in BCD{{verify}}.
 
=On rounding=
{{stub}}
<!--
Even mathematically, rounding ties (numbers that are exactly halfway) are interesting,
because any choice is arbitrary.
 
(note that in floating point, ties are less common than you may think ''because'' of limited precision)
 
 
It seems the most common(ly taught?) convention may be to '''round away from zero''':
0.5 to whole number goes to 1.0,
and -0.5 goes to -1.0.
 
The thing is that for some purposes, that's a noticeable bias upwards.
There are other methods that are just as arbitrary but do not have that bias -- at least for reasonably distributed values.
 
 
Options:
* Away from zero
 
* ties to even number, a.k.a. Banker's algorithm - for example 0.5 to 0, 1.5 to 2, 2.5 to 2, etc.
 
* ties to odd
 
* random choice (stochastically)
 
* alternate (deterministically)
** small bias due to order, and whether you have an odd number of values or not
 
 
 
IEEE754 floating point operations round to even.
 
While your programming language may do its own thing instead, recent languages have paid more attention to this issue.
 
 
More generally, rounding may be more relevant to decimal types, e.g. those used for money in databases.
 
 
-->
 
=Word, octet, nibble=
==Word==
In computer hardware and programming, a '''word''' is the fundamental storage size in an architecture, reflected by the processor and usually also the main bus.
 
A 16-bit computer has 16-bit words, a 64-bit computer has 64-bit words, etc. On all but ancient computers, they're multiples of eight.
 
 
Words:
* In programming, 'word' often means 'an architecture-sized integer.'
* Some use it in the implicit context of the architecture they are used to, such as indicating 16 bits in the days of 16-bit CPUs. In the context of 32-bit processors, a halfword is 16 bits, a doubleword 64 bits, and a quadword 128 bits. These terms may  stick for a while even in 64-bit processor era.
* 'Long word' can also be used, but is ambiguous
Basically, this is a confusing practice.
 
 
This refers specifically to integer sizes because [[floating point numbers]] are available in a few standardized sizes {{comment|(usually IEEE 754)}}.
 
In languages like C, <tt>int</tt> carries the meaning of a word; it may differ in size depending on what architecture it is compiled for. Some coders don't use specific-sized integers when they should, which is quite sloppy and can lead to bugs - or be perfectly fine when it just needs to store some not-too-large numbers. It may interfere with signed integers ([http://en.wikipedia.org/wiki/Two's_complement Two's complement] negative numbers), particularly if you count on certain behaviour, such as that -1 would be equal to <tt>0xffffffff</tt>, which is only true for 32-bit signed ints, not for signed ints in general.
 
==Octet==
Fancy word for a byte - eight bits large.
 
Seen in some standard definitions, largely because some older (mostly ancient and/or unusual) computers used sizes more creatively, which also implied that 'byte' sometimes meant sizes other than 8 bits, and because 'byte' carries more of a connotation of being already binary coded - and octet more that of just the concept of grouping eight bits as a unit.
 
(For more details, see [http://foldoc.org/byte foldoc on the byte])
 
 
A few countries, like france, use megaoctet / Mo instead of the megabyte.
{{comment|(Though for french it seems this is also because it sounds potentially rude - though this applies to bit, not byte[https://www.quora.com/Do-French-speakers-rather-say-m%C3%A9gaoctet-or-megabyte])}}
 
==Nibble==
(Also seen spelled nybble, even nyble)
 
 
Half a byte: four bits. 
 
Most commonly used in code comments to describe the fact you are storing two things in a byte, in two logical parts, the high and low nibble.
 
Few architectures or languages have operations at nibble level, probably largely because bitmask operations are simple enough to do, and more flexible.
 
Has seen other definitions in ancient/strange computers.
 
 
 
=Semi-sorted=
 
==atan and atan2==
{{stub}}
 
 
Quick review of the geometric properties of a right angled triangle, via SOHCAHTOA:
tan(angle) = opposite_length / adjacent_length
 
And if you know the sizes of the two straight edges, you can use the arctangent (a.k.a. arctan, atan) to go the other way:
atan( opposite_length / adjacent_length ) = angle
 
 
So if you know the angle, you know the relative magnitudes of these two edges, e.g.
tan(30&deg;) = 0.577
 
So arctan takes these relative magnitudes and tells you the angle that implies
arctan(0.577) = 29.98&deg;    (because of rounding for brevity)
 
 
Computer operations work in radians so, equivalently:
tan(0.523) = 0.577
and
arctan(0.577) = 0.524
 
 
 
When you have atan, why is there also an atan2?
 
Well, atan assumes you care just about the slope, so gives answers between -pi/2 and pi/2 (-90..90 degrees, vertical downwards to vertical upwards, but always pointing to the right).
 
There's two common issues with that.
 
 
One is that when you calculate {{inlinecode|atan( y/x )}}, you are doing that division before handing a value over. And for vertical lines, that's a division by zero.
So as-is, you have to special-case vertical lines in your code.
 
 
The other is that when you work not with graph-style slopes but with vectors, you may care about slope, but also direction, what quadrant it's pointing in.
 
When you did y/x you are already ambiguous about the quadrant, so instead of:
atan( y/x )
you do
atan2( y, x )
which gives you an answer in the full circle's range, -pi to pi
 
 
For example, when
: x=2 and y=1  (shallowly pointing to the right and up)
atan( y/x )    is 1.07 radians
atan2( y,x )  is 1.07 radians
 
: x=0 and y=2  (vertical, up)
atan( y/x )    is a zero division error
atan2( y,x )  is 1.57 radians (0.5pi)
 
 
: x=0 and y=-2  (vertical, down)
atan( y/x )    is a zero division error
atan2( y,x )  is -1.57 radians (-0.5pi)
 
 
: x=-2 and y=1  (shallowly pointing to the left and up)
atan( y/x )    is -1.107 radians
atan2( y,x )  is  2.677 radians (0.85pi)
 
 
See also:
* https://en.wikipedia.org/wiki/Inverse_trigonometric_functions
* https://en.wikipedia.org/wiki/Atan2
 
 
 
[[Category:Math]]
[[Category:Programming]]
 
=On architecture bit sizes=
<!--
 
 
 
The amount of bits has little ''direct'' relation to the graphics. Or the sound.
 
 
 
The amount of bits in a system is how large a number the typical basic CPU operation works with.
 
Which tends to be the same as the memory bus width and a few other things, because if they weren't the same size you need translation for every one of your most basic operations.
 
You can generally work with ''smaller'' ones well, but ''larger'' ones will need to be expressed in at least a handful of CPU instructions. {{comment|(For an example today, consider Arduinos, because AVRs are 8-bit RISC, meaning every operation on a 16-bit, 32-bit, or 64-bit data type will more instructions. The compiler does this for you, but anything you can do in pure 8-bit types will be faster.)}}
 
It'll be slower. Not necessarily ''terribly'' slow, but when at all possible, you design any system with all common components matching in this.
 
 
The increase in bits over time is
* partially because for many uses, the ability to count to no more than 256 (8-bit) or 65536 (16-bit) turns out to be pretty limiting
 
* partially because of ease of working with larger amounts of memory
 
 
 
 
Yes, this CPU-related size ''does'' correlate with the time, and thereby the graphics hardware of that time.
 
 
When we say 64-bit graphics, we're probably just pointing at the N64 era.
 
 
The graphics we see are almost always 8-bit (per color), the sound we hear is almost always 16-bit.
 
Often enough, more bits are involved in calculating those may involve more, for various detail-related reasons.
 
 
 
 
 
-->
 
 
 
[[Category:Programming]]
[[Category:Explanation]]
[[Category:Glossary]]

Latest revision as of 17:04, 22 April 2024

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Floating point numbers

Floating point numbers store approximations of real numbers, and are the most common way of doing so in computers.


When (more) precise real numbers are necessary, we look to computer algebra and symbolic math, such software has its own solutions, and accepts it is sometimes slow. (those also deal more easily and sensibly with rational numbers and transcendental and irrational numbers are harder for computers to store and use, both at all and efficiently)


Yet from a practical view, most math involving decimal points, fractions, very large numbers, very small numbers, or transcendental numbers like pi or e, are precise enough to do in floating point, particularly if you know its limitations.

So floating point is a pragmatic tradeoff, with limited and fairly-predictable precision, and consistent speed.

Speed that comes in part from some CPU silicon dedicated to floating point calculations. If you didn't you could simulate it, but it would be factors slower, and on simple microcontrollers this is still the case, and there are some considerations of doing part of your calculations in integer, or in fixed point ways.


Within computers, floating point are mainly contrasted with

which could be seen as a makeshift precursor to floating point (and are still useful on platforms without floating point calculations)
  • arbitrary-precision arithmetic,
    • BigInt


Intuitively, floating point can be understood as an idea similar to scientific notation.

Consider 20400.

In scientific notation you might write that as 2.04 * 104.

Floating point would actually store that same number as something more like 1.2451172 * 214. (Exactly why and how isn't hugely important, e.g. its use of base-2 is mostly details about slightly more efficient use of the same amount of silicon.)



Bit division and layout

You don't need to know much of this, but to give some idea how it's stored: IEEE 754 floats divide their allotted bits into sign, mantissa, and exponent.

The standard float types in most programming languages will be IEEE 754 32-bit and/or IEEE 754 64-bit

largely because most CPUs can can handle those natively and therefore quickly
also we're used to them - e.g. arduinos have to simulate floats with a bunch of work for each floating point operation, but it's hacked in because it's very convenient


An IEEE 754 32-bit float uses 1 bit for the sign, 23 for the mantissa and 8 for the exponent. The bit layout:

seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm

An IEEE 754 64-bit floats (a.k.a. 'double-precision', doubles) uses 1 bit for sign, 52 for the mantissa and 11 for the exponent. The bit layout:

seeeeeeeeeeemmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm


There are other-sized variations, like 80-bit extended precision, 128-bit quadruple, 256-bit octuple, 16-bit half, and also 36-bit, 48-bit, and 60-bit, and I've heard mention of 120-bit and one or two others.


You can always emulate lower-precision floats, e.g. use 80-bit to emulate 64-bit, by throwing away the excess precision only when the result is fetched out.

This is apparently what x86 FPUs have done since the 8087 - they are internally 80-bit, out of necessity, because some operations are by nature less precise than the precision its intermediates are stored in, so you need some extra bits for those few operations not to work poorly.


Most programming have 32-bit and/or 64-bit IEEE floats as a built-in type. A few might allow some use or control over extended precision[1]), but this is a topic with footnotes of its own.


Representation details

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
🛈 You may want to read the following thinking only about how this affects inaccuracy, because most of these internals are not very useful to know.
And for in-depth details, there are better resources out there already. This is intended as a summary.


The value a float represents is calculated something like:

sign * (1+mantissa) * 2exponent


Sign is coded as 1 meaning negative, 0 meaning positive, though in the formula above, think of it as -1 and 1.

Exponent is an integer which (in 32-bit floats) represents values within -126..127 range

  • so e.g.
coded value 1 represents -126
coded value 127 represents 0
coded value 254 represents 127
  • values 0 and 255 (which would represent -127 and 128) are used for special cases:
    • Zero (which can't be represented directly because of the 1+ in the formula) is represented by exponent=00000000 and mantissa=0
    • Infinity is represented by exponent=11111111 and mantissa=0.
    • Note that the last two don't mention sign -- both infinity and zero can be positive and negative
    • NaN ('Not a Number') is represented by exponent=11111111 and non-zero mantissa
      • there is a further distinction into quiet NaNs for indeterminate operations, and signaling NaNs for invalid operations, but you often wouldn't care

Mantissa bits represent 2-1 (=0.5), 2-2 (=0.25), and so on. For a 32-bit float there are 23 mantissa bits, so before the exponent gets involved they represent:

0.5
0.25
0.125
0.0625
0.03125
0.015625
0.0078125
0.00390625
0.001953125
0.0009765625
0.00048828125
0.000244140625
0.0001220703125
0.00006103515625
0.000030517578125
0.0000152587890625
0.00000762939453125
0.000003814697265625
0.0000019073486328125
0.00000095367431640625
0.000000476837158203125
0.0000002384185791015625
0.00000011920928955078125

For 64-bit floats there are another 29 of these.

So yes, the mantissa part is representing fractions - some explanations introduce the whole thing this way - but only in a directly understandable way when you ignore the exponent).


Example from IEEE bits to represented number:

  • 1 01111111 10000000000000000000000
  • Sign is 1, representing -1
  • exponent is 127 (decimal int), representing 0
  • mantissa is 10000000000000000000000, representing 0.5

Filling in the values into (sign * (1+mantissa) * 2exponent) makes for (-1*(1+0.5)*20), is -1.5


Example from number to IEEE bits

Encoding a number into float form can be understood as:

  1. start with base-2 scientific notation (with exponent 0)
  2. increment the exponent while halving the part that goes into the mantissa (which keep the represented number the same, to within error).
  3. repeat step 2 until value-that-goes-into-the-mantissa is between 0 and 1
(Note that mantissa values between 0 and 1 represents 1+that, so in the examples below we divide until it's between 1 and 2)


For example: 7.0

  • is 7.0*20
  • is 3.5*21
  • is 1.75*22 (mantissa stores (that-1), 0.75, so we're now done with the shifting)

We have sign=positive, exponent=2, and mantissa=0.75. In bit coding:

  • sign bit: 0
  • exponent: we want represent 2, so store 129, i.e. 10000001
  • mantissa: we want to represent 0.75, which is 2-1+2-2, (0.5+0.25) so 11000000000000000000000

So 7.0 is 0 10000001 11000000000000000000000


The 20400 mentioned above works out as 1.2451171875*214.

  • Sign: 0
  • Exponent: 10001101 (141, representing 14)
  • Mantissa: 00111110110000000000000
Using the table above, that mantissa represents 0.125+0.0625+0.03125+0.015625+0.0078125+0.001953125+0.0009765625 (which equals 0.2451171875)


If you want to play with this, see e.g. https://www.h-schmidt.net/FloatConverter/IEEE754.html


See also:

So how is that precision distributed?

On the limited precision

Never assume floating point numbers or calculations are fully precise

As [2] mentions, one reason for inaccuracy is akin to the reason we can't write 1/3 in decimal either - writing 0.333333333 is imprecise no matter how many threes you add.

Sure, others are precise, but when we're talking about limitations, we want to know the worst case.


Introductions often mention that (counting from the largest digit)

32-bit float is precise up to the first six digits or so
64-bit precise up to the first sixteen digits or so

There's some practical footnotes to that but it's broadly right.



That "stop anywhere and it's inexact" example isn't the best intuition, though, because some numbers will be less imprecise than others, and a few will be exact.

Consider that:

  • 0.5 is stored precisely in a float
  • 0.1 and 0.2 are not stored precisely
  • while 20400.75 is stored precisely, 20400.8 is not
note that having few digits in decimal representation is (mostly) unrelated few digits in float representation
  • with 32-bit floats, all integers within -16777215..16777215 are exact
with 64-bit floats, all integers in -9007199254740991..9007199254740991 are exact
this is why some programming languages opt to not have integer types at all, for example Javascript and Lua


Even if you say 'fair enough' about that, operations make things worse, intuitively because each intermediate result becomes the nearest representable number.


Okay, but you may not have considered that this breaks some properties you might not expect, such as commutativity?

Consider:

0.1+(0.2+0.3) != (0.1+0.2)+0.3

and, sometimes more pressingly,

0.3-(0.2+0.1) != 0.0

(practically, tests for zero should often be something like abs(v)<0.0001 instead, but it's not always clear how close you should require it to be)


Why do things go wrong?

In part it's because what we write down precisely in base 10 isn't necessarily precise in float representation.

In part because operations can mess things up further.


There frankly isn't a single best mental model of how or when these inaccuracies happen.

You're best off assuming everything may be incorrect to within rounding error, and that every operation might contribute to that.

You should also assume these errors may accumulate over time, meaning there is no perfect choice for that 'almost zero' value.


Operations make things worse, part 2.

Combining numbers of (vastly) different scales creates another issue.

For an intuition why, consider scientific notation again, e.g. adding 1E2 + 1E5

If you stick to this notation, you would probably do that this by considering that 1E2 is equal to 0.001E5, so now you can do 0.001E5 + 1E5 = 1.001E5

That same "scaling up the smaller number's exponent to fit the larger number" in float means increasing the exponent, while dividing the mantissa.

The larger this magnitude change, the more that digits fall out of the mantissa (and that's ignoring the rounding errors).

If there is a large enough scale difference, the smaller number falls away completely. Say, 1 + 1E-5 is 1.00001, but 1 + 1E-23 is 1.0 even in 64-bit floats.

It's not that it can't store numbers that small, it's that the precision limit means you can't combine numbers of a certain magnitude difference.


Operations make things worse, part 3.

Scale stuff introduces more interesting cases.

Remember a bit above where

0.1+(0.2+0.3) != (0.1+0.2)+0.3

Well,

0.1+(0.2+100) == (0.1+100)+0.2 

You may not expect that, and it takes some staring to figure out why.


Operations make things worse, part 4.

Errors accumulate.

For example, if you keep updating a number with operations, you should expect rounding errors to happen each calculation, and therefore accumulate. For example, if you do

view = rotate(current_view, 10 degrees)

and do it 36 times you would expect to be precisely back where you started - but by now you probably would not expect that of floats. Do that another couple thousand times, and who knows where you are exactly?

In games that may not matter. Usually. Except when it does.

There are ways to avoid such issues, in general and in games, and it's not even hard, but it is specific knowledge.


Operations make things worse, part 5.

Some mathematical operations do not map easily to floating point, and when specs don't strictly define required accuracy there may be a possible tradeoffs that implementations may be doing, for more speed and less precision.

One of the better known examples is exp, expf (exponentiation) This is also one of a few reasons different libraries and different hardware should not be expected to all give bit-identical results.


Note that x87 (x86 FPUs) operations do most calculations in 80-bit internally (and have for a much longer time that you'ld think). Intuitively because (the internal steps within what looks like) a single FPU operation imply losing a few digits (depending on the operation), so you need to work with more to have your result stay actually accurate to within your 64 bit numbers. (It's overkill for 32-bit, but doesn't hurt).

However:

  • It does not mean subsequent operations get near-80-bit precision, because the intermediate 'fetch this out to memory' is typically 64-bit.
Even if your language exposes 80-bit floats in their typing system (e.g. C's long double), and even if it can loads/saves them into the FPU registers, there are various footnotes.
  • SIMD things like SSE do not do this 80-bit stuff.[3].
  • GPUs also do not
Actually, GPUs sometimes play looser with floating point specs in other ways, though less so now than they did early days
  • because the x87 is a coprocessor (with its own register stack), it's clunkier and harder to optimize and CPU makers have apparently been trying to get rid of it in favour of the SIMD variants. There are cases where this is better, there are cases where it is not better.


It's usually well quantified how large the error may be on any specific hardware, e.g. via ulp.


Though note that compiler optimizations can still change things from being bit identical.




Denormal numbers

Denormal numbers, a.k.a. subnormal numbers, are ones that are so small (near zero) that the exponent has already bottomed out, so we have to have leading zeroes in the mantissa.

This implies you have fewer bits of precision than you would normally have.


In many cases these are so close to zero you can just treat them as zero, for practical reasons, and occasionally for performance reasons as well.


See also:

Significance loss

Storing integers

A good range of integers can be stored exactly in floating point.


Basically it's those for which the binary representation needs at most the amount of bits the mantissa actually has (plus one implicit bit).

For example, 32-bit floats have 23 bits for the mantissa, so can store integers up to 223+1-1, so practically integers within -16777215 .. 16777215

64-bit floats have 52 mantissa bits, so can store integes within -9007199254740991..9007199254740991


Notes:

  • technically you can store up to 2mantissabits+1 rather than mantissabits+1-1, but languages tend to define safe values as those for which n and n+1 are exactly representable, or perhaps more to the point, as those that are not also approximations for other numbers. So it ends one lower.


  • After the abovementioned limit, you effectively get every second integer, after that every fourth, etc.
(and before that you could also store halves, before that quarter)
If you've ever played with fixed-point integers (using integers to imitate floats) this may be somewhat intuitive to you


ulps and wobble

Floating point compression

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The way floating point numbers are stored in bits doesn't lend their binary form to be compressed very well using generic lossless compression.


Lossless floating point compression has seen some research, but the short answer is that only some relatively specific cases get more than a little compression. (Say, time series may be very predictable. For example, influxdb has some interesting notes on what it does to its data)


Lossy floating point compression is easier, and in amounts to decimating/rounding to some degree.

Intuitively, if you don't care about the last few digits in each number(/bits in a mantissa), you can chop off a little. It's like storing a float64 in a float32, except you can do this in finer-grained steps.

On GPUs

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Older GPUs were more specialized devices, and deviated from standard IEEE floats.

Old as in roughly pre-CUDA, and CUDA 1.x still had some issues (e.g. dealing with denormal numbers). See e.g. [4].


Modern GPUs (since Compute Capability 2) provide something much closer to IEEE-754 floating point.

Still with a few footnotes that mean they are not equivalent to FPUs.

Also, while FPUs in the x86 line do calculations in 80-bit (basically since always), on GPU 32-bit is actually 32-bit, and 64-bit is actually 64-bit (closer to e.g. SSE), meaning they can accumulate errors faster. In some case you can easily work around that, in some cases it's a limiting factor.


You may be able to make GPUs use float16, though don't expect ease of use, and while it may be faster, it probably won't be twice the speed, so it may not be worth it.

Speed of 16, 32, and 64-bit FP in GPU isn't a direct halving/doubling, because of various implementation details, some of which also vary between architectures. But also on that most cards currently focus on 32-bit - because they're assigning most silicon to things most useful in gaming, for which singles are enough.


When lower precision is okay, even integer / fixed-point may help speed on GPU (verify)


float64 is less of a performance hit on most FPUs(verify) than GPUs.


If it matters, you may want to check every GPU (and every GPU/driver change) http://graphics.stanford.edu/projects/gpubench/test_precision.html

http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf



On repeatability

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Differences between CPUs and GPUs are expected

See above for the reasons.

The operations will typically still be within IEEE 754 specs. ...which for many operations are not quite as strict as you may think.


Differences between different GPUs and drivers will happen

The precision for an operation in a range is usually well characterizED, and typically within IEEE spec, but may differ between hardware and with different drivers.

Also, the GPU and driver is effectively a JIT optimizer, so it might rearrange operations with minor side effects(verify)

See e.g.

http://graphics.stanford.edu/projects/gpubench/test_precision.html
https://www-pequan.lip6.fr/~jezequel/ARTICLES/article_CANA2015.pdf


Some code is not deterministic for speed reasons

e.g CUDA atomics: the order in which concurrent atomic updates are performed is not defined, so this can have rounding/associativity-related side effects.

You can avoid this with different code. But in some applications you may want the speed increase (see e.g. tensorflow).


Differences between runs on the same GPU+driver should not happen

...yet sometimes do.

It seems that GPUs pushed too hard makes mistakes (presumably in memory more than in calculation?). You wouldn't notice this in gaming, or necessarily in NN stuff, but you would care in scientific calculation.

Sometimes this can be fixed with drivers that use the hardware a little more carefully.

Sometimes it's a risk-speed tradeoff, one you can tweak in settings, and one that may change with hardware age.

On FPU-less CPUs

See also (floats)


Interesting note relevant to audio coding:


Relevant standards:

  • IEEE 754 [5] for a long time referred specifically to IEEE 754-1985, The "IEEE Standard for Binary Floating-Point Arithmetic" and is the most common reference.
  • recently, IEEE 754-2008 was published, which is mostly just the combination of IEEE 754-1985 and IEEE 854 (radix-independent). Before it was released, it was known as IEEE 754r[6], ('revision')
  • IEEE 854, specifically IEEE 854-1987, is a radix-independent variation [7]
  • IEC 60559:1989, "Binary floating-point arithmetic for microprocessor systems", is the same as 754-1985


Unsorted:


There are other, more specific floating-point implementations

Integers

Signed and unsigned integers

A signed integers can store negative values, an unsigned integer is one that can store only positive numbers.

Presumably it's just referring to whether a minus sign gets involved (I've never checked the origins of the terms).


Unsigned integers are often chosen to be able to count a little further, because signed integer cut the range in half.

Say, a signed 16-bit int counts from −32768 to 32767 (...assuming two's complement style storage, see below), an unsigned 16-bit int counts from 0 to 65535.


How we could store negative numbers

There are a few different ways of representing negative numbers.

The most commonly used is two's complement, partly because they are some of the easiest and most efficient to implement in hardware - most operations for unsigned numbers actually do the correct thing on two's-complement signed as well. (Not true for one's complement)


Arbitrary-precision integers

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Arbitrary-precision integers, also frequently named something shorter like 'bignum' or 'bigint', is integer that can represent any large value -- until it doesn't fill available storage, anyway.


A toy implementation of storing these, and some basic operations on them, is surprisingly reasonable to make.

Say, one approach that may not be very efficient yet is easy to understand (and program) is to keep a list of int32s, each representing three decimal digits at a time. For example, 1034123 would be stored as the list [1, 34, 123].


Many operations can be done with some simple piece-by-piece logic with the underlying integer operations, followed by pass of 'if an element is >999, increment the next highest element' (like carry when doing multiplication on paper) (right-to-left, in case that pushes the next one over 999 as well). For example:

  • 1034123+1980
= [1,34,123] + [1,980]
= [1,35,1103]
= [1,36,103] (= 1036103)
  • 1980*5002
= [1,980] * [5,2]
= [2,1960] + [5,4900,0]
= [3,960] + [9,900,0]
= [9,903,960] (= 9903960)


Notes:

  • Serious bignum interpretations do something similar, but are a bunch cleverer in terms of speed, space efficiency, and edge cases of various operations.
  • The choice above for 1000 in int32 is not very space-efficient at all. It was chosen for the toy implementation because
add or multiply can't can overflow the value (1000*1000 is well within the range) and
it's pretty trivial to print (and to see the meaning of the stored data during debugging)
  • The choice of when to do the carry test doesn't matter so much in this toy implementation (is an avoidable intermediate step in the multiplication example), but matters in real implementations


See also:


Fixed point numbers

What

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Fixed point numbers are, in most implementations, a library of functions that abuses integers to represent more fractional numbers.


Introduction by example

Consider that a regular integer works something like:

  • bit 1 represents 1
  • bit 2 represents 2
  • bit 3 represents 4
  • Bit 4 represents 8
  • Bit 5 represents 16
  • Bit 6 represents 32
  • Bit 7 represents 64
  • Bit 8 represents 128
  • ...and so on.

In which case:

  • 00000100 represents 4
  • 00000111 represents 7 (4 + 2 + 1)


Now you say that that instead:

  • bit 1 represents 1/4
  • bit 2 represents 1/2
  • bit 3 represents 1
  • Bit 4 represents 2
  • Bit 5 represents 4
  • Bit 6 represents 8
  • Bit 7 represents 16
  • Bit 8 represents 32
  • ...etc

In which case:

  • 000001 00 represents 1.00
  • 000001 11 represents 1.75 (1 + 1/2 + 1/4)


Put another way, you're still just counting, but you decided you're counting units of 1/4.


It's called "fixed point" because (as the spacing in the bits there suggests), you just pretend that you shifted the decimal point two bits to the left, and sits there always.

Which also means various operations on these numbers make sense as-is. For adding you don't even need a library, as long as you're aware of the rough edges.

More complex operations make things messier, so you do need some extra bookkeeping, and often want to detect bordercases like overflow (stuff which in floating point are handled by the specs / hardware flags) - see e.g. saturation.

Why

The fixed point trick was useful when Floating Point Units were not yet ubiquitous part of CPUs, because it means you can use regular integers to calculate things with more digits, in a way that's faster than emulating an IEEE float.

(In a few specific well-tuned cases it could even beat shabby FPUs)


Fixed point is still used, e.g.:

  • on platforms without floating point calculation, e.g. many microcontrollers, some embedded CPUs, and such
  • in libraries that may have to run on such platforms
  • to make rounding fully predictable and bit-perfect across systems, in calculation and in data storage.
e.g. SQL defines fixed-point data, so most SQL databases can store and calculate with what is functionally roughly the same (implementation may be more focused on being fractions, but it amounts to much the same).
  • for quick and dirty high-precision calculation (since you can extend this into more bits, sort of like a specialized bigint)
    • mostly because from scratch, it's easier to implement fixed-point logic with more bits than floating point logic with more bits (though these days there are great high/arbitrary-precision float libraries)

More detail

See also



Storing fractions

BCD

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Binary Coded Digits are a way of coding integers that is different from your typical two's complement integers.


BCD codes each digits in four bits. For example, (decimal) 12 would be 0001 0010

One side effect that the hexadecimal representation of the coded value will look like the number: decimal 12 is coded like 0x12.


However, you don't use all your bits, some operations are harder on this representation, and in a processor would require a little more silicon. So why would you do this?


The main benefit is easy conversion to something you show as decimal, which makes sense when displaying a limited amount of numbers - clocks, calculators, 7-segment drivers, particularly when doing so without a processor.

It is also seen where you want simple electronics, and/or want something more akin to fixed-point than floating-point calculations.


Pocket calculators will (still(verify)) tend to work in BCD(verify).

On rounding

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Word, octet, nibble

Word

In computer hardware and programming, a word is the fundamental storage size in an architecture, reflected by the processor and usually also the main bus.

A 16-bit computer has 16-bit words, a 64-bit computer has 64-bit words, etc. On all but ancient computers, they're multiples of eight.


Words:

  • In programming, 'word' often means 'an architecture-sized integer.'
  • Some use it in the implicit context of the architecture they are used to, such as indicating 16 bits in the days of 16-bit CPUs. In the context of 32-bit processors, a halfword is 16 bits, a doubleword 64 bits, and a quadword 128 bits. These terms may stick for a while even in 64-bit processor era.
  • 'Long word' can also be used, but is ambiguous

Basically, this is a confusing practice.


This refers specifically to integer sizes because floating point numbers are available in a few standardized sizes (usually IEEE 754).

In languages like C, int carries the meaning of a word; it may differ in size depending on what architecture it is compiled for. Some coders don't use specific-sized integers when they should, which is quite sloppy and can lead to bugs - or be perfectly fine when it just needs to store some not-too-large numbers. It may interfere with signed integers (Two's complement negative numbers), particularly if you count on certain behaviour, such as that -1 would be equal to 0xffffffff, which is only true for 32-bit signed ints, not for signed ints in general.

Octet

Fancy word for a byte - eight bits large.

Seen in some standard definitions, largely because some older (mostly ancient and/or unusual) computers used sizes more creatively, which also implied that 'byte' sometimes meant sizes other than 8 bits, and because 'byte' carries more of a connotation of being already binary coded - and octet more that of just the concept of grouping eight bits as a unit.

(For more details, see foldoc on the byte)


A few countries, like france, use megaoctet / Mo instead of the megabyte. (Though for french it seems this is also because it sounds potentially rude - though this applies to bit, not byte[8])

Nibble

(Also seen spelled nybble, even nyble)


Half a byte: four bits.

Most commonly used in code comments to describe the fact you are storing two things in a byte, in two logical parts, the high and low nibble.

Few architectures or languages have operations at nibble level, probably largely because bitmask operations are simple enough to do, and more flexible.

Has seen other definitions in ancient/strange computers.


Semi-sorted

atan and atan2

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Quick review of the geometric properties of a right angled triangle, via SOHCAHTOA:

tan(angle) = opposite_length / adjacent_length

And if you know the sizes of the two straight edges, you can use the arctangent (a.k.a. arctan, atan) to go the other way:

atan( opposite_length / adjacent_length ) = angle


So if you know the angle, you know the relative magnitudes of these two edges, e.g.

tan(30°) = 0.577

So arctan takes these relative magnitudes and tells you the angle that implies

arctan(0.577) = 29.98°    (because of rounding for brevity)


Computer operations work in radians so, equivalently:

tan(0.523) = 0.577

and

arctan(0.577) = 0.524


When you have atan, why is there also an atan2?

Well, atan assumes you care just about the slope, so gives answers between -pi/2 and pi/2 (-90..90 degrees, vertical downwards to vertical upwards, but always pointing to the right).

There's two common issues with that.


One is that when you calculate atan( y/x ), you are doing that division before handing a value over. And for vertical lines, that's a division by zero. So as-is, you have to special-case vertical lines in your code.


The other is that when you work not with graph-style slopes but with vectors, you may care about slope, but also direction, what quadrant it's pointing in.

When you did y/x you are already ambiguous about the quadrant, so instead of:

atan( y/x )

you do

atan2( y, x ) 

which gives you an answer in the full circle's range, -pi to pi


For example, when

x=2 and y=1 (shallowly pointing to the right and up)
atan( y/x )    is 1.07 radians 
atan2( y,x )   is 1.07 radians 
x=0 and y=2 (vertical, up)
atan( y/x )    is a zero division error
atan2( y,x )   is 1.57 radians (0.5pi)


x=0 and y=-2 (vertical, down)
atan( y/x )    is a zero division error
atan2( y,x )   is -1.57 radians (-0.5pi)


x=-2 and y=1 (shallowly pointing to the left and up)
atan( y/x )    is -1.107 radians
atan2( y,x )   is  2.677 radians (0.85pi)


See also:

On architecture bit sizes