Notes on numbers in computers

From Helpful
(Redirected from Denormal numbers)
Jump to navigation Jump to search
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Floating point numbers

Floating point numbers store approximations of real numbers, and are the most common way of doing so in computers.


When (more) precise real numbers are necessary, we look to computer algebra and symbolic math, such software has its own solutions, and accepts it is sometimes slow. (those also deal more easily and sensibly with rational numbers and transcendental and irrational numbers are harder for computers to store and use, both at all and efficiently)


Yet from a practical view, most math involving decimal points, fractions, very large numbers, very small numbers, or transcendental numbers like pi or e, are precise enough to do in floating point, particularly if you know its limitations.

So floating point is a pragmatic tradeoff, with limited and fairly-predictable precision, and consistent speed.

Speed that comes in part from some CPU silicon dedicated to floating point calculations. If you didn't you could simulate it, but it would be factors slower, and on simple microcontrollers this is still the case, and there are some considerations of doing part of your calculations in integer, or in fixed point ways.


Within computers, floating point are mainly contrasted with

which could be seen as a makeshift precursor to floating point (and are still useful on platforms without floating point calculations)
  • arbitrary-precision arithmetic,
    • BigInt


Intuitively, floating point can be understood as an idea similar to scientific notation.

Consider 20400.

In scientific notation you might write that as 2.04 * 104.

Floating point would actually store that same number as something more like 1.2451172 * 214. (Exactly why and how isn't hugely important, e.g. its use of base-2 is mostly details about slightly more efficient use of the same amount of silicon.)



Bit division and layout

You don't need to know much of this, but to give some idea how it's stored: IEEE 754 floats divide their allotted bits into sign, mantissa, and exponent.

The standard float types in most programming languages will be IEEE 754 32-bit and/or IEEE 754 64-bit

largely because most CPUs can can handle those natively and therefore quickly
also we're used to them - e.g. arduinos have to simulate floats with a bunch of work for each floating point operation, but it's hacked in because it's very convenient


An IEEE 754 32-bit float uses 1 bit for the sign, 23 for the mantissa and 8 for the exponent. The bit layout:

seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm

An IEEE 754 64-bit floats (a.k.a. 'double-precision', doubles) uses 1 bit for sign, 52 for the mantissa and 11 for the exponent. The bit layout:

seeeeeeeeeeemmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm


There are other-sized variations, like 80-bit extended precision, 128-bit quadruple, 256-bit octuple, 16-bit half, and also 36-bit, 48-bit, and 60-bit, and I've heard mention of 120-bit and one or two others.


You can always emulate lower-precision floats, e.g. use 80-bit to emulate 64-bit, by throwing away the excess precision only when the result is fetched out.

This is apparently what x86 FPUs have done since the 8087 - they are internally 80-bit, out of necessity, because some operations are by nature less precise than the precision its intermediates are stored in, so you need some extra bits for those few operations not to work poorly.


Most programming have 32-bit and/or 64-bit IEEE floats as a built-in type. A few might allow some use or control over extended precision[1]), but this is a topic with footnotes of its own.


Representation details

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
🛈 You may want to read the following thinking only about how this affects inaccuracy, because most of these internals are not very useful to know.
And for in-depth details, there are better resources out there already. This is intended as a summary.


The value a float represents is calculated something like:

sign * (1+mantissa) * 2exponent


Sign is coded as 1 meaning negative, 0 meaning positive, though in the formula above, think of it as -1 and 1.

Exponent is an integer which (in 32-bit floats) represents values within -126..127 range

  • so e.g.
coded value 1 represents -126
coded value 127 represents 0
coded value 254 represents 127
  • values 0 and 255 (which would represent -127 and 128) are used for special cases:
    • Zero (which can't be represented directly because of the 1+ in the formula) is represented by exponent=00000000 and mantissa=0
    • Infinity is represented by exponent=11111111 and mantissa=0.
    • Note that the last two don't mention sign -- both infinity and zero can be positive and negative
    • NaN ('Not a Number') is represented by exponent=11111111 and non-zero mantissa
      • there is a further distinction into quiet NaNs for indeterminate operations, and signaling NaNs for invalid operations, but you often wouldn't care

Mantissa bits represent 2-1 (=0.5), 2-2 (=0.25), and so on. For a 32-bit float there are 23 mantissa bits, so before the exponent gets involved they represent:

0.5
0.25
0.125
0.0625
0.03125
0.015625
0.0078125
0.00390625
0.001953125
0.0009765625
0.00048828125
0.000244140625
0.0001220703125
0.00006103515625
0.000030517578125
0.0000152587890625
0.00000762939453125
0.000003814697265625
0.0000019073486328125
0.00000095367431640625
0.000000476837158203125
0.0000002384185791015625
0.00000011920928955078125

For 64-bit floats there are another 29 of these.

So yes, the mantissa part is representing fractions - some explanations introduce the whole thing this way - but only in a directly understandable way when you ignore the exponent).


Example from IEEE bits to represented number:

  • 1 01111111 10000000000000000000000
  • Sign is 1, representing -1
  • exponent is 127 (decimal int), representing 0
  • mantissa is 10000000000000000000000, representing 0.5

Filling in the values into (sign * (1+mantissa) * 2exponent) makes for (-1*(1+0.5)*20), is -1.5


Example from number to IEEE bits

Encoding a number into float form can be understood as:

  1. start with base-2 scientific notation (with exponent 0)
  2. increment the exponent while halving the part that goes into the mantissa (which keep the represented number the same, to within error).
  3. repeat step 2 until value-that-goes-into-the-mantissa is between 0 and 1
(Note that mantissa values between 0 and 1 represents 1+that, so in the examples below we divide until it's between 1 and 2)


For example: 7.0

  • is 7.0*20
  • is 3.5*21
  • is 1.75*22 (mantissa stores (that-1), 0.75, so we're now done with the shifting)

We have sign=positive, exponent=2, and mantissa=0.75. In bit coding:

  • sign bit: 0
  • exponent: we want represent 2, so store 129, i.e. 10000001
  • mantissa: we want to represent 0.75, which is 2-1+2-2, (0.5+0.25) so 11000000000000000000000

So 7.0 is 0 10000001 11000000000000000000000


The 20400 mentioned above works out as 1.2451171875*214.

  • Sign: 0
  • Exponent: 10001101 (141, representing 14)
  • Mantissa: 00111110110000000000000
Using the table above, that mantissa represents 0.125+0.0625+0.03125+0.015625+0.0078125+0.001953125+0.0009765625 (which equals 0.2451171875)


If you want to play with this, see e.g. https://www.h-schmidt.net/FloatConverter/IEEE754.html


See also:

So how is that precision distributed?

On the limited precision

Never assume floating point numbers or calculations are fully precise

As [2] mentions, one reason for inaccuracy is akin to the reason we can't write 1/3 in decimal either - writing 0.333333333 is imprecise no matter how many threes you add.

Sure, others are precise, but when we're talking about limitations, we want to know the worst case.


Introductions often mention that (counting from the largest digit)

32-bit float is precise up to the first six digits or so
64-bit precise up to the first sixteen digits or so

There's some practical footnotes to that but it's broadly right.



That "stop anywhere and it's inexact" example isn't the best intuition, though, because some numbers will be less imprecise than others, and a few will be exact.

Consider that:

  • 0.5 is stored precisely in a float
  • 0.1 and 0.2 are not stored precisely
  • while 20400.75 is stored precisely, 20400.8 is not
note that having few digits in decimal representation is (mostly) unrelated few digits in float representation
  • with 32-bit floats, all integers within -16777215..16777215 are exact
with 64-bit floats, all integers in -9007199254740991..9007199254740991 are exact
this is why some programming languages opt to not have integer types at all, for example Javascript and Lua


Even if you say 'fair enough' about that, operations make things worse, intuitively because each intermediate result becomes the nearest representable number.


Okay, but you may not have considered that this breaks some properties you might not expect, such as commutativity?

Consider:

0.1+(0.2+0.3) != (0.1+0.2)+0.3

and, sometimes more pressingly,

0.3-(0.2+0.1) != 0.0

(practically, tests for zero should often be something like abs(v)<0.0001 instead, but it's not always clear how close you should require it to be)


Why do things go wrong?

In part it's because what we write down precisely in base 10 isn't necessarily precise in float representation.

In part because operations can mess things up further.


There frankly isn't a single best mental model of how or when these inaccuracies happen.

You're best off assuming everything may be incorrect to within rounding error, and that every operation might contribute to that.

You should also assume these errors may accumulate over time, meaning there is no perfect choice for that 'almost zero' value.


Operations make things worse, part 2.

Combining numbers of (vastly) different scales creates another issue.

For an intuition why, consider scientific notation again, e.g. adding 1E2 + 1E5

If you stick to this notation, you would probably do that this by considering that 1E2 is equal to 0.001E5, so now you can do 0.001E5 + 1E5 = 1.001E5

That same "scaling up the smaller number's exponent to fit the larger number" in float means increasing the exponent, while dividing the mantissa.

The larger this magnitude change, the more that digits fall out of the mantissa (and that's ignoring the rounding errors).

If there is a large enough scale difference, the smaller number falls away completely. Say, 1 + 1E-5 is 1.00001, but 1 + 1E-23 is 1.0 even in 64-bit floats.

It's not that it can't store numbers that small, it's that the precision limit means you can't combine numbers of a certain magnitude difference.


Operations make things worse, part 3.

Scale stuff introduces more interesting cases.

Remember a bit above where

0.1+(0.2+0.3) != (0.1+0.2)+0.3

Well,

0.1+(0.2+100) == (0.1+100)+0.2 

You may not expect that, and it takes some staring to figure out why.


Operations make things worse, part 4.

Errors accumulate.

For example, if you keep updating a number with operations, you should expect rounding errors to happen each calculation, and therefore accumulate. For example, if you do

view = rotate(current_view, 10 degrees)

and do it 36 times you would expect to be precisely back where you started - but by now you probably would not expect that of floats. Do that another couple thousand times, and who knows where you are exactly?

In games that may not matter. Usually. Except when it does.

There are ways to avoid such issues, in general and in games, and it's not even hard, but it is specific knowledge.


Operations make things worse, part 5.

Some mathematical operations do not map easily to floating point, and when specs don't strictly define required accuracy there may be a possible tradeoffs that implementations may be doing, for more speed and less precision.

One of the better known examples is exp, expf (exponentiation) This is also one of a few reasons different libraries and different hardware should not be expected to all give bit-identical results.


Note that x87 (x86 FPUs) operations do most calculations in 80-bit internally (and have for a much longer time that you'ld think). Intuitively because (the internal steps within what looks like) a single FPU operation imply losing a few digits (depending on the operation), so you need to work with more to have your result stay actually accurate to within your 64 bit numbers. (It's overkill for 32-bit, but doesn't hurt).

However:

  • It does not mean subsequent operations get near-80-bit precision, because the intermediate 'fetch this out to memory' is typically 64-bit.
Even if your language exposes 80-bit floats in their typing system (e.g. C's long double), and even if it can loads/saves them into the FPU registers, there are various footnotes.
  • SIMD things like SSE do not do this 80-bit stuff.[3].
  • GPUs also do not
Actually, GPUs sometimes play looser with floating point specs in other ways, though less so now than they did early days
  • because the x87 is a coprocessor (with its own register stack), it's clunkier and harder to optimize and CPU makers have apparently been trying to get rid of it in favour of the SIMD variants. There are cases where this is better, there are cases where it is not better.


It's usually well quantified how large the error may be on any specific hardware, e.g. via ulp.


Though note that compiler optimizations can still change things from being bit identical.




Denormal numbers

Denormal numbers, a.k.a. subnormal numbers, are ones that are so small (near zero) that the exponent has already bottomed out, so we have to have leading zeroes in the mantissa.

This implies you have fewer bits of precision than you would normally have.


In many cases these are so close to zero you can just treat them as zero, for practical reasons, and occasionally for performance reasons as well.


See also:

Significance loss

Storing integers

A good range of integers can be stored exactly in floating point.


Basically it's those for which the binary representation needs at most the amount of bits the mantissa actually has (plus one implicit bit).

For example, 32-bit floats have 23 bits for the mantissa, so can store integers up to 223+1-1, so practically integers within -16777215 .. 16777215

64-bit floats have 52 mantissa bits, so can store integes within -9007199254740991..9007199254740991


Notes:

  • technically you can store up to 2mantissabits+1 rather than mantissabits+1-1, but languages tend to define safe values as those for which n and n+1 are exactly representable, or perhaps more to the point, as those that are not also approximations for other numbers. So it ends one lower.


  • After the abovementioned limit, you effectively get every second integer, after that every fourth, etc.
(and before that you could also store halves, before that quarter)
If you've ever played with fixed-point integers (using integers to imitate floats) this may be somewhat intuitive to you


ulps and wobble

Floating point compression

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

The way floating point numbers are stored in bits doesn't lend their binary form to be compressed very well using generic lossless compression.


Lossless floating point compression has seen some research, but the short answer is that only some relatively specific cases get more than a little compression. (Say, time series may be very predictable. For example, influxdb has some interesting notes on what it does to its data)


Lossy floating point compression is easier, and in amounts to decimating/rounding to some degree.

Intuitively, if you don't care about the last few digits in each number(/bits in a mantissa), you can chop off a little. It's like storing a float64 in a float32, except you can do this in finer-grained steps.

On GPUs

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Older GPUs were more specialized devices, and deviated from standard IEEE floats.

Old as in roughly pre-CUDA, and CUDA 1.x still had some issues (e.g. dealing with denormal numbers). See e.g. [4].


Modern GPUs (since Compute Capability 2) provide something much closer to IEEE-754 floating point.

Still with a few footnotes that mean they are not equivalent to FPUs.

Also, while FPUs in the x86 line do calculations in 80-bit (basically since always), on GPU 32-bit is actually 32-bit, and 64-bit is actually 64-bit (closer to e.g. SSE), meaning they can accumulate errors faster. In some case you can easily work around that, in some cases it's a limiting factor.


You may be able to make GPUs use float16, though don't expect ease of use, and while it may be faster, it probably won't be twice the speed, so it may not be worth it.

Speed of 16, 32, and 64-bit FP in GPU isn't a direct halving/doubling, because of various implementation details, some of which also vary between architectures. But also on that most cards currently focus on 32-bit - because they're assigning most silicon to things most useful in gaming, for which singles are enough.


When lower precision is okay, even integer / fixed-point may help speed on GPU (verify)


float64 is less of a performance hit on most FPUs(verify) than GPUs.


If it matters, you may want to check every GPU (and every GPU/driver change) http://graphics.stanford.edu/projects/gpubench/test_precision.html

http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf



On repeatability

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Differences between CPUs and GPUs are expected

See above for the reasons.

The operations will typically still be within IEEE 754 specs. ...which for many operations are not quite as strict as you may think.


Differences between different GPUs and drivers will happen

The precision for an operation in a range is usually well characterizED, and typically within IEEE spec, but may differ between hardware and with different drivers.

Also, the GPU and driver is effectively a JIT optimizer, so it might rearrange operations with minor side effects(verify)

See e.g.

http://graphics.stanford.edu/projects/gpubench/test_precision.html
https://www-pequan.lip6.fr/~jezequel/ARTICLES/article_CANA2015.pdf


Some code is not deterministic for speed reasons

e.g CUDA atomics: the order in which concurrent atomic updates are performed is not defined, so this can have rounding/associativity-related side effects.

You can avoid this with different code. But in some applications you may want the speed increase (see e.g. tensorflow).


Differences between runs on the same GPU+driver should not happen

...yet sometimes do.

It seems that GPUs pushed too hard makes mistakes (presumably in memory more than in calculation?). You wouldn't notice this in gaming, or necessarily in NN stuff, but you would care in scientific calculation.

Sometimes this can be fixed with drivers that use the hardware a little more carefully.

Sometimes it's a risk-speed tradeoff, one you can tweak in settings, and one that may change with hardware age.

On FPU-less CPUs

See also (floats)


Interesting note relevant to audio coding:


Relevant standards:

  • IEEE 754 [5] for a long time referred specifically to IEEE 754-1985, The "IEEE Standard for Binary Floating-Point Arithmetic" and is the most common reference.
  • recently, IEEE 754-2008 was published, which is mostly just the combination of IEEE 754-1985 and IEEE 854 (radix-independent). Before it was released, it was known as IEEE 754r[6], ('revision')
  • IEEE 854, specifically IEEE 854-1987, is a radix-independent variation [7]
  • IEC 60559:1989, "Binary floating-point arithmetic for microprocessor systems", is the same as 754-1985


Unsorted:


There are other, more specific floating-point implementations

Integers

Signed and unsigned integers

A signed integers can store negative values, an unsigned integer is one that can store only positive numbers.

Presumably it's just referring to whether a minus sign gets involved (I've never checked the origins of the terms).


Unsigned integers are often chosen to be able to count a little further, because signed integer cut the range in half.

Say, a signed 16-bit int counts from −32768 to 32767 (...assuming two's complement style storage, see below), an unsigned 16-bit int counts from 0 to 65535.


How we could store negative numbers

There are a few different ways of representing negative numbers.

The most commonly used is two's complement, partly because they are some of the easiest and most efficient to implement in hardware - most operations for unsigned numbers actually do the correct thing on two's-complement signed as well. (Not true for one's complement)


Arbitrary-precision integers

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Arbitrary-precision integers, also frequently named something shorter like 'bignum' or 'bigint', is integer that can represent any large value -- until it doesn't fill available storage, anyway.


A toy implementation of storing these, and some basic operations on them, is surprisingly reasonable to make.

Say, one approach that may not be very efficient yet is easy to understand (and program) is to keep a list of int32s, each representing three decimal digits at a time. For example, 1034123 would be stored as the list [1, 34, 123].


Many operations can be done with some simple piece-by-piece logic with the underlying integer operations, followed by pass of 'if an element is >999, increment the next highest element' (like carry when doing multiplication on paper) (right-to-left, in case that pushes the next one over 999 as well). For example:

  • 1034123+1980
= [1,34,123] + [1,980]
= [1,35,1103]
= [1,36,103] (= 1036103)
  • 1980*5002
= [1,980] * [5,2]
= [2,1960] + [5,4900,0]
= [3,960] + [9,900,0]
= [9,903,960] (= 9903960)


Notes:

  • Serious bignum interpretations do something similar, but are a bunch cleverer in terms of speed, space efficiency, and edge cases of various operations.
  • The choice above for 1000 in int32 is not very space-efficient at all. It was chosen for the toy implementation because
add or multiply can't can overflow the value (1000*1000 is well within the range) and
it's pretty trivial to print (and to see the meaning of the stored data during debugging)
  • The choice of when to do the carry test doesn't matter so much in this toy implementation (is an avoidable intermediate step in the multiplication example), but matters in real implementations


See also:


Fixed point numbers

What

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Fixed point numbers are, in most implementations, a library of functions that abuses integers to represent more fractional numbers.


Introduction by example

Consider that a regular integer works something like:

  • bit 1 represents 1
  • bit 2 represents 2
  • bit 3 represents 4
  • Bit 4 represents 8
  • Bit 5 represents 16
  • Bit 6 represents 32
  • Bit 7 represents 64
  • Bit 8 represents 128
  • ...and so on.

In which case:

  • 00000100 represents 4
  • 00000111 represents 7 (4 + 2 + 1)


Now you say that that instead:

  • bit 1 represents 1/4
  • bit 2 represents 1/2
  • bit 3 represents 1
  • Bit 4 represents 2
  • Bit 5 represents 4
  • Bit 6 represents 8
  • Bit 7 represents 16
  • Bit 8 represents 32
  • ...etc

In which case:

  • 000001 00 represents 1.00
  • 000001 11 represents 1.75 (1 + 1/2 + 1/4)


Put another way, you're still just counting, but you decided you're counting units of 1/4.


It's called "fixed point" because (as the spacing in the bits there suggests), you just pretend that you shifted the decimal point two bits to the left, and sits there always.

Which also means various operations on these numbers make sense as-is. For adding you don't even need a library, as long as you're aware of the rough edges.

More complex operations make things messier, so you do need some extra bookkeeping, and often want to detect bordercases like overflow (stuff which in floating point are handled by the specs / hardware flags) - see e.g. saturation.

Why

The fixed point trick was useful when Floating Point Units were not yet ubiquitous part of CPUs, because it means you can use regular integers to calculate things with more digits, in a way that's faster than emulating an IEEE float.

(In a few specific well-tuned cases it could even beat shabby FPUs)


Fixed point is still used, e.g.:

  • on platforms without floating point calculation, e.g. many microcontrollers, some embedded CPUs, and such
  • in libraries that may have to run on such platforms
  • to make rounding fully predictable and bit-perfect across systems, in calculation and in data storage.
e.g. SQL defines fixed-point data, so most SQL databases can store and calculate with what is functionally roughly the same (implementation may be more focused on being fractions, but it amounts to much the same).
  • for quick and dirty high-precision calculation (since you can extend this into more bits, sort of like a specialized bigint)
    • mostly because from scratch, it's easier to implement fixed-point logic with more bits than floating point logic with more bits (though these days there are great high/arbitrary-precision float libraries)

More detail

See also



Storing fractions

BCD

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Binary Coded Digits are a way of coding integers that is different from your typical two's complement integers.


BCD codes each digits in four bits. For example, (decimal) 12 would be 0001 0010

One side effect that the hexadecimal representation of the coded value will look like the number: decimal 12 is coded like 0x12.


However, you don't use all your bits, some operations are harder on this representation, and in a processor would require a little more silicon. So why would you do this?


The main benefit is easy conversion to something you show as decimal, which makes sense when displaying a limited amount of numbers - clocks, calculators, 7-segment drivers, particularly when doing so without a processor.

It is also seen where you want simple electronics, and/or want something more akin to fixed-point than floating-point calculations.


Pocket calculators will (still(verify)) tend to work in BCD(verify).

On rounding

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Word, octet, nibble

Word

In computer hardware and programming, a word is the fundamental storage size in an architecture, reflected by the processor and usually also the main bus.

A 16-bit computer has 16-bit words, a 64-bit computer has 64-bit words, etc. On all but ancient computers, they're multiples of eight.


Words:

  • In programming, 'word' often means 'an architecture-sized integer.'
  • Some use it in the implicit context of the architecture they are used to, such as indicating 16 bits in the days of 16-bit CPUs. In the context of 32-bit processors, a halfword is 16 bits, a doubleword 64 bits, and a quadword 128 bits. These terms may stick for a while even in 64-bit processor era.
  • 'Long word' can also be used, but is ambiguous

Basically, this is a confusing practice.


This refers specifically to integer sizes because floating point numbers are available in a few standardized sizes (usually IEEE 754).

In languages like C, int carries the meaning of a word; it may differ in size depending on what architecture it is compiled for. Some coders don't use specific-sized integers when they should, which is quite sloppy and can lead to bugs - or be perfectly fine when it just needs to store some not-too-large numbers. It may interfere with signed integers (Two's complement negative numbers), particularly if you count on certain behaviour, such as that -1 would be equal to 0xffffffff, which is only true for 32-bit signed ints, not for signed ints in general.

Octet

Fancy word for a byte - eight bits large.

Seen in some standard definitions, largely because some older (mostly ancient and/or unusual) computers used sizes more creatively, which also implied that 'byte' sometimes meant sizes other than 8 bits, and because 'byte' carries more of a connotation of being already binary coded - and octet more that of just the concept of grouping eight bits as a unit.

(For more details, see foldoc on the byte)


A few countries, like france, use megaoctet / Mo instead of the megabyte. (Though for french it seems this is also because it sounds potentially rude - though this applies to bit, not byte[8])

Nibble

(Also seen spelled nybble, even nyble)


Half a byte: four bits.

Most commonly used in code comments to describe the fact you are storing two things in a byte, in two logical parts, the high and low nibble.

Few architectures or languages have operations at nibble level, probably largely because bitmask operations are simple enough to do, and more flexible.

Has seen other definitions in ancient/strange computers.


Semi-sorted

atan and atan2

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Quick review of the geometric properties of a right angled triangle, via SOHCAHTOA:

tan(angle) = opposite_length / adjacent_length

And if you know the sizes of the two straight edges, you can use the arctangent (a.k.a. arctan, atan) to go the other way:

atan( opposite_length / adjacent_length ) = angle


So if you know the angle, you know the relative magnitudes of these two edges, e.g.

tan(30°) = 0.577

So arctan takes these relative magnitudes and tells you the angle that implies

arctan(0.577) = 29.98°    (because of rounding for brevity)


Computer operations work in radians so, equivalently:

tan(0.523) = 0.577

and

arctan(0.577) = 0.524


When you have atan, why is there also an atan2?

Well, atan assumes you care just about the slope, so gives answers between -pi/2 and pi/2 (-90..90 degrees, vertical downwards to vertical upwards, but always pointing to the right).

There's two common issues with that.


One is that when you calculate atan( y/x ), you are doing that division before handing a value over. And for vertical lines, that's a division by zero. So as-is, you have to special-case vertical lines in your code.


The other is that when you work not with graph-style slopes but with vectors, you may care about slope, but also direction, what quadrant it's pointing in.

When you did y/x you are already ambiguous about the quadrant, so instead of:

atan( y/x )

you do

atan2( y, x ) 

which gives you an answer in the full circle's range, -pi to pi


For example, when

x=2 and y=1 (shallowly pointing to the right and up)
atan( y/x )    is 1.07 radians 
atan2( y,x )   is 1.07 radians 
x=0 and y=2 (vertical, up)
atan( y/x )    is a zero division error
atan2( y,x )   is 1.57 radians (0.5pi)


x=0 and y=-2 (vertical, down)
atan( y/x )    is a zero division error
atan2( y,x )   is -1.57 radians (-0.5pi)


x=-2 and y=1 (shallowly pointing to the left and up)
atan( y/x )    is -1.107 radians
atan2( y,x )   is  2.677 radians (0.85pi)


See also:

On architecture bit sizes