Floating Point Representation

Published Apr 24, 2018

Floating point numbers are never interesting for a programmer. They always bring a lot of tension and uncertainty along with them.

Nothing brings fear to my heart more than a floating point number. — Gerald Jay Sussman

It is essential to understand how floating points are stored to get a better understanding of them. Before IEEE-754, every computer manufacturer devised its own convention for representing floating point numbers.
The IEEE format is based on a certain set of rules and principles which makes it pretty easy to understand. The best way to understand them is to compare them with decimals.


Let us see how decimals are represented:
d(m-1). d(m-2).d(m-3)……………d(1).d(0)
For example:
12.34 = 110 + 21 + 3*(1/10) + 4*(1/100)
By analogy, consider this notation:
b(m-1).b(m-2).........b(1).b(0)
For example:
10.11 = 12 + 01 + 1*(1/2) + 1*(1/4)

With the above notation, we can represent numbers that can be written as (x * 2^y). Numbers like (1/5), (1/3) cannot be represented exactly, and we can try to increase the accuracy of these numbers by increasing the number of bits.

The above notation would not be efficient for very large numbers. Consider the case where we want to represent (5 * 2¹⁰⁰). It would be represented by 101 followed by 100 zeroes. Instead, we can use the representation (x * 2^y) to store the numbers more efficiently. The IEEE Format follows the following notation:
V = (-1)^s * M * 2^E;

  1. s (sign) determines if the number is positive or negative.
  2. M (significand) is a fractional binary number between 0 and 1 - e (epsilon).
  3. E weights the value by a power of 2.
    It is similar to (x * 2^y), but will help us to store in memory in a more efficient way. Let us see how that happens!

The bit representation of a floating point number is divided into three fields:
s: single sign bit.
k-bit exponent field exp: encodes E.
n-bit fraction field frac: encodes M.

For single precision numbers (floating points) the number of bits assigned to each of these are: 1 bit for s, k = 8, n = 23.

Floating points can have a wide range of values from fractional values (1/2¹⁴⁹) to very large values (2¹²⁷). (This is one of the reasons we have floating point numbers, isn’t it?).

To keep things somewhat structured, we divide the numbers in three sets:
{-1 < Numbers < 1, -1 ≤ Numbers, and ≥ 1, Special Numbers(Infinity, anything divided by 0 (NaN) , etc.) }
These three sets have different methods associated with them. Let’s have a look at each of the methods (Please keep in mind s (signed bit), exp (exponent), frac (fraction).)

  1. Normalized Values:
    This is the set with -1 ≤ Numbers, and ≥ 1
    s: can be 0 or 1.
    exp: anything expect all ones or all zeroes (This is how sets are divided).
    frac: anything between 0 and 1 in binary representation ( 0.(fn-1).….f1.f0)
    E = e-bias (e is unsigned representation of exp)
    Bias = 2*(k-1)-1; (k is the number of bits allowed in exp, 8 for floats, 11 for doubles). Bias makes sure we cover numbers in both ranges (≤ -1, ≥ 1)
    M = 1 + f ( This is also known as Implied leading 1 representation).
    Let’s go through some examples to understand this better. Let us assume we have a 5 bit number with 1 bit for s, 2 bits for exp (k), 2 bits for frac (n).
    (Remember exp cannot be all zeroes or all ones in the normalized case)
    In the above examples, E = 2^(2–1)-1 = 1.

  2. Denormalized Values:
    This is the set with -1≤ Numbers ≤ 1
    s: can be 0 or 1.
    exp: all zeroes
    frac: anything between 0 and 1 in binary representation ( 0.(fn-1).….f1.f0)
    E = 1-bias (e is unsigned representation of exp)
    Bias = 2*(k-1)-1; (k is the number of bits allowed in exp, 8 for floats, 11 for doubles). Bias makes sure we cover numbers in both ranges (≤ -1, ≥ 1)
    M = f
    Both M and E are calculated in a different manner than Normalized case.
    Let’s go through some examples to understand this better:
    Note: There are two ways to represent 0. (+0, -0).

  3. Special Values:
    exp: all ones
    s: if 0, then +Infinity, if 1, then -Infinity.
    If f != 0, then we return NaN (Now you know where this comes from)
    Let’s go through some examples:
    1_pc1yYf5VFPTcurMGcpLwOg.png
    I hope now you have a better understanding of floating point numbers.

Discover and read more posts from Prabhsimran Singh Baweja
get started