1. Introduction
This goal of this essay is as a reminder to those already familiar with IEEE 754 binary representation.
This will involve the comparison of two numbers given in this specific format. There are four
different outcomes of such a comparison: (1) the first number could be smaller (x < y ), (2) the
second number could be smaller (x > y ), (3) the two numbers could be equal (x= y ), (4) the two
numbers cannot be compared ( x and y cannot be compared).
In the first part of this essay, we explain how rational numbers are represented in binary using the
IEEE 754 binary representation. This will follow with a discussion on what needs to be considered
when implementing the comparison of two rational numbers. This means all 4 relevant cases must
be considered. An algorithmic description will be provided on how to do this. It will be detailed
enough that you could implement this into Java. We will then further discuss how the algorithm can
be translated into a digital circuit and what differences there are between this and implementing it
into Java.
Finally, there will be a summary and conclusion provided. This will explain what has been learnt and
what was difficult. There will also be a brief assessment of the essay based on the marking criteria
provided.
2. Representing rational numbers in IEEE 754 binary representation
We will start with how rational numbers are represented in IEEE 754 representation by going
through reasonings and terms so that it fits together as you read on.
We have binary representations for natural numbers ( x ∈ N ), where N is the set of all natural
numbers and the ∈ symbol means ‘belongs in’, i.e., x belongs in N . Furthermore, we can also have
this for integers (x ∈ Z ), where Z represents the set of all integers.
For real numbers (x ∈ R), this isn’t feasible as real numbers can be infinite, and computers only
perform finite representation. If there were a pair of real numbers, there would always be another
number between them. This is an issue as to represent each unique and individual real number is
excessive for finite computers.
On the other hand, it is possible to represent rational numbers (x ∈ Q) in binary, where Q
represents the set of all rational numbers. Rational numbers can be represented as a fraction, i.e,
q=a /b, where q ∈ Q , a ,b ∈ Z .
There are two ways of approaching how to represent rational numbers. The first one being fixed
point representations, where you limit the number of digits after the decimal point. This solution is
very simple to understand but it’s only use would be in very specific situations. For example, in
scenarios relating to money, where there would only need to be two decimal values, e.g., 3.20.
Therefore, this isn’t an ideal way to represent the rational numbers.
x ∈ Q has a unique representation known as floating point representation as x=(−1 ) s ∙ m∙ 10e ,
where s ∈{0 ,1 }, where 0 represents the value being positive and 1 being negative, m∈ Q is the
significand which satisfies the following inequality 1 ≤m<10 known as the normalised floating point
representation, e ∈ Z is the exponent. For example, when you represent 12.34 using this method:
s=0 as it’s positive, m=1.234 since the value of the significand can only vary between 1 and 10,
and e=1 because the value has been divided once. Due to how we have defined this, 0 cannot be
, represented since the significand can only satisfy 1 ≤m<10. So, the method would fail as m=0 is
not possible. Since 1 ≤m<10, we can represent it as follows:
∞
m=∑ d i ∙ 10−i , withd i ∈ { 0 , 1 ,2 , 3 , 4 , 5 , 6 ,7 ,8 ,9 } .
i=0
This is called decimal representation, the Σ represents the summation from i=0 to infinity. 10−i is
the fractional part of the representation. For example, if i=1, then we are representing 0.1 (tenths)
making i=2 represent 0.01 (hundredths), this is because we are working in base 10. The summation
of infinity indicates that you can add an infinite number of terms, allowing for infinitely long
sequences of binary or decimal digits in the significand of the expressions. This is an issue as digital
computers can only work in a finite manner, therefore limiting the number of digits that can be used
to represent binary. So instead of infinity, we use a limit l , where l can take values from 0 to the
designated limit (which is the maximum length of the number - 1).
From above, we have represented m in base 10. However, binary uses base 2, we resolve this by
simply switching the value of 10 to 2 and changing d i ∈{0 ,1 }. Therefore making m satisfy the
inequality 1 ≤m<2. We have the following summation:
l
m=∑ d i ∙ 2 , with d i ∈ { 0 , 1 } .
−i
i=0
It is clear from this summation that the value of m is now defined by summations in terms of
1 ,1 /2 , 1/ 4 ,1/8 , …, this is essentially the powers of 2 from 0 to −l .
Rational numbers can’t always represented accurately
There is a caveat to having a limit as it has an impact on binary floating point representation not
being able to accurately represent rational numbers. We show a proof of this using contradiction,
where we will prove the true statement to be right by comparing it to the complete opposite.
Proof:
The decimal value 0.2 cannot be accurately represented. By contradiction, we assume that this isn’t
the case. So,
l
1
0.2= =m=∑ d i ∙ 2−i , with d i ∈ {0 , 1 } .
5 i=0
Under the assumption that the above is true we can then write,
l l
1
=m=∑ d i ∙2 =2 ∑ d i ∙2
−i −l l−i
5 i=0 i=0
We have pulled out the term 2−l from all terms in the summation and then corrected this. We
correct this by changing 2−i to 2l−i to make up for the fact that the whole expression has been
divided by 2l. Next, we substitute j=(l−i) into the expression,
l
m=2−l ∑ dl − j ∙ 2 j
j=0