2019; Fused Floating Point Dot Product Unit (EE311 VLSI Lab Project)

Guining Pertin
Nov 20, 2019
5 min read

Updated: Oct 7, 2025

Introduction

This work was done for EE311 course project under Dr. Trivedi, EEE Dept., IITG.

The project was on implementation of the paper titled “Fused Floating Point Arithmetic for Discrete Wavelet Transform” by Temesghen Tekeste, Hani Saleh, Baker Mohammad and Mohammed Ismail on ZedBoard Zync-7000 FPGA Development Board.

The link to the paper is given – https://ieeexplore.ieee.org/document/7870140

The project was mentored by Meenali Janveja, EEE Department and the members are –

Guining Pertin – BTech, ECE
Bhaskar Baiplawat – BTech, ECE

NOTE: Don't quote on me regarding this work. I am writing the blog years later and I don't remember much about this specific work.

Explanation of the paper

IEEE 754 standard floating point numbers are often used in computer applications due to the higher precision it provides compared to fixed point numbers. Lot of DSP applications uses computer based algorithm where floating point arithmetic is used on large quantity for computation applications like FFT, convolution etc. The multiplication and addition combined together(essentially computation of matrix multiplications and dot products) is used vividly for computational purpose. Consider four numbers A,B,C,D and using conventional methods in order to add(or subtract) the product AB and CD using floating point multiplier and adder the hardware requirements are 2 multipliers and 1 adder .Here we want to implement one single unit which will compute the addition(or subtraction) of product AB and CD directly. The paper uses this unit for computation of DWT. DWT is discrete wavelet transform where sequence is passed through a series of low pass and high pass filters to extract time frequency components.

Proposed Fused Floating Point Unit:

Implementation

The unit will take five inputs four numbers A,B,C,D and one operation 0 as addition of AB and CD whereas 1 as subtraction of AB and CD. One output will be returned.

Exp Compare:

To multiply two numbers their exponents will be added. Each exponent will have biased added so the biased added in final exponent will be 2 times so subtract bias 1 time from both AB and CD exponent. The addition or subtraction of AB and CD is expressed in terms of bigger of two calculated exponents. So bigger of two is determined. The bigger number is provided to exponent adjust module where as difference of two is provided to alignment module.

Multiplier Tree:

The multiplier tree uses the Wallace tree multiplier architecture but realized using a series of carry save adders(CSAs) over the partial products.

Ref: Keshaveni N., High Speed Area Efficient 32 Bit Wallace Tree Multiplier

Two vectors of 47 length unit known as sum and carry vectors for each AB and CD product is generated. Actually carry is 46 unit length but due to used structure MSB of sum vector and Carry is aligned to each other so it is shifted by one unit to left giving 47 unit length.

Alignment:

Input will be these sum and carry vector of both AB and CD products. Amount of shift will be exponent difference. we will decide which one to be shift based on whose exponent is bigger. Ultimately all sum and carry vectors will be converted to 71 bit string. If exponent of AB is bigger sum and carry of CD is shifted right by the required shift amount .Remaining zeros will be added to end to make it 71 bits. At that point AB sum and carry is made 71 bit by adding 24 bits at the end. If exponent of CD is bigger sum and carry of AB is shifted right by the required shift amount. Remaining zeros will be added to end to make it 71 bits. At that point CD sum and carry is made 71 bit by adding 24 bits at the end.

2s Complement Logic:

When the subtraction is performed then only complement is taken. So Based on operation select 2’s complement is taken otherwise retained originally.

Operation Select:

Based on sign of AB product , CD product and original Operation which operation is performed internally in unit is determined. Sign of AB and CD is exor of sign bits of A,B and C,D respectively. The final logic of operation internally performed is:

OPSEL = exor (sign AB , sign CD ,original operation )

4:2 CSA:

It will add aligned sum and carry of AB product to give 72 bit final AB result. Similarly aligned sum and carry of CD product to give 72 bit final CD result.

Leading Zero Anticipator (LZA):

LZA will take input as AB final result and CD final result. LZA will work parallel to adder to determine how many zeros will come before one in result of adder. The 72 bit sequence is encoded and zeros will be counted. Encoding is

When y(i) is zero, Normalization shift amount is incremented (no of leading zeros).

Parallel Adder:

l add 72 bit strings of AB and complemented CD to give required 73 bit result. First bit is carry generated. The string considered for normalization is different. When carry bit is 0 operation select is add then last 72 bits considered. When carry bit 1 operation select is add then 73rd to 2nd bit. when carry bit is 1 and opsel is sub consider last 72 bits. when carry is 0 and opsel is sub consider last 72 bits and take its 2’s complement this 2’s complement is considered for shifting.

These 72 bits are shifted left by LZA shift amount. The MSB of the normalized result is 1 we need to consider 27 bits of of shifted result. These 27 bits will be normalized result.

Rounding and Post Normalization:

In rounding and post normalization of 27 bits last three are rounding bit , guard bit and sticky bit . based on these 3 1 is added for rounding in first 24 bits of 27 normalized result.

Logic to add 1 is = round &( guard | sticky )

During add of 1 to 24 bits if carry is generated it is exponent overflow. addition of 1 will have 25 bit result consider bits 24th to 2nd if 25th bit is 1. It is final 23 bit mantisa. IF 25th bit is zero consider last 23 bits as the final result.

Exponent Adjust:

It is result of exponent compare added to overflow and 73rd bit of adder if opsel is add. It is the final exponent of AB and CD add/sub.

Final sign of AB and CD add/sub : It is determined based on opsel, 73rd bit of adder call as cout and sign of AB. Logic is

Sign = (cout&sign AB ) | (~opsel & sign AB ) |(opsel & ~cout & ~sign AB)

Sign 0 is positive Sign 1 is negative.