An Efficient Implementation of an Exponential Random Number Generator in a Field Programmable Gate Array (FPGA)

Smitha Gautham
Virginia Commonwealth University

Follow this and additional works at: https://scholarscompass.vcu.edu/etd
Part of the Electrical and Computer Engineering Commons

© The Author

Downloaded from
https://scholarscompass.vcu.edu/etd/2173

This Thesis is brought to you for free and open access by the Graduate School at VCU Scholars Compass. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of VCU Scholars Compass. For more information, please contact libcompass@vcu.edu.
An Efficient Implementation of an Exponential Random Number Generator in a Field Programmable Gate Array (FPGA)

Smitha Gautham

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science at Virginia Commonwealth University

Virginia Commonwealth University, 2010

Director – Dr. James M. McCollum
Assistant Professor of Electrical & Computer Engineering
Dedication

In loving memory of my late grandmother
Nagarathnamma
Acknowlegements

First and foremost, I would like to thank my advisor Dr. James M. McCollum for his constant support and encouragement. I am grateful for his patient guidance and valuable suggestions throughout my research. I would like to thank Dr. Robert H. Klenke, Dr. Alen Docef and Dr. Vojislav Kecman for serving on my thesis committee. I want to thank Dr. Robert H. Klenke for sharing his group’s random number generator code to interface with my exponential distribution module. I would also like to thank Dr. Roslyn Hobson for her support.

It was great pleasure to be a part of the VCU Computer Systems Laboratory. I would like to thank all my lab members who made working in the lab always enjoyable. I would like to specially acknowledge Jake Berlier for his help in understanding the Xilinx tools.

I would like to thank my friends Sai Priya, Sravanthy and Meena who made my stay at VCU very memorable.

I would like to thank my parents Latha and Muralidhar for their love and unwavering support. Their encouragement and blessing helped me reach where I am today. I fondly remember my late grand-mother who inspired me to study well and take up a professional career, waking me early in the morning with a hot cup of coffee during my exams. My grandfather was also a constant source of support. I would like to thank my in laws Anupama and Jayasimha for always being there for me. Their love and encouragement helped me a lot. I would like to thank my sister Sapna and brother Sharath with whom I can share all my aspirations and feelings. I would like to thank my sister in law Nirupama for her good wishes.

I am fortunate to have a husband like Atul, whose love and support motivates me to achieve anything I aspire for. I would like to thank him for being so special.
Abstract

Many physical, biological, ecological and behavioral events occur at times and rates that are exponentially distributed. Modeling these systems requires simulators that can accurately generate a large quantity of exponentially distributed random numbers, which is a computationally intensive task. To improve the performance of these simulators, one approach is to move portions of the computationally inefficient simulation tasks from software to custom hardware implemented in Field Programmable Gate Arrays (FPGAs).

In this work, we study efficient FPGA implementations of exponentially distributed random number generators to improve simulator performance. Our approach is to generate uniformly distributed random numbers using standard techniques and scale them using the inverse cumulative distribution function (CDF). Scaling is implemented by curve fitting piecewise linear, quadratic, cubic, and higher order functions to solve for the inverse CDF.

As the complexity of the scaling function increases (in terms of order and the number of pieces), number accuracy increases and additional FPGA resources (logic cells and block RAMs) are consumed. We analyze these tradeoffs and show how a designer with particular accuracy requirements and FPGA resource constraints can implement an accurate and efficient exponentially distributed random number generator.
# Table of Contents

<table>
<thead>
<tr>
<th>Section</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>Acknowledgements</td>
<td>3</td>
</tr>
<tr>
<td>Table of contents</td>
<td>5</td>
</tr>
<tr>
<td>List of Tables</td>
<td>7</td>
</tr>
<tr>
<td>List of Figures</td>
<td>8</td>
</tr>
<tr>
<td>1. Introduction</td>
<td>9</td>
</tr>
<tr>
<td>1.1 Motivation</td>
<td>9</td>
</tr>
<tr>
<td>1.2 Contributions</td>
<td>10</td>
</tr>
<tr>
<td>1.3 Outline</td>
<td>10</td>
</tr>
<tr>
<td>2. Background</td>
<td>12</td>
</tr>
<tr>
<td>2.1 Probability Theory</td>
<td>12</td>
</tr>
<tr>
<td>2.2 Floating point numbers and operations on floating point numbers</td>
<td>18</td>
</tr>
<tr>
<td>2.3 Curve fitting</td>
<td>21</td>
</tr>
<tr>
<td>2.4 Field Programmable Gate Arrays</td>
<td>23</td>
</tr>
<tr>
<td>2.5 Previous work</td>
<td>24</td>
</tr>
<tr>
<td>3. Implementation</td>
<td>27</td>
</tr>
<tr>
<td>3.1 Matlab implementation</td>
<td>27</td>
</tr>
<tr>
<td>3.2 Hardware implementation</td>
<td>35</td>
</tr>
<tr>
<td>4. Results</td>
<td>41</td>
</tr>
<tr>
<td>4.1 Results and verification</td>
<td>41</td>
</tr>
<tr>
<td>4.2 Resource Utilization</td>
<td>42</td>
</tr>
<tr>
<td>4.3 True random number interface</td>
<td>49</td>
</tr>
<tr>
<td>5. Conclusion</td>
<td>52</td>
</tr>
</tbody>
</table>
6. References..................................................................................................................55

7. Appendix....................................................................................................................57
## List of Tables

<table>
<thead>
<tr>
<th>Table</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>3.1</td>
<td>Increase in LUT size as the number of intervals increases</td>
<td>31</td>
</tr>
<tr>
<td>3.2</td>
<td>Scaling of LUT entries with number of intervals and order of polynomial fit</td>
<td>32</td>
</tr>
<tr>
<td>4.1</td>
<td>Resource utilization for various LUT sizes</td>
<td>43</td>
</tr>
<tr>
<td>4.2</td>
<td>Predicted BRAM usage for different LUT sizes for different Polynomials</td>
<td>45</td>
</tr>
<tr>
<td>4.3</td>
<td>Mean Square Error for different LUT sizes for different polynomials</td>
<td>46</td>
</tr>
<tr>
<td>4.4</td>
<td>Device utilization for different DLUT sizes and polynomials</td>
<td>49</td>
</tr>
<tr>
<td>4.5</td>
<td>Resource utilization and frequency of operation of Pipelined Coregen floating point unit vs. floating point library</td>
<td>51</td>
</tr>
</tbody>
</table>
# List of Figures

<table>
<thead>
<tr>
<th>Figure</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>2.1</td>
<td>Probability density function for rolling of a dice</td>
<td>13</td>
</tr>
<tr>
<td>2.2</td>
<td>PDF of a normal distribution function</td>
<td>14</td>
</tr>
<tr>
<td>2.3</td>
<td>Probability of x falling between a and b, given PDF=f(x)</td>
<td>15</td>
</tr>
<tr>
<td>2.4</td>
<td>Cumulative distribution function (CDF)</td>
<td>16</td>
</tr>
<tr>
<td>2.5</td>
<td>The PDF for exponential distribution for different values of λ</td>
<td>17</td>
</tr>
<tr>
<td>2.6</td>
<td>CDF of exponential distribution for different values of λ</td>
<td>18</td>
</tr>
<tr>
<td>2.7</td>
<td>Floating point representation</td>
<td>19</td>
</tr>
<tr>
<td>2.8</td>
<td>Polynomial curve fit for linear, quadratic, cubic and quartic polynomials</td>
<td>22</td>
</tr>
<tr>
<td>3.1</td>
<td>Inverse CDF curve fitted by linear polynomial</td>
<td>29</td>
</tr>
<tr>
<td>3.2</td>
<td>Inverse CDF function divided into 2 intervals and curve fitted with linear</td>
<td>30</td>
</tr>
<tr>
<td></td>
<td>and quadratic polynomials</td>
<td></td>
</tr>
<tr>
<td>3.3</td>
<td>Schematic to show how polynomial fit in an interval introduces an error</td>
<td>34</td>
</tr>
<tr>
<td>3.4</td>
<td>MSE vs. number of intervals (2^n) for linear, quadratic and cubic polynomials</td>
<td>35</td>
</tr>
<tr>
<td>3.5</td>
<td>Block diagram representation linear interpolation</td>
<td>39</td>
</tr>
<tr>
<td>3.6</td>
<td>Block diagram representation quadratic interpolation</td>
<td>39</td>
</tr>
<tr>
<td>3.7</td>
<td>Block diagram representation cubic interpolation</td>
<td>40</td>
</tr>
<tr>
<td>4.1</td>
<td>Graphical representation of utilization of BRAM by linear, quadratic and</td>
<td>44</td>
</tr>
<tr>
<td></td>
<td>cubic polynomials</td>
<td></td>
</tr>
<tr>
<td>4.2</td>
<td>Variation of Mean square error for different BRAM sizes</td>
<td>47</td>
</tr>
<tr>
<td>4.3</td>
<td>Variation of Mean square error for different interval lengths and different polynomial fits</td>
<td>48</td>
</tr>
</tbody>
</table>
1. Introduction

This chapter discusses the motivation for accelerating the performance of exponentially distributed random number generation using FPGAs, key contributions made by this work, and provides an outline for the thesis.

1.1 Motivation

Many natural systems have properties, rates, or events that are exponentially distributed. The arrival of cars at traffic lights are exponentially distributed [1]. The rates at which phone numbers are called in a telemarketing system are exponentially distributed [1]. The inter-arrival time of packets in a network router and processing tasks on a computer are exponentially distributed [3]. Engineers model these systems to optimize the capacity and efficiency of their design, in turn using simulators that rely heavily on exponentially distributed numbers.

Scientists use the exponential distribution to model the radioactive decay time of a nuclear material [4]. Also, the decrease in the intensity of electromagnetic radiation in a medium follows an exponential distribution [4]. Models can then be used to calculate the thickness of a shield used in a nuclear reactor.

Similarly, the rate of a chemical reaction can be modeled by the exponential distribution. In pharmacology, for instance, the time it takes for a chemical to be metabolized also usually follows an exponential decay [5]. This has implications on the absorption of a medicine by the body as well as how it is spatially distributed to different parts of the body over time.

Because exponentially distributed random numbers are used to model such a wide variety of physical, biological, ecological, and behavioral systems, accelerating the performance of exponentially distributed random number generation stands to benefit many areas of scientific inquiry and engineering design.
Field Programmable Gate Arrays (FPGAs), which have been used to accelerate a wide variety of computationally intensive tasks, are an obvious platform for developing accelerated simulators that are either partially or fully implemented in hardware. The first step in constructing FPGA-accelerated simulators is to develop re-usable hardware elements that can be assembled into working solutions. Here we study a fundamental issue in the development of hardware accelerated simulators by developing an optimized and accurate exponential random number generator.

1.2 Contribution

McCollum et al [6] first designed a system capable of generating exponentially distributed random numbers using a piecewise linear approximation of the inverse CDF using a 256 entry data lookup table. This work is significantly expanded here by exploring the tradeoffs in FPGA resources and number accuracy by varying the data lookup table size and increasing the order of the approximation function. The study shows how a designer can generate the most accurate exponential numbers given a particular limitation on FPGA resources. The study also shows how a designer can use the least amount of FPGA resources to achieve a specific accuracy value. An exponential distribution generator is then implemented on a Virtex 5 FPGA capable of generating 100,000,000 exponentially distributed random numbers per second.

1.3 Outline

Introductions to probability theory, curve fitting, FPGAs, and prior work are presented in the Section 2. Details of how the random number generator was constructed and how performance and error analysis was conducted are given in the Section 3. Section 4 describes the results of the research and gives a detailed description of FPGA resource utilization for different order polynomials and different data look up table sizes. Section 5
concludes this thesis by summarizing the key results and discussing the future areas of research.
2. Background

Information needed to understand this thesis and description of previous work performed in the proposed research area is described in this section. First, probability density functions [7] and their importance in generation of random numbers that follow a particular distribution are described. The emphasis is on the exponential distribution. Second, floating point numbers [8] that are key to implementing any such distribution on hardware and operations involving them are discussed. Third, the use of curve fitting to approximate these distribution functions in various intervals is described. Finally, the work performed by other researchers is discussed.

2.1 Probability

The concept of probability can be explained with a simple example involving tossing of coins or rolling of dice. When a coin is tossed, there is equal chance of getting a head or tail. Thus the probability of getting a head or tail is 0.5. When a dice is rolled, there is one in six chance of getting a specific number between 1 and 6, the probability of any of these events is 0.1333.

The probability of an event occurring is thus the number of ways the event can occur divided by all of the possible ways of getting the outcomes. The probability of an event occurring is always less than or equal to 1. A probability of 1 implies that there is one hundred percent chance of occurrence of the event. A probability of zero implies that there is no chance of occurrence of the event.

For example, when a pair of dice is flipped, we can make the following observations.

- Number of ways of getting a sum of 2 or 12 is one in twelve
- Number of ways of getting a sum of 3 or 11 is two in twelve
- Number of ways of getting a sum of 4 or 10 is three in twelve
• Number of ways of getting a sum of 5 or 9 is four in twelve
• Number of ways of getting a sum of 6 or 8 is five in twelve
• Number of ways of getting a sum of 7 is six in twelve

It can be observed that all of these outcomes do not have equal probabilities. Suppose the sum is plotted on the x-axis and the probability of obtaining this sum is plotted on the y-axis, it gives a graph of the probability of each and every event (in this case the sum) as shown in Figure 2.1. A similar graph can be obtained when we toss a coin a hundred times and record the number of consecutive heads or tails. These are graphs of the probability density function (PDF) of the event.

![Figure 2.1: Probability density function for rolling of a dice.](image)
2.1.1 Probability distributions

Many systems can be modeled using probability distributions. There are different standard forms of probability distribution functions, such as the exponential, normal, weibull, and uniform, that commonly mimic the behavior of real-world events. These distributions can be observed in various applications in our daily activities.

As discussed in the introduction section, radio-active decay time of a material and waiting time at a traffic light follow an exponential distribution, which will be discussed in later sections.

Marks scored by students in a class and performance of employees in a company follow the normal distribution, an example of which is shown in Figure 2.2.

![Figure 2.2. PDF of a normal distribution function.](image)

2.1.2 Probability density function

The probability density function (PDF) represents the likelihood of a random number falling in a given sample space [9]. For example, when a coin is tossed a hundred times and the number of consecutive heads or tails are plotted on X axis and the probability of
getting the outcome is plotted on the Y axis, then we get a graph of the Probability density function of the event.

The PDF is described by the following equation

\[ P(a \leq x \leq b) = \int_a^b f(x) \, dx \]  

(2.1).

For example, if the marks scored by students in a class is plotted and it gives us a probability distribution function as shown in Figure 2.3, then the probability of the marks falling between a and b where a and b are marks of the students can be calculated by using equation (2.1) where \( f(x) \) is the probability distribution function defining the marks scored by the students.

![Figure 2.3. Probability of x falling between a and b, given PDF=f(x).](image)

### 2.1.3 Cumulative distribution function

The cumulative distribution function is the integral of the probability density function. CDF \( (x) \) represents the probability of a number being less than or equal to \( x \). Since, the total probability can never exceed one, the CDF can never take a value greater than one.
It usually asymptotes to 1 as \( x \) tend to infinity as shown in Figure 2.4. The cumulative distribution function of a number \( x \) can be described by the equation

\[
F(x) = \int f(x) \, dx
\]  

(2.2)

where \( F(x) \) is the probability density function.

2.1.4 Inverse cumulative density function

The Inverse CDF involves finding the value “\( x \)”, so that the CDF \( (x) = p \), where \( 0 \leq p \leq 1 \). In other words given a certain probability \( p \) between 0 and 1, the inverse CDF \( (p) \) gives the number \( x \) described above.

It can be seen that if \( p \) is a uniformly distributed random number between 0 and 1, and the exponential CDF is chosen, then taking the inverse CDF yields exponentially distributed random numbers. If we choose the CDF to be Weibull, Normal, etc we would get Weibull or normally distributed random numbers by taking the inverse CDF of these functions respectively. This makes inverse CDF a very useful method to generate numbers with different distributions.
2.1.5 The Exponential distribution

The exponential distribution is used significantly in modeling statistical data and simulation. The probability distribution function (PDF) of exponential distribution is given by [7]

\[ f(t) = \lambda e^{-\lambda t}, \quad t \geq 0 \tag{2.3} \]

where \( \lambda \) is a positive parameter. A plot of the probability density function for different values of \( \lambda \) is shown in Figure 2.5.

![PDF for exponential distribution for different values of \( \lambda \).](image)

**Figure 2.5.** The PDF for exponential distribution for different values of \( \lambda \).

The cumulative distribution function (CDF) of exponential distribution, which is the integral of the PDF, is given by [7]

\[ F(t) = 1 - \lambda e^{-\lambda t}, \quad t \geq 0 \tag{2.4} \]
where $\lambda$ is a positive parameter. A plot of the cumulative distribution function is shown in Figure 2.6.

![CDF of exponential distribution for different values of \(\lambda\).](image)

**Figure 2.6.** CDF of exponential distribution for different values of $\lambda$.

The inverse CDF can be calculated as $t = -\ln(1 - F(t))$ where $F(t)$ is the CDF of the exponential distribution function.

The approach used in this thesis will be to generate uniformly distributed random numbers using standard FPGA methods, then scale the uniform random numbers to the exponential distribution using the inverse CDF. The challenge in implementing this approach in hardware will be to accurately and efficiently calculate the inverse CDF function in hardware.

### 2.2 Floating-Point Number Representation

This section describes floating-point number representation, floating-point addition and floating-point multiplication. There are data look up tables (DLUT) used in the design,
which have numbers in the integer format. These numbers are converted to floating point form to perform further operations. The description below describes real to float conversion. Floating point addition and multiplication are also performed in the design when we do curve fitting using different polynomials. Hence, these operations are also described in this section.

2.2.1 Floating point Numbers

Floating point numbers are used to represent very large and very small numbers. They can be represented in single precision or double precision format. In single precision format the number is represented as a 32 bit binary number. In double precision format the number is represented as a 64 bit binary number. In our research implementation, single precision floating point operations are performed.

A 32 bit floating point number has three parts. It has one sign bit, eight bits for the exponent and 23 bits for the mantissa. The sign bit is used to indicate if the number is positive or negative. Sign bit is zero for a positive and one for negative number. The exponent ranges from -126 to +127. All bits are zeros in the exponent for representing zero and all bits are ones for representing infinity or NaN. The mantissa part has 23 bits. There is an implicit 1 at the beginning of the mantissa. Therefore the mantissa has a precision of 24 bits [8].

Figure 2.7 shows a single bit floating point representation.

![Figure 2.7. Floating point representation.](image)
The following steps describe the procedure to convert a real number to a 32 bit floating point format [10]:

1. The integral and fraction part of the given number is converted to binary. The fraction part in converted to binary by repeated multiplication by 2.

2. $2^0$ is appended at the end of the number. This does not change the value of the number as it is same as multiplying by 1.

3. The number is then normalized. Normalization of a number means, the decimal point is shifted such that there is only one number to the left of the decimal point. The exponent is adjusted such that the value of the number is not changed.

4. A bias is added to the exponent and the new exponent is placed in the exponent field of the floating point number. The bias is given by $2^{k-1} - 1$ where $k$ is the number of bits in the exponent field of the floating point number. For a 32 bit floating point number, the exponent field has 8 bits. Therefore the bias is $2^{8-1} - 1 = 127$.

5. The sign bit is set for the given number. If the number is positive, sign bit is zero, otherwise the sign bit is one.

Ex: Convert 3.25 to floating point

- Convert to binary
  
  3 in binary is $11$
  
  0.25 in binary
  
  $0.25 \times 2 = 0.50 \quad 0$
  
  $0.50 \times 2 = 1.0 \quad 1$

- 3.25 in binary is 11.01
- Insert $2^0$ 11.01 * $2^0$
- Shift decimal point and adjust exponent:
  
  $1.101 \times 2^1$

- Normalize: $1 + 127 = 128 \quad 128 = 10000000$
• Floating point representation of number 3.25 is
  0 10000000 10100000000000000000000

2.2.2 Floating point Multiplication

Single precision floating point multiplication consists of multiplying two 32 bit numbers with an 8 bit exponent and a 23 bit mantissa. The exponents are added and the mantissas are multiplied. Mantissa multiplication can be done by repeated additions. The result is then normalized.

2.2.3 Floating Point addition

Single precision floating point addition involves addition of two 32 bit floating point numbers. The two numbers must have the same exponent value to do the addition operation, or they must be shifted numerous times until they have the same exponent. The mantissas are then added.

2.3 Curve fitting

The exponential distribution implemented on hardware in this research involves curve fitting with different polynomials. This section gives a brief description of curve fitting.

2.3.1. Polynomial fit

A pool of data points can be defined by a mathematical equation or well defined curve by the method of curve fitting. Curve fitting can be done in many ways. One of the methods of curve fitting is to fit polynomial equations to the data set. The polynomial which best fits the data set is used to define the data points. Best fit depends on the nature of the data points. In this thesis curve fitting was done by polynomial curve fit. A set of data points are fit by different polynomials like linear, quadratic, cubic etc.
The equation for a linear polynomial is given by

\[ y = ax + b \]  

(2.5)

where \( x \) is a set of input data points. The co-efficient \( a \) and \( b \) are calculated so that \( ax + b \) is close to the output \( y \).

The expression for higher order polynomials is given by

\[ Y = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + \ldots + a_{m-1} x^{m-1} + a_m x^m \]  

(2.6)

where \( m \) is the order of the polynomial.

Figure 2.8. Polynomial curve fit for linear, quadratic, cubic and quartic polynomials.
Figure 2.8 is curve fit using linear, quadratic, cubic and quartic polynomials. The coefficients of the polynomials can be calculated numerically. This is easy for linear polynomials. But for higher order polynomials, numeric calculations become tedious. There are tools available for doing curve fit. The curve fitting toolbox available in Matlab was used in this research. “Polyfit” is used to do the curve fit in Matlab. “Polyval” is used to evaluate the curve fit polynomial and compare against input data and get an error estimate between the given curve and the polynomial curve fit.

2.3.2 Mean square error

The error between the given curve and the polynomial fit is squared and the mean of this square of the errors between the data points is taken. This gives the error estimate in the polynomial fit. The lower the MSE, the polynomial is considered a better fit.

2.4 Field Programmable Gate Arrays

Field programmable gate arrays (FPGA) are an array of logical blocks which can be configured based on the design requirements. The interconnections between the cells are field programmable and hence the name field programmable gate arrays. Each logical block consists of logic gates, flip flops, and multiplexers which can be programmed to perform the required function [11].

FPGAs have a lot of advantages over Application Specific Integrated Circuits (ASIC). They have more design flexibility since there is no need to draw the layout and fabricate the IC for each design [12]. The designs can be tested instantaneously on hardware without the need to wait for a fabricated IC to test the design. Alterations on the original design can be made easily. ASICs on the other hand take more time to manufacture and thereby take more time for the design to be tested. They are not flexible. Once an ASIC is fabricated, design changes cannot be made.
In this research, a Xilinx Virtex 5 FPGA was used. Virtex 5 FPGA consists of about 330,000 logic blocks and 16.4 Mbits of Block RAM [13]. Block RAM is large memory blocks configured to store data continuously. In Virtex 5 FPGA, BRAMS can be used as one 36Kb or two 18 Kb RAM. Each logical block in a Virtex 5 FPGA consists of 4 LUTs and 4 flip flops. It also has an IBM Power PC RISC processor core.

2.5 Previous work by other researchers

Random numbers have many applications in gaming, biological simulations etc. The exponential distribution module that is designed can be used with a random number generator to get exponentially distributed random numbers. This section gives a brief description on random numbers generation. A summary of work of others researchers on generating true random numbers and random numbers that follow different distributions is presented in this section.

2.5.1 True random numbers

Random numbers can be truly random or pseudo random. In case of pseudorandom numbers, the sequence of random numbers repeat after certain set of data. The time for the first repeat of data depends on the initial seed. In case of true random numbers, the next number that is generated cannot be predicted. Random numbers can be generated by hardware or software. Software random numbers generally have a code written which is initiated when random number generation is set. They are usually pseudo random. Pseudorandom numbers are preferable sometimes to true random numbers because you can repeat the simulations for a certain set of data and repeat the simulation for the same set of random data and verify the results [7]. Just the initial seed has to be changed to get a new set of random data.

True random numbers are used in cryptography so that secrecy is maintained [14]. They are also used in gaming [14]. Thermal noise, photo electric effect is used as some of randomness in a hardware random number generator [14].
A hardware random number generator generated by Mitchum [14] was interfaced to the exponential distribution module to get a hardware exponential true random number generator. In Mitchum [14] a true random number generator is designed such that after generation of each number there are multiple paths that can be followed to generate the next successive number. It is difficult to predict the outcome of the generator as there are many ways that may be followed to get the number. Many digital components like ring oscillator, counters, shift registers, multiplexers and transposers were used in the design [14]. The ring oscillator has series of inverters connected back to back and the output of the last inverter is fed back to the first one. The period of the output wave of the ring oscillator is unpredictable and is the source of randomness in this paper. Divergent paths were introduced by using Linear feedback shift registers (LFSRs) of different lengths with ring oscillators to clock them. Thus true random numbers are generated and the randomness is tested using different methods.

2.5.2 Random numbers that follow different distributions

Wallace [15] method uses a conventional random number generator to generate a pool of random numbers. This is multiplied by an orthogonal matrix to get a new pool of random numbers. This is based on closure property where the new pool generated will also belong to the same random distribution as the old pool. Mixing of the numbers in the new pool is performed so that each number in the old pool can affect every element in the new pool after many successive transformations. This method may be used for different distributions but has been shown only for the exponential case. Here statistical errors get built up due to rounding off with successive transformations.

Ziggurat [16] is an efficient method for normal distributions but can be used for any distribution that increases monotonically to a peak and then decreases. It is based on the fact that most points lie in the mid-region of an exponential distribution and very few points lie in the tail of the curve. The points which fall in the mid region can be calculated easily and require less computation but points in the tail region require more
computation. A pipelined approach is used so that computation of the tail and the mid region can be done in parallel by hardware to generate a lot of mid-region points and few tail points. These are mixed together at the final step. The main disadvantage of this method is it cannot be used to for all kinds of distributions.

Thomas and Luk [17,18] develop a very general random number generator that is capable of generating random numbers with any kind of distribution. This uses a good mix of both hardware and software methods. The LUTs for different distributions are generated by software. A combination of different distributions that approximates the target distribution well is also chosen by software. The hardware generates a uniform random number and then uses the combination of distributions that has already been selected to generate random numbers that follow the desired distribution. The hardware also uses another random variable to pick one of the component distributions at random for calculating the transformation of each of the uniformly distributed random point. This is quick as it is only selecting a component distribution randomly and not using a combination of tables for calculating a given point. This method is applicable to different distributions. The method described in [17] uses equal weights for all selected tables, but the weight or probability of use of selected tables can be changed in method described in [18]. This allows it to approximate target distributions better. The drawback of this method is approximation of the target distribution by software is time consuming and complex. If the target distribution is known in advance then this is not an efficient method to generate random numbers.

McCollum [6] uses a LUT based method. The inverse CDF of a distribution is divided into many intervals. Each interval is curve fitted with linear polynomial and the coefficients are stored in a Data Look Up Tables (DLUTs). Thus random numbers can be generated by interpolation. This method can be used for all kinds of distributions.

In this research a similar approach to McCollum [6] is used. Here each interval is curve fitted with higher order polynomials like quadratic and cubic. It was found that the accuracy improves. By reducing the number of intervals the size of the DLUT can be
decreased while accuracy can be maintained by using higher order polynomial fits. The
next chapter describes the detailed implementation of this scheme.
3. Implementation

The design and implementation of the exponential distribution on an FPGA based on using the inverse CDF was performed in two steps as described below:

1. In the first step Matlab is used to generate Data Look Up Tables (DLUT) and optimize the interval size. This is purely implemented in software, where the effect of the size of intervals on the mean square error as well as the size of the data look-up tables required is studied. This is used to select an optimum interval size for which the data lookup tables are generated for linear, quadratic and cubic curve fits.

2. In the second step, a hardware implementation of the inverse CDF, using the data look-up tables generated for the optimum interval size is performed. This is implemented on a Virtex 5 XC5VF70T-2FFG1136 FPGA.

These two steps are explained in detail in the “Matlab implementation” and “Hardware implementation” sections.

3.1 Matlab implementation

The aim of this research project is to generate the exponential distribution in hardware. To achieve this we have to implement the inverse CDF of the distribution. A 32 bit uniform number between 0 and 1 input to this inverse CDF will generate an exponentially distributed number as output.

The inverse CDF of exponential distribution function is given below [7].

\[ t = - \ln (1 - F(t)) \]  

(3.1)
This involves implementing the natural log in hardware, which is difficult. Hence, a DLUT (Data look up table) based method is used to implement the natural log in hardware. The values of the output for all possible inputs can be stored in a DLUT. Then, for a 32 bit input there are $2^{32}$ i.e. 4,294,967,296 possible output combinations to be stored in the DLUT. This would require a huge amount of memory for implementation. Hence, a more efficient way is proposed. Instead of storing all the output values in DLUT, the inputs are divided into intervals. The inverse CDF of the numbers in each interval is calculated and curve fitted with polynomials. The coefficients of the polynomials are stored in LUT. By choosing the coefficients for the correct interval from the DLUT and interpolating, the natural log of a number can be calculated.

Figure 3.1 shows the inverse CDF, curve fitted by a linear polynomial $Ax+B$ where $A$ and $B$ are polynomial coefficients. Input numbers from 0 to 100 were curve fitted by linear polynomial and coefficients $A= 3.951$ and $B= -0.808$ were obtained.

![Figure 3.1. Inverse CDF curve fitted by linear polynomial.](image)
Figure 3.2 shows the inverse CDF divided into two intervals and then curve fitted by a linear polynomial $Ax+B$ and quadratic polynomial $Ax^2+Bx+C$ where $A,B$ and $C$ are polynomial coefficients. Polynomial coefficients obtained in each interval are shown below.

For the first interval:
Linear fit : $A = 1.365$  $B= -0.034$. The linear equation is $(1.365 \times x) +(-0.034 )$
Quadratic fit $A= 0.935$  $B= 0.897$ and $C= 0.004$. The quadratic equation is $(0.935 \times x^2) +(0.897 \times x) +(0.004)$

For the second interval:
Linear fit: $A= 7.838$  $B= -4.024$. Therefore linear equation is $(7.838 \times x) +(-4.024)$
Quadratic fit: $A= 37.451$  $B= -48.339$ and $C= 16.245$. The quadratic equation is $(37.451 \times x^2) +(-48.339 \times x) +16.245$
The coefficients are stored in Data Look Up Tables (DLUTs). Thus for an input \( x \), the inverse exponential CDF is calculated by fetching the coefficient values from the LUTs and interpolating using the polynomials.

Table-3.1 shows that the DLUT entries increases exponentially with the bit size corresponding to the number of intervals. For example, if there are 8-bits that are allocated to the number of intervals, then we need \( 2^8 = 256 \) DLUT entries. In fact, we will need a multiple of this number depending on the order of the interpolation polynomial used.

<table>
<thead>
<tr>
<th>Bit-size allocated to Number of Intervals</th>
<th>Number of entries in LUT</th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>16</td>
</tr>
<tr>
<td>8</td>
<td>256</td>
</tr>
<tr>
<td>12</td>
<td>4096</td>
</tr>
<tr>
<td>16</td>
<td>65536</td>
</tr>
<tr>
<td>20</td>
<td>1,048,576</td>
</tr>
<tr>
<td>24</td>
<td>16,777,216</td>
</tr>
<tr>
<td>28</td>
<td>268,435,456</td>
</tr>
<tr>
<td>32</td>
<td>4,294,967,296</td>
</tr>
</tbody>
</table>

Table3.1. Increase in LUT size as the number of intervals increases.

The intervals are curve fitted with linear, quadratic and cubic polynomials. Curve fitting was done using polyfit () and the interpolations are performed using polyval () function in Matlab.
Table 3.2 shows the DLUT size for linear, quadratic and cubic polynomials for different interval lengths the inverse CDF is divided into.

<table>
<thead>
<tr>
<th>Intervals</th>
<th>Linear</th>
<th>Quadratic</th>
<th>Cubic</th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>32</td>
<td>48</td>
<td>64</td>
</tr>
<tr>
<td>8</td>
<td>512</td>
<td>768</td>
<td>1024</td>
</tr>
<tr>
<td>12</td>
<td>8,192</td>
<td>12,288</td>
<td>16,384</td>
</tr>
<tr>
<td>16</td>
<td>131,072</td>
<td>196,608</td>
<td>262,144</td>
</tr>
<tr>
<td>20</td>
<td>2,097,152</td>
<td>3,145,728</td>
<td>4,194,304</td>
</tr>
<tr>
<td>24</td>
<td>33,554,432</td>
<td>50,331,648</td>
<td>67,108,864</td>
</tr>
<tr>
<td>28</td>
<td>536,870,912</td>
<td>805,306,368</td>
<td>1,073,741,824</td>
</tr>
</tbody>
</table>

**Table 3.2.** Scaling of DLUT entries with number of intervals and order of polynomial fit.

From Table 3.2 we can see that as the order of the polynomial increases the DLUT entries also increases as higher order polynomials will have more coefficients to be stored in the DLUT. As the number of intervals increase, there is a large increase in DLUT size.

Linear, quadratic and cubic polynomials used are:

\[ A_1x+B_1, \quad (3.2) \]
\[ A_2x^2+B_2x+C_2 \quad (3.3) \]
\[ A_3x^3+B_3x^2+C_3x+D_3 \quad (3.4) \]

Thus, as we use higher order polynomials to curve fit, the number of coefficients increase. Thus LUT size also increases. The number of intervals that the inverse CDF is divided into is also important. From Table 3.2 we can see that as the interval size increases, there will be bigger LUTs for coefficients in each interval.
The Matlab code which implements the design is given below.

```matlab
total_points = 2^32;
no_intervals = 2^8;
points_interval = 2^24;
for j = 1: no_intervals;
    x = linspace ((j-1)/no_intervals, (j/no_intervals)-(1/total_points), points_interval);
    y = -log (1 - x); % Exponential distribution
    poly1 = polyfit(x, y, 1);
    poly2 = polyfit(x, y, 2);
    poly3 = polyfit(x, y, 3);
end
```

- The input is a 32 bit uniform number. Therefore there are $2^{32}$ points. We chose to divide them into $2^8$ intervals (one particular case). Therefore there are $2^{24}$ points in each interval.
- The for loop iterates from 1st to the 28th interval.
- “x” holds the starting point and end point for each interval. Matlab function linspace generates an array of values from the start point to the end point.
- For all points in each interval, the inverse CDF is calculated. “y” holds the inverse CDF values for each interval.
- Points in each interval are curve fitted by the polyfit () function in Matlab. Linear, quadratic and cubic polynomials are used to fit the intervals.

It should be noted that in dealing with natural log function, ln (0) →∞. In case of the inverse CDF of exponential distribution the log (1-x) term tends to infinity when x tends to one. To avoid this situation the code was modified to bypass the last bit by subtracting $1/2^{32}$ from the each interval so “1” is not included in the last interval.
Another important consideration is the Mean Square error (MSE). This is used to measure the accuracy of the curve fitted polynomial.

Figure 3.3 shows the inverse CDF fitted by a linear polynomial. We can see that there is a difference between actual curve and the curve fit, which results in an error. Mean square error is calculated by the mean of square of the difference between actual and curve fit.

![Figure 3.3](image)

**Figure 3.3.** Schematic to show how polynomial fit in an interval introduces an error.

Our objective is to generate an optimum DLUT(Data look up table) size, that minimizes mean square error. When the number of intervals is increased the mean square error decreases but the DLUT size increases. When the number of intervals is decreased the error increases while DLUT size decreases. A trade off has to be achieved between the MSE and the DLUT size. Using higher order polynomials can improve the accuracy but would require more entries in the DLUT.

Figure 3.4 shows the MSE for linear, quadratic, cubic, quartic (4th order) and quintic (5th order) polynomials.
Figure 3.4. MSE vs. number of intervals ($2^n$) for linear, quadratic and cubic polynomials for inverse CDF of exponential distribution.

We observe that the MSE decreases as the number of intervals increases. As we fit higher order polynomials the mean square error does not decrease significantly. We can see from the plot that after certain intervals, the MSE of quadratic and cubic polynomials are very close. There is no significant decrease in MSE as we move to higher order polynomials like quartic and quintic for larger interval lengths. From this we can conclude that we cannot see considerable increase in accuracy as we fit higher order polynomials. It may be better to fit a quadratic instead of cubic for certain interval lengths as the MSE are close and quadratic has less coefficients to store in LUTs.

For our design interval length of 8 was chosen. This gives a DLUT size of 256 entries. It also gave a sufficient accuracy. For example if we want to get a MSE of about $0.5\times10^{-3}$ then we can choose a linear fit with interval size of $2^{10}$. From the Figure 3.4 we can see that even a cubic fit of interval length $2^8$ gives the same accuracy. This would reduce the
DLUT size for the same accuracy. Hence, a cubic is a better fit for this case. There may be other cases where higher order polynomials may not provide the best fit.

A similar approach can be used to implement any distribution function on hardware. We have to just change the DLUTs for the distribution in the Matlab code that generates the coefficients.

### 3.2 Hardware implementation

In this section the implementation of inverse CDF of exponential distribution on an FPGA is discussed. The design was implemented on a Virtex 5 XC5VF70T-2FFG1136 FPGA and verified in Modelsim simulator. The exponential distribution module was interfaced with a true random number generator to produce an exponentially distributed random number generator.

#### 3.2.1 Hardware implementation of inverse CDF for generating exponential numbers

The DLUTs generated in Matlab are in real format. These variables of type “real” can be simulated in Modelsim but is not synthesizable on Xilinx ISE. They have to be converted to integer or floating point form for synthesis. A VHDL code was written to convert all the real LUTs to integer LUTs that can be used in Xilinx.

Here is a part of the VHDL code written to convert the LUT contents of type “real” to type “integer”. Here linA has the coefficient of linear fit from Matlab. These are converted to floating point by the to_float function. The floating point numbers are converted to “std_logic_vector” type and then to type “integer” as there is no direct conversion function from “float” to “integer”. The integer coefficients are stored in linAINT. These can be stored in DLUTs and are synthesizable. The DLUT (Data Look Up Table) values are stored in integer format in the design. They can also be stored in floating point format. Conversion of the DLUT numbers to floating point using the to_float function can be avoided if the DLUT numbers are already stored in float format.
The design was implemented using floating point DLUTs and there was no significant decrease in the logic utilization by the design. Here is a part of the code which implements the above design:

```
process(clk)
begin
  if clk='1' and clk' event then
    for index in 0 to 255 loop
      linAfloat(index)<=To_float(linA(index)); -- converts from real to float
      linASTD(index)<=to_slv(linAfloat (index)); -- converts from float to std_logic_vector
      linAINT(index)<=conv_integer(linASTD (index)); std_logic_vector to integer
    end loop;
  end if;
end process;
```

Implementation of exponential distribution on hardware involves floating point operations. A floating point library created by David Bishop[19,20] was used to perform the floating point addition and multiplication operations. The to_float function used in the code above which converts a number of any type to a single precision floating point number is also part of the floating point library.

The design was implemented in VHDL. The algorithm of the design is as follows. The input is a 32 bit uniformly distributed number less than 1. The higher 8 bits of the input is used as an index to get the polynomial coefficients from the DLUT. This would give the interval that the input number falls into. The coefficients that correspond to this particular interval are used to perform the interpolation to get the inverse CDF. Thus the inverse CDF to convert a number between 0 and 1 to an exponentially distributed number is realized.

Interpolation was performed using linear, quadratic and cubic polynomials. Figures 3.5, 3.6 and 3.7 describe the design operations. The upper 8 bits of the 32 bit input number is used to get the coefficients from the data look up table. Depending on the polynomial
chosen for interpolation, the DLUT coefficients are selected. Interpolation using linear polynomial will have two DLUTs, quadratic will have three DLUTs, and cubic will have four DLUTs and so on. Interpolation is performed after the coefficient values are fetched from the DLUTs. Cubic interpolation will require the most logic followed by quadratic and then linear. Thus the higher the order of the interpolation polynomial, the more logic is needed. For example, the cubic polynomial can be represented as \[ A_3x^3 + B_3x^2 + C_3x + D_3 \] and the linear polynomial can be represented as \[ A_1x + B_1 \] where \( A_3, B_3, C_3 \) and \( D_3 \) are the cubic polynomial coefficients, \( A_1 \) and \( B_1 \) are linear polynomial coefficients and \( x \) is the 32 bit input number. All of the numbers are in floating point format.

Cubic interpolation involves finding the cube and square of the input floating point number, which requires three floating point multiplications and three floating point additions. Whereas, linear interpolation involves one floating point multiplication and one floating point addition. Thus, the higher the order of the polynomial used for curve fitting, the more logic would to be implemented and more DLUTs would be required. Thus depending on the memory available to store the DLUTs and to implement the logic, a suitable polynomial has to be selected. The accuracy needed must also be taken into consideration while selecting the polynomial order. As shown in the Matlab implementation section, there may not be significant improvement in accuracy for certain interval sizes as we move to higher order polynomials for example choosing 4\(^{th}\) order instead of a 3\(^{rd}\) order may not reduce the interpolation error significantly.

Block diagrams for the linear, quadratic, and cubic implementations are given in Figures 3.5, 3.6 and 3.7.
Figure 3.5. Block diagram representation linear interpolation.

Figure 3.6. Block diagram representation quadratic interpolation.
3.2.2 True Random Number Generator

The exponential distribution that is implemented on hardware is interfaced to a true random number generator to get an exponentially distributed random number. The true random number generator is designed by Mitchum and Klenke [14]. The input to the exponential distribution module is taken from the true random number generator output.
4. Results

The previous sections discussed selection of the optimum interval size for generating data look up tables (DLUTs) for linear, quadratic and cubic fits. To achieve this, the effect of the interval size on mean square error (MSE) for different polynomial fits was determined. Then the effect of the interval size on memory required for DLUTs for different polynomials was also studied. The tradeoff between the accuracy and memory required for DLUTs resulted in the selection of an optimum interval size and polynomial order. The hardware implementation of the exponential distribution was also discussed.

In this section, the results of the implementation of the design in the previous section (summarized above) on Virtex 5 XC5VF70T-2FFG1136 FPGA are shown. The speed achieved by the design and memory utilized for design for different interval lengths are described.

4.1 Results and verification

The inverse exponential distribution of a number is given by

\[ Y = -\ln(1-x); \]  \hspace{1cm} (4.1)

where \( x \) is the input 32 bit number and \( y \) is the 32 bit exponential output. An example of the execution of this function by the actual FPGA is shown to confirm that the implementation was correct. The FPGA output is shown below.

<table>
<thead>
<tr>
<th>Input value</th>
<th>FPGA Output</th>
<th>Actual value</th>
</tr>
</thead>
<tbody>
<tr>
<td>80000000</td>
<td>3F317346</td>
<td>3F317218</td>
</tr>
<tr>
<td>567854</td>
<td>3AAD0FC3</td>
<td>3AAD0DE4</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
The input is in the hexadecimal (hex) format. This input hex number is converted to 32 bit binary number and a decimal point is appended in the beginning of the number. Thus the input is always less than 1. The FPGA output is found by implementing the equation in hardware. The actual value gives the correct value of the output. We can see that the FPGA output is almost equal to the actual output. The difference between the FPGA output and the actual output is due to the interpolation error. The results shown above were obtained by performing linear polynomial curve fit with an interval length of 8.

E.g.: For input 800000000

- 32 Binary conversion gives 10000000000000000000000000000000
- After appending decimal point, the number becomes
  0.10000000000000000000000000000000  i.e. 0.5 in decimal
- Output = -ln(1-0.5)=0.6931471806
- Converting output to floating point, we get 3F317218. The output was converted to floating point for verification using IEEE floating point conversion [21].
- The FPGA output is slightly different (3F317346) due to small error caused by the curve fit.

4.2 Resource Utilization

The Virtex 5 XC5VFX70T is a Xilinx Power PC FPGA. It consists of about 330,000 logical cells, 1200 user I/Os and 18 Mb of block RAM [22]. FPGA resource utilization for linear, quadratic and cubic polynomial fits was studied. The design is mapped on the logical blocks and the DLUTs used in the design are mapped on to the BRAMs on the FPGA. The logic block utilization is almost the same for a given polynomial. However, the number of block RAMs used increases with the interval size. When the DLUT sizes are varied, only the BRAM utilization changes, but the logic usage remains the same.

Table 4.1 shows the FPGA device utilization for linear, quadratic and cubic polynomial for different interval sizes. Figure 4.1 shows the graphical representation for the same.
<table>
<thead>
<tr>
<th>Log (LUT size)</th>
<th>Linear</th>
<th>Quadratic</th>
<th>Cubic</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>slice registers used</td>
<td>slice LUTs used</td>
<td>Block RAMs used</td>
</tr>
<tr>
<td>6</td>
<td>265</td>
<td>4606</td>
<td>2</td>
</tr>
<tr>
<td>8</td>
<td>261</td>
<td>4708</td>
<td>2</td>
</tr>
<tr>
<td>10</td>
<td>267</td>
<td>4809</td>
<td>2</td>
</tr>
<tr>
<td>12</td>
<td>270</td>
<td>4820</td>
<td>8</td>
</tr>
<tr>
<td>14</td>
<td>273</td>
<td>4607</td>
<td>32</td>
</tr>
<tr>
<td>16</td>
<td>271</td>
<td>4811</td>
<td>122</td>
</tr>
</tbody>
</table>

**Table 4.1.** Resource utilization for various LUT sizes.
From Figure 4.1 it can be seen that the device utilization increases as we go from linear to quadratic to higher order polynomials. The block RAM size increases as we move to higher order polynomials and as the DLUT size increases. The BRAM size reaches the maximum limit of the Virtex 5 FPGA for LUT size of $2^{16}$ entries in case of quadratic and cubic fit. Thus as we fit higher order polynomials, the maximum DLUT size that can be used in the design decreases. Similarly, for a smaller DLUT size, higher order polynomial fits can be used to fit the given data.

The logic blocks utilized by the design for the exponential distribution module are significantly less than the available logic resources on the FPGA. However, as mentioned earlier the DLUT limits are reached very fast (For $2^{16}$ entries). The design is easily fit in the FPGA for all the linear, quadratic and cubic polynomials. The cubic fit occupies only 24% of the available logic. Thus higher order polynomials can be used if necessary to get better accuracy as there are lot of logic available unused. However, we will have to use...
smaller DLUTs to stay within the BRAM memory limit. Virtex 5 FPGA has 148 BRAMs of 36Kb each. They can be used as one 36KB block or two 18 KB blocks. The maximum limit is reached for BRAM utilization as the data look up table (DLUT) size increases. The BRAM can hold a total of 5,328Kb of data. Thus, other FPGAs like Virtex6 can be used if more BRAM is needed. The BRAMs needed for a DLUT size of $2^{16}$ for quadratic and cubic is mapped to be 192 and 242 respectively, which exceeds the available BRAM size of 148. This results in an “error” message as shown in the last row of Table 4.1. Thus another FPGA with higher BRAM memory can be used for these implementations.

From the data obtained in Table 4.1 by running the exponential distribution VHDL code for different interval sizes and different polynomial fits, the BRAM utilization for higher order polynomials for different interval lengths was predicted. These values are shown in Table 4.2.

Mean Square Error (MSE) was calculated in Matlab for different DLUT sizes and for various polynomial fits from linear to 8th order polynomial fit. This is shown in Table 4.3.

<table>
<thead>
<tr>
<th>Poly Order</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
</tr>
</thead>
<tbody>
<tr>
<td>Log (LUT size)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>10</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
</tr>
<tr>
<td>12</td>
<td>8</td>
<td>11</td>
<td>15</td>
<td>19</td>
<td>22</td>
<td>26</td>
<td>30</td>
<td>33</td>
</tr>
<tr>
<td>14</td>
<td>32</td>
<td>48</td>
<td>64</td>
<td>81</td>
<td>96</td>
<td>110</td>
<td>126</td>
<td>141</td>
</tr>
<tr>
<td>16</td>
<td>122</td>
<td>192</td>
<td>242</td>
<td>300</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

*Table 4.2*. Predicted BRAM usage for different LUT sizes for different polynomials

BRAMS usage shown bold was obtained by implementing the design on Virtex 5 FPGA. BRAMS usage shown in italics was predicted based on trends.
<table>
<thead>
<tr>
<th>Log (LUT size)</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mean square Error</td>
<td>Mean square Error</td>
<td>Mean square Error</td>
<td>Mean square Error</td>
<td>Mean square Error</td>
<td>Mean square Error</td>
<td>Mean square Error</td>
<td>Mean square Error</td>
<td></td>
</tr>
<tr>
<td>10</td>
<td>2.4447 e-004</td>
<td>1.0849 e-004</td>
<td>6.1054 e-005</td>
<td>6.1010 e-005</td>
<td>3.9081 e-005</td>
<td>3.9049 e-005</td>
<td>3.9046 e-005</td>
<td>3.9030 e-005</td>
</tr>
<tr>
<td>12</td>
<td>6.1103 e-005</td>
<td>2.7110 e-005</td>
<td>2.0215 e-005</td>
<td>1.5243 e-005</td>
<td>1.5240 e-005</td>
<td>1.5164 e-005</td>
<td>1.4669 e-005</td>
<td>1.4133 e-005</td>
</tr>
</tbody>
</table>

**Table 4.3**. Mean Square Error for different LUT sizes for different polynomials.

The results obtained in the design report of FPGA implementation agree with the Matlab simulations results described in the implementation section.

From Table 4.3 we can see that the MSE decreases as we move to higher order polynomials. But after a certain point, there is no significant decrease in error. For example, for a LUT size of $2^{12}$, there is a large decrease in MSE as we move from linear fit to quadratic and then cubic. But the MSE remains almost the same after quartic (4th order) polynomial fit. Instead of using a higher order polynomial fit like 7th or 8th order fit for a LUT size of $2^{12}$, we can use 4th order polynomial fit which gives almost the same MSE.
Figure 4.2 shows the variation of Mean Square Error (MSE) for different BRAM sizes for linear, quadratic and cubic polynomials. We can see that as the BRAM size increases, the MSE decreases drastically.

Figure 4.3 shows the variation of mean square error for different interval lengths from $2^0$ to $2^{16}$ and for polynomial fits from 1st order to 12th order.
From Figure 4.3 it can be seen that as the interval length increases, many higher order polynomial fits after cubic have almost similar MSEs. The MSEs after 10th order are almost the same. Thus as the order of the polynomial fit is increased there is no significant improvement in MSE and it is thus better to use a lower order polynomial fit with a similar MSE. This way the number of logical blocks utilized will be smaller making the implementation more efficient. The MSE decreases drastically as we use bigger DLUTs, but the BRAMs size increases. From Figure 4.3 we can see that even with small DLUTs we can get better accuracy by using higher order polynomials like 3rd or 4th order for curve fitting. Thus depending on the availability of BRAMs and logic and also depending on the MSE requirement, an appropriate DLUT size and polynomial fit has to be chosen.

Figure 4.3. Variation of Mean square error for different interval lengths and different polynomial fits
Table 4.4. Device utilization for different DLUT sizes and polynomials.

In Table 4.4 it can be seen that the device utilization is similar for all DLUT sizes less than $2^{10}$. Thus DLUT sizes less than $2^{10}$ is eliminated, as better accuracy can be achieved with $2^{10}$ DLUT than with smaller DLUTs, for the same device utilization. For a DLUT size of $2^{14}$ the MSE of quadratic is almost similar to cubic. Thus it is better to use quadratic as has less device utilization. The quadratic and cubic polynomials for a DLUT size of $2^{16}$ cannot be used as the BRAM utilization for these designs exceed the available BRAM on the FPGA. Thus depending on the resources available and the MSE needed, the DLUT size and the polynomial fit can be chosen for the design from the available choices.

### 4.2.1 Timing analysis

The exponential distribution was successfully implemented on the FPGA and the results were verified. The speed the design had to be calculated to know how many exponentially distributed numbers can be generated each second. A timer was initiated in Xilinx Platform Studio (XPS) to count the number clock pulses taken to get the output exponential number. The software part of the design was done in Xilinx SDK. A C code was written to start the timer when input is ready and stop the timer when the output is generated.
A FIFO was created using Xilinx coregen which holds input numbers. The exponent of the number that was input was calculated and displayed as output on the Hyper terminal. The clock pulse count was stored in timer register which was read and thus the timing was verified.

Number of clock pulses taken to find the exponent of numbers was monitored. The FIFO size was $2^{16}$. Therefore, $2^{16}$ numbers were given as input to the exponential module. The timer register displayed a value of 65,722. A system clock of 100 MHz was used. The following calculations were performed to determine the number of exponential numbers that are generated every second.

$$((100\text{MHz})^{-1} \times 65,722)^{-1} \times 2^{16} = 99,716,989 \text{ numbers per second}$$

The results shows that one exponential number is generated every clock pulse. This implies that almost 100 million exponential numbers can be generated per second.

The static timing report in the Xilinx ISE showed the maximum frequency of operation of 50-60 MHz for linear, quadratic and cubic polynomial fits. But the exponential module was run at a 100MHz clock successfully. The static timing report takes the longest path to calculate the maximum frequency of operation and this path may be unused in the design. Therefore, the exponential distribution generator was able to run at 100MHz. The timer register values also showed that 100 million random numbers were generated every second.

The floating point library that is used in the design is not pipelined. Using a more efficient pipelined floating point unit would increase the frequency in the timing report. The Xilinx Core generator’s floating point adder and multiplier was used in the design instead of the floating point library. The floating point library was only used to convert the input uniform number to float before performing addition or multiplication operations. It was found that the frequency of operation in the Xilinx ISE timing report went up to 91M Hz for the linear fit. The logic utilization also increased for the pipelined
Coregen floating point unit. Thus by using a pipelined floating point adder and multiplier and an efficient conversion of the input uniform random number to float, the design can be run at higher frequencies. The design implemented in this thesis uses the floating point library for all its operations.

Figure 4.5 shows the resource utilization and the frequency of operation obtained in the timing report for linear and cubic curve fits. In the first case, pipelined coregen floating point adder and multiplier were used to perform floating point addition and multiplication and float library was used to convert the input uniform number to floating point number. The DLUTs (Data Look Up Tables) were stored in float format. Hence the conversion of the polynomial coefficients in DLUTs to floating point is not needed. The design is pipelined and hence both the linear and cubic get the same frequency. In the second case, floating point library was used to convert the input to floating point and also to do addition and multiplication operations. The DLUTs were stored in integer formats. Thus the conversions of coefficients in the DLUTs have to be converted before performing the interpolation.

<table>
<thead>
<tr>
<th>Floating point unit</th>
<th>Linear</th>
<th>Cubic</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>SR</td>
<td>SLUT</td>
</tr>
<tr>
<td>Pipelined Coregen</td>
<td>2,000</td>
<td>2,066</td>
</tr>
<tr>
<td>Float library</td>
<td>238</td>
<td>3,262</td>
</tr>
</tbody>
</table>

Table 4.5. Resource utilization and frequency of operation of Pipelined Coregen floating point unit vs. floating point library

SR: Slice register
SLUT: Slice LUT
Freq: maximum Frequency of operation in Xilinx timing report
4.3 True random number interface

A true random number generator [14] was interfaced with the exponential module and the results were verified. 32 bit exponentially generated random numbers were generated by connecting the output the true random number generator to the input of the exponential module. The design was performed for linear, quadratic and cubic polynomial fits with a DLUT size of $2^{10}$. 
5. Conclusion

An efficient exponential distribution generator was implemented on hardware by dividing the inverse CDF into intervals and curve fitting each interval with linear, quadratic and cubic polynomials. This method of implementing exponential distribution on hardware can be extended to other probability distribution functions with a well defined inverse CDF function. The design has the following attributes:

*Speed/Number of exponentially distributed numbers generated per second*

The exponential module that is designed is time efficient as it produces one exponential number every clock cycle, i.e. 100 million numbers/second with a 100 MHz clock.

*Efficiency*

The design is also very efficient. In case of exponential distribution, the mean square error decreases as we fit higher order polynomials up to cubic polynomials into each interval, although this improvement may not be significant as we fit higher order polynomial than cubic.

*Tradeoff between number of intervals and polynomial order: optimization*

A thorough study of fitting different polynomials to the intervals and getting an optimum size data look up table was performed. Thus depending on the resources available and accuracy needed, the DLUT size and the polynomial fit is decided. The design was implemented on a Virtex 5 FPGA and the resources of the FPGA were well utilized.

5.1 Future work

The research performed in this thesis gives a very efficient exponential distribution generator. But there is still a lot of scope for improvement. More efficient methods to handle the data look up tables can be implemented. Instead of storing the DLUT values in the memory, different methods like changing the values of the data look up tables at run time can be investigated. This way, we can change the probability distribution we want to
generate at run time and thus get a universal probability distribution generator which is more general and widely applicable. For example certain data that behaves like an exponential distribution initially and normal distribution later can be handled by hardware using the above method of changing the DLUT values during runtime.

We can try to implement the design in other FPGAs like Virtex 6, which have more block RAM and can thus store big DLUTs. A better floating point unit that performs floating point addition and multiplication operations efficiently and with less resource utilization can be used instead of the floating point library used in this research.
6. References


**Appendix**

**Linear**

--Function: Calculates exponential of a number by curve fitting using linear polynomial
-- Input: clk - std_logic,urn -32 bit std_logic_vector
-- Output: exprand -32 bit std_logic_vector

LIBRARY IEEE;
library ieee_proposed;
use ieee_proposed.float_pkg.all;
use IEEE.Std_Logic_1164.all;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use ieee_proposed.fixed_pkg.all;

entity linear is
  port(
    clk :in std_logic;
    urn : in std_logic_vector(31 downto 0);
    exprand : out std_logic_vector (31 downto 0)
  );
end linear;

architecture design of linear is
signal ind:std_logic_vector(5 downto 0):="000000";
signal index,Xi:integer:=0;
signal exp : float32;
signal Alin,Blin,X,Xd:float32;
signal expl:float32;
constant f:float32:="00110000000000000000000101010001";
signal stda_linear,stdb_linear:std_logic_vector(31 downto 0);
signal Al:float32;
type LUT is array (0 to 63) of integer;
constant LINEARa1 :LUT := (1051438193, 1051572411, 1051706629,1051844202,
1051981775, 1052126059,1052270343,1052414627,1051438193,1051572411,
1051706629, 1051844202,1051981775,1052126059,1052270343,1052414627,
1051438193,1051572411,1051706629,1051844202,1051981775,1052126059,
1052270343,1052414627,1051438193,1051572411,1051706629,1051844202,
1051981775,1052126059,1052270343,1052414627,1051438193,1051572411,
1051706629, 1051844202,1051981775,1052126059,1052270343,1052414627,
1051438193,1051572411,1051706629,1051844202,1051981775,1052126059,
1052270343,1052414627,1051438193,1051572411,1051706629,1051844202,
1051981775,1052126059,1052270343,1052414627,1051438193,1051572411,
1051706629, 1051844202,1051981775,1052126059,1052270343,1052414627);
constant LINEARb1 :LUT :=(1051438193, 1051572411, 1051706629,1051844202,
1051981775, 1052126059,1052270343,1052414627,1051438193,1051572411,
1051706629, 1051844202,1051981775,1052126059,1052270343,1052414627,
1051438193,1051572411,1051706629,1051844202,1051981775,1052126059,
1052270343,1052414627,1051438193,1051572411,1051706629,1051844202,
1051981775,1052126059,1052270343,1052414627,1051438193,1051572411,
1051706629, 1051844202,1051981775,1052126059,1052270343,1052414627);

begin

process(clk)
begin
  if clk='1' and clk' event then
    Ind<=urn(31 downto 26);
    Index<=conv_integer(Ind);
    Xi<=conv_integer(urn(31 downto 0));
    Xd<=to_float(Xi);
    X<= Xd * f;
    stda_linear<=conv_std_logic_vector(LINEARa1(Index),32);
    stdb_linear<=conv_std_logic_vector(LINEARb1(Index),32);
    Exp<=expl;
    exprand<=to_slv(Exp);
  end if;
end process;

--Converts a and b to floating point
toflinear:process(clk)
begin
  if clk='1' and clk' event then
    Alin<=To_float(stda_linear);
    Blin<=To_float(stdb_linear);
  end if;
end process toflinear;

--Linear polynomial fit ax+b
Lin: process(clk)
begin
  if clk='1' and clk' event then
    Al<= Alin * X;
  end if;
end process Lin;

explin:process(clk)
begin
  if clk='1' and clk' event then
    expl<= Al + Blin;
  end if;
end process explin;
end design;
-- Function: Calculates exponential of a number by curve fitting using quadratic polynomial
-- Input: clk - std_logic, urn - 32 bit std_logic_vector
-- Output: exprand - 32 bit std_logic_vector

LIBRARY IEEE;
library ieee_proposed;
use ieee_proposed.float_pkg.all;
USE IEEE.Std_Logic_1164.all;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use ieee_proposed.fixed_pkg.all;

entity quad is
port(clk :in std_logic;
urn : in std_logic_vector(31 downto 0);
exprand : out std_logic_vector (31 downto 0)
);
end quad;

architecture design of quad is
signal ind:std_logic_vector(5 downto 0):="000000";
signal index,Xi:integer:=0;
signal exp : float32;
signal Aqua,Bqua,Cqua,X,Xd,Xsq,ABq: float32;
signal expq:float32;
constant f:float32:="00110000000000000000000101010001";
signal stda_quad :std_logic_vector(31downto 0);
signal stdb_quad :std_logic_vector(31downto 0);
signal stdc_quad :std_logic_vector(31downto 0);
signal Aq,Bq :float32;
signal stdexp:std_logic_vector(31 downto 0);
type LUT is array (0 to 63) of integer;
constant quada2 :LUT :=(379663292, 217424109, 557450834, 984198983, 870656109,
521483550, 517574259, 941418453, 706227913, 319521545, 412345498, 986391343,
334652128, 833511935, 435739226, 962474219, 813280190, 696487943, 464112399, 306213272,
719818415, 239194206, 346640482, 71970826, 608060172, 103701628,
771529494, 799659653, 984077195, 920412516, 268415954, 539803676, 771747174,
623179087, 825648194, 55913812, 619520027, 830043195, 208033776, 845628895,
671874810, 603660393, 710539471, 314444421, 475378989, 80127875, 133820893,
120324693, 76010575, 478285937, 278050042, 834087114, 617109346, 728459085,
873597457, 389275420, 587303300, 594957368, 769990562, 85133768, 339352445,
412813635, 479459311, 860062059);
constant quadb2 :LUT := (121415088 , 876323916, 937253381, 787972678, 49196488,
491548451, 211578144, 549758566, 235474273, 893442683, 396331566, 351517566, 578149530,
378562418, 715679974, 732341334, 479861249, 855946707, 218110383, 472371576,
661612092, 834545553, 687540855, 335799729 , 928497794, 255354729, 983216576, 855785575,
60755367, 701539471, 314444421, 475378989, 80127875, 133820893,
120324693, 76010575, 478285937, 278050042, 834087114, 617109346, 728459085,
873597457, 389275420, 587303300, 594957368, 769990562, 85133768, 339352445,
412813635, 479459311, 860062059);
constant quadc2 :LUT := (116533356, 685389924, 588238150, 917946112, 802764422, 486309328,
171037299, 109562825, 298733345, 134901071, 515036947,223590236, 33624974, 274192888,
557952513, 390516554, 313987946, 944639288, 186290869, 92085636, 48972236,
79665943, 250775239, 96080636, 63773320, 264533359, 782670867, 313673103, 712652608,
453599047, 463071040, 217878626, 393015413, 97091877, 772021279, 905223561,
368020715, 304002596, 759476717, 114782895, 738470734, 921290860, 448683070, 347114952, 673705094, 949198404, 949284024,
begin

process(clk)
begin
  if clk='1' and clk' event then
    ind<=urn(31 downto 26);
    index<=conv_integer(ind);
    Xi<=conv_integer(urn(31 downto 1));
    Xd<=to_float(Xi);
    X<= Xd *f;

    stda_quad<=conv_std_logic_vector(quada2(index),32);
    stdb_quad<=conv_std_logic_vector(quadb2(index),32);
    stdc_quad<=conv_std_logic_vector(quadc2(index),32);
    exp<=expq;

    exprand<=to_slv(exp);
  end if;
end process;

----Converts a,b and c to floating point

tofquad: process(clk)
begin
  if clk='1' and clk' event then
    Aqua<=To_float(stda_quad);
    Bqua<=To_float(stdb_quad);
    Cqua<=To_float(stdc_quad);
  end if;
en
end process tofquad;

-- A*x^x

Quad1: process(clk)
begin
  if clk='1' and clk' event then
    Aq<= Aqua*Xsq;
  end if;
end process Quad1 ;

--B*x

Quad2: process(clk)
begin
  if clk='1' and clk' event then
    Bq<=Bqua*X;
  end if;
end process Quad2 ;

--A*x^x+B*x

Aqadd: process(clk)
begin
  if clk='1' and clk' event then
    ABq<=Aq+Bq;
  end if;
end process Aqadd;

--Ax*x+B*x+C

expquad: process(clk)
begin
  if clk='1' and clk' event then
    expq<= ABq+Cqua;
  end if;
end process expquad;

--x^x

Xsquare: process(clk)
begin
  if clk='1' and clk' event then
    Xsq<=X*X;
  end if;
end process Xsquare;

end

end if;
end process Xsquare;

end design;
Cubic

LIBRARY IEEE;
library ieee_proposed;
use ieee_proposed.float_pkg.all;
USE IEEE.Std_Logic_1164.all;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use ieee_proposed.fixed_pkg.all;

entity cubic is
port(
  clk :in std_logic;
  urn : in std_logic_vector(31 downto 0);
  exprand : out std_logic_vector (31 downto 0)
);
end cubic;

architecture design of cubic is
signal ind:std_logic_vector(5 downto 0):="000000";
signal index,Xi:integer:=0;
signal exp : float32;
signal Acu,Bcu,Ccu,Dcu: float32;
signal  X,Xd,Xsq,Xcube,ABq,ABc,ABCc: float32;
signal expl,expq,expc:float32;
constant f:float32:="00110000000000000000000101010001";
signal stda_cubic,stdb_cubic,stdc_cubic,stitial_cubic:std_logic_vector(31 downto 0);
signal Al,Aq,Bq,Ac,Bc,Cc :float32;
signal stdexp:std_logic_vector(31 downto 0);
type LUT is array (0 to 63) of integer;
constant cubica3 :LUT := (379663292, 217424109, 557450834, 984198893, 870656109,
521483550, 517574259, 941418453, 706227913, 319552454, 102354598, 986391343,
334652128, 833511935, 435739226, 96487943, 464112399,
306213272, 719814815, 239194206, 346640482, 71970826, 508006172, 103701628,
771747174, 263179087, 825648194, 55913812, 31441421, 475378998, 80127875,
133820893, 120324693, 776010575, 478285937, 608006172, 703303000, 994957368,
769990562, 85133768, 339352445, 412813635, 479459311, 860062059);
constant cubicb3 :LUT := (121415088,876323916, 937253381, 787972678, 491496488,
491548451, 211578144, 549758553, 687540855, 335799729, 928947794, 255354729,
895321656, 807557367, 230535176, 887455705, 603568373, 6654410,
315853674, 397014267, 20010657, 968826781, 81286366, 147223167, 64566974,
81860266, 397014267, 204749175, 30451201, 777464834, 487484221, 45571940,
405885702, 781029560, 476883415, 752281163, 475985463, 479095154, 290316292,
839241273, 828608954, 631612334, 48649999, 506442288, 40292210, 323987124,
957742691, 54475232, 929801280);
constant cubicc3 :LUT := (116533356, 685389924, 588238105, 917946112, 802764422,
486309328, 71037299, 109506285, 298733345, 134091071, 535036947, 223590236,
33662498, 274919288, 557952511, 601695545, 313887946, 944638288, 186290869,
920806536, 48972236, 796665943, 250775239, 96080636, 63733320, 264533359, 782670867,
331673103, 712652608, 453590047, 46071040, 271878626, 390315413, 790971877, 772021279,
90522361, 368027015, 30488417, 759467167, 114782895,
738470734, 921290860, 448636307, 347114952, 673705094, 941998404, 949284024,
89586837, 365771981, 271051071, 620423888, 961468376, 161029073, 46284332,
340057507, 863919883, 295979126, 259954604, 88450728, 60233253, 4968773,
121532074, 281923731, 85720864);
constant cubicd3 :LUT := (130710016, 993703648, 621858027, 225919470, 832077741,
186164517, 525982250, 770532095, 939741118, 132138646, 315853674, 923454197,
begin

process(clk)
begin
  if clk='1' and clk' event then
    ind<=urn(31 downto 26);
    index<=conv_integer(ind);
    Xi<=conv_integer(urn(31 downto 1));
    Xd<=to_float(Xi);
    X<= Xd *f;
    stda_cubic<=conv_std_logic_vector(cubica3(index),32);
    stdb_cubic<=conv_std_logic_vector(cubicb3(index),32);
    stdc_cubic<=conv_std_logic_vector(cubicc3(index),32);
    stdd_cubic<=conv_std_logic_vector(cubicd3(index),32);
    exp<=expc;
    exprand<=to_slv(exp);
  end if;
end process;

--a,b,c and d to floating point
tofcubic:process(clk)
begin
  if clk='1' and clk' event then
    Acu<=To_float(stda_cubic);
    Bcu<=To_float(stdb_cubic);
    Ccu<=To_float(stdc_cubic);
    Dcu<=To_float(stdd_cubic);
  end if;
end process tofcubic;

--a\cdot x^3 + b\cdot x^2
Acadd: process(clk)
begin
  if clk='1' and clk' event then
    ABc<=Ac+Bc;
  end if;
end process Acadd;
-- a*x^3 +b*x^2 + c*x
ABcadd: process(clk)
begin
  if clk='1' and clk' event then
    ABCc<=ABc+Cc;
  end if;
end process ABcadd;

-- a*x^3 +b*x^2 + c*x +d
expcubic:process(clk)
begin
  if clk='1' and clk' event then
    expc<= ABCc+Dcu;
  end if;
end process expcubic;

-- x*x
Xsquare: process(clk)
begin
  if clk='1' and clk' event then
    Xsq<=X*X;
  end if;
end process Xsquare;

-- x*x*x
Xcu: process(clk)
begin
  if clk='1' and clk' event then
    Xcube<= X*Xsq;
  end if;
end process Xcu;

end design;
Linear interpolation using Coregen Floating point adder and multiplier
(uses float library to convert input to floating point)

LIBRARY IEEE;
library ieee_proposed;
use ieee_proposed.float_pkg.all;
USE IEEE.Std_Logic_1164.all;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use ieee_proposed.fixed_pkg.all;
Library XilinxCoreLib;

entity linear is
port(clk :in std_logic;
urn : in std_logic_vector(31 downto 0);
exprand : out std_logic_vector (31 downto 0));
end linear;

architecture design of linear is
signal ind:std_logic_vector(7 downto 0):="00000000";
signal index,Xi:integer:=0;
signal exp : float32;
signal Alin,Blin,X,Xd:float32:="00000000000000000000000000000000";
constant f1:std_logic_vector:="00110000000000000000000101010001";
constant LINEARa1 :LUT := (x"3F804189", x"3F80C155", x"3F814467", x"3F81C77A",
 x"3F824A8C", x"3F82D0E5", x"3F83573F", x"3F83DE98", x"3F8463F1",
 x"3F84ED91", x"3F857A78", x"3F860419", x"3F869100", x"3F87212D", x"3F87AE14",
 x"3F883E42", x"3F88DADB", x"3F88F8A1", x"3F89F85C", x"3F8B2DD1", x"3F8BB9C3",
 x"3F8C56D6", x"3F8CF0D8", x"3F8E2824", x"3F8E8CB4", x"3F8F6944",
 x"3F9009D5", x"3F90A965", x"3F915183", x"3F91F55A", x"3F92C786",
 x"3F9346DC", x"3F93F141", x"3F949B6A", x"3F954952", x"3F95F5FD",
 x"3F966770", x"3F9758E2", x"3F9809D1", x"3F98C49C", x"3F997C1C",
 x"3F9A339C", x"3F9AE663", x"3F9BAC71", x"3F9C6A7F", x"3F9D288D",
 x"3F9DE9E2", x"3F9F97319", x"3FA03A98", x"3FA0D14E", x"3FA29FBE",
 x"3FA36E2F", x"3FA43F6E", x"3FA515E4", x"3FA5E9E2", x"3FA62C27",
 x"3FA79DB2", x"3FA879E", x"3FA98510", x"3FAA5A2A", x"3FAB1C43",
 x"3FAC01A3", x"3FACEA4B", x"3FADD639", x"3FAEC227", x"3FAEBA42",
 x"3FB0A71E", x"3FB1999A", x"3FB292A3", x"3FB38BAC", x"3FB48B44",
 x"3FB58ADB", x"3FB68DB9", x"3FB793DE", x"3FB89D49", x"3FB9C9EF",
 x"3FBC0D2F", x"3FBDF6FD", x"3FBF10CB", x"3FC03127", x"3FC15183",
 x"3FC2786C", x"3FC3A29C", x"3FC4CCCD", x"3FC5FDB8", x"3FC73190",
 x"3FC86C22", x"3FC9A6B5", x"3FCDB0A4", x"3FCED999", x"3FDD0D1B",
 x"3FDD16B5", x"3FDB29F5", x"3FDF41893", x"3FDF57A78", x"3FDF6DA4",
 x"3FDF9B2EC", x"3FDB2FEC", x"3FDC98C1", x"3FDE2824",
 x"3FDFAC5C", x"3FE1374C", x"3FE2C3C3", x"3FE45A1D", x"3FE5F3B6",
 x"3FE793DE", x"3FE99A93", x"3FEAE48F", x"3FEC985F", x"3FEE4F76",
 x"3FEE0162", x"3FF1D495", x"3FF74F0E", x"3FF930BE", x"3FFB18FC",
 x"3FFD07C8", x"3FFF0069", x"40007FC7", x"4001844D", x"40028C15",
 x"40039C8E", x"4004AA65", x"4005F48", x"4006D917", x"4007F7CF",
 x"40091871", x"400A425B", x"400B6FD2", x"400CA234", x"400DDB23",
 x"400F7559", x"4010A596", x"401A36E", x"4012F1AA", x"40144674",
 x"4015A027", x"4017020C", x"401868DC", x"4019D7DC", x"401BB4DA6",
 x"401CC96E", x"401E4C30", x"401FD8F1", x"4021679D", x"402306E7",
 x"4024AA65", x"40265604", x"40280B78", x"4029C91D", x"402BBEF3",
 x"402BD6042", 65
begin

--Coregen multiply
coremul2:ENTITY floating_point_v3_1(floating_point_v3_1_a)
port map(
    a => Xds,
    b => f1,
    clk => clk,
    result => Xs
);

coremul1:ENTITY floating_point_v3_1(floating_point_v3_1_a)
port map(
    a => Alins,
    b => Xs,
    clk => clk,
    result => Als
);

--Coregen add
coreadd:ENTITY floating_point_v3_0(floating_point_v3_0_a)
port map(
    a => Als,
    b => Blins,
    clk => clk,
    result => exprand
);

process(clk)
begin
    if clk='1' and clk' event then
        ind<=urn(31 downto 24);
        index<=conv_integer(ind);
        Xi<=conv_integer(urn(31 downto 0));
        Xd<=>to_float(Xi);
        Xds<=>to_slv(Xd);
    end if;
end process;

--Converts a and b to floating point
toflinear:process(clk)
begin
    if clk='1' and clk' event then
        Alins<=>LINEARa1(index);
        Blins<=>LINEARb1(index);
    end if;
end process tooflinear;

end design;
Modelsim code for converting real to std_logic;
(The code below converts one of the linear polynomial co-efficient from real to float integer format)

LIBRARY IEEE;
USE work.all;
library ieee_proposed;
use ieee_proposed.float_pkg.all;
USE IEEE.Std_Logic_1164.all;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
use ieee_proposed.math_utility_pkg.all;
use ieee_proposed.fixed_pkg.all;
USE ieee_proposed.float_pkg.all;
USE ieee.MATH_REAL.ALL;

entity rand is
port(clk :in std_logic;
    urn : in std_logic_vector(31 downto 0); -- input 32 bit random number from LFSR
    fit :in integer range 0 to 2; -- integer to indicate linear,quadtratic or cubic
    ans :out std_logic_vector(31 downto 0)); -- exponentially generated random number
end rand;

architecture design of rand is

signal ind:std_logic_vector(7 downto 0):="00000000";
signal index,Xi,i:integer:=0;
signal upper,A,B,C,D:real;
signal A1,B1,exp,X:float32;

constant LINEARa1 :LUT := ( 1.0020, 1.0059, 1.0099, 1.0139, 1.0179, 1.0220, 1.0261,
1.0302, 1.0343,1.0385, 1.0428, 1.0470, 1.0513, 1.0557, 1.0600, 1.0644, 1.0689, 1.0734,
1.0779, 1.0825, 1.0870, 1.0917, 1.0964, 1.1011, 1.1058, 1.1106, 1.1155, 1.1204, 1.1253,
1.1302, 1.1353, 1.1403, 1.1454, 1.1506, 1.1558, 1.1610,1.1663, 1.1716, 1.1770, 1.1824,
1.1879, 1.1935, 1.1991, 1.2047, 1.2104,1.2162, 1.2220, 1.2278, 1.2337, 1.2397, 1.2457,
1.2518, 1.2580, 1.2642,1.2705, 1.2768, 1.2832, 1.2897, 1.2962,1.3028, 1.3095, 1.3162,
1.3230,1.3299, 1.3368, 1.3438, 1.3509, 1.3581, 1.3653, 1.3727, 1.3801, 1.3875,1.3951,
1.4027, 1.4105, 1.4183, 1.4262, 1.4342, 1.4423, 1.4504, 1.4587,1.4671, 1.4755, 1.4841,
1.4927,1.5013, 1.5103, 1.5193, 1.5284, 1.5375,1.5468, 1.5562, 1.5658, 1.5754, 1.5851,
1.5950, 1.6050, 1.6151, 1.6254,1.6358, 1.6463, 1.6570, 1.6678, 1.6787, 1.6898, 1.7010,
1.7124, 1.7239,1.7356, 1.7474, 1.7595, 1.7716, 1.7840, 1.7965, 1.8092, 1.8221, 1.8351,
1.8484, 1.8618, 1.8755, 1.8893, 1.9034, 1.9176, 1.9321, 1.9468, 1.9617,1.9768, 1.9922,
2.0078, 2.0237, 2.0398, 2.0562, 2.0729, 2.0898, 2.1070,2.1245, 2.1423, 2.1603, 2.1787,
2.1974, 2.2165, 2.2358, 2.2555, 2.2756,2.2960, 2.3168, 2.3379, 2.3595, 2.3814, 2.4038,
2.4266, 2.4498, 2.4734, 2.4976, 2.5222, 2.5473, 2.5729, 2.5990, 2.6257, 2.6529, 2.6806,
begin
process(clk)
begin
if clk='1' and clk' event then
for index in 0 to 255 loop
LINEARb1f(index)<=To_float(LINEARb1(index));
LINEARb1std(index)<=to_slv(LINEARb1f(index));
LINEARb1int(index)<=conv_integer(LINEARb1std(index));
LINEARb1revstd(index)<=conv_std_logic_vector(LINEARb1int(index),32);
LINEARb1rev(index)<=To_float(LINEARb1revstd(index));
end loop;
end if;
end process;
end design;

Matlab

Matlab code used to get polynomial co-efficients

```
n_int_bits=8;
n_int=2^n_int_bits; % number of intervals
n_points_bits=32;
n_points=2^n_points_bits; % Number of bits in the given number
factor=1/2^1;
n_points_intv=factor*n_points/n_int; % Each interval has 2^32/2^8 points

for j=1:n_int;
    x=linspace ((j-1)/n_int,(j/n_int)-(1/n_points_intv),n_points_intv);
    lamda=1;
    y=-log (1 - x) /lamda; %inverse of exponent cdf

    poly1=polyfit(x,y,1); %linear fit
    A1(j)=poly1(1);
    B1(j)=poly1(2);

    poly2=polyfit(x,y,2); %quadratic fit
    A2(j)=poly2(1);
    B2(j)=poly2(2);
    C2(j)=poly2(3);

    poly3=polyfit(x,y,3); %quadratic fit
    A3(j)=poly3(1);
    B3(j)=poly3(2);
    C3(j)=poly3(3);
    D3(j)=poly3(4);
end
```

70
Matlab code used to calculate Mean Square error

```matlab
% polynomial co-efficients

N = 20000000;
x = 0:1:N-1;
x = x./N;
y = -log(1-x);

n_order = 12;
n_intervals = 16;
mse = zeros(n_order,n_intervals+1);

order = 8
interval_pow = 12
n_int = 2^interval_pow;

testy = zeros(1,N);
for j = 0:n_int-1
    lowerindex = floor(N*(j)/n_int+1);
    upperindex = floor(N*(j+1)/n_int);
    poly1 = polyfit(x(lowerindex:upperindex),y(lowerindex:upperindex),order);
    testy(lowerindex:upperindex) = polyval(poly1,x(lowerindex:upperindex));
end

mse(order,interval_pow+1) = sum((y-testy).^2) / N;
```