This is a repository copy of *A Novel Multi-objective Optimisation Algorithm for Routability and Timing Driven Circuit Clustering on FPGAs*.

White Rose Research Online URL for this paper:
http://eprints.whiterose.ac.uk/138402/

Version: Accepted Version

**Article:**
Wang, Yuan, Trefzer, Martin Albrecht orcid.org/0000-0002-6196-6832, Bale, Simon Jonathan et al. (2 more authors) (2018) A Novel Multi-objective Optimisation Algorithm for Routability and Timing Driven Circuit Clustering on FPGAs. IET Computers and Digital Techniques. CDT-2018-5115. ISSN 1751-861X

---

**Reuse**
Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of the full text version. This is indicated by the licence information on the White Rose Research Online record for the item.

**Takedown**
If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request.
A Novel Multi-objective Optimisation Algorithm for Routability and Timing Driven Circuit Clustering on FPGAs

Yuan Wang, Martin A. Trefzer, Simon J. Bale, James A. Walker, Andy M. Tyrell

Abstract:
Circuit clustering algorithms fit synthesised circuits into FPGA configurable logic blocks (CLBs) efficiently. This fundamental process in FPGA CAD flow directly impacts both effort required and performance achievable in subsequent place-and-route processes. Circuit clustering is limited by hardware constraints of specific target architectures. Hence, better circuit clustering approaches are essential for improving device utilisation whilst at the same time optimising circuit performance parameters such as, e.g., power and delay. In this paper, we present a method based on multi-objective genetic algorithm (MOGA) to facilitate circuit clustering. We address a number of challenges including CLB input bandwidth constraints, improvement of CLB utilisation, minimisation of interconnects between CLBs. Our new approach has been validated using the "Golden 20" MCNC benchmark circuits that are regularly used in FPGA-related literature. The results show that the method proposed in this paper achieves improvements of up to 50% in clustering, routability and timing when compared to state-of-the-art approaches including VPack, T-VPack, RPack, DPack, HDPack, MOPack and iRAC. Key contribution of this work is a flexible EDA flow that can incorporate numerous objectives required to successfully tackle real-world circuit design on FPGA, providing device utilisation at increased design performance.

1 Introduction

Field Programmable Gate Arrays (FPGAs) have developed significantly over the past 20+ years and are the number one choice for prototyping complex digital designs. Their flexibility moves them into many application areas such as reconfigurable computing, evolvable hardware and fault-tolerant systems. However, this flexibility puts limitations on maximum achievable design speed and area on FPGA. The resource utilisation is thereby not constrained solely by the hardware structure, but also depends significantly on the application mapping process. Such computer-aided design (CAD) flow for a implementing digital circuits on FPGA consists of synthesising a design into gate-level netlists whose components are mapped onto basic logic elements of the fabric. These are then clustered into higher-level configurable logic blocks (CLBs). Hence, this process is known as circuit clustering [1–3]. Separate placement and routing stages then attempt to map clustered functions onto the fabric and connect these blocks to form the circuit for a given application. Clustering is therefore a fundamental process in CAD flow that is linked to the architecture of a specific FPGA and therefore possibly limited by a variety of hardware constraints. Hence, better circuit clustering approaches are essential to the successful and effective use of FPGAs.

Clustering a large netlist, the synthesised circuit, into groups is a non-trivial task when considering all clustered circuit properties. The problem becomes more complex as the inherent hardware constraints are considered as well. When a group of logic elements has been selected and clustered into a logic block, the circuit properties that have to be optimised are often conflicting, and it is usually a non-trivial problem to optimally balance multiple clustering objectives. This cannot be efficiently addressed by simply weighting and accumulating them into a single performance metric or figure of merit. Circuit clustering is by its very nature a complex multi-objective optimisation problem. In this paper, we therefore propose a multi-objective circuit clustering technique to solve and optimise the circuit-clustering stage in CAD for FPGA design.

The method proposed in this paper is divided into two steps, each using multi-objective genetic algorithms (MOGAs): the first generates initial solutions using predictive metrics, and the second generates specific optimised solutions. Both steps consider five optimisation objectives using non-dominated sorting and crowding-distance selection based on NSGA-II [4]. The MOGAs described in this paper produce multiple unique solutions, which are based on Pareto optimality. This method gives maximum flexibility and makes it possible to add additional clustering metrics later without changing the core algorithm or invalidating solutions found already. The MCNC-20 benchmark suite is used to test and validate the proposed method. We compare the experimental results to other state-of-the-art FPGA circuit clustering methods, VPack [1], T-VPack [5], RPack [6], DPack, HDPack [7], MOPack [8] and iRAC [3]. The improvements achieved with the proposed method are up to 4.33% in reducing the number of logic blocks and up to 14.24% in reducing interconnect when compared to iRAC, the best-performing circuit clustering method which also enhances both FPGA area usage and routability. The most important result of our method is that optimised solutions achieve a speed up of mapped circuits by up to 27.62% compared to T-VPack and outperforms other well-known methods.

2 Circuit Clustering in FPGA Design Flow

This section provides an overview of the FPGA architecture and state-of-the-art FPGA design flow. In order to implement a complete design onto FPGA, many design objectives have to be considered by automated CAD tools. Fig. 1(a) shows a cluster-based FPGA CAD flow.
2.1 FPGA Model

A cluster-based island-style FPGA is widely used, and its logic and routing blocks are arranged in a 2D mesh [9]. To lower the routing difficulties and improve application performance, a logic block, also known as configurable logic block (CLB), usually contains $N$ basic logic elements (BLEs) and internal routing resources [10]. The BLE is the smallest configurable logic element that includes a $k$-input look-up table (LUT) and a reconfigurable flip-flop (FF). An important feature is usually presented in many Altera FPGAs [11, 12], which is the input-bandwidth constraint [13]. This means that the total number of inputs, $I$, of a CLB is less than $N \times k$, due to resource limitations. Though this constraint does not exist in all modern FPGAs, it represents an additional issue and has been the subject of research in circuit clustering. Additionally, the FPGA CAD research tool VPR [14] used here to facilitate design mapping is also based on an input-bandwidth-constraint CLB model. In most FPGA circuit clustering literature, $N$, $k$, and $I$ are normally set to 18, 8, and 4 respectively, and CLBs are assumed to have a unique clock [15]. These assumptions are also used in this paper.

The routing architecture is based on a $X$ row by $Y$ column array of CLBs. Each CLB is connected to the routing channel, wire segments, via the input and output connection blocks [16]. The connectivity of the input and output connection blocks are defined by two parameters, $F_{in/out}$, the fraction of wire segment width in the channel (which refers to the pre-defined channel width $W$) to the connection number of the input or output of the CLB is used [9]. The switch block contains a set of programmable routing switches, and is positioned between CLBs. Switch block flexibility is defined by the parameter $F_s$ representing the number of possible connections that a wire segment can make to other wire segments.

2.2 Circuit Clustering for FPGAs

Circuit clustering, also known as circuit packing, is a process to partition a synthesised circuit into sub circuits to enable mapping them to FPGA without breaking any CLB hardware constraints. Circuit clustering also indicates how to best group (pair) the BLEs based on their LUT and FF connections. Fig. 1(b) illustrates this process.

Circuit clustering is a fundamental process in the CAD flow, and the quality of clustering can significantly impact subsequent placement and routing processes which then directly affect the performance of the circuit. It can be significantly more problematic—or even impossible—to optimise a circuit’s performance in subsequent steps of the CAD flow if clustering is ineffective. When clustering BLEs into CLBs, even if two solutions have the same number of clustered CLBs, their BLE combinations within the CLBs can be different. Therefore, circuit clustering is a complex grouping problem similar to multi-objective bin packing—a well-known NP-hard problem [17], but with additional constraints and requirements.

2.3 Multi-objective Problem Formulation

The basic requirements of circuit clustering which refer to routability-driven circuit clustering are: Firstly, it is required to cluster all BLEs into CLB resources while minimising the number of CLBs. Secondly, external CLB interconnects must be reduced by including as many connections within the CLBs. Fig. 1(c) illustrates this and explains how fewer external CLB interconnects facilitate routing [6, 15].

Under the assumption that $I$ is the CLB input number, $N$ specifies the BLE number within the CLB and each has one clock, the circuit clustering problem can be formulated as a set of BLEs: $B = \{b_1, b_2, ..., b_m\}$ representing a synthesised circuit, and a set of empty CLBs representing the FPGA: $C = \{c_1, c_2, ..., c_m\}$. When clustering BLEs into CLBs, the following conditions have to be met:

\[
\text{INPUT}(c_i) \leq I \quad i = 1, 2, ..., m  
\]  
\[
\text{BLE}(c_i) \leq N \quad i = 1, 2, ..., m  
\]  
\[
c_i \cap c_j = \emptyset \quad i, j = 1, 2, ..., m, \ i \neq j  
\]  

\[
\sum_{i=1}^{m} \text{BLE}(c_i) = B \quad i = 1, 2, ..., m  
\]

(c) Mapping results under different CLB interconnects, when a clustered circuit has fewer CLB interconnects, the routed circuit can have fewer tracks [15]. Therefore, a narrow channel width is used on the FPGA.

Fig. 1

Hence, a routability-driven circuit clustering method usually optimises two aspects of a clustered circuit, which are defined in Equations (5)-(6). Equation (5) represents a circuit’s absolute area
on a FPGA. As shown in Fig. 1(c), the CLB interconnect (net) number also has to be as small as possible; this condition is represented by Equation (6):

$$\text{BLE}(c_i) \rightarrow N \ (\text{minimise } |C|) \quad i = 1, 2, \ldots, m \quad (5)$$

$$\sum_{i=1}^{m} \text{Net}(c_i) \rightarrow 0 \quad i = 1, 2, \ldots, m \quad (6)$$

As a result, these two parameters are usually considered a “golden rule” to evaluate the quality of the clustered circuit. In addition to improving routability, the circuit clustering method also reduces the delay of the mapped circuit thereby improving its speed through optimising the critical path. Other goals include power-driven circuit clustering, however, we focus on routability- and timing-driven circuit clustering in this paper.

### 3 Conventional Circuit Clustering Methods

This section reviews a number of well-known circuit clustering methods (algorithms). Most methods are targeting CLB-input-bandwidth-constraint island-style FPGAs. In contrast, there are also methods for the input-bandwidth-free CLB FPGA, where $k \times N$ and represented no input constraint.

These clustering techniques can be classified as bottom-up and top-down. Bottom-up refers to clustering a circuit by moving BLEs into CLBs sequentially based on a greedy algorithm, giving a locally optimal perspective, and CLBs are constructed one by one. Top-down methods typically consider using graph partitioning methods, which view a circuit from a more global perspective, and separate the circuit by recursively partitioning it until each part of the circuit is able to fit into CLBs. There are also hybrid methods and post-routing-assisted methods. “Hybrid” methods combine bottom-up and top-down methods, and “post-routing-assistant” indicates that the methods incorporate the CAD flow in their techniques.

#### 3.1 Bottom-up Methods

In bottom-up methods, a seed BLE has to be selected via a specified method, for example the number of BLE inputs and outputs. The seed is then directly moved into an empty CLB. To cluster more suitable BLEs in the CLB, these algorithms usually use an attraction function to determine which is the best candidate BLE that can be moved next. The attraction function is weighted by a number of clustering objectives. The value of the function is known as the “gain”. The highest gain BLE is selected for each clustering iteration.

Typical bottom-up clustering methods include VPack [1] T-VPack [5], RPack [6], T-RPack [18], iRAC [3] and MO-Pack [8]. As the seed BLE selection and attraction function are required in these methods, where it is uncertain whether or not the above two functions can supply the best BLE, the solution is usually local-optimal only. Weighting objectives in a single attraction function is able to meet the multi-objective optimisation needs, but the simple weighting can also destroy the proportionality between objectives.

#### 3.2 Top-down Methods

To facilitate top-down circuit clustering, most methods in this category are based on graph partitioning approaches treating the circuit as a hypergraph. The top-down circuit clustering methods usually deploy the hMETIS [19] hypergraph partitioning algorithm, and these methods can be viewed as various extensions of hMETIS, where hMETIS is a standalone tool performing the $k$-way graph partitioning based on a multi-level paradigm.

The first notable top-down FPGA circuit clustering method was introduced in [20], and designed for input-bandwidth-constraint CLB FPGAs. This method initially uses hMETIS to partition a circuit coarsely, involving a second iRAC-method-based step to further optimise clustered sub circuits in order to fit into CLBs. Since there is an extra step, it degrades the quality of the hMETIS results. A more recent top-down method has been proposed, PPack (also T-PPack, short for timing-driven PPack) [13], with promising results. Although these methods can produce better solutions than the bottom-up method, using graphs to cluster a circuit can make it difficult to involve clustering constraints, or clustering metrics.

#### 3.3 Other Methods

HDPack and DPack [7] are typical examples of hybrid methods, where they use circuit partitioning to preferentially cluster the synthesised circuit into sub circuits. These sub circuits are then optimised using bottom-up methods. Moreover, HDPack also incorporates the placement process in the CAD flow (DPack without using a CAD flow), which can approximately determine which regions are more congested based on a FPGA model, and extra adjustments can be conducted for the clustered circuit. Un/DoPack [21] and T-NDPack [22] involve the entire mapping process. In addition, Un/DoPack and T-NDPack introduce the concept of depopulation [21] in their methods. Even though these methods combine both top-down and bottom-up and use the top-down method as the first step, the quality of the results is usually decreased massively during the (second) bottom-up step.

### 4 Proposing new MOGA-based FPGA Circuit Clustering Methods

The concept of evolutionary computing (EC) has been developed over many years [23, 24], genetic algorithms (GAs) [25] are an important variant of evolutionary algorithms developed within the EC community. The outstanding advantages of GAs are that they can produce excellent solutions for a targeted problem without significant amounts of domain knowledge introduced at the start of the process and, if designed correctly, do not fall into local minima. In addition, multi-objective genetic algorithms (MOGAs) further enhance the problem solving ability for conflicting multi-objective problems, allowing MOGAs to address real search and optimisation problems.

In this paper, we propose a novel multi-objective genetic algorithm [4, 26, 27] based circuit clustering method. The proposed method contains two customised MOGAs: DBPack and HYPack, where both use Pareto optimality to incorporate multiple optimisation objectives. The former is used to generate initial circuit clustering solutions. As the GA produces stochastic results, DBPack can
be executed many times to accumulate different solutions. HYPack then uses these solutions as input and performs further optimisation. This work is conducted from a global perspective (top-down). Subsequently, HYPack is connected to a CAD flow, and optimises solutions via the real mappings.

5 DBPack: Producing Initial Clustering Solutions Using MOGA

Rather than incrementally adding BLEs to a CLB, DBPack ("DB" being short for database), utilises a MOGA to search for groups of BLEs for a CLB. In this work, each GA run produces a CLB containing one or more BLEs. DBPack and partial experimental results were initially published in [28]. In this section, the DBPack implementation is introduced. It includes the GA chromosome representation, genetic operations, fitness functions, solution selection and experimental results.

5.1 Overview Algorithm

The DBPack execution flow is illustrated in Fig. 2. DBPack clusters BLEs by using a number of GAs, the number of GAs are dependent on the un-clustered BLEs. Experiments show that clustering circuits using such an approach can reduce the GA search space compared to searching for a solution from a global perspective, where a useful solution can be produced efficiently. In each GA run, the initial population is randomly generated and based on the un-clustered BLEs. Then the individuals are evolved under multiple clustering objectives.

5.2 Representation

A binary string has been used to encode DBPack GA’s chromosome. The DBPack GA chromosome consists of a number of genes, and these genes are used to represent BLE selected for a CLB. The number of genes in the chromosome is determined by the un-clustered BLE number, hence the chromosome length is variable (as un-clustered BLEs become clustered). Each gene is used to encode each BLE index. A gene that has the binary value “1” means that the BLE index corresponding to that gene position has been selected for a CLB. Otherwise, the gene has the binary value “0” suggesting that the corresponding BLE is not yet selected. The detailed GA representation can be found in [28].

5.3 Reproduction

Each GA has both crossover and mutation implemented as genetic operation in order to generate new individuals. The crossover is a one-point binary crossover, and the crossover operation is controlled by a crossover rate. In a GA generation, two individuals are randomly selected from the population to perform crossover, the crossover point of their chromosomes is randomly determined. These two individuals then produce two new individuals. The “flipping a bit” mutation operation is utilised in DBPack. This mutation operation operates after crossover, and is performed on all offspring. For each offspring, DBPack mutation operation is designed to randomly flip one, and only one gene in the individual’s chromosome.

5.4 Fitness Functions

At each CLB construction, DBPack involves five fitness functions to guide the GA evolution to search for suitable BLEs. Each objective requires its own fitness function which evaluates a candidate solution and assigns a quality metric in the form of a fitness value to guide evolutionary search. In this case, these fitness functions not only describe which objectives need to be optimised, but also handle clustering constraints. In DBPack, the MO selection is based on the NSGA-II [4], which selects the best individuals using the fast-non-dominated sort and crowding distance.

5.5 Solution Selection

In DBPack, each CLB construction uses a MOGA, and the GA executes for a fixed number of generations. At the end of this execution all individuals can be considered as possible solutions for a CLB. In order to identify the best individual based upon MO characteristic, the selection process checks all final generation individuals.

\[
f_{\text{BLE}}(x) = (\# \text{ of BLEs})^{-1}
\]

\[
f_{\text{inter. cons.}}(x) = \begin{cases} 
2, & (\# \text{ of inter. cons.} = 0) \\
(\# \text{ of inter. cons.})^{-1}, & (\# \text{ of inter. cons.} > 0)
\end{cases}
\]

\[
f_{\text{increased cons.}}(x) = \# \text{ of increased CLB nets}
\]

\[
f_{\text{input}}(x) = \# \text{ of inputs}
\]

\[
f_{\text{output}}(x) = \# \text{ of outputs}
\]

Clustering objectives are described in Equations (7)-(11). Each function represents one objective of the searched BLEs, and all functions are defined to return smaller values when the function represented objective is improved. Explanations of these functions are as follows: Equation (7) represents the number of BLEs for a CLB. Equation (8) shows how many circuit connections a CLB contains. It presents two situations: When the BLEs have no included connection, it returns a large penalty. Otherwise, it presents a function relationship of CLB included connections. Equation (9) is to set up a global optimisation for CLB interconnects. If there are already clustered CLBs, current clustered CLB interconnects are known. When a new CLB is added, how many new interconnects appeared is calculable. Equations (10)-(11) are the controls of the CLB input and output numbers, and these are inspired by Rent’s rule [29].

\[
f_{\text{BLE penalty}}(x) = \begin{cases} 
0, & (\# \text{ of BLEs} \leq N) \\
\# \text{ of BLEs} \div A, & (\# \text{ of BLEs} > N)
\end{cases}
\]

\[
f_{\text{input penalty}}(x) = \begin{cases} 
0, & (\# \text{ of inputs} \leq I)
\end{cases}
\]

In addition to the objective functions that are defined, penalties are implemented to handle constraints. Equations (12)-(13) are the defined penalty functions. These functions produce penalty violations when BLE combinations are invalid for the targeted CLB type. Equation (12) presents the BLE number constraint, and Equation (13) is to control the input number of the BLEs. A and B are two proportional coefficients. These coefficients adjust the penalty violation levels. Experiments show, A = 7, B = 2 are efficient settings, where the penalty violations have to be small enough to avoid degrading the GA population diversity.

The objective functions and penalty functions have been set up for DBPack. According to [27, 30, 31], the penalty can be added to all objective functions to handle constraints in MOGAs. DBPack fitness functions are defined as Equations (14), where Equation (15) shows the sum of the penalties. Index \( j \) indicates the five objects listed in Equations (12)-(13).

\[
f_{\text{fitness}}(x) = f_{\text{obj}}(x) + f_{\text{penalty}}(x)
\]

\[
f_{\text{penalty}}(x) = f_{\text{BLE penalty}}(x) + f_{\text{input penalty}}(x)
\]
Individuals that are on the first Pareto front and having \( n (n = N) \) BLEs and less than or equal \( I \) inputs, are temporarily stored. In practice, the GA might not find any solution which has \( n = N \) BLEs, so \( n \) is reduced until individuals are found. The key to this process is to find all maximum BLE solutions. Subsequently, these temporarily stored individuals are ranked based on their internal connections. The individuals that have the most internal connections is selected as a CLB.

5.6 Results Comparison

The largest benchmark "clma" in MCNC-20 is used for adjusting the DBPack GA parameters as it represents the largest search space. The calibrated DBPack GA parameters are summarised as follows:

1. Population size: 200
2. Crossover probability: 0.6
3. Mutation probability: one gene per individual
4. Maximum generation number: 15000

As the output of DBPack is fully clustered circuits, which enables the comparison of the clustered circuits to other methods. Similar to these other methods, the CLB size \( N \), CLB input number \( I \) and LUT size \( k \) are set to 8, 18 and 4, and include one clock. Table 1 lists the DBPack best results compared to previously published bests, including CLB number and CLB interconnects. These results, each benchmark, are based on 100 DBPack runs. As can be seen, DBPack can maximum CLB utilisation.

### Table 1 Results comparison between DBPack, VPack, T-VPack, RPack and iRAC for a subset of MCNC benchmarks. *Interc.* represents the number of CLB interconnects, lower values are better

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Circuit Property</th>
<th>DBPack</th>
<th>VPack</th>
<th>T-VPack</th>
<th>RPack</th>
<th>iRAC</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>LUTs FFs</td>
<td>CLBs</td>
<td>Interc.</td>
<td>CLBs</td>
<td>Interc.</td>
<td>CLBs</td>
</tr>
<tr>
<td>alu4</td>
<td>1522</td>
<td>0</td>
<td>192</td>
<td>543</td>
<td>198</td>
<td>1296</td>
</tr>
<tr>
<td>apex2</td>
<td>1878</td>
<td>0</td>
<td>238</td>
<td>841</td>
<td>241</td>
<td>1626</td>
</tr>
<tr>
<td>apex4</td>
<td>1262</td>
<td>0</td>
<td>161</td>
<td>637</td>
<td>163</td>
<td>1037</td>
</tr>
<tr>
<td>bigkey</td>
<td>1707</td>
<td>224</td>
<td>214</td>
<td>716</td>
<td>214</td>
<td>1622</td>
</tr>
<tr>
<td>clima</td>
<td>8381</td>
<td>33</td>
<td>1058</td>
<td>3636</td>
<td>1056</td>
<td>7139</td>
</tr>
<tr>
<td>des</td>
<td>1591</td>
<td>0</td>
<td>199</td>
<td>909</td>
<td>204</td>
<td>1601</td>
</tr>
<tr>
<td>diffreq</td>
<td>1494</td>
<td>377</td>
<td>188</td>
<td>590</td>
<td>188</td>
<td>1280</td>
</tr>
<tr>
<td>dsp</td>
<td>1370</td>
<td>224</td>
<td>172</td>
<td>713</td>
<td>198</td>
<td>1590</td>
</tr>
<tr>
<td>elliptic</td>
<td>3602</td>
<td>1122</td>
<td>453</td>
<td>1278</td>
<td>453</td>
<td>3244</td>
</tr>
<tr>
<td>ex1010</td>
<td>1598</td>
<td>0</td>
<td>587</td>
<td>2264</td>
<td>595</td>
<td>3799</td>
</tr>
<tr>
<td>ex5p</td>
<td>1064</td>
<td>0</td>
<td>136</td>
<td>595</td>
<td>136</td>
<td>950</td>
</tr>
<tr>
<td>fmisc</td>
<td>3539</td>
<td>886</td>
<td>447</td>
<td>1266</td>
<td>448</td>
<td>2958</td>
</tr>
<tr>
<td>misex3</td>
<td>1397</td>
<td>0</td>
<td>177</td>
<td>586</td>
<td>178</td>
<td>1101</td>
</tr>
<tr>
<td>pdc</td>
<td>4575</td>
<td>0</td>
<td>580</td>
<td>1934</td>
<td>593</td>
<td>3813</td>
</tr>
<tr>
<td>s398</td>
<td>1930</td>
<td>8</td>
<td>242</td>
<td>480</td>
<td>246</td>
<td>1711</td>
</tr>
<tr>
<td>s38417</td>
<td>6096</td>
<td>1463</td>
<td>804</td>
<td>2912</td>
<td>803</td>
<td>4921</td>
</tr>
<tr>
<td>s38584</td>
<td>6281</td>
<td>1260</td>
<td>806</td>
<td>2537</td>
<td>806</td>
<td>4649</td>
</tr>
<tr>
<td>seq</td>
<td>1750</td>
<td>0</td>
<td>222</td>
<td>753</td>
<td>223</td>
<td>1496</td>
</tr>
<tr>
<td>spla</td>
<td>3690</td>
<td>0</td>
<td>467</td>
<td>1464</td>
<td>476</td>
<td>3031</td>
</tr>
<tr>
<td>tseng</td>
<td>1046</td>
<td>385</td>
<td>132</td>
<td>464</td>
<td>132</td>
<td>979</td>
</tr>
</tbody>
</table>

Sum of CLBs 7475 7551 7498 7550 7971
DBPack # CLB improvement 0.00% 1.01% 0.31% 0.99% 6.22%

Sum of Interc. 25118 49840 37239 38300 28077
DBPack Interc. improvement 0.00% 49.60% 32.55% 34.42% 10.54%

As the output of DBPack is fully clustered circuits, which enables the comparison of the clustered circuits to other methods. Similar to these other methods, the CLB size \( N \), CLB input number \( I \) and LUT size \( k \) are set to 8, 18 and 4, and include one clock. Table 1 lists the DBPack best results compared to previously published bests, including CLB number and CLB interconnects. These results, each benchmark, are based on 100 DBPack runs. As can be seen, DBPack can maximum CLB utilisation.

6 HYPack: Optimising Clustering Solutions from A Global Perspective Using MOGA

HYPack ("HY" short for hybrid) attempts to continuously optimise clustered circuits from a global perspective, and also incorporates with DBPack a method to re-cluster BLEs. The HYPack implementation is introduced in this section including the GA chromosome representation and genetic operations. The HYPack experimental results can be found in [28]. Subsequently, HYPack has been extended as a timing-driven circuit clustering method, T-HYPack, where CAD flow based fitness is involved. The fitness functions and further experimental results of T-HYPack are detailed in the next section.

6.1 Overview Algorithm

The HYPack execution flow is illustrated in Fig. 3. HYPack optimises intermediate solutions, in this case, from DBPack. Assuming DBPack has executed many times and generated enough solutions, these solutions are then converted as HYPack initial population. Then the GA further optimises solutions under an MO selection mechanism.

![Fig. 3: HYPack circuit clustering flow. HYPack reads already clustered circuit solutions, and uses these as the GA's initial population. Then the GA further optimises solutions under an MO selection mechanism.](image-url)
The crossover operation is intended to exchange BLE combinations for CLBs. Two individuals are selected randomly from the population, and their BLEs are exchanged using a crossover operation, as shown in Fig. 4(a). The crossover operation exchanges the BLEs of the two selected individuals to produce two new individuals.

The mutation operation, as shown in Fig. 4(b), is designed to randomly eliminate two CLBs in one individual, which releases previously allocated BLEs in the individual. The released BLEs are then reinserted into the individual before the mutation.

In the HYPack GA, an integer string has been selected to encode the chromosome. Each integer value represents the CLB index, and the gene values indicate which CLBs the BLEs are allocated. Inside the chromosome, its gene number is equal to BLE number. The crossover operation exchanges the BLEs of the two selected individuals to produce two new individuals.

The mutation operation randomly eliminates two CLBs in one individual, which releases previously allocated BLEs in the individual. The released BLEs are then reinserted into the individual before the mutation.

**Fig. 4**: HYPack crossover operation. This operation is designed to swap BLE combinations between two possible solutions (individuals).

### 6.2 Representation

In the HYPack, an integer string has been selected to encode the chromosome. These integer values present the CLB index, and the gene’s position is used to encode the BLE index. Fig. 5(a) is an example to illustrate this representation. Each integer position, gene position, is used to encode each independent BLE. These integer values present the CLB index, and the gene values indicate which CLBs the BLEs are allocated. Inside the chromosome, its gene number is equal to BLE number, hence the length of chromosome is variable and dependent on the BLE number.

### 6.3 Reproduction

Both crossover and mutation are implemented in the HYPack GA to create new individuals, and these operations are inspired by [32]. These genetic operations have two functions:

1. Select two individuals randomly from the population, and copy them to produce two new individuals.
2. Randomly determine which CLBs between the copied individuals exchange BLEs (crossover).

After both crossover and mutation operations are completed, a number of BLEs are likely to have been released. These released BLEs which are reserved in the individual need to be reclustered, and this is using the DBPack method. In this reclustered process, the DBPack method only produces new CLBs with new indexes back to the HYPack GA individuals. This implies that the DBPack method does not reinsert a single BLE into an individual existed CLBs, also there is no need to handling clustering constrains in HYPack as there is no invalid solution generated.

**Fig. 5**: HYPack genotype and mutation operation.
7 T-HYPack: Involving CAD Flow in GA for further Improving Clustering Solutions

The major difference from HYPack is that T-HYPack uses VPR [14] to facilitate a CAD-assistant, placement and routing processes, circuit timing optimisation, which extracts additional FPGA mapping information as the fitness in HYPack GA loop. This timing optimisation is different from conventional methods, for example T-VPack, where clusters target circuits based on a circuit timing analysis and incorporating the connection criticality, and then arranges most critical connections inside CLBs. The reason is that the FPGA CLB internal connection propagation delays are lower than CLB interconnects.

7.1 Extracting Mapping Parameters as Fitness Functions

When T-HYPack evolves solutions, or receives initial solutions from DBPack, the solutions are not only evaluated on the basic circuit clustering requirements, but also comparing their mapping performances.

In the solution evaluation process, an individual is firstly converted as a clustered circuit. The basic clustering fitness then are assigned, as shown in Equation (16)-(18), where (16) indicates CLB number, (17) shows the CLB interconnect number and (18) illustrates how many connections are included in CLBs. For each individual, the individual represented circuit, is then translated as a VPR readable netlist. Subsequently, the VPR output is used to compose new fitness criteria. The new fitness functions are represented in Equations (19)-(20). T-HYPack has five fitness functions, all these fitness functions return smaller values when a better solution is found.

\[
\begin{align*}
    f_{\text{hyobj1}}(x) &= \# \text{ of CLBs} \\
    f_{\text{hyobj2}}(x) &= \# \text{ of global nets} \\
    f_{\text{hyobj3}}(x) &= (\# \text{ of CLB absorbed nets})^{-1} \\
    f_{\text{hyobj4}}(x) &= \text{Critical path delay} \\
    f_{\text{hyobj5}}(x) &= \text{Routing wire length}
\end{align*}
\]

T-HYPack is executed for a fixed number of generations. On the final generation, the best individual, with the smallest delay and fewest CLB solution, is chosen from the first Pareto front, the individual is then converted as a netlist, and regarded as the ultimate clustering solution of a targeted circuit.

7.2 Results Comparison

According to DBPack GA parameter calibration and settings, the GA parameters of T-HYPack were determined. As the released BLEs are limited in number, generation and population size reductions are applied to the BLE recluster process. Recluster DBPack GA, T-HYPack and recluster DBPack GA parameters are summarised as follows:

1. Initial solution size: 100 (initial solution number)

As T-HYPack involves VPR, and VPR mapping is time consuming when circuits are large, which significantly increases the evolution time, T-HYPack only uses ten small MCNC-20 benchmarks for testing; “small” refers to a synthesised benchmark that has 1,000-1,500 BLEs. Based on the FPGA model, the defined FPGA architecture parameters are shown in Table 2, which are based on the T-VPack testing environments.

Each selected MCNC-20 benchmark has been optimised ten times using T-HYPack, the best results are compared to T-VPack. Table 3 shows the comparison, which includes circuit CLB number, CLB interconnect, channel width, routing wire length and delay (timing). Note that T-HYPack uses the same FPGA area as T-VPack. As can be seen, T-HYPack solution demonstrated a number of improvements, T-HYPack can speed up a circuit by up to 27.62% compared to T-VPack. As nearly all methods are compared with T-VPack, Table 4 shows a general comparison to these methods based on the literature. The table indicates the T-HYPack is the best timing-driven circuit clustering method. In addition, Fig. 6 shows the routed “tseng” benchmark on the FPGA using T-VPack and T-HYPack circuit clustering methods. VPack (left) uses 28 tracks vs. T-HYPack (right) uses 20 tracks in the FPGA routing channel.

8 Conclusion

In this paper, we present a new method to perform circuit clustering based on MOGAs for cluster-based FPGAs. Directly forming CLBs from BLEs makes the clustering efficient and robust, and the use of Pareto optimality also allows multiple objectives to be evolved simultaneously without degrading the solution quality on a particular objective. The CAD-assisted method also shows significant potential to improve the circuit clustering. The experimental results indicate that our method offers a significant improvement on clustered circuits compared to other methods, especially for circuit timing performance. However, there is also a cost; The main concern is currently execution time which is longer than conventional clustering algorithms, as the proposed method uses a combination of GA, nested GAs, and CAD flow. However the structure of our method and algorithms is inherently parallel and can be improved to reduce execution time by using parallel execution on a multi-core system or HPC infrastructure. A larger number of GA generations.
and benchmarks can then be tested, where the method can be further evaluated. The key contribution of the proposed methodology is a flexible design automation method that can incorporate multiple objectives and constraints required to successfully tackle real-world circuit design on FPGA, resulting in better FPGA device utilisation at increased circuit performance. Additional clustering objectives can be incorporated without the need to change the core algorithm.

Acknowledgment

This work is part of the PAnDA project that is funded by EPSRC (EP/I005838/1).

9 References


Table 3: Best timing performed T-HYPack results compared to T-VPack. (Algo.:algorithm, Interc.: CLB interconnects, CH: channel width, Wire-Len: wire length, T-V: T-VPack, T-HY: T-HYPack. Lower is better)

| Benchmark | Algo | CLBs | Interc. | CH | Wire-Len | Delay (%)
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>alu4</td>
<td>T-V</td>
<td>192</td>
<td>804</td>
<td>34</td>
<td>9410</td>
<td>8.33</td>
</tr>
<tr>
<td></td>
<td>T-HY</td>
<td>192</td>
<td>528</td>
<td>34</td>
<td>1277</td>
<td>7.15</td>
</tr>
<tr>
<td>apex2</td>
<td>T-V</td>
<td>240</td>
<td>1244</td>
<td>44</td>
<td>15681</td>
<td>11.05</td>
</tr>
<tr>
<td></td>
<td>T-HY</td>
<td>238</td>
<td>834</td>
<td>44</td>
<td>15824</td>
<td>8.75</td>
</tr>
<tr>
<td>apex4</td>
<td>T-V</td>
<td>165</td>
<td>863</td>
<td>52</td>
<td>12072</td>
<td>7.94</td>
</tr>
<tr>
<td></td>
<td>T-HY</td>
<td>162</td>
<td>645</td>
<td>46</td>
<td>10588</td>
<td>7.60</td>
</tr>
<tr>
<td>bigkey</td>
<td>T-V</td>
<td>214</td>
<td>1040</td>
<td>26</td>
<td>15619</td>
<td>6.44</td>
</tr>
<tr>
<td></td>
<td>T-HY</td>
<td>214</td>
<td>669</td>
<td>14</td>
<td>13640</td>
<td>4.49</td>
</tr>
<tr>
<td>diffeq</td>
<td>T-V</td>
<td>189</td>
<td>1033</td>
<td>28</td>
<td>7686</td>
<td>6.22</td>
</tr>
<tr>
<td></td>
<td>T-HY</td>
<td>188</td>
<td>565</td>
<td>26</td>
<td>6817</td>
<td>7.14</td>
</tr>
<tr>
<td>dslp</td>
<td>T-V</td>
<td>172</td>
<td>762</td>
<td>18</td>
<td>14368</td>
<td>6.21</td>
</tr>
<tr>
<td></td>
<td>T-HY</td>
<td>172</td>
<td>704</td>
<td>22</td>
<td>15157</td>
<td>4.58</td>
</tr>
<tr>
<td>ex5p</td>
<td>T-V</td>
<td>139</td>
<td>767</td>
<td>46</td>
<td>9780</td>
<td>10.01</td>
</tr>
<tr>
<td></td>
<td>T-HY</td>
<td>136</td>
<td>592</td>
<td>46</td>
<td>9618</td>
<td>7.08</td>
</tr>
<tr>
<td>misex3</td>
<td>T-V</td>
<td>178</td>
<td>840</td>
<td>38</td>
<td>10429</td>
<td>8.93</td>
</tr>
<tr>
<td></td>
<td>T-HY</td>
<td>178</td>
<td>579</td>
<td>42</td>
<td>9925</td>
<td>7.08</td>
</tr>
<tr>
<td>seq</td>
<td>T-V</td>
<td>221</td>
<td>1055</td>
<td>42</td>
<td>14480</td>
<td>8.93</td>
</tr>
<tr>
<td></td>
<td>T-HY</td>
<td>225</td>
<td>758</td>
<td>44</td>
<td>13800</td>
<td>7.22</td>
</tr>
<tr>
<td>tseng</td>
<td>T-V</td>
<td>133</td>
<td>804</td>
<td>21</td>
<td>6632</td>
<td>7.80</td>
</tr>
<tr>
<td></td>
<td>T-HY</td>
<td>132</td>
<td>456</td>
<td>20</td>
<td>4545</td>
<td>6.15</td>
</tr>
</tbody>
</table>

Table 4: A comparison for FPGA circuit clustering methods. (CLB input-bandwidth-free circuit clustering method, N.A.: not available, Interc.: CLB interconnects, CH: Channel, Higher is better)

<table>
<thead>
<tr>
<th>Method</th>
<th>CLB (%)</th>
<th>CLB interc. (%)</th>
<th>CH width (%)</th>
<th>Wire length (%)</th>
<th>Delay (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>T-VPack</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
<td>0.00</td>
</tr>
<tr>
<td>T-RPack</td>
<td>N.A.</td>
<td>7.01</td>
<td>2.66</td>
<td>N.A.</td>
<td>5.00</td>
</tr>
<tr>
<td>iRAC</td>
<td>-6.24</td>
<td>25.86</td>
<td>16.10</td>
<td>25.00</td>
<td>-4.35</td>
</tr>
<tr>
<td>DPack</td>
<td>N.A.</td>
<td>9.20</td>
<td>N.A.</td>
<td>17.70</td>
<td>7.80</td>
</tr>
<tr>
<td>HDPack</td>
<td>N.A.</td>
<td>12.70</td>
<td>N.A.</td>
<td>23.20</td>
<td>6.10</td>
</tr>
<tr>
<td>MO-Pack</td>
<td>N.A.</td>
<td>10.73</td>
<td>11.44</td>
<td>12.60</td>
<td>-1.44</td>
</tr>
<tr>
<td>PPack</td>
<td>N.A.</td>
<td>19.80</td>
<td>19.80</td>
<td>17.20</td>
<td>-4.30</td>
</tr>
<tr>
<td>T-PPack</td>
<td>N.A.</td>
<td>17.00</td>
<td>17.00</td>
<td>15.10</td>
<td>3.60</td>
</tr>
<tr>
<td>DBPack</td>
<td>0.59</td>
<td>31.21</td>
<td>13.00</td>
<td>8.70</td>
<td>20.85</td>
</tr>
<tr>
<td>T-HYPack</td>
<td>0.54</td>
<td>32.34</td>
<td>6.25</td>
<td>8.52</td>
<td>27.62</td>
</tr>
</tbody>
</table>


