This is a repository copy of *End-to-End Schedulability Tests for Multiprocessor Embedded Systems based on Networks-on-Chip with Priority-Preemptive Arbitration*.

White Rose Research Online URL for this paper:
http://eprints.whiterose.ac.uk/93367/

Version: Submitted Version

**Article:**

https://doi.org/10.1016/j.sysarc.2014.05.002

---

**Reuse**
Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of the full text version. This is indicated by the licence information on the White Rose Research Online record for the item.

**Takedown**
If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request.
End-to-End Schedulability Tests for Multiprocessor Embedded Systems based on Networks-on-Chip with Priority-Preemptive Arbitration

Leandro Soares Indrusiak
Real-Time Systems Group - Department of Computer Science
University of York - York, United Kingdom
lsi@cs.york.ac.uk

ABSTRACT
Simulation-based techniques can be used to evaluate whether a particular NoC-based platform configuration is able to meet the timing constraints of an application, but they can only evaluate a finite set of scenarios. In safety-critical applications with hard real-time constraints, this is clearly not sufficient because there is an expectation that the application should be schedulable on that platform in all possible scenarios. This paper presents a particular NoC-based multiprocessor architecture, as well as a number of analytical methods that can be derived from that architecture, aiming to allow designers to check, for a given platform configuration, whether all application tasks and communication messages always meet their hard real-time constraints in every possible scenario. Experiments are presented, showing the use of the proposed methods when evaluating different task mapping and platform topologies.

Categories and Subject Descriptors
C.3 [Special-purpose and Application-based Systems]: Real-time and embedded systems.

General Terms
Performance, Design.

1. INTRODUCTION
Embedded systems typically have to fulfill timing constraints that are related to their application domain and usage scenarios. Constraints are usually specified as the deadline to perform specific functions. For example, a high-definition video recorder must be able to capture, compress and store 25 video frames per second. In safety-critical applications, such constraints are said to be hard real-time constraints, as there is an expectation that they have to be met by the system in all possible scenarios. Therefore, embedded systems designers must be able to evaluate which design alternatives can fulfill those constraints and, for safety-critical applications, guarantee real-time behaviour.

In this paper, we present analytical methods to evaluate whether a multicore embedded system based on a Network-on-Chip (NoC) can fulfill all its timing constraints. A NoC-based system can have tens to hundreds of processing cores interconnected by an on-chip packet-switching network that allows data to be transferred between the local caches of each core and from/to external memory. Section 2 of the paper provides more detail on this type of system architecture. It will then become clear that the performance of the NoC interconnect is as critical as the performance of the processing cores when it comes to meet timing constraints.

Throughout this paper, we will use the terms end-to-end timing constraint or end-to-end deadline of an application task-chain. Those terms denote constraints derived from the application domain (e.g. every video frame must be processed in 40 ms or less) that must be met by specific components of the application (i.e. chains of communicating tasks). Our goal is to establish whether all task-chains of an application have their end-to-end deadlines met by a particular NoC-based platform configuration, and this problem is referred in this paper as end-to-end schedulability test. Such test must consider the end-to-end latency of each task of a task-chain: the time it takes for a processing core to execute that task (computation latency) plus the time it takes for the NoC to transfer all data produced by that task to the next one on the chain (communication latency). In Section 3, precise definitions of all those concepts will be given, followed in Section 4 by formulations of end-to-end schedulability tests that are tailored to NoC-based multicores with priority arbitration.

Some of the schedulability tests presented in this paper are based on classic Response Time Analysis (RTA) [1] and on NoC traffic flow schedulability analysis [2]. Individually, those analyses cannot be used to evaluate and improve the schedulability of a NoC system. For example, the traffic flow schedulability analysis from [2] has been used in [3] to produce fully schedulable task mappings, but authors had to artificially limit the number of tasks mapped to each core, as the analysis does not directly consider the different interference patterns resulting from mapping the source of the traffic flows to different cores. Without a limitation on the maximum number of tasks per core, the mapping optimisation process would lead to solution with all tasks mapped to the same core (so all communications are local, instantaneous and therefore schedulable). Likewise, the evaluation of NoC schedulability using only RTA would be oblivious to the delays caused by network contention. Therefore, in this paper we discuss how to compose those two analytical methods to achieve correct upper bounds to the end-to-end latency, and show that the resulting analytical model is useful as a test to evaluate whether a specific task mapping is schedulable.

Schedulability tests are not always used in industry and academia. Often, system designers address the schedulability problem by simulating the system under different scenarios and checking if the obtained figures for computation and communication latencies meet the constraints. There are two main limitations to that approach. Firstly, for a complex multicore embedded system, the simulation of a few seconds of an application’s execution may take hours or days [4], limiting the number of design alternatives.
that can be evaluated and the portion of the application lifetime that can be considered. Secondly, simulation can only verify whether constraints are met within the scenarios that are explicitly simulated. In complex embedded systems, the set of possible scenarios is too vast to be exhaustively covered, so it is not possible to check whether constraints are always met. For example, if application tasks can suffer release jitter, it would be necessary to simulate each and every possible value of jitter for each task in order to make sure that the timing constraints are met in every case. In Section 5, we use a number of benchmarks to evaluate the proposed schedulability tests, we compare the obtained figures with those obtained with simulation, and propose a design flow that benefits from the joint use of both techniques.

2. NOC-BASED MULTICORES

NoCs are a common architectural template for processors with dozens of cores, and it has the potential to scale with the increase of the core count up to hundreds or thousands. Figure 1 shows a simplified representation of a NoC architecture. It has 16 cores, each of them represented together with their own local cache as a white rectangle. Cores are directly connected to NoC switches (grey rectangles), which route data packets towards a destination (which may be another core, an interface to off-chip memory, a custom hardware accelerator, etc.).

Many components of the NoC template can be parameterized to better meet design goals, such as the number and type of cores, buffering, routing and arbitration policies, among others. In this paper, our choice of a specific subset within such a large number of alternatives was based on three criteria: (i) adopt architectural features that are widely used in industry and academia, (ii) use on-chip resources efficiently, and (iii) privilege techniques that are amenable to the type of schedulability tests we are investigating.

Following criterion (i), we concentrate on the widely used 2D mesh topology [5][6][7][8]. Criterion (ii) motivates the use of wormhole switching, as its buffer overhead is much smaller than store-and-forward (SAF) approaches, and its link allocation is more efficient than circuit switching approaches: there is no need to reserve the complete path of a packet, and NoC links are only allocated on the segments of the path where there is data ready to be transferred. Finally, criterion (iii) requires some level of predictability on resource sharing policies, so we limit our approach to NoCs with non-adaptive routing and priority arbitration such as QNoC [7] or Hermes [9]. The most common implementation of priority arbitration is based on virtual channels (VCs) [10], which allow packets with higher priority to preempt the transmission of low priority ones, making it easier to predict the outcome of network contention scenarios. Figure 1 shows a detailed view of a NoC switch with priority-arbitrated VCs: in each input port, a different FIFO buffer stores data words (flits) of packets arriving through different VCs (one for each priority level). The routing component assigns an output port for each incoming packet according to its destination. A credit-based approach [10] guarantees that data is only forwarded from a router to the next when there’s enough buffer space to hold it at the right VC. At any time, a flit of a given packet will be sent through its respective output port if it has the highest priority among the packets being sent out through that port, and if it has credits (that is, buffer space on the respective buffer of the neighbouring node connected to that output port). If the highest priority packet can’t send data because it is blocked elsewhere in the network, the next highest priority packet can access the output link.

3. SYSTEM MODEL AND NOTATION

In this paper, we investigate ways to determine whether application tasks executing and communicating over a specific NoC-based multicore can meet all application-specific timing constraints. Therefore, we need a system model that covers the application as well as the NoC-based platform and its configurations.

For the application model, we recall the sporadic task model and define an application as a taskset \( \Gamma = \{ \tau_1, \tau_2, \ldots, \tau_k \} \) where each task \( \tau_i \) is a 6-tuple \( \{ C_i, T_i, D_i, J_i, P_i, \varphi_i \} \) indicating respectively its worst case computation time, period (i.e. minimum inter-release time interval), deadline, release jitter and priority. The sixth element of the tuple is the only proposed addition to the sporadic task model, and represents a communication message sent by \( \tau_i \). Our initial assumption is that each task produces a single message \( \varphi_i \) which is sent immediately after it finishes its computation. The message is defined as a 3-tuple \( \{ \tau_d, Z, K_i \} \) representing its destination task, size and maximum release jitter. A task-chain \( X = \{ \tau_1, \tau_2, \ldots, \tau_k \} \) is an ordered subset of \( \Gamma \) where each task sends a message to the subsequent task in \( X \), and all of them have the same period \( T_\tau \). We assume that all task-chains in a particular application \( \Gamma \) are disjoint subsets of \( \Gamma \), and that loops are not allowed (i.e. the sixth element of the tuple of the final task \( \tau_k \) of every task-chain is the empty set \( \emptyset \)).

The model of the NoC platform is composed of a set of processing cores \( \Pi = \{ \pi_1, \pi_2, \ldots, \pi_n \} \), a set of switches \( \Xi = \{ \xi_1, \xi_2, \ldots, \xi_m \} \), and a set of unidirectional links \( \Lambda = \{ \lambda_{1a}, \lambda_{1b}, \lambda_{21}, \lambda_{22}, \ldots, \lambda_{nm}, \lambda_{mn} \} \). Links can connect cores to switches, or switches with each other, allowing for all possible direct and indirect NoC topologies. For example, the architecture shown in Fig. 1 has 16 cores \( \pi_1 \ldots \pi_n \), each of them connected to one of the 16 switches \( \xi_1 \ldots \xi_m \) via two unidirectional links (e.g. \( \lambda_{1a} \) and \( \lambda_{4b} \)). The switches, in turn, are

![Figure 1. NoC architecture with detail of the router structure](image-url)
connected to each neighbouring switch by two links (e.g. \( \lambda_{21}, \lambda_{12}, \lambda_{23}, \lambda_{32}, \lambda_{26}, \lambda_{62} \) are the links attached to switch \( \xi_2 \)).

NoCs forward packets from source to destination according to a routing algorithm. We define a function \( \text{route}(\pi_s, \pi_d) = (\lambda_{a_1}, \lambda_{a_2}, \ldots, \lambda_{a_{|\Lambda|}}) \) denoting the subset of \( \Lambda \) used to transfer packets from core \( \pi_s \) to core \( \pi_d \). A route will include links connecting the source and destination cores to their respective switches, and all the links between switches along the way. The cardinality of a route is defined as \( |\text{route}(\pi_s, \pi_d)| \) and will be informally referred as its hop count. For the example in Fig. 1, \( |\text{route}(\pi_1, \pi_3)| = 4 \) and \( |\text{route}(\pi_1, \pi_6)| = 4 \) for most commonly used routing algorithms.

Task mapping is a critical part of the design of multicore systems. It defines which application tasks should be mapped onto which processing core (i.e. on which core each task will execute). Many different approaches to task mapping have been proposed, taking into account the time when the mapping occurs, whether tasks can be remapped (or migrated) during execution, and which metrics should be considered when making a mapping decision. We therefore define a surjective function \( \text{map}(t) = \pi_t \) to denote the core onto which a task is mapped. Its inverse is defined as \( \text{map}^{-1}(\pi_t) = \{t_1, \ldots, t_n\} \) and represents the tasks mapped to a given core. Likewise, the mapping of a message \( \text{map}^m(\phi) = \text{route}(\text{map}(t_i), \text{map}(t_j)) \) denotes the route of its packets over the NoC, and the inverse \( \text{map}^m^{-1}(\lambda) = \{\phi_1, \ldots, \phi_n\} \) represents the messages mapped over a given link.

Once the mapping of all tasks of \( \Gamma \) is defined, it is possible to calculate the basic communication latency \( L_i \) of every message \( \phi_i \). It represents the time taken by the message to be completely transferred from its source to its destination, assuming no contention over the NoC links (i.e. as if the message is the only one using the NoC). The actual value of \( L_i \) will depend on implementation-specific characteristics of the NoC (e.g. link width, time required for a packet header to cross a router, and for a flit to cross a link). A common formulation is the following:

\[
L_i = \left| \text{map}(\phi_i) \right| \cdot \lambda_{\text{hop}} + \left( |\text{map}(\phi_i)| - 1 \right) \cdot \lambda_{\text{router}} + (Z_i / \text{width}) \cdot \lambda_{\text{hop}},
\]

where the first term represents the time it takes for the packet header to traverse all the NoC links, expressed as the product of the message hop count and the latency \( \lambda_{\text{hop}} \) for the header to traverse a single link; the second term represents the time it takes for the packet header to traverse all NoC routers, and is expressed as the product of the number of routers along the path (which is usually the number of hops minus one in most direct networks) and the latency \( \lambda_{\text{router}} \) for the header to traverse a router; the third term represents the time taken by the packet payload to follow the header in a wormhole fashion all the way to the destination, expressed by the message length \( Z_i \) (in bits) divided by the link width (which results in the number of payload flits), multiplied by the single link latency \( \lambda_{\text{hop}} \).

\[
4. \text{ END-TO-END SCHEDULABILITY TESTS FOR NOC-BASED MULTICORES}
\]

A schedulability test is able to discern system configurations that are schedulable, that is, able to meet their timing constraints even in the worst case scenario. In this paper, we assume that a system is schedulable iff all its task chains meet their end-to-end deadlines. To check this property, we first revisit a number of existing techniques that can be used as necessary schedulability tests.

\[
4.1 \text{ Schedulability of tasks over a processing core}
\]

A processor utilisation test can be used to check whether all tasks mapped to a particular core \( \pi_a \) do not exceed its capacity:

\[
\sum_{\tau_i \in \text{map}^{-1}(\pi_a)} \frac{C_i}{T_i} \leq 1
\]

This test is necessary but obviously not sufficient because even though the core \( \pi_a \) may be capable to run all the tasks, it may not be able to run all of them within their deadlines. Response Time Analysis (RTA) [1] is the standard technique to evaluate how much the interference from higher priority tasks can delay the completion time of task \( \tau_i \):

\[
R_i = C_i + \sum_{\forall \tau_j \in \text{hp}(\tau_i)} \left[ \frac{C_j}{T_j} \right] C_j
\]

where the function \( \text{hp}(\tau_i) \) denotes the set of all tasks that can preempt \( \tau_i \); those mapped to the same core and that have a higher priority. Formally, \( \text{hp}(\tau_i) \) includes every task \( \tau_j \in \Gamma \) where \( \text{map}(\tau_i) = \text{map}(\tau_j) \) and \( P_i < P_j \). With the help of Eq. 2, it is possible to calculate the longest possible time interval between the release of \( \tau_i \) and its termination. This is done by adding \( \tau_i \)'s computation time \( C_i \) and the computation times \( C_j \) of all releases of tasks \( \tau_j \) that could preempt it. The result of that sum is referred as \( \tau_i \)'s worst case response time and is represented by \( R_i \). As \( R_i \) appears in both sides of Eq. 2, an iterative solution was proposed in [1]. RTA has been widely used to test schedulability of uniprocessor and statically mapped multiprocessor systems with fixed priorities.

More advanced tests have been reviewed in [11], considering more advanced task models that support task migration (global scheduling), dynamic priorities and different constraints on deadlines. However, the tests described and referenced above do not explicitly consider inter-task communication. Instead, most assume that all communication latencies can be combined with the worst case computation time \( C_i \) of the respective tasks. For uniprocessor systems with uniform memory access times, such assumption can be acceptable as the communication overhead can be predictable and usually small compared with the computation time. In NoC-based systems, however, the communication latency introduced by the NoC when tasks access memory or exchange messages depends heavily on the task mapping, the application communication patterns and resulting network congestion (which is particularly hard to predict in the case of wormhole switching NoCs). This leads to high variability in communication latencies, which can be of the same order of magnitude of the computational time \( C_i \) of the tasks (or even higher). Therefore, we make a case to explicitly consider communication times when analysing schedulability of NoC-based systems.

\[
4.2 \text{ Schedulability of packets over a NoC and end-to-end schedulability of communicating tasks}
\]

To address the schedulability of packets transmitted over a NoC, we rely on the work proposed by Shi and Burns [2], which in turn builds on RTA. Their work assumes that packets are released into the NoC sporadically, i.e. a series of packets (referred in [2] as a traffic flow) has a minimum inter-release interval which is known at design time. The maximum size of each packet is also known a priori. On the platform side, the main assumption is that the NoC...
routers perform deterministic routing, and that the link arbiters can preempt packets when higher-priority packets request the output link they are using. Such assumption is valid for the type of NoC architectures described in Section 2. The worst case latency \( S_i \) of a packet transmitted over such a NoC can be found using Eq. 3, which has been rewritten from the original in [2] to follow the notation presented in Section 3. To simplify the notation, we assume that there is a one-to-one relationship between application messages and packets sent over the NoC, and therefore use the same symbol \( \varphi \) for both.

\[
S_i = \sum_{\forall \varphi_j \in \text{interf}(\varphi_i)} \left( S_i + \frac{K_j + K^j_i}{T_j} \right) L_j + L_i \tag{3}
\]

The function \( \text{interf}(\varphi) \) denotes the direct interference set of \( \varphi \), which is the set of all packets that can preempt \( \varphi \), which are those whose routes at have at least one NoC link in common with \( \varphi \)’s route and that have higher priority. Formally, \( \text{interf}(\varphi) \) includes every packet \( \varphi_j \) where \( \text{map}(\varphi_i) \cap \text{map}(\varphi_j) \neq \emptyset \) and \( P_i < P_j \). The intuition behind Eq. 3 is similar to what was presented for Eq. 2. The value of \( S_i \) can be found by adding \( \varphi_i \)’s basic latency \( L_i \) and the latencies \( L_j \) of all releases of packets \( \varphi_j \) that could preempt it. The same iterative solution proposed in [1] can be used here.

It is worth noticing that the release jitter of \( \varphi \) can influence how many times it can preempt \( \varphi \). In Eq. 3, we consider two types of release jitter: \( K_j \) which is caused by the execution of the task \( \tau_j \) that releases \( \varphi_j \), and \( K^j_i \) which is caused by indirect interference (i.e., packets that can preempt \( \varphi \) but cannot interfere on \( \varphi \) because they don’t share any links, see [2] for a detailed definition).[2][1]

Since the value of \( K_j \) must be the maximum amount of time elapsed between the start of \( \varphi_j \)’s period and its actual release, and since we have defined that a packet is released immediately after its respective task has finished computation, we can clearly state that \( K_j = R_j \). Finally, from [2] we have that \( K^j_i = S_j - L_j \).

Thus, the worst case end-to-end response time of a task \( \tau_j \) is given by \( \text{EER}_j = R_j + S_j \), which composes its worst-case computation response time and its worst case communication latency (Figure 2). Its end-to-end schedulability can be tested by checking whether \( \text{EER}_j \leq D_j \).

### 4.3 End-to-end schedulability of task chains

To test the schedulability of a task chain \( X \), we need to consider the individual end-to-end response times of all tasks \( \tau_i \in X \). Before we can do that, we must discriminate three modes of execution for task chains over multiple processing elements: sequential, parallel and pipelined.

In a sequential execution, a task chain will be executed completely, in one or more processors, before it can be executed again. In other words, only a single task \( \tau_i \in X \) can be executing at a given point in time.

In a parallel execution over multiple processors, there are no constraints over the execution of task chains, and arbitrarily many jobs of a task chain can be executing at the same time.

![Figure 2. End-to-end response time of a communicating task](image)

A pipelined execution is a special kind of parallel execution which allows multiple jobs of the same task chain to be executed simultaneously over different processors, but disallows the simultaneous execution of more than one job of the same task. A common pattern for pipelined execution is to have a number of jobs of a task chain \( X \) running concurrently, each of them released after \( T_s \) time units after the preceding one, in a phase-shifted way. We refer to this pattern as a synchronous pipeline. Figure 3 shows an example of a task chain executing as a synchronous pipeline. It includes three tasks \( \tau_1, \tau_2 \) and \( \tau_3 \) running on separate cores (each of them represented on a separate timeline), their respective communications over NoC links (also shown over separate timelines), occasionally suffering interference from higher priority tasks and packets (not shown in the figure). Curved arrows show the functional dependencies between the computation and communication components of one chain, making it easier to see that those dependencies will always be satisfied as long as each task meets its end-to-end deadline constraint.

![Figure 3. Example of a 3-task chain executed in a synchronous pipeline over 3 processors](image)
Dₖ equally among all tasks of the chain, so the end-to-end deadline Dₖ of each individual task is equal to Tₓ. This enables a task chain X with x tasks to produce its output x periods after its release, but once the pipeline is filled each of its jobs will produce an output at every period.

The schedulability test for this particular case is a simple check of whether \( \forall \tau_j \in X, EER_{ij} \leq T_x \). This assumes that there will be acceptable deadline misses for the first x jobs of the task chain X while the pipeline is being filled, and guarantees that the system will never miss a deadline after that. The intuition behind this approach is that each task of the chain will be triggered every \( T_x \) time units, and has to finish computing and communicating with the next task of the chain before the end of the period, so that the following task will have all the data it needs before it can run at the next periodic tick.

4.4 Link utilisation tests

Similarly to Eq. 1, a utilisation test can be applied to each of the NoC links, aiming to check whether the messages mapped to each of them will not exceed their bandwidth:

\[
\sum_{\phi_j \in E_i} \frac{L_j}{T_j} \leq 1
\]  

(4)

Again, this test is necessary but not sufficient because even though the link \( \lambda_{ij} \) may be capable to transmit all messages mapped to it without starvation, they might not meet their deadlines.

By considering the multi-hop nature of NoCs, we identify another utilisation test that addresses the direct interference set \( \text{interf}(\phi_i) \) of a message \( \phi_i \):

\[
\sum_{\phi_j \in \text{interf}(\phi_i)} \frac{L_j}{T_j} \leq 1
\]  

(5)

The intuition behind this test is the following: if a message \( \phi_i \) can interfere and hinder the progress of another message \( \phi_j \) over the NoC, this happens regardless of the link where the contention happens. In other words, the complete route of a message can be seen as a single resource with exclusive access, and if a higher priority message needs to use any part of that route the whole transmission of \( \phi_i \) will be halted. For example, if \( \phi_i \) is routed over \( n \) different links and it suffers interference from \( \phi_j \) which also uses one or more of those links and has a higher priority, the time \( \phi_i \) waits for the shared link(s) will be the same if they share link \( \lambda_1, \lambda_2, \ldots, \lambda_n \) or any combination of them, as in every case \( \phi_i \) will not be able to progress in a pipelined fashion towards its destination.

The same intuition can be extended to other higher priority messages that share any possible combination of links with \( \phi_i \). Therefore, we conclude that the direct interference set \( \text{interf}(\phi_i) \) determines all contenders for the route of a message \( \phi_i \), and the overall utilisation of that route has to be less than 1 due to the exclusive access.

The proof that this test is tighter than the test in Eq. 4 lies on the fact that the direct interference set \( \text{interf}(\phi_i) \) is a superset of each of the sets including the messages that share any of the links \( \phi_i \) is mapped to, and that can interfere with it: \( \forall \lambda \in \text{map}(\phi_i), \text{hp}(\text{map}^{-1}(\lambda)) \subseteq \text{interf}(\phi_i), \text{where} \text{hp}(\{\phi_i \ldots \phi_n\}) \) denotes the subset of messages that have higher priority than \( \phi_i \). Actually, from the definition given in Section 4.2 it is easy to see that the direct interference set is actually the union of all those sets:

\( \text{interf}(\phi_i) = \bigcup_{\lambda \in \text{map}(\phi_i)} \text{hp}(\text{map}^{-1}(\lambda)) \). Thus, the utilisation test given in Eq. 5 will cover, when applied to the lowest priority message of each link, the test given in Eq. 4.

In any case, both utilisation tests identified in this subsection are necessary, but not sufficient. While they are useful to discriminate unschedulable mappings, they cannot guarantee schedulability. They are nonetheless useful to prune large mapping spaces, as they are less computationally expensive than the tests described in subsections 4.2 and 4.3.

5. EXPERIMENTAL WORK

To evaluate the correctness and usefulness of the schedulability tests described in the previous section, we devised two types of experiment. In subsection 5.1, we will compare the figures for computation and communication response times found using the proposed schedulability tests with figures obtained through simulation of predefined configurations of a NoC-enabled embedded system. In subsection 5.2, we will then show that the proposed tests can be used as a fitness function within a search-based optimisation heuristic.

5.1 Joint end-to-end schedulability analysis and simulation

In this series of experiments, we analyse the schedulability of a benchmark application over a specific NoC-based embedded platform.

The platform follows the architecture described in Section 2, with homogeneous cores running priority-preemptive task schedulers, distributed memory, 2D-mesh NoC interconnect with XY dimension routing, credit-based flow control, 8 virtual channels with 3-flit input buffers per port and priority-preemptive link arbitration. It is worth noticing that the schedulability tests proposed in Section 4 would support alternatives on most of those architectural choices, but priority-preemptive arbitration at the cores and NoC links is a requirement.

The chosen benchmark application is based on the autonomous vehicle (AV) introduced in [12], including 39 communicating tasks performing functionality such as navigation control, vibration control and obstacle detection through stereo photogrammetry. Task periods vary between 0.04 to 1 second, and communication volumes vary between 1 and 76 kbytes.

To model the benchmark as task chains, a number of tasks of the original application had to be partitioned (i.e. to break tree-like structures when a task receives data from multiple sources). Furthermore, we had to introduce the notion of "sink tasks" to model DMA transfers to the local memory of the core where specific tasks are mapped to. In those cases, the destination task does not require any computation overhead (e.g. last task of a chain writes to a memory-mapped actuator, so it does not load the destination core, but used the bandwidth of the NoC links all the way to the destination interface).

To maintain the realism of the benchmark, we constrained all mappings used in the paper in such a way that all partitions of a task, as well as its respective sink, are mapped to the same core (so that only possible mappings of the application were considered). Table 1 shows the complete set of tasks, showing which chain they belong to (chains have lengths between 2 and 5 tasks), their names (first four letters indicate the original task name from [12], with an appendix if the task is a partition or a sink of one of the original tasks), destination task, computation
We selected one platform configuration (a 4x4 mesh) and three different task allocations, and applied equations 2 and 3 to find the worst-case end-to-end response time of each of the 39 communicating tasks under each mapping. Figures 4a, 4b and 4c show the results for mappings M1, M2 and M3 respectively.

The worst-case end-to-end response time of each task is plotted with a brown cross, and their individual deadline is shown as a red horizontal line. Mappings M2 and M3 are fully schedulable, as all EER values are below the respective deadlines. M1, however, has a number of unschedulable tasks, denoted by the brown crosses plotted at the upper margin of Figure 4a (the actual worst case response times in those cases were not found, as our implementation stops iterating towards a solution once a deadline is missed).

We then used the tool flow presented in [13] and the simulation models presented in [14] to obtain latency figures for the execution of the benchmark application over the platform under all three mappings. We simulated each scenario for a target time

### Table 1 – Autonomous Vehicle benchmark

<table>
<thead>
<tr>
<th>task</th>
<th>chain</th>
<th>name</th>
<th>dest task</th>
<th>comp (ms)</th>
<th>period (ms)</th>
<th>pri</th>
<th>comm (bytes)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 A</td>
<td></td>
<td>POSI-A</td>
<td>2</td>
<td>5</td>
<td>500</td>
<td>31</td>
<td>2048</td>
</tr>
<tr>
<td>2 A</td>
<td></td>
<td>NAVC-A</td>
<td>3</td>
<td>10</td>
<td>500</td>
<td>32</td>
<td>4096</td>
</tr>
<tr>
<td>3 A</td>
<td>OBDB-A</td>
<td></td>
<td>42</td>
<td>150</td>
<td>500</td>
<td>33</td>
<td>32768</td>
</tr>
<tr>
<td>4 B</td>
<td>OBDB-B</td>
<td></td>
<td>33</td>
<td>150</td>
<td>1000</td>
<td>34</td>
<td>65536</td>
</tr>
<tr>
<td>5 C</td>
<td>NAVC-C</td>
<td></td>
<td>40</td>
<td>20</td>
<td>100</td>
<td>24</td>
<td>1024</td>
</tr>
<tr>
<td>6 C</td>
<td>SPES-C</td>
<td></td>
<td>5</td>
<td>5</td>
<td>100</td>
<td>25</td>
<td>1024</td>
</tr>
<tr>
<td>7 D</td>
<td>NAVC-D</td>
<td></td>
<td>40</td>
<td>10</td>
<td>100</td>
<td>26</td>
<td>2048</td>
</tr>
<tr>
<td>8 E</td>
<td>FBUS-E</td>
<td></td>
<td>47</td>
<td>10</td>
<td>40</td>
<td>1</td>
<td>76800</td>
</tr>
<tr>
<td>9 F</td>
<td>FBUS-F</td>
<td></td>
<td>48</td>
<td>10</td>
<td>40</td>
<td>2</td>
<td>76800</td>
</tr>
<tr>
<td>10 G VOD2</td>
<td>42</td>
<td>20</td>
<td>500</td>
<td>3</td>
<td>1024</td>
<td></td>
<td></td>
</tr>
<tr>
<td>11 H VOD2</td>
<td>42</td>
<td>20</td>
<td>500</td>
<td>4</td>
<td>1024</td>
<td></td>
<td></td>
</tr>
<tr>
<td>12 I</td>
<td></td>
<td>FBUS</td>
<td>20</td>
<td>10</td>
<td>40</td>
<td>5</td>
<td>76800</td>
</tr>
<tr>
<td>13 J</td>
<td></td>
<td>FBUS</td>
<td>21</td>
<td>10</td>
<td>40</td>
<td>6</td>
<td>76800</td>
</tr>
<tr>
<td>14 K</td>
<td></td>
<td>FBUS</td>
<td>22</td>
<td>10</td>
<td>40</td>
<td>7</td>
<td>76800</td>
</tr>
<tr>
<td>15 L</td>
<td></td>
<td>FBUS</td>
<td>23</td>
<td>10</td>
<td>40</td>
<td>8</td>
<td>76800</td>
</tr>
<tr>
<td>16 M</td>
<td></td>
<td>FBUS</td>
<td>24</td>
<td>10</td>
<td>40</td>
<td>9</td>
<td>76800</td>
</tr>
<tr>
<td>17 N</td>
<td></td>
<td>FBUS</td>
<td>25</td>
<td>10</td>
<td>40</td>
<td>10</td>
<td>76800</td>
</tr>
<tr>
<td>18 O</td>
<td></td>
<td>FBUS</td>
<td>26</td>
<td>10</td>
<td>40</td>
<td>11</td>
<td>76800</td>
</tr>
<tr>
<td>19 P</td>
<td></td>
<td>FBUS</td>
<td>27</td>
<td>10</td>
<td>40</td>
<td>12</td>
<td>76800</td>
</tr>
<tr>
<td>20 I</td>
<td></td>
<td>BFE1</td>
<td>28</td>
<td>20</td>
<td>40</td>
<td>13</td>
<td>4096</td>
</tr>
<tr>
<td>21 J</td>
<td></td>
<td>BFE1</td>
<td>43</td>
<td>20</td>
<td>40</td>
<td>14</td>
<td>4096</td>
</tr>
<tr>
<td>22 K</td>
<td></td>
<td>BFE1</td>
<td>43</td>
<td>20</td>
<td>40</td>
<td>15</td>
<td>4096</td>
</tr>
<tr>
<td>23 L</td>
<td></td>
<td>BFE1</td>
<td>43</td>
<td>20</td>
<td>40</td>
<td>16</td>
<td>4096</td>
</tr>
<tr>
<td>24 M</td>
<td></td>
<td>BFE1</td>
<td>29</td>
<td>20</td>
<td>40</td>
<td>17</td>
<td>4096</td>
</tr>
<tr>
<td>25 N</td>
<td></td>
<td>BFE1</td>
<td>44</td>
<td>20</td>
<td>40</td>
<td>18</td>
<td>4096</td>
</tr>
<tr>
<td>26 O</td>
<td></td>
<td>BFE1</td>
<td>44</td>
<td>20</td>
<td>40</td>
<td>19</td>
<td>4096</td>
</tr>
<tr>
<td>27 P</td>
<td></td>
<td>BFE1</td>
<td>44</td>
<td>20</td>
<td>40</td>
<td>20</td>
<td>4096</td>
</tr>
<tr>
<td>28 Q</td>
<td></td>
<td>FDF1</td>
<td>30</td>
<td>10</td>
<td>40</td>
<td>21</td>
<td>16384</td>
</tr>
<tr>
<td>29 M</td>
<td></td>
<td>FDF1</td>
<td>51</td>
<td>10</td>
<td>40</td>
<td>22</td>
<td>16384</td>
</tr>
<tr>
<td>30 I</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>23</td>
<td>8192</td>
</tr>
<tr>
<td>31 Q</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>24</td>
<td>8192</td>
</tr>
<tr>
<td>32 R</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>25</td>
<td>8192</td>
</tr>
<tr>
<td>33 B</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>26</td>
<td>8192</td>
</tr>
<tr>
<td>34 S</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>27</td>
<td>8192</td>
</tr>
<tr>
<td>35 T</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>28</td>
<td>8192</td>
</tr>
<tr>
<td>36 U</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>29</td>
<td>8192</td>
</tr>
<tr>
<td>37 V</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>30</td>
<td>8192</td>
</tr>
<tr>
<td>38 U</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>31</td>
<td>8192</td>
</tr>
<tr>
<td>39 V</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>32</td>
<td>8192</td>
</tr>
<tr>
<td>40 V</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>33</td>
<td>8192</td>
</tr>
<tr>
<td>41 V</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>34</td>
<td>8192</td>
</tr>
<tr>
<td>42 V</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>35</td>
<td>8192</td>
</tr>
<tr>
<td>43 V</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>36</td>
<td>8192</td>
</tr>
<tr>
<td>44 V</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>37</td>
<td>8192</td>
</tr>
<tr>
<td>45 V</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>38</td>
<td>8192</td>
</tr>
<tr>
<td>46 V</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>39</td>
<td>8192</td>
</tr>
<tr>
<td>47 V</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>40</td>
<td>8192</td>
</tr>
<tr>
<td>48 V</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>41</td>
<td>8192</td>
</tr>
<tr>
<td>49 V</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>42</td>
<td>8192</td>
</tr>
<tr>
<td>50 V</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>43</td>
<td>8192</td>
</tr>
<tr>
<td>51 V</td>
<td></td>
<td>STPH</td>
<td>43</td>
<td>30</td>
<td>40</td>
<td>44</td>
<td>8192</td>
</tr>
</tbody>
</table>
of 200 seconds, which allows for a good coverage of the application lifetime (the shortest period is of the video processing tasks, that must execute every 0.04 seconds to achieve 25 VGA frames per second, and the longest period is of 1 second for the tyre pressure control task). Figure 4 also shows, for each mapping, the best, worst and average end-to-end latency observed during simulation for each of the 39 communicating tasks. The plots show that the worst case response times found by our schedulability tests are effectively an upper bound to all results found with simulation. They also show that while M2 and M3 are fully schedulable mappings, M2 has higher worst case and maximum observed latencies (specially in communications 4, 7 and 37). Taking that into account, mapping M3 would be preferable as its results allow for larger safety margins.

5.2 Using end-to-end schedulability tests as fitness within search-based optimisation

As shown in Figure 4, the end-to-end timeliness of applications is affected by the way tasks and flows are mapped onto the NoC platform. Finding optimal mappings, or even acceptable ones, is a challenge in most NoC systems, and a significant amount of research was dedicated to this topic [15]. Heuristic search-based mapping is one of the mapping techniques reviewed by [15], and its distinctive feature is a fitness function to evaluate solutions over a given search space, aiming to converge towards solutions with increasing fitness. A common practice in such cases is to simulate the NoC platform with a given mapping for a specific amount of time, and use some aggregate of the latencies of all packets obtained through the simulation as the fitness of that mapping [16]. This process is then repeated for many different mappings across the search space until a mapping is found that fulfils the requirements (e.g. average latency of all packets below a given threshold).

In this subsection, we show that the schedulability tests described in Section 4 can be used to solve two problems found in search-based mappers with simulators as fitness function, specifically when it comes to optimise hard real-time systems. As discussed in Section 1, simulators cannot easily find worst-case packet latencies, and the time they take to run can be very high when evaluating complex NoCs. In search-based mappers, the second problem is particularly severe, because the search heuristic may have to simulate hundreds or thousands of different mapping before an acceptable solution can be found.

To solve those problems, we have implemented a search-based algorithm that uses the approach described in subsection 4.3 to find whether, given a particular mapping, how many of an application’s tasks are end-to-end schedulable. Like in [16], our search-based heuristic follows an evolutionary approach, modelling a particular mapping as a chromosome representing on each gene the processing core where each task should be mapped (Figure 5.a). The evolution is performed across generations of a population of 100 individuals, each represented by one of such chromosomes. The initial population can be randomly generated, but subsequent generations are produced by applying crossover and mutation operations over the fittest chromosomes of the preceding one (Figure 5.b). In our implementation, crossovers were implemented by creating a new chromosome from the first and second halves of two existing chromosomes. Similarly, mutations created new chromosomes by swapping the contents of any two genes of an existing chromosome. The chosen chromosomes for crossover and mutation were those that, when evaluated using the technique described in 4.3, would have the lowest number of unschedulable tasks.

Ideally, after a number of generations the population will contain at least one individual with a chromosome representing a mapping that meets our constraints, i.e. has zero tasks that are end-to-end unschedulable. A detailed study on how different chromosome formats, population sizes, mutation and crossover styles and rates affect the convergence of the genetic algorithm towards a full schedulable solution can be found in [17].

\[
\begin{array}{cccccc}
\tau_1 & \tau_2 & \tau_3 & \tau_4 & \ldots & \tau_n \\
\pi_c & \pi_d & \pi_k & \pi_l & \ldots & \pi_b \\
\end{array}
\]

Figure 5. Evolutionary mapping: (a) chromosome format and (b) evolutionary search process.

We have used this evolutionary mapping algorithm to search for schedulable mappings of the AV application over 3x3, 4x4 and 5x5 mesh NoC platforms. Figure 6 shows the number of end-to-end unschedulable tasks of the best mapping of each generation. It can be seen that mappings for the 5x5 platform can be found very easily (as there are more resources and therefore less interference), reaching a fully schedulable mapping in 8 generations. For the 4x4 platform, the situation is slightly more difficult, but the evolutionary mapper is capable to find a fully schedulable mapping in 11 generations. Finally, for a 3x3 platform, the evolutionary mapping cannot find a fully schedulable mapping after 50 generations (because the utilization of the application exceeds the available capacity of the 3x3 platform), but it can clearly show improvements over generations, reaching a minimum of 12 end-to-end unschedulable tasks.
5.2.1 Performance comparison

The performance of the proposed schedulability test, when used as a fitness function of a search-based heuristic, shows a significant improvement over simulation-based fitness functions such as those used by [16]. A simple Java-based implementation of the proposed test takes 0.13 seconds to evaluate the schedulability of a single mapping of the AV application over a 4x4 NoC. This is at least one order of magnitude faster than simulation, as reported in [14], and consistently reinforced by our experiments reported in subsection 5.1. The time it takes to simulate 2 seconds of a single mapping of the AV application is 7.89 seconds, for a fast simulator operating at TLM (Transaction Level Modelling) level. For a cycle-accurate simulation of the same scenario, the time elapsed is 2895.69 seconds.

Such numbers show that even if it would be feasible to identify the worst-case release scenario for all tasks and packets, a state-of-the-art NoC simulator would take 60 times longer (or up to 20000 times longer, if full accuracy must be achieved) to evaluate the fitness of one specific mapping. Recalling that in a typical search-based mapping heuristic one must check the fitness of thousands of mappings, we can clearly see the advantage of the proposed approach (e.g. in the experiments described above we needed 1100 application of the fitness function to find a fully scheduling mapping for a 4x4 platform, i.e. 11 generations of a population of 100 individuals).

6. RELATED WORK

Besides RTA and its derivatives, other analytical models have also been used to evaluate schedulability in NoCs.

Beekooij et al. [18] have proposed an extension to dataflow analysis (originally proposed by Lee and Messerschmitt [19]) that can model the behaviour of a homogeneous synchronous dataflow (HSDF) application performing computation and communication over a specific type of NoC (i.e. statically scheduled time-division multiplexing of links). They assume that the worst-case computation time of each application task is known (just like in this paper, as referred as $C_t$ in Section 3). However, due to the nature of their underlying NoC architecture, they can assume that there is no contention over NoC links, and thus the delay introduced by the NoC to each data transfer can be established independently for each task chain. Therefore, the worst-case end-to-end latency of a task-chain can be found by dataflow analysis, which can calculate the latest arrival time of the data token at the output of the last task of the chain.

Qian et al. [20] proposed the use of network calculus [21] to calculate worst-case packet latency bounds in wormhole NoCs, as long as all traffic can be modelled as an arrival curve and all NoC routers can be modelled by a service curve. Such curves abstract the actual behaviour of the application and the NoC by the bandwidth required or provided, respectively, at each point in time. The calculation of latency bounds is done through algebraic operations over all arrival curves at a given router, as well as the router’s service curve. The main challenge of this approach is to represent the behaviour of a sequence of specific routers (with their particular buffering and arbitration schemes) as a service curve. The modelling of the application traffic as arrival curves is also challenging, especially if the variations on the source task’s execution are taken into account (e.g. execution time variability or interference from tasks running on the same core), and this is currently an open problem preventing the use of network calculus on the evaluation of NoC end-to-end schedulability.

Other approaches to evaluate NoC schedulability are surveyed by in [22], all of them based on dataflow analysis, network calculus or RTA. The survey also states the difficulty to compare different analytical methods based on distinct formalisms, as they have fundamentally different assumptions. Still, they provide a summary of strengths and weaknesses of each type of analysis. Their assessment of dataflow and network calculus models has similar views as the ones we provided above, emphasizing the restrictions that must be imposed on the application behaviour and the NoC resource sharing disciplines. Their assessment of RTA and its derivatives, however, states that the main weakness is the inability to represent dependencies between flows, which is an issue that we have directly addressed in this paper and solved for the restricted case of synchronous pipelines.

7. CONCLUSIONS AND FUTURE WORK

In this paper, we investigated ways to determine whether application tasks executing and communicating over a specific NoC-based multicore can meet all application-specific timing constraints. We have identified a number of schedulability tests, and have shown their utility within distinct steps of an embedded system design flow. By combining them with simulation, designers can obtain a more detailed understanding of the overheads that are needed to guarantee performance in the worst case, as opposed to the average case. And by using them as fitness in search-based optimisation, we enabled a faster coverage of the typically large design spaces given by the multiple design alternatives in this kind of system.

For the sake of simplicity, we assumed an application model where tasks require all data to be available locally before they execute, and can send a single message only after they finish their computation. While restrictive, this model supports the widely used Actor model (i.e. read-execute-write) and can represent applications based on task chains. A more general formulation that allows tasks to send an arbitrary number of messages can be easily derived, but was left to future work, and would enable the representation of and tree-like structures. Even in the case of simple task chains, we have only addressed the restricted case of
synchronous pipelines. Extensions to the presented tests to address general pipeline and sequential execution are currently under investigation, and so is the use of deadline decomposition approaches (such as in [23] and [24]) and schedulability tests supporting release offsets (such as [25]).

Additional future work can take advantage of the utilisation tests presented in subsections 4.1 and 4.4 to accelerate the design space exploration by quickly pruning away mappings with over-utilised cores or links. Such approach could improve even further the performance reported in subsection 5.2.1, where the substantially heavier schedulability test presented in subsection 4.3 was used throughout the whole optimisation.

Finally, the proposed platform model assumes homogeneous cores, switches and links. Interesting avenues of research can also be opened by lifting such restrictions.

8. ACKNOWLEDGEMENTS

The author would like to thank Zheng Shi, Alan Burns, Osmar Marchi dos Santos and Borislav Nikolic for the discussions on the tests presented in Section 4; and Paris Mesidis, Adrian Racu and Norazizi Sayuti for the discussions and help with the experimental work supporting subsection 5.2.

9. REFERENCES


