Optimizing Direct Convolutions on ARM Multi-Cores

Convolution kernels are widely seen in deep learning workloads and are often responsible for performance bottlenecks. Recent research has demonstrated that a direct convolution approach can outperform the traditional convolution implementation based on tensor-to-matrix conversions. However, existing approaches for direct convolution still have room for performance improvement. We present nDirect, a new direct convolution approach that targets ARM-based multi-core CPUs commonly found in smartphones and HPC systems. nDirect is designed to be compatible with the data layout formats used by mainstream deep learning frameworks but offers new optimizations for the computational kernel, data packing, and parallelization. We evaluate nDirect by applying it to representative convolution kernels and demonstrating its performance on four distinct ARM multi-core CPU platforms. We compare nDirect against state-of-the-art convolution optimization techniques. Experimental results show that nDirect gives the best overall performance across evaluation scenarios and platforms.


INTRODUCTION
Convolutional Neural Networks (CNNs) are one of the most popular deep neural network architectures and are found to be successful in a wide range of tasks, including image classification [38,60], object detection [46,55,56], natural language processing [43], and semantic segmentation [58,68].The core component of a CNN is the convolutional (CONV) operation [23,35,51,63], which is often responsible for the performance bottleneck of a CNN, accounting for over 90% of the CNN execution time [28].As such, there has been considerable interest in optimizing convolution implementations to accelerate CNNs [1,10,21].
Traditionally, CONV kernels were implemented as general matrix multiplications (GEMM) [1,3,9,10].This approach maps the input tensor into a row-or column-major matrix through format conversion (also known as im2col) to translate CONV operations into GEMM kernels [18,52].By using this matrix format, the convolution operation can be performed as a single matrix multiplication to take advantage of the highly optimized GEMM kernel accelerated using heavily optimized BLAS (Basic Linear Algebra Subprograms) libraries [3,7,11].
However, im2col can increase the memory footprint and the tensor-to-matrix conversion can result in an irregular-shaped matrix with sub-optimal performance [31,53,66].As such, more recent approaches attempt to optimize CONV operations without converting the input tensors into matrices.This strategy is known as direct convolution [16, 24, 27, 30-32, 45, 49, 52, 54, 65, 67, 69].It works by sliding a CONV kernel over the input tensor and computing the dot product between the kernel and a small patch of the input at each position.This operation is repeated for every input position to produce a feature map.Direct convolution has two advantages over im2col.Firstly, direct convolution has lower memory requirements as it operates directly on the input tensor, avoiding transforming the input tensor into a larger matrix.This can improve cache locality and reduce memory usage.Secondly, direct convolution can exploit the sparsity of the convolution kernel and avoid unnecessary computations [54], leading to faster computation times.
LIBXSMM is the state-of-the-art library-based solution for implementing direct convolutions on CPUs [5,6,31,33].It uses a specialized data layout and the Batch-Reduce GEMM (BRGEMM) as the computational kernel [32,33].It gives improved performance over im2col+GEMM on x86 and ARM CPUs.While promising, LIBXSMM has two fundamental drawbacks.First, its data layout design is incompatible with the common data layouts (i.e.,  or   ) 1  used in mainstreamed deep learning (DL) frameworks [8,13,30].Therefore, integrating the BRGEMM routines into DL frameworks requires either code refactoring to the underlying DL framework or introducing a format conversion stage at the user code when calling and exiting each CONV operator.The latter requires changes to the standard user model code and will incur additional overhead.Second, LIBXSMM still uses a conventional GEMM-based microkernel, which fails to leverage potential data reuse opportunities in convolutions [9,52] to improve performance.Other work on direct convolutions [27,33,52,65,67,69] also failed to address the two aforementioned limitations within a single framework.These approaches either use a different data layout with integration issues [33,52,67], or sacrificing performance to maintain compatibility with standard data layouts [27,65,69].
This paper presents nDirect 2 , a new direct convolution solution with a focus on providing high performance, high data reusability, and DL framework compatibility.Our work explicitly targets ARM multi-core CPUs widely seen in smartphones and HPC systems, which are also commonly used for CNN model inference.nDirect implements new strategies for micro-kernel computation, data packing and parallelization.nDirect is designed to be compatible with mainstreamed DL frameworks and does not require code refactoring of the underlying CONV implementations or the user model code.Instead of transforming data between different data layouts [31,67], nDirect adheres to the conventional tensor 1  =[Batch Size, Input Channels, Input Height, Input Width];   =[Batch Size, Input Height, Input Width, Input Channels]. 2The code and data for this paper are publically available at: https://github.com/nDIRECT/nDIRECT.We demonstrate the benefit of nDirect by applying it to three HPC multi-cores and one embedded CPU of the ARMv8 architecture.We evaluate nDirect by measuring its performance on individual convolution layers and the end-to-end inference time of representative CNN models.We compare nDirect against four state-of-theart convolution approaches [18,27,31,70].We show that nDirect consistently delivers better performance across hardware platforms.We showcase that, despite being a low-level library-based method and lacking high-level optimizations like operator fusion, nDirect is competitive to Ansor, an automated search framework within TVM, for the end-to-end inference optimization.
This paper makes the following contributions: • It presents a new direct convolution algorithm that preserves the conventional data layouts used by mainstream DL frameworks; • It proposes a new way to implement convolution computation kernels, which outperforms existing solutions; • It provides a set of analytical models to derive the optimal algorithmic parameters.

BACKGROUND AND PROBLEM SCOPE
Table 1 summarizes the CONV notations used throughout in the paper.

Prior Convolution Implementations
Algorithm 1 gives a straightforward, unoptimized implementation of CONV, which has seven nested loops around a multiply-andaccumulate statement.The algorithm uses stride ( ) to determine how to move the input tensor  across the  spatial space of the filter tensor  , to generate the output tensor .As there are no dependencies across the loop iterations, the computation can be permuted and tiled to improve the performance [45].Algorithm 1 can be typically improved using four strategies [48], including the direct convolution targeted in this work, the im2col+GEMM approach, FFT (Fast Fourier Transform) and Winograd [44].While FFT and Winograd can reduce the computation complexity, they have limited applications [41,50].This is because the two methods can increase the memory pressure and reduce the prediction accuracy [42].Since our work focuses on optimizing CONV without compromising prediction accuracy, direct convolution and im2col+GEMM are the most relevant methods.

Im2col+GEMM Approach
The process of lowering the CONV kernel to GEMM is known as im2col.A GEMM computation generally contains three dimensions, referred to as  ′ ,  ′ and  ′ in this paper.Given a convolution size, this process involves flattening and patching images according to columns and then arranging these columns into a concatenated matrix.The convolution kernels are stored in matrix format ahead of time and called upon by a GEMM routine to execute convolutions.While using optimized GEMMs can speed up convolutions, this method has additional memory overhead, and the available hardware memory bandwidth can restrict its performance.This is a particular problem for the parallel execution of CONV on multicore CPUs with large batch sizes because the available bandwidth to each core may be insufficient to achieve optimal GEMM performance.

Direct Convolution
There have been several attempts to optimize direct convolutions with varying degrees of success.For example, the ARM Compute Library (ACL) [1] supports direct convolution implementation, but it gives a poor performance on our evaluation platform and only works for very limited configurations.LIBXSMM [5,31] is most closely relevant to nDirect, but it uses a new storage format so as to enhance data locality and utilizes SIMD instructions.Additionally, it tiles loops to accommodate small matrix multiplications as the innermost micro-kernel, which is generated by just-in-time (JIT) [39] compilation.In this process, the filter with a data layout of  is converted into a tensor with dimensions

Search-based Code Optimization
There is also a body of work on auto-tuning the DNN back-end code generation [15,19,20,47,64,71].The Ansor [70] in the TVM DL compiler [19] employs evolutionary search with a predictive model to search an optimized code schedule by looking at optimizations like loop tiling and instruction scheduling.The code schedule is then passed to the back-end code generator (e.g., the LLVM compiler) to emit machine instructions.Ansor supports auto-tuning for direct convolution with the  data layout.Ansor also leverages  the operator fusion technique from Relay [57] into computational subgraphs for code optimizations.

Positioning Our Work
Table 2 summarizes prior work in format conversion, memory footprint, and performance.LIBXSMM requires format conversion, requiring code refactoring or will incur conversion overhead.Im2col+GEMM could increase memory pressure.All prior methods also leave much room for performance improvement.Therefore, nDirect aims to fill this gap by preserving the mainstream DL formats and achieving high performance.
To illustrate these points, consider optimizing CONV operator on Phytium 2000+, a 64-core ARMv8 multi-core CPU [29].We employ ResNet-50, a popular CNN utilized for object detection [38] and set the batch size to match the number of physical cores available [31], while executing various CONV layers of different sizes from ResNet-50 (see Table 4).

Motivation Results
Figure 1 shows the convolution performance of representative implementations, and we normalize the throughput (GFLOPS) relative to the theoretical peak performance of Phytium 2000+ with 64 cores.Breakdown of overhead.Figure 1a gives a breakdown of the runtime overheads for each part of im2col+GEMM and LIBXSMM's direct convolution approaches.In the case of im2col+GEMM, the runtime overheads arise from data packing, im2col transformation and micro-kernel calls.Convolutions with  > 1 and  > 1 require im2col transformation, which causes expensive data duplication cost.Additionally, the overhead of data packing can not be ignored, accounting for up to 40% of total expenses for CONV layer 17.For LIBXSMM, assuming the adoption of conventional data formats  , the runtime overheads originate from data format transformation and micro-kernel calls.As presented in Figure 1a, the cost of data format transformation accounts for the majority of overall overhead, with up to 90% of total execution time for CONV layer 5. Parallel execution.Figure 1b displays the performance of individual CONV layers from ResNet-50 when executed using a batch size of 64 on 64 cores.It is worth noting that we only measure the performance of LIBXSMM's micro-kernels to observe the benefits of using a cache-friendly data format.Despite performing the best, LIBXSMM only delivers an average 50% of the theoretical CPU peak performance.In addition, we observe that im2col+GEMM achieves 40% of the peak performance.For convolutions without im2col transformation, such as CONV layers 19 and 20, GEMM methods achieve close to 50% of the peak performance.

Opportunities for Improvement
After closely examining the results and the implementation of prior work, we have identified three opportunities for improvement, described as follows.
First of all, compatibility with the mainstream data layout used in DL frameworks is important for the adoption of these approaches.While LIBXSMM has achieved promising convolution performance, it introduces new data layouts designed to improve cache locality and exploit vectorization.However, incorporating such new data layouts into mainstream DL applications would require significant redevelopment of existing frameworks and entail substantial engineering efforts.This is challenging for processors like ARM CPUs, which often lack DL software support compared to x86 and GPUs.Alternatively, without changing the underlying DL framework, data format conversion will need to be performed by the user code before and after calling each CONV operator.This not only requires user code refactoring but the expensive overhead of format conversion can also outweigh the benefit of im2col.For instance, the conversion time for CONV layers 1 in Figure 1a is around 4× of the actual computation time.Therefore, a better scheme should Algorithm 2: nDirect Convolution for (ℎ = 0; ℎ <  ; ℎ + =  ℎ ) do 3: L3: for ( = 0;  < ;  + =   ) do 4: L4: for transform_filter( ); 6: L5: for (ℎ = ℎ ; ℎ < ℎ +  ℎ ; ℎ + +) do 7: L6: for ( = 0;  <  ; + =   ) do 8: Input_Buffer B ← Pack_Micro-kernel( ); 9: L7: for minimize the disruption to the existing DL software systems, making integrating them into existing DL frameworks easier without significant redevelopment effort.Secondly, we identified opportunities to enhance the performance of GEMM-based convolution methods.Although LIBXSMM uses optimized micro-kernels and a cache-friendly data format to achieve fast direct convolution, we found that its loop tile sizes are too small to fully utilize the multi-level caches and fused multiplyaccumulate (FMA) units available in modern ARMv8 multi-cores.And the sequential load instructions generated by LIBXSMM's JIT compiler can cause pipeline stall hazards.
Moreover, we noticed that the im2col transformation and sequential data packing utilized in the im2col+GEMM approach can also hamper performance by generating significant memory load and store operations.This can result in slowdowns when multiple threads are competing for memory bandwidth.Therefore, an ideal micro-kernel for convolution should have high performance and no additional memory access overhead.
Thirdly, we observed that existing parallelization strategies are coarse-grained, contributing to the poor convolution performance on ARM multi-cores.For example, ACL's direct convolution achieves only 5% of the multi-core peak performance on Phytium 2000+.This is because of the strategy's naive naïve parallelization of the  dimension without considering the convolution workloads characteristics, such as the batch size  and input shape  ×  .As a result, the computations are performed sequentially over multiple batches, resulting in linear cost accumulation.Further optimization is needed to overcome this problem.
To summarize, these findings indicate that there is still considerable potential for performance improvement when optimizing convolution operations on ARM multi-core CPUs.

Overview
nDirect exploits the opportunities identified in Section 3.2.We achieve this by redesigning direct convolution with compatible data layouts, new micro-kernels and suitable parallelization strategies optimized for multi-core CPUs.
Data layout.To be compatible with mainstream DL frameworks (e.g., Tensorflow [13] and MXNet [8]), nDirect preserves the conventional  and    data layouts.In this paper, we explain nDirect using the  data layout as an example.
Algorithm implementations.Algorithm 23 outlines the nDirect convolution for the  data format, inspired by the GEMM block algorithm [34].We tile the filter and input tensors at two levels to improve spatial data locality.The first level of tiling exploits cache usage (lines 2-4) and determines the tile size based on the capacity of each level of cache, as described in Section 4. The second tiling level uses vector registers (lines 6-9) with a tile size that maximizes floating-point arithmetic intensity ( ), as detailed in Section 5. We use the outer-product method to update the output tensor  since its  is higher than the inner-product method, allowing us to access elements of the filter tensor  more continuously.This is also the reason for focusing on the format conversion of the tensor  (line 5). Figure 2 illustrates that the input tensor's spatial data locality is poor, and the processor can only continuously access   elements at each iteration.To address this, we map its elements to a continuous buffer (line 8).
Road map.In the upcoming sections, we will delve into three essential optimizations that we have implemented in nDirect to minimize data movements when permuting loops (Section 4), introduce a novel micro-kernel that is optimized for convolution (Section 5), and outline our parallelization strategy (Section 6).Our current implementation supports single floating-point (FP32) as this is the most commonly used data type for CNN, but our techniques can be applied to other data types, including FP16, FP64 and INT16.

NDIRECT DESIGN
Algorithm 2 shows the main computation kernel of nDirect, described in the following subsections.nDirect follows the design principle of the classical Goto algorithm for matrix multiplications [34,62].

Loop Ordering
Since a CONV operator can be considered a high-dimensional GEMM, we map the CONV's dimensions to Goto's loop ordering as outlined in Algorithm 2. Specifically, we map the CONV dimensions to the GEMM (..,  ′ ,  ′ , and  ′ ) computation dimensions of the input tensor  , the filter tensor  , and the output tensor , as  →  ′ ,  ×  ×  →  ′ , and  ×  ×  →  ′ .The specific mapping method and data flow scenarios are as follows.
In Algorithm 2, we use loops 2 and 3 to partition  into subblocks that can fit into the last-level cache (LLC).Unlike the Goto algorithm, we choose not to pack the elements of  between these two levels of loops.This is because the CNN tensor is often irregularshaped, i.e., one of the tensor dimensions can be much smaller than the others.Prior studies [67] have shown that data packing can introduce additional memory operations that cannot be amortized by the improved performance for irregularly shaped GEMMs [66].
Loops 3 and 4 partition the  into a series of sub-blocks that can fit into the L2 cache.Here, we choose to transform elements of filter  into continuous memory space on the fly.This is because the size of  is typically much smaller than that of the  , i.e.,  <<  ×  × .During the packing step, the processor accesses filter  in a pipelined manner, where the packing overhead can be hidden.Moving to loops 5 and 6, we further divide the sub-blocks of

Determine the Tiling Size
Loop tiling is key to improving cache data locality.In this subsection, we explain how to determine the sizes of  ℎ ,   , and   in Algorithm 2. Note that we will discuss the block size of the microkernel in Section 5. Our design aims to take advantage of the vector FMA units while leveraging the memory hierarchy of caches and vector registers.
To optimize the L1 data cache utilization, each  ×  × (  + −1) slice of the input  should be kept in the L1 cache during each iteration of loop 7.Furthermore, the L1 cache should also hold two   ×   ×  ×  slices of  at this loop level.Therefore,   must satisfy the following constraints: Section 5.2.3 show that the optimal value of   and   are 8 and 12 respectively on our evaluation platforms.Then we can obtain   with Equation 1.
Similarly, we would like the L2 cache to keep one   ×  ×  ×  block of filter  during each iteration of loop 6, and two  ×   × (  +  − 1) slices of input  at loop 6.Because the L2 cache on ARM CPUs often holds both data and instructions simultaneously, the cache needs to reserve some space for the instructions being executed and data elements of .Therefore,   and   must satisfy: With Equations 1 and 2, we can obtain   and   .Likely, we can derive  ℎ in a similar way by considering the capacity constraint of the L3 cache (should this be available on the underlying hardware).

MICRO-KERNEL DESIGN
nDirect incorporates two micro-kernels that are specifically customized to maximize  and minimize data access latency.The first micro-kernel is designed to accelerate convolutions, corresponding to line 10 of Algorithm 2. The second micro-kernel is responsible for packing input tensor  and performing calculations in the first iteration of loop  7 (line 8 of Algorithm 2).

Design Overview
nDirect aims to improve the data reuse of direct convolution and leverage the ARM NEON SIMD extensions to boost instruction level parallelism.Specifically, we utilize the 32 128-bit-wide vector registers ( 0− 31) and the arithmetic fused multiply-accumulate (FMA) unit available on ARMv8 CPUs.The challenge is to select suitable vector parameters (  and   ) to maximize register multiplexing   and  .To this end, we use analytical methods to guide our optimization.As a working example, we use FP32 tensor datatype and a 3 × 3 convolution kernel to explain our approach in this section, but our techniques can be applied to other datatype and convolution kernels by adjusting the parameters of the analytical models.

Main Micro-kernel
Since each vector register can hold 4 FP32 elements, the vector length,   is set to be a multiple of 4 to fully utilize the vector units.

Optimization goal.
Algorithm 3 outlines the micro-kernel implementation in nDirect.Here, we unroll the loop with an upper bound of  for a convolution kernel size of  (lines 5-14).Our objective is to maximize  in one iteration of loop 9.To illustrate the algorithm workflow, we use a 3 × 3 convolution kernel as an example.
During each iteration of loop 9, we initially load   +  − 1 input and   filter tensor elements into vector registers.We then use scalar-vector multiplication with FMA instructions to compute   ×   output elements, resulting in 2 ×   ×   floating-point operations.Note that each FMA instruction includes an addition operation and an multiplication operation.Figure 2(a) illustrates the first round of the calculation.After completing the first round of calculation, we update the vector registers that store the filter elements.At the same time, the input data related to the convolution operation requires an offset of step size 1 in the vector registers.We perform similar operations at the end of the second round of calculation (Figure 2(b)).Finally, we formulate the average  in one iteration of loop 8 as follows:

Solving equations.
To optimize nDirect, we consider the constraints defined in Equation 3 and the goal defined in Equation 4. To maximize the  , we adopt the Lagrange multipliers method [36] to find the optimal values of   and   for CPU architectures used in our evaluation.

Micro-kernel for Packing
Conventional im2col+GEMM uses a sequential packing strategy by mapping the discontinuous input matrix elements into a linear buffer before performing computation.This strategy can reduce memory access latency during computation but introduces additional overhead as can be seen from Figure 1a.nDirect also aims to pack discontinuous input tensor elements into a linear buffer, which is smaller than the L1 cache, but it tries to hide the packing latency.Note that the input tensor elements used are identical in each iteration of loop 7 in Algorithm 2. nDirect performs data packing in the first iteration (line 8 of Algorithm 2).
Figure 3 shows how nDirect packs tensor elements.Generally,   ×3×(  + −1) input elements accessed in loop 7 in Algorithm 2 are initially distributed in   continuous channels of input tensor  .In the first iteration of loop 7, nDirect calls Pack_Micro-kernel to pack discontinuous   × 3 × (  +  − 1) input elements into a linear buffer named  (line 8 in Algorithm 2).Since sequential write operations with data dependencies can incur pipeline stall hazards, nDirect places store (st) instructions immediately after FMA instructions to hide packing overheads by utilizing the out-oforder instruction execution of modern CPUs.In each subsequent iteration, input data are fetched from linear buffer , designed to improve the L1 data cache hit rate.

PARALLELIZATION STRATEGIES
We use OpenMP with static work partitioning to parallelize CONV operations on shared-memory multi-core CPUs.To utilize hardware parallelism, we use all available CPU cores, meaning that we will spawn  parallel threads for a CPU with  cores.Ideally, all the cores would start and finish the work simultaneously, thus not having any core idling at any point in time.However, this is not always possible due to memory access latency and application workloads.As such, we need to carefully determine how many threads are used to parallelize each of the parallel dimensions.

Model Thread Mapping
nDirect parallelizes the  , ,  and  dimensions in Algorithm 2. We do not parallelize the reduction dimensions of ,  and , because doing so can result in write conflicts since all participating threads would write to the same location in the .While these conflicts can be eliminated using locks or additional memory buffers, the associated runtime overhead can be high [45].
To map threads onto computation dimensions, we use   threads to parallelize the  dimension and   threads to parallelize the  ,  and  dimensions, where   ×   =  .At runtime, each thread performs Similarly, the number of memory accesses to filter  within each parallel thread is  • ••   , which is accessed in a streaming manner meaning that the memory accesses are performed on continuous addresses.Additionally, memory access to input tensor  required by each parallel thread is  • • •   • 2 , which is accessed in a nonstreaming manner.To model the difference in accessing latency between streaming and non-streaming memory accesses, we introduce a coefficient  to memory accesses to  .Therefore, the  for each thread is: Our objective is to maximize  , which means minimizing

Solving the Equation
By applying the inequality of arithmetic and geometric mean method to Equation 6, we have: where both sides of the equation will equal if would reach its maximum value.Since the micro-kernel for packing (Section 5.3) has little overhead, we take the up-bound value of   , i.e, Note for dimensions of  ,  and  , the priority of parallelization is  ,  and  .Specifically, if    > 1, we will use  threads to parallelize dimension  , and    threads to parallelize dimension  .For the targeting hardware platform, we use microbenchmarks to determine  by accessing the memory space in a streaming and non-streaming manner.Since the value is determined offline and is a one-off cost, it does not affect the runtime performance.

EXPERIMENTAL SETUP
We evaluate nDirect by comparing it against four existing convolution implementations described in Section 7.3.Our evaluation includes layer-wise performance comparison and end-to-end inference of the entire CNN network.

Hardware Platforms
Our experiments were performed on three HPC systems and one embedded system with ARM multi-cores.Our evaluation platforms include Phytium 2000+ [62], Kunpeng 920 (KP920) [4], Thun-derX2 [37], and a Raspberry Pi 4 (RPi 4) [12].Table 3 provides an overview of the specifications for these platforms.It is worth noting that the L2 cache on Phytium 2000+ is shared between a cluster of four cores, while it is private to a processor core on KP920 and ThunderX2.

Convolution Workloads
We use convolution layers from two representative CNNs: ResNet-50 [38] and VggNet-16 [60].They are widely used for large-scale image recognition.Table 4 gives the experimental parameters used for each layer.We set the batch size to match the number of physical cores to evaluate the performance of multi-batch CONV operations and CNNs end-to-end inference.

Baseline Implementations
We compare nDirect against the following baselines: im2col+GEMM.We use the im2col implementation from MXNet and the OpenBLAS GEMM routine [11], and OpenMP for multithreading parallelization.Besides, we use MXNet with im2col+GEMM as the baseline when evaluating the end-to-end inference.We use MXNet 1.6.0.
LIBXSMM.The direct convolution provided by LIBXSMM utilizes small GEMM-based micro-kernel generated by JIT.It requires converting the input tensor into a specified format.We excluded this transformation time from the execution time for a fair comparison.We use LIBXSMM 1.17.
XNNPACK.Google's XNNPACK is a highly optimized solution for neural network inference and frequently utilized in mobile systems.It provides the indirect convolution algorithm, a modification of GEMM-based convolution algorithms but with a smaller memory footprint and elimination of im2col transformation cost.
Ansor.This optimizer [70] is part of the TVM DL compilation framework [19].To generate high-performance tensor program, Ansor searches in a large search space to find the optimal computational subgraphs.We use Ansor in TVM version 0.12.0 and deploy it to tune convolution layers and CNN models.We use the default number of executed trials of Ansor.Specifically, we use the number of executed trials to 1,000, 15,000 and 20,000 when tuning a single layer, VggNet and ResNet variants, respectively.We exclude the tuning overhead from our measurement.For layer-wise evaluation, we compare nDirect against multiple schemes: im2col+GEMM, LIBXSMM, XNNPACK and Ansor.We also integrated nDirect with MXNet and evaluated the end-toend performance of CNN models by comparing our approach with im2col+GEMM used by MXNet and CNN models tuned by Ansor and the TVM back-end code generator.

Evaluation Methodology
To ensure a fair comparison, we adopt the same experimental setups used in the source publications or utilize the default settings of the baseline methods.Specifically, we use    and 4 data formats for XNNPACK's indirect convolution and  5 for LIBXSMM's direct convolution.For other methods, we use  for input tensors and 6 for filters.Additionally, we include all the layout transformation overhead of nDirect when measuring its performance.We run each experiment 20 times and report the geometric mean GFLOPS.We found the variances across different runs to be minor, less than 5%.

Multi-core Convolutions
Figure 4 reports the multi-core convolution throughput (measured in GFLOPS) on each of the evaluation platforms.The x-axis corresponds to layer ids given in Table 4.The line chart shows nDirect's performance with respect to the hardware's theoretical peak performance (see the y-axis on the right).
Compared with the best-performing baseline, nDirect improves the throughput by 1.32×, 1.34× and 1.07× respectively, on average, on Phytium 2000+, KP920 and ThunderX2, which highlights the effectiveness of our new convolution computation mode.For most layers with  = 1 (Section 2.1), nDirect delivers 70%-80% of the CPU peak performance.For example, on layers with  = 3 and  = 3, nDirect achieves up to ≈ 80% of the peak performance, exceeding layers with  = 1 and  = 1 because it can utilize more vector registers to achieve a higher  according to Equation 3.For  = 2, each time the micro-kernel is called, the amount of data fetched into the vector registers is consistent with when  = 1, but the quantity of computation is reduced by half, resulting in a decrease in  .Hence, there is a partial performance penalty.Nonetheless, nDirect performs best overall and consistently outperforms the baseline methods across CONV layers and platforms.
Figure 5 quantifies our packing optimization to the end performance improvement using five convolution layers from VggNet.The technique demonstrates different levels of performance benefits on different architectures.This is because the cache-replacing policy on Phytium 2000+ is pseudo-random, differing from the other two platforms, which utilize the Least Recently Used (LRU) replacement policy.

Direct Convolution Tuned by Ansor
In this experiment, we take the throughput of individual convolutional layers tuned by Ansor as the baseline and report the performance improvement of nDirect over Ansor.The results are given in Figure 6.We found that the Ansor auto-tuning for each convolution layer can coverage in 1,000 execution trials, suggesting that we have given a sufficient search budget to Ansor.nDirect outperforms Ansor-tuned direct convolution on individual layers across evaluation platforms, giving an average performance improvement of 1.92×, 1.82×, and 1.51× on Phytium 2000+, KP920, ThunderX2 respectively.On some layers like layer 10, Ansor delivers comparable performance to nDirect.However, nDirect still outperforms Ansor on all individual layers by offering better data packing and parallelization strategies.

End-to-end Inference Time
We evaluate the end-to-end inference performance of nDirect under different ResNet and VGGNet variants on Phytium 2000+ and ThunderX2.We choose to compare with Ansor as LIBXSMM and XNNPACK are not compatible with MXNet to run the entire network.4.   As shown in Figure 7, we normalized the inference performance to that of Ansor.nDirect, as a library-based approach, can deliver comparable performance to Ansor, but without the expensive search overhead of Ansor.Specifically, on Phytium 2000+, nDirect delivers a speedup of 1.19× to 1.45× over Ansor.On ThunderX2, nDirect delivers slightly lower performance for the end-to-end inference compared to Ansor, with a speedup of 0.88× to 0.98×.The better performance of Ansor on the whole CNN is due to its ability to optimize across CNN layers through operator fusion [67,72].This technique can write back operations for intermediate results and fetch operations in the CNNs pipeline, further reducing memory access latency and bandwidth pressure to improve CNNs end-toend performance.Because ThunderX2 has a lower bandwidth than Phytium 2000+, such optimization becomes more important.As nDirect is designed to optimize individual CONV operators, it does not support operator fusion.Our future work will look into integrating nDirect into TVM to take advantage of the higherlevel operator fusion optimization.Nonetheless, nDirect delivers comparable performance to Ansor despite lacking operator fusion optimizations.

Embedded Platform
We now evaluate nDirect on an embedded system with lower computation capabilities than HPC systems.Figure 8 reports the results of nDirect and alternative implementations on RPi 4. nDirect outperforms the alternatives both in single-threaded and multithread scenarios.Specifically, the best-performing baselines are XNNPACK for single-core execution and LIBXSMM for multi-core executions.However, nDirect delivers a geometric mean speedup of 1.15× and 1.19× over XNNPACK and LIBXSMM, respectively, confirming the effectiveness of our optimization.

Impact of Hyper-threading
Our evaluation was conducted by turning off the hardware hyperthreading (HT).In this experiment, we enable HT on ThunderX2 to  exploit HT hardware parallelism.Here, we run 4 threads per core and set the batch size to match the number of logical cores.The results are given in Figure 9. nDirect outperforms XNNPACK, the best-performing baseline, by delivering a geometric mean speedup of 1.28×.
Specialized data layouts.Many prior works have sought to optimize convolution operations by introducing specialized data formats that allow for continuous memory accesses and direct use of SIMD instructions and FMA units [24,31,61,67,69].These approaches have demonstrated promising convolution performance by enabling stride-1 memory access and hardware-specific optimizations.However, one significant drawback is that they often require new, specialized data formats that cannot be easily integrated into mainstream DL frameworks that use conventional formats.This limitation either requires changing the underlying DL frameworks or the user code that can also result in additional computation overhead for format conversion when invoking the standard CONV operator.nDirect is designed to avoid this pitfall by operating on the standard data layouts used by mainstreamed DL frameworks.

DISCUSSION
Our work specifically targets ARMv8 multi-cores.In this section, we discuss how to extend our techniques to other architectures and convolution kernels.

Architecture Portability
Our approach is generally applicable and can be easily migrated to other architectures.All our discussions so far target the ARMv8 architecture with 128-bit vector register.The latest ARM Scalable Vector Extension(SVE) [2] provides a variable vector length, which is any multiple of 128 bits between 128 and 2048 bits.Our techniques can be applied to this extension with modified   and   according to the available length and number of vector registers.
In addition to ARM-based CPUS, our techniques are also applicable to modern CPU architectures with SIMD extensions, like Intel AVX-512.Porting our techniques to other hardware architectures requires modifying the micro-kernels according to the constraints defined in Equation 3.These constraints can vary depending on the data type and the vector register width of the target architecture.Furthermore, our approach can be combined with auto-tuning to search for tile sizes and permutation orders to match different cache hierarchies.

Integrating with Other Kernels
Our techniques can be directly applied to standard convolution kernels commonly used in mainstream applications without requiring any modifications to the user code.Here, we discuss how our approach can be integrated with Depthwise Separable Convolution (DSC) [59] and 3D Convolution.DSC, consisting of Depthwise Convolution and Pointwise Convolution, is the building block for two representative CNN models, Xception [40] and MobileNet [22].nDirect can be directly called to compute the Pointwise Convolution since it can be seen as the 1 × 1 convolution kernel with vectorizable dimension .To support Depthwise Convolution, we only needs removing the reduction operations of dimension C in micro-kernels.Since 3D Convolution can be seen as 2D Convolution with additional reduction dimensions, we can directly use the micro-kernels of nDirect for acceleration and further optimize the outer loops for better cache locality.

CONCLUSIONS
We have presented nDirect, a new direct convolution solution to provide high performance, high data reusability, and deep learning (DL) framework compatibility on ARM multi-core CPUs.nDirect complies with the conventional data formats used by mainstream DL framework but offers new optimizations for micro-kernel design, data packing and parallelization.We evaluate nDirect by testing its performance on individual convolution layers and the end-to-end inference time of representative CNN models.We conduct our evaluation on four platforms: three HPC multi-cores and one embedded CPU of the ARMv8 architecture.We also compare nDirect against state-of-the-art convolution libraries and a DL tuning framework.Experimental results show that nDirect outperforms the competing baselines on most test cases, achieving better overall performance across all hardware platforms.

ARTIFACT IDENTIFICATION Abstract
This artifact illustrates the procedure to reproduce the experimental results presented in the "Optimizing Direct Convolutions on ARM Multi-Cores" paper.This paper presents nDirect , a new direct convolution solution with a focus on providing high performance, high data reusability, and Deep Learning (DL) framework compatibility.
We provide instructions on how to deploy and run nDirect on ARMv8-based multi-core CPUs.

REPRODUCIBILITY OF EXPERIMENTS Evaluation Platforms
We conduct our evaluation on four platforms: three HPC multicores (Phytium 2000+ , ThunderX2 , KP920 ) and one embedded CPU (Raspberry Pi 4 ) of the ARMv8 architecture.Our evaluation platforms run Linux kernel version 4.19.46.We compile the benchmarks using gcc version 8.2.1 with the "-O3" compiler option and use OpenMP for parallel execution.For parallelization, we utilize GNU libgomp (version 201511).

Benchmarks
For layer-wise performance evaluation, we use convolution layers in ResNet-50 and VggNet-16.For CNNs end-to-end inference, we use ResNet-18, ResNet-50, VggNet-16 and VggNet-19 models.We set the batch size to match the number of physical cores to evaluate the performance of multi-batch convolution operations and CNNs end-to-end inference.

Evaluation and Expected Results
The environment variable OMP_NUM_THREADS is used to assign the parallel thread counts.Another variable GOMP_CPU_AFFINITY is set to ensure that each thread is bound to the corresponding core.We set the batch size to match the number of physical cores.Additionally, we include all the layout transformation overhead of nDirect when measuring its performance.For layer-wise evaluation, we run each convolution layer 5 times for a hot cache and repeated 20 times to report the geometric mean GFLOPS.We use    and  data formats for XNNPACK's indirect convolution and   for LIBXSMM's direct convolution.For other methods, we use  for input tensors and  for filters.We found the variances across different runs to be minor, less than 5%.For CNNs end-to-end inference, we run each benchmark 5 times for warming and repeated 20 times to report the mean time.The average GFLOPs and mean time should be similar to the reported values in the main paper.

Overall Results
For layer-wise evaluation, we show that nDirect outperforms the competing baselines on most test cases, achieving better overall performance (FP32 GFLOPS) across all hardware platforms.For CNNs end-to-end inference, nDirect outperforms im2col+GEMM and delivers comparable performance to Ansor. Author

ARTIFACT INSTALLATION DEPLOYMENT PROCESS
Install OpenBLAS, XNNPACK, Libxsmm libraries and LLVM compiler following the official instructions.Compile TVM v0.10.0 using LLVM-10 by following the official instructions.The source code of nDirect , benchmarks and scripts associated with this work can be download: https://github.com/nDIRECT/nDIRECT. See README.mdfile in our repository for guidance to build and run nDirect .

Figure 2 :
Figure 2: The nDirect convolution workflow in one iteration of loop 9 in Algorithm 3 (lines 3 to 14).The input, output and nontransparent filter blocks are held in vector registers.Arrows from input to output represent FMA operations.

Figure 6 :
Figure 6: Performance comparison for convolution operators with respect to Ansor.

Table 1 :
Summary of notations used in the paper

Table 3 :
Hardware platforms used in evaluation