System-Level Communication Performance Estimation for DMA-Controlled Accelerators

Kim, Sunwoo; Park, Sungkyung; Park, Chester Sungchung

doi:10.1109/access.2021.3119516

Cited by 4 publications

(3 citation statements)

References 35 publications

(64 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another popular method of reducing memory access row conflicts in convolutional operation is to adjust the number of columns during loop tiling [33] [34] [35] [36]. However, several state-of-the-art solutions empirically determine the number of columns in the loop tiling to reduce memory row conflicts, or directly set the number of columns in the loop tiling to the number of columns of the output feature map when sufficient on-chip buffer is available.…”

Section: Related Workmentioning

confidence: 99%

“…However, several state-of-the-art solutions empirically determine the number of columns in the loop tiling to reduce memory row conflicts, or directly set the number of columns in the loop tiling to the number of columns of the output feature map when sufficient on-chip buffer is available. For example, many related works [33] [34] [35] [36] can only provide empirical parameters such as T m, T n, T r, T c, etc. The sizes of convolutional layers for many DNN models such as YOLOv2 significantly differ from each other.…”

Section: Related Workmentioning

confidence: 99%

“…It is assumed that DDR memory used by a convolutional accelerator IP works in a row-column-bank logical address mapping manner. Figure 4 illustrates a data layout in DDR4 for the input or output feature maps with row-major order in the loop tiling shown in Figure 1 [33]. Taking the output In Figure 4, if the size of an input or output feature map is fixed, the increment of T c will enlarge the scale of tiles and reduce the number of tiles for a given convolution layer.…”

Section: A Loop Tiling Technique and Data Layoutmentioning

confidence: 99%

See 2 more Smart Citations

OptTc: A method of optimizating memory access latency for convolutional accelerators

Zhao,

Wang,

Zhang

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: A Loop Tiling Technique and Data Layoutmentioning

confidence: 99%

See 1 more Smart Citation

OptTc: A method of optimizating memory access latency for convolutional accelerators

Zhao,

Wang,

Zhang

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

Modeling and Simulation of System Bus and Memory Collisions in Heterogeneous SoCs

et al. 2022

View full text Add to dashboard Cite

A system simulator is proposed and developed, which can help to optimize design parameters and hence minimize the number of collisions. In order to search the optimal design parameter combination which meets the user requirement, the proposed simulator has some knobs: partitioning between software and hardware, scheduling the operations in the system, and memory merging, all of which can be adjusted to predict collisions and search the optimal architecture. Also, design parameters can be adjusted sequentially to cover all design options and estimate the predicted performance for each option. The proposed system simulator is evaluated with an example signal processing algorithm, orthogonal matching pursuit (OMP) algorithm. Performances of four cases of the OMP algorithm are predicted by the proposed simulator and in turn are compared with the actual performances on ZedBoard. The proposed simulator can predict the performance of heterogeneous systems on chips with under 5% error for all the candidate architectures for OMP while taking the system bus and memory conflicts into account. Moreover, the optimized heterogeneous SoC architecture for the OMP algorithm improves performance by up to 32% compared with the conventional CAG-based approach. The proposed simulator is verified that the proposed performance estimation algorithm is generally applicable to estimate the performance of any heterogeneous SoC architecture. For example, the estimation error is measured to be no more than 5.9% for the convolutional layers of CNNs and no more than 5.6% for the LDPC-coded MIMO-OFDM. In addition, the optimized heterogeneous SoC architecture improves performance by up to 48% for the third convolutional layer of AlexNet and 56% for the LDPC-coded MIMO-OFDM. Lastly, compared with the conventional simulationbased approaches, the proposed estimation algorithm provides a speedup of one to two orders of magnitudes.

show abstract

Fast and Accurate Virtual Prototyping of an NPU with Analytical Memory Modeling

Park,

Moh,

Lee

et al. 2023

Proceedings of the 34th International Workshop on Rapid System Prototyping

View full text Add to dashboard Cite

System-Level Communication Performance Estimation for DMA-Controlled Accelerators

Cited by 4 publications

References 35 publications

OptTc: A method of optimizating memory access latency for convolutional accelerators

OptTc: A method of optimizating memory access latency for convolutional accelerators

Modeling and Simulation of System Bus and Memory Collisions in Heterogeneous SoCs

Fast and Accurate Virtual Prototyping of an NPU with Analytical Memory Modeling

Contact Info

Product

Resources

About