2019 International Conference on Field-Programmable Technology (ICFPT) 2019
DOI: 10.1109/icfpt47387.2019.00029
|View full text |Cite
|
Sign up to set email alerts
|

Unexpected Diversity: Quantitative Memory Analysis for Zynq UltraScale+ Systems

Abstract: Memory throughput is one of the major bottlenecks for accelerator performance. Now that Zynq UltraScale+ systems are being deployed at exascale to edge, it is important to understand its limitations and optimizations possible for developers. In this paper, we extensively evaluate the memory performance and behaviour for various AXI ports combinations, burst sizes, access patterns, and the number of accelerators per AXI port. Our results on ZCU102 and Ultra 96 boards show that 1) effective throughput of these s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
3
1

Relationship

1
9

Authors

Journals

citations
Cited by 26 publications
(15 citation statements)
references
References 7 publications
0
14
0
Order By: Relevance
“…V. RELATED WORK Manev et al [4] have characterized DRAM access from the PL side on the ZU9EG. The highest throughput from DRAM they could achieve was 75 % of the theoretical maximum using three out of four AXI ports, avoiding the usage of ports 1 and 2 simultaneously.…”
Section: Characterization Outcomementioning
confidence: 99%
“…V. RELATED WORK Manev et al [4] have characterized DRAM access from the PL side on the ZU9EG. The highest throughput from DRAM they could achieve was 75 % of the theoretical maximum using three out of four AXI ports, avoiding the usage of ports 1 and 2 simultaneously.…”
Section: Characterization Outcomementioning
confidence: 99%
“…While embedded platforms provide limited bandwidth [58]- [60], e.g. less than 4.5 GB/s for Ultra96 and ZC706, sustaining peak bandwidth even on larger devices, such as ZCU104, is nontrivial [59]. This is aggravated as multiple applications are collocated on a single device [60]- [62]; and ii) underutilised PEs due to the mismatch of diverse layer shapes [15]- [19].…”
Section: Challenges Of Fpga-based Cnn Inference Enginesmentioning
confidence: 99%
“…The first selected configuration was a sorter with k = 128, w = 4, 128-bit datapath and an operating frequency of 250MHz. This frequency can saturate the achievable bandwidth of that port at 128 bits per cycle [28] and was achieved with the throughput optimisation (internally with w = 8). As shown in figure 8, the sorter achieves a speedup of up to around 36 times for data that fit in a single pass (k 2 = 16384) and above around 15 times for all other input sizes.…”
Section: B Use Case 1: Sortingmentioning
confidence: 99%