Pirmin Vogel scite author profile

Bartolini

Benini

2014

We present an efficient FPGA architecture suitable for a medical 3D ultrasound beamformer. We tackle the delay calculation bottleneck, which is the heart and the most critical part of the beam-former, by proposing a computationally efficient design that is able to perform volumetric real-time beamforming on a single-chip FPGA. The design has been demonstrated for a 32×32-channel receive probe, and we extrapolated the requirements of the architecture for 80×80 channels. I. MOTIVATION Medical ultrasound (US) imaging is well established, being used in a wide range of applications including detecting static structures, such as tumors, and studying dynamic phenomena like blood flow and valve functionality. US imaging is comprises three main processes: insonification, beamforming (BF), and visualization. Insonification is the process of emitting Radio Frequency (RF) acoustic waves from a piezoelectric transducer, called probe, through a body region. The waves are reflected from inhomogeneous tissues interfaces that act as scatterers due to acoustic impedance mismatches. The returned echoes are digitized and processed through an algorithm called Beam-forming (BF). Finally, a post-processing step should be performed, including mapping the beamformed signals into screen image pixels. Recently, 3D US imaging has become available. A key advantages is that, since whole volumes are acquired at once, it is possible to remove the traditional dependence on having a trained sonographer operating the probe, in order to locate minute anatomical structures by fine adjustments of the position and orientation of the transducer. This enables telesonography, where even an unskilled operator can upload scans to a hospital where trained radiologists will issue a diagnosis. Unfortunately, present-day 3D imagers are bulky and expensive, suitable only for clinics and hospitals. A portable US platform with cheap, battery-operated electronics would be a breakthrough, enabling telesonography in rescue environments, in rural areas, and in developing countries, with major societal benefits. To this end, we undertake to implement 3D beamforming on a single FPGA. II. PROBLEM DEFINITION AND PREVIOUS WORK Beamforming is the core of any US imaging machine. It is the process of mapping the echoes to their origins by summing them along a certain delay profile, that represents the two-way time-of-flight of the acoustic wave from the origin to each scatterer, and back to the all the piezoelectric elements. BF also includes apodization, the weighting of the delayed echoes by a factor that compensates for antenna directivity effects. In volumetric US imaging, a software-based implementation of the beamformer is not optimal if we target a battery-powered platform, whereas a hardware design offers major potential energy savings. One of the critical challenges of 3D US imaging is the number of receiving channels of high-end transducers, up to 100×100 elements, and the correspondingly massive computations required for image reconstruction. Different state-of-t...

Efficient Virtual Memory Sharing via On-Accelerator Page Table Walking in Heterogeneous Embedded SoCs

ACM Trans. Embed. Comput. Syst.

Kurth

Weinbuch

et al. 2017

Shared virtual memory is key in heterogeneous systems on chip (SoCs) that combine a general-purpose host processor with a many-core accelerator, both for programmability and performance. In contrast to the full-blown, hardware-only solutions predominant in modern high-end systems, lightweight hardware-software co-designs are better suited in the context of more power-and area-constrained embedded systems and provide additional benefits in terms of flexibility and predictability. As a downside, the latter solutions require the host to handle in software synchronization in case of page misses as well as miss handling. This may incur considerable run-time overheads. In this work, we present a novel hardware-software virtual memory management approach for many-core accelerators in heterogeneous embedded SoCs. It exploits an accelerator-side helper thread concept that enables the accelerator to manage its virtual memory hardware autonomously while operating cache-coherently on the page tables of the user-space processes of the host. This greatly reduces overhead with respect to host-side solutions while retaining flexibility. We have validated the design with a set of parameterizable benchmarks and real-world applications covering various application domains. For purely memory-bound kernels, the accelerator performance improves by a factor of 3.8 compared with host-based management and lies within 50% of a lower-bound ideal memory management unit. CCS Concepts: • Software and its engineering → Virtual memory; Main memory; • Computer systems organization → Heterogeneous (hybrid) systems; System on a chip; Embedded software;

Lightweight Virtual Memory Support for Zero-Copy Sharing of Pointer-Rich Data Structures in Heterogeneous Embedded SoCs

IEEE Trans. Parallel Distrib. Syst.

Marongiu

Benini

2017

While high-end heterogeneous systems are increasingly supporting heterogeneous uniform memory access (hUMA), their low-power counterparts still lack basic features like virtual memory support for accelerators. Instead of simply passing pointers, explicit data management involving copies is needed which hampers programmability and performance. In this work, we evaluate a mixed hardware/software solution for lightweight virtual memory support for many-core accelerators in heterogeneous embedded systems-on-chip. Based on an input/output translation lookaside buffer managed by a host kernel-level driver, and compiler extensions protecting the accelerator's accesses to shared data, our solution is non-intrusive to the architecture of the accelerator cores, and enables zero-copy sharing of pointer-rich data structures.

Assessing the area/power/performance tradeoffs for an integrated fully-digital, large-scale 3D-ultrasound beamformer

Hager

Bartolini

et al. 2014

High-frame-rate and high-resolution 3D medical ultrasound imaging imposes high requirements on the involved processing hardware. Several thousands of analog signals need to be processed in many steps to obtain a final image. Fully digital beamforming makes it possible to achieve high image quality coupled with extreme flexibility. Unfortunately, digital beamforming imposes staggering requirements on main memory bandwidth caused by the loading of off-chip stored beamforming delays. In this paper we present the first fully-digital integrated beamformer that is able to compute 269.3 M focal points (FP) per second from 10 000 receive channels, and which does not require off-chip main memory. This is enabled by our novel delay approximation circuit that exploits temporal correlation between subsequent computations and thereby allows to compute the delays for beamforming online. To estimate the area and power requirements, the complete system was designed and the beamformer core was evaluated for a 130 nm CMOS technology. The estimated complexity per channel is 37.2 kGE and the corresponding power dissipation was estimated with 48 mW.

Lightweight virtual memory support for many-core accelerators in heterogeneous embedded SoCs

Vogel¹,

Marongiu²,

Benini³

2015