We present an efficient FPGA architecture suitable for a medical 3D ultrasound beamformer. We tackle the delay calculation bottleneck, which is the heart and the most critical part of the beam-former, by proposing a computationally efficient design that is able to perform volumetric real-time beamforming on a single-chip FPGA. The design has been demonstrated for a 32×32-channel receive probe, and we extrapolated the requirements of the architecture for 80×80 channels. I. MOTIVATION Medical ultrasound (US) imaging is well established, being used in a wide range of applications including detecting static structures, such as tumors, and studying dynamic phenomena like blood flow and valve functionality. US imaging is comprises three main processes: insonification, beamforming (BF), and visualization. Insonification is the process of emitting Radio Frequency (RF) acoustic waves from a piezoelectric transducer, called probe, through a body region. The waves are reflected from inhomogeneous tissues interfaces that act as scatterers due to acoustic impedance mismatches. The returned echoes are digitized and processed through an algorithm called Beam-forming (BF). Finally, a post-processing step should be performed, including mapping the beamformed signals into screen image pixels. Recently, 3D US imaging has become available. A key advantages is that, since whole volumes are acquired at once, it is possible to remove the traditional dependence on having a trained sonographer operating the probe, in order to locate minute anatomical structures by fine adjustments of the position and orientation of the transducer. This enables telesonography, where even an unskilled operator can upload scans to a hospital where trained radiologists will issue a diagnosis. Unfortunately, present-day 3D imagers are bulky and expensive, suitable only for clinics and hospitals. A portable US platform with cheap, battery-operated electronics would be a breakthrough, enabling telesonography in rescue environments, in rural areas, and in developing countries, with major societal benefits. To this end, we undertake to implement 3D beamforming on a single FPGA. II. PROBLEM DEFINITION AND PREVIOUS WORK Beamforming is the core of any US imaging machine. It is the process of mapping the echoes to their origins by summing them along a certain delay profile, that represents the two-way time-of-flight of the acoustic wave from the origin to each scatterer, and back to the all the piezoelectric elements. BF also includes apodization, the weighting of the delayed echoes by a factor that compensates for antenna directivity effects. In volumetric US imaging, a software-based implementation of the beamformer is not optimal if we target a battery-powered platform, whereas a hardware design offers major potential energy savings. One of the critical challenges of 3D US imaging is the number of receiving channels of high-end transducers, up to 100×100 elements, and the correspondingly massive computations required for image reconstruction. Different state-of-t...
Shared virtual memory is key in heterogeneous systems on chip (SoCs) that combine a general-purpose host processor with a many-core accelerator, both for programmability and performance. In contrast to the full-blown, hardware-only solutions predominant in modern high-end systems, lightweight hardware-software co-designs are better suited in the context of more power-and area-constrained embedded systems and provide additional benefits in terms of flexibility and predictability. As a downside, the latter solutions require the host to handle in software synchronization in case of page misses as well as miss handling. This may incur considerable run-time overheads. In this work, we present a novel hardware-software virtual memory management approach for many-core accelerators in heterogeneous embedded SoCs. It exploits an accelerator-side helper thread concept that enables the accelerator to manage its virtual memory hardware autonomously while operating cache-coherently on the page tables of the user-space processes of the host. This greatly reduces overhead with respect to host-side solutions while retaining flexibility. We have validated the design with a set of parameterizable benchmarks and real-world applications covering various application domains. For purely memory-bound kernels, the accelerator performance improves by a factor of 3.8 compared with host-based management and lies within 50% of a lower-bound ideal memory management unit. CCS Concepts: • Software and its engineering → Virtual memory; Main memory; • Computer systems organization → Heterogeneous (hybrid) systems; System on a chip; Embedded software;
While high-end heterogeneous systems are increasingly supporting heterogeneous uniform memory access (hUMA), their low-power counterparts still lack basic features like virtual memory support for accelerators. Instead of simply passing pointers, explicit data management involving copies is needed which hampers programmability and performance. In this work, we evaluate a mixed hardware/software solution for lightweight virtual memory support for many-core accelerators in heterogeneous embedded systems-on-chip. Based on an input/output translation lookaside buffer managed by a host kernel-level driver, and compiler extensions protecting the accelerator's accesses to shared data, our solution is non-intrusive to the architecture of the accelerator cores, and enables zero-copy sharing of pointer-rich data structures.
High-frame-rate and high-resolution 3D medical ultrasound imaging imposes high requirements on the involved processing hardware. Several thousands of analog signals need to be processed in many steps to obtain a final image. Fully digital beamforming makes it possible to achieve high image quality coupled with extreme flexibility. Unfortunately, digital beamforming imposes staggering requirements on main memory bandwidth caused by the loading of off-chip stored beamforming delays. In this paper we present the first fully-digital integrated beamformer that is able to compute 269.3 M focal points (FP) per second from 10 000 receive channels, and which does not require off-chip main memory. This is enabled by our novel delay approximation circuit that exploits temporal correlation between subsequent computations and thereby allows to compute the delays for beamforming online. To estimate the area and power requirements, the complete system was designed and the beamformer core was evaluated for a 130 nm CMOS technology. The estimated complexity per channel is 37.2 kGE and the corresponding power dissipation was estimated with 48 mW.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.