Using soft-core processors on FPGAs offers the opportunity to customize the system design in order to accelerate the application. While this has always been possible manually by hardware designers, it requires distinct knowledge of design methods and of the microarchitecture of the soft-core. In this paper we show that a mature compiler like the GCC can be used for automatic generation of processor customizations directly from the C code of the application. To this end, we have extended the GCC to automatically select candidate sequences of the whole application and transform them into hardware extensions.
Abstract-Due to the continuously decreasing cost of FPGAs, they have become a valid implementation platform for SOCs. Typically, a soft core processor implementation is used to execute the software parts of the SOC. As each system is individually designed for a particular application, the idea is natural to support compute intensive parts of the code through customized hardware acceleration. Two different architectural variants have been proposed for this purpose in SOCs: either as an instruction set extension with specialized pipeline implementation or as a peripheral component that is programmed through memory mapping. In this contribution we analyze the efficiency (speedup related to LUTs) of those two variants.
Embedded systems are o en designed as complex architectures with numerous processing elements. E ectively programming such systems requires parallel programming models, e.g. task-based or data ow-based models. With these types of models, the mapping of the abstract application model to the existing hardware architecture plays a decisive role and is usually optimized to achieve an ideal resource footprint or a near-minimal execution time. However, when mapping several independent programs to the same platform, resource con icts can arise. is can be circumvented by remapping some of the tasks of an application, which in turn a ect its timing behavior, possibly leading to constraint violations. In this work we present a novel method to compute mappings that are robust against local task remapping. e underlying method is based on the bio-inspired design centering algorithm of L p-Adaptation. We evaluate this with several benchmarks on di erent platforms and show that mappings obtained with our algorithm are indeed robust. In all experiments, our robust mappings tolerated signi cantly more run-time perturbations without violating constraints than mappings devised with optimization heuristics.
Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example. Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth. We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25 脳 more energy efficient than expert-crafted Intel CPU implementations.
Modern FPGAs have become so affordable that they can be used to substitute ASICs in mass produced devices. A key component of such configurable system on a chip (CSoC) is the processor core. Available and usable cores are either 32 or 8 bit wide. Thus, there is a gap between these two extremes, which we want to fill with our SoC kit. In this contribution we elaborate on our SoC kit and its components and compare it to other SoC design environments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations鈥揷itations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.