Bit matrix compression is a highly relevant operation in computer arithmetic. Essentially being a multi-operand addition, it is the key operation behind fast multiplication and many higher-level operations such as multiply-accumulate, the computation of the dot product or the implementation of FIR filters. Compressor implementations have been constantly evolving for greater efficiency both in general and in the context of concrete applications or specific implementation technologies. This paper is building on this history and describes a generic implementation of a bit matrix compressor for Xilinx FPGAs, which does not require a generator tool. It contributes FPGAoriented metrics for the evaluation of elementary parallel bit counters, a systematic analysis and partial decomposition of previously proposed counters and a fully implemented construction heuristic with a flexible compression target matching the device capabilities. The generic implementation is agnostic of the aspect ratio of the input matrix and can be used for multiplication the same way as it can be for single-column population count operations.
Neural networks have established as a generic and powerful means to approach challenging problems such as image classification, object detection or decision making. Their successful employment foots on an enormous demand of compute. The quantization of network parameters and the processed data has proven a valuable measure to reduce the challenges of network inference so effectively that the feasible scope of applications is expanded even into the embedded domain.This paper describes the making of a real-time object detection in a live video stream processed on an embedded allprogrammable device. The presented case illustrates how the required processing is tamed and parallelized across both the CPU cores and the programmable logic and how the most suitable resources and powerful extensions, such as NEON vectorization, are leveraged for the individual processing steps. The crafted result is an extended Darknet framework implementing a fully integrated, end-to-end solution from video capture over object annotation to video output applying neural network inference at different quantization levels running at 16 frames per second on an embedded Zynq UltraScale+ (XCZU3EG) platform.
The mapping of reads, i.e. short DNA base pair strings, to large genome databases has become a critical operation for genetic analysis and diagnosis. The underlying alignment operation essentially is a string search tolerating some character mismatches and possibly character deletions or insertions with respect to a reference genome. Its output comprises the locations within the reference that are likely to correspond to the mapped DNA snippet.This paper describes PoC-Align, an alignment infrastructure using FPGA accelerators. It is an extension of our preceding FPGA aligner [1], which has been enhanced to tolerate alignment gaps (insertions and deletions) and to be more customizable though generic parameters. In addition to the descriptions of the implementation of these extensions, we also name the mainly software-carried enhancements, such as the support of mapping paired-end reads, that are implemented on top of the FPGA accelerator. Providing a thorough overview on the complete infrastructure, we aim at advertising the disclosure of the sources of our solution and hope to encourage other groups to use and extend this platform.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.