Syed Asad Alam scite author profile

2014

VLSI Design

Logarithmic number system (LNS) is an attractive alternative to realize finite-length impulse response filters because of multiplication in the linear domain being only addition in the logarithmic domain. In the literature, linear coefficients are directly replaced by the logarithmic equivalent. In this paper, an approach to directly optimize the finite word length coefficients in the LNS domain is proposed. This branch and bound algorithm is implemented based on LNS integers and several different branching strategies are proposed and evaluated. Optimal coefficients in the minimax sense are obtained and compared with the traditional finite word length representation in the linear domain as well as using rounding. Results show that the proposed method naturally provides smaller approximation error compared to rounding. Furthermore, they provide insights into finite word length properties of FIR filters coefficients in the LNS domain and show that LNS FIR filters typically provide a better approximation error compared to a standard FIR filter.

Low-precision Logarithmic Number Systems

ACM Trans. Archit. Code Optim.

Garland

Gregg

2021

Logarithmic number systems (LNS) are used to represent real numbers in many applications using a constant base raised to a fixed-point exponent making its distribution exponential. This greatly simplifies hardware multiply, divide, and square root. LNS with base-2 is most common, but in this article, we show that for low-precision LNS the choice of base has a significant impact. We make four main contributions. First, LNS is not closed under addition and subtraction, so the result is approximate. We show that choosing a suitable base can manipulate the distribution to reduce the average error. Second, we show that low-precision LNS addition and subtraction can be implemented efficiently in logic rather than commonly used ROM lookup tables, the complexity of which can be reduced by an appropriate choice of base. A similar effect is shown where the result of arithmetic has greater precision than the input. Third, where input data from external sources is not expected to be in LNS, we can reduce the conversion error by selecting a LNS base to match the expected distribution of the input. Thus, there is no one base that gives the global optimum, and base selection is a trade-off between different factors. Fourth, we show that circuits realized in LNS require lower area and power consumption for short word lengths.

Improved Particle Filter Resampling Architectures

2019

J Sign Process Syst

The most challenging aspect of particle filtering hardware implementation is the resampling step. This is because of high latency as it can be only partially executed in parallel with the other steps of particle filtering and has no inherent parallelism inside it. To reduce the latency, an improved resampling architecture is proposed which involves pre-fetching from the weight memory in parallel to the fetching of a value from a random function generator along with architectures for realizing the prefetch technique. This enables a particle filter using M particles with otherwise streaming operation to get new inputs more often than 2M cycles as the previously best approach gives. Results show that a pre-fetch buffer of five values achieves the best area-latency reduction trade-off while on average achieving an 85% reduction in latency for the resampling step leading to a sample time reduction of more than 40%. We also propose a generic division-free architecture for the resampling steps. It also removes the need of explicitly ordering the random values for efficient multinomial resampling implementation. In addition, on-the-fly computation of the cumulative sum of weights is proposed which helps reduce the word length of the particle weight memory. FPGA implementation results show that the memory size is reduced by up to 50%.

Generalized division-free architecture and compact memory structure for resampling in particle filters

2015

Implementation of time-multiplexed sparse periodic FIR filters for FRM on FPGAs

2011

Abstract-Frequency-response masking (FRM) is a set of techniques for lowering the computational complexity of narrow transition band FIR filters. These FRM use a combination of sparse periodic filters and non-sparse filters. In this work we consider the implementation of these filters in a time-multiplexed manner on FPGAs. It is shown that the proposed architectures produce lower complexity realizations compared to the vendor provided IP blocks, which do not take the sparseness into consideration. The designs are implemented on a Virtex-6 device utilizing the built-in DSP blocks.

Techniques for Efficient Implementation of FIR and Particle Filtering

2016

Finite-length impulse response (FIR) filters occupy a central place many signal processing applications which either alter the shape, frequency or the sampling frequency of the signal. FIR filters are used because of their stability and possibility to have linear-phase but require a high filter order to achieve the same magnitude specifications as compared to infinite impulse response (IIR) filters. Depending on the size of the required transition bandwidth the filter order can range from tens to hundreds to even thousands. Since the implementation of the filters in digital domain requires multipliers and adders, high filter orders translate to a large number of these arithmetic units for its implementation. Research towards reducing the complexity of FIR filters has been going on for decades and the techniques used can be roughly divided into two categories; reduction in the number of multipliers and simplification of the multiplier implementation.One technique to reduce the number of multipliers is to use cascaded subfilters with lower complexity to achieve the desired specification, known as frequency-response masking (FRM). One of the sub-filters is a upsampled model filter whose band edges are an integer multiple, termed as the period L, of the target filter's band edges. Other sub-filters may include complement and masking filters which filter different parts of the spectrum to achieve the desired response. From an implementation point-of-view, time-multiplexing is beneficial because generally the allowable maximum clock frequency supported by the current state-of-the-art semiconductor technology does not correspond to the application bound sample rate. A combination of these two techniques plays a significant role towards efficient implementation of FIR filters. Part of the work presented in this dissertation is architectures for time-multiplexed FRM filters that benefit from the inherent sparsity of the periodic model filters.These time-multiplexed FRM filters not only reduce the number of multipliers but lowers the memory usage. Although the FRM technique requires a higher number delay elements, it results in fewer memories and more energy efficient memory schemes when time-multiplexed. Different memory arrangements and memory access schemes have also been discussed and compared in terms of their efficiency when using both single and dual-port memories. An efficient v vi Abstract pipelining scheme has been proposed which reduces the number of pipelining registers while achieving similar clock frequencies. The single optimal point where the number of multiplications is minimum for non-time-multiplexed FRM filters is shown to become a function of both the period, L and time-multiplexing factor, M . This means that the minimum number of multipliers does not always correspond to the minimum number of multiplications which also increases the flexibility of implementation. These filters are shown to achieve power reduction between 23% and 68% for the considered examples.To simplify the multiplier, alt...

On the RTL Implementation of FINN Matrix Vector Unit

ACM Trans. Embed. Comput. Syst.

Gregg

Gambardella³

et al. 2023

FPGA-based accelerators are becoming increasingly popular for deep neural network inference due to their ability to scale performance with increasing degree of specialization with dataflow architectures or custom data type precision. In order to reduce the barrier for software engineers and data scientists to adopt FPGAs, C++- and OpenCL-based design entries with high-level synthesis (HLS) have been introduced. They provide higher abstraction compared to register-transfer level (RTL)-based design. HLS offers faster development time, better maintainability and more flexibility in code exploration, when evaluating several options for multi-dimension tensors, convolutional layers or different degrees of parallelism. For this reason, HLS has been adopted by DNN accelerator generation frameworks such as FINN and hls4ml. In this paper, we present an alternative backend library for FINN, leveraging RTL. We investigate and evaluate, across a spectrum of design dimensions, the pros and cons of an RTL-based implementation versus the original HLS variant. We show that for smaller design parameters, RTL produces significantly smaller circuits as compared to HLS. For larger circuits, however, the look-up table (LUT) count of RTL-based design is slightly higher, up to around \(15\% \) . On the other hand, HLS consistently requires more flip-flops (FFs) (with an orders-of-magnitude difference for smaller designs) and block RAMs (BRAMs) (2 × more). This also impacts the critical path delay, with RTL producing significantly faster circuits, up to around \(80\% \) . Furthermore, RTL also benefits from at-least a 10 × reduction in synthesis time. Finally, the results were validated in practice using two real-world use cases, one of a multi-layer perceptron (MLP) used in network intrusion detection and the other a convolution network called ResNet used in image recognition. Overall, since HLS frameworks code-generate the hardware design, the benefits of the ease in the design entry is less important. As such, the gained benefits in synthesis time together with some design-dependent resource benefits, make the RTL abstraction an attractive alternative.

On the Implementation of Time-Multiplexed Frequency-Response Masking Filters

IEEE Trans. Signal Process.

2016