Power, Area, and Performance Optimization of Standard Cell Memory Arrays Through Controlled Placement

Teman, Adam; Rossi, Davide; Meinerzhagen, Pascal; Benini, Luca; Burg, Andreas

doi:10.1145/2890498

Cited by 50 publications

(49 citation statements)

References 18 publications

(46 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GCC 4.9 and LLVM 3.7 toolchains are available for the cores, while OpenMP 3.0 is supported on top of the bare-metal parallel runtime. The cores share a single instruction cache of 4 kB of Standard Cell Memory (SCM) [55] that can increase energy efficiency by up to 30% compared to an SRAM-based private instruction cache on parallel workloads [56]. The ISA extensions of the core include general-purpose enhancements (automatically inferred by the compiler), such as zero-overhead hardware loops and load and store operations embedding pointer arithmetic, and other DSP extensions that can be explicitly included by means of intrinsic calls.…”

Section: Soc Architecturementioning

confidence: 99%

An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

Conti

Schilling

Schiavone

et al. 2017

IEEE Trans. Circuits Syst. I

Self Cite

109

View full text Add to dashboard Cite

Near-sensor data analytics is a promising direction for IoT endpoints, as it minimizes energy spent on communication and reduces network load -but it also poses security concerns, as valuable data is stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a System-on-Chip based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65 nm technology, consumes less than 20 mW on average at 0.8 V achieving an efficiency of up to 70 pJ/B in encryption, 50 pJ/px in convolution, or up to 25 MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep CNN consuming 3.16 pJ per equivalent RISC op; local CNN-based face detection with secured remote recognition in 5.74 pJ/op; and seizure detection with encrypted data collection from EEG within 12.7 pJ/op.

show abstract

Section: Soc Architecturementioning

confidence: 99%

An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

Conti

Schilling

Schiavone

et al. 2017

IEEE Trans. Circuits Syst. I

Self Cite

109

View full text Add to dashboard Cite

show abstract

“…Instruction caches can also be implemented with SCMs. The usage of SCMs for the implementation of frequently-accessed memory banks significantly improves energy efficiency, since energy/access of SCM is significantly lower than that of SRAMs for the relatively small cuts needed in L1 instruction and data memories [32]. Depending on the availability of low-voltage memories in the targeted implementation technology, different ratios of SCM and SRAM memory can be instantiated at design time.…”

Section: Cluster Architecturementioning

confidence: 99%

Flexible, Scalable and Energy Efficient Bio-Signals Processing on the PULP Platform: A Case Study on Seizure Detection

Montagna

Benatti

Rossi

2017

JLPEA

Self Cite

View full text Add to dashboard Cite

Ultra-low power operation and extreme energy efficiency are strong requirements for a number of high-growth application areas requiring near-sensor processing, including elaboration of biosignals. Parallel near-threshold computing is emerging as an approach to achieve significant improvements in energy efficiency while overcoming the performance degradation typical of low-voltage operations. In this paper, we demonstrate the capabilities of the PULP (Parallel Ultra-Low Power) platform on an algorithm for seizure detection, representative of a wide range of EEG signal processing applications. Starting from the 28-nm FD-SOI (Fully Depleted Silicon On Insulator) technology implementation of the third embodiment of the PULP architecture, we analyze the energy-efficient implementation of the seizure detection algorithm on PULP. The proposed parallel implementation exploits the dynamic voltage and frequency scaling capabilities, as well as the embedded power knobs of the PULP platform, reducing energy consumption for a seizure detection by up to 10× with respect to a sequential implementation at the nominal supply voltage and by 4.2× with respect to a sequential implementation with voltage scaling. Moreover, we analyze the trans-precision optimization of the algorithm on PULP, by means of a hybrid fixed-and floating-point implementation. This approach reduces the energy consumption by up to 43% with respect to the plain fixed-point and floating-point implementations, leveraging the requirements in terms of the precision of the kernels composing the processing chain to improve energy efficiency. Thanks to the proposed architecture and system-level approach for optimization, we demonstrate that PULP reduces energy consumption by up to 140× with respect to commercial low-power microcontrollers, being able to satisfy the real-time constraints typical of bio-medical applications, breaking the barrier of microwatts for a 50-ms complete seizure detection and a few milliwatts for a 5-ms detection latency on a fully-programmable architecture.

show abstract

“…We demonstrate that this approach improves the energy efficiency of the digital core of the accelerator by 5.1×, and the throughput by 1.3×, with respect to a baseline architecture based on 12-bit MAC units operating at a nominal supply voltage of 1.2 V. To extend the performance scalability of the device, we implement a latch-based standard cell memory (SCM) architecture for on-chip data storage. Although SCMs are more expensive than SRAMs in terms of area, they provide better voltage scalability and energy efficiency [26], extending the operating range of the device in the low-voltage region. This further improves the energy efficiency of the engine by 6× at 0.6 V, with respect to the nominal operating voltage of 1.2 V, and leads to an improvement in energy efficiency by 11.6× with respect to a fixed-point implementation with SRAMs at its best energy point of 0.8 V. To improve the flexibility of the convolutional engine we implement support for several kernel sizes (1×1 -7×7), and support for per-channel scaling and biasing, making it suitable for implementing a large variety of CNNs.…”

Section: Introductionmentioning

confidence: 99%

YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration

Andri

Cavigelli

Rossi

et al. 2018

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

Self Cite

197

138

View full text Add to dashboard Cite

Abstract-Convolutional Neural Networks (CNNs) have revolutionized the world of computer vision over the last few years, pushing image classification beyond human accuracy. The computational effort of today's CNNs requires power-hungry parallel processors or GP-GPUs. Recent developments in CNN accelerators for system-on-chip integration have reduced energy consumption significantly. Unfortunately, even these highly optimized devices are above the power envelope imposed by mobile and deeply embedded applications and face hard limitations caused by CNN weight I/O and storage. This prevents the adoption of CNNs in future ultra-low power Internet of Things end-nodes for near-sensor analytics. Recent algorithmic and theoretical advancements enable competitive classification accuracy even when limiting CNNs to binary (+1/-1) weights during training. These new findings bring major optimization opportunities in the arithmetic core by removing the need for expensive multiplications, as well as reducing I/O bandwidth and storage. In this work, we present an accelerator optimized for binary-weight CNNs that achieves 1.5 TOp/s at 1.2 V on a core area of only 1.33 MGE (Million Gate Equivalent) or 1.9 mm 2 and with a power dissipation of 895 µW in UMC 65 nm technology at 0.6 V. Our accelerator significantly outperforms the state-of-the-art in terms of energy and area efficiency achieving 61.2 TOp/s/W@0.6 V and 1.1 TOp/s/MGE@1.2 V, respectively.

show abstract

Power, Area, and Performance Optimization of Standard Cell Memory Arrays Through Controlled Placement

Cited by 50 publications

References 18 publications

An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

Flexible, Scalable and Energy Efficient Bio-Signals Processing on the PULP Platform: A Case Study on Seizure Detection

YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration

Contact Info

Product

Resources

About