Wei Niu scite author profile

With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing Deep Neural Networks (DNNs) inference is still challenging considering the high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with higher accuracy loss.In this paper, we advance the state-of-the-art by introducing a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in the design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency. In other words, our method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an endto-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression techniquepattern-based pruning based on an extended ADMM solution framework-and a set of thorough architecture-aware compiler/code generation-based optimizations, i.e., filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning. Evaluation results demonstrate that PatDNN outperforms three state-ofthe-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 44.5×, 11.4×, and 7.1×, respectively, with no accuracy compromise. Real-time inference of representative large-scale DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.

show abstract

S-Glutathionylation Enhances Human Cystathionine β-Synthase Activity Under Oxidative Stress Conditions

Niu

Yadav

Adamec

et al. 2015

Antioxidants & Redox Signaling

View full text Add to dashboard Cite

Aims: Cystathionine b-synthase (CBS) catalyzes the first and rate-limiting step in the two-step trans-sulfuration pathway that converts homocysteine to cysteine. It is also one of three major enzymes responsible for the biogenesis of H 2 S, a signaling molecule. We have previously demonstrated that CBS is activated in cells challenged by oxidative stress, but the underlying molecular mechanism of this regulation has remained unclear. Results: Here, we demonstrate that S-glutathionylation of CBS enhances its activity *2-fold in vitro. Loss of this post-translational modification in the presence of dithiothreitol results in reversal to basal activity. Cys346 was identified as the site for S-glutathionylation by a combination of mass spectrometric, mutagenesis, and activity analyses. To test the physiological relevance of S-glutathionylation-dependent regulation of CBS, HEK293 cells were oxidatively challenged with peroxide, which is known to enhance the trans-sulfuration flux. Under these conditions, CBS glutathionylation levels increased and were correlated with a *3-fold increase in CBS activity. Innovation: Collectively, our results reveal a novel post-translational modification of CBS, that is, glutathionylation, which functions as an allosteric activator under oxidative stress conditions permitting enhanced synthesis of both cysteine and H 2 S. Conclusions: Our study elucidates a molecular mechanism for increased cysteine and therefore glutathione, synthesis via glutathionylation of CBS. They also demonstrate the potential for increased H 2 S production under oxidative stress conditions, particularly in tissues where CBS is a major source of H 2 S. Antioxid. Redox Signal. 22,[350][351][352][353][354][355][356][357][358][359][360][361]

show abstract

PCONV: The Missing but Desirable Sparsity in DNN Weight Pruning for Real-Time Execution on Mobile Devices

Guo

Niu

et al. 2020

AAAI

120

View full text Add to dashboard Cite

Model compression techniques on Deep Neural Network (DNN) have been widely acknowledged as an effective way to achieve acceleration on a variety of platforms, and DNN weight pruning is a straightforward and effective method. There are currently two mainstreams of pruning methods representing two extremes of pruning regularity: non-structured, fine-grained pruning can achieve high sparsity and accuracy, but is not hardware friendly; structured, coarse-grained pruning exploits hardware-efficient structures in pruning, but suffers from accuracy drop when the pruning rate is high. In this paper, we introduce PCONV, comprising a new sparsity dimension, – fine-grained pruning patterns inside the coarse-grained structures. PCONV comprises two types of sparsities, Sparse Convolution Patterns (SCP) which is generated from intra-convolution kernel pruning and connectivity sparsity generated from inter-convolution kernel pruning. Essentially, SCP enhances accuracy due to its special vision properties, and connectivity sparsity increases pruning rate while maintaining balanced workload on filter computation. To deploy PCONV, we develop a novel compiler-assisted DNN inference framework and execute PCONV models in real-time without accuracy compromise, which cannot be achieved in prior work. Our experimental results show that, PCONV outperforms three state-of-art end-to-end DNN frameworks, TensorFlow-Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 39.2 ×, 11.4 ×, and 6.3 ×, respectively, with no accuracy loss. Mobile devices can achieve real-time inference on large-scale DNNs.

show abstract

Neural Personalized Ranking for Image Recommendation

Niu

Caverlee

2018

101

View full text Add to dashboard Cite

Nitrite Reductase Activity and Inhibition of H2S Biogenesis by Human Cystathionine ß-Synthase

et al. 2014

View full text Add to dashboard Cite

Nitrite was recognized as a potent vasodilator >130 years and has more recently emerged as an endogenous signaling molecule and modulator of gene expression. Understanding the molecular mechanisms that regulate nitrite metabolism is essential for its use as a potential diagnostic marker as well as therapeutic agent for cardiovascular diseases. In this study, we have identified human cystathionine ß-synthase (CBS) as a new player in nitrite reduction with implications for the nitrite-dependent control of H2S production. This novel activity of CBS exploits the catalytic property of its unusual heme cofactor to reduce nitrite and generate NO. Evidence for the possible physiological relevance of this reaction is provided by the formation of ferrous-nitrosyl (FeII-NO) CBS in the presence of NADPH, the human diflavin methionine synthase reductase (MSR) and nitrite. Formation of FeII-NO CBS via its nitrite reductase activity inhibits CBS, providing an avenue for regulating biogenesis of H2S and cysteine, the limiting reagent for synthesis of glutathione, a major antioxidant. Our results also suggest a possible role for CBS in intracellular NO biogenesis particularly under hypoxic conditions. The participation of a regulatory heme cofactor in CBS in nitrite reduction is unexpected and expands the repertoire of proteins that can liberate NO from the intracellular nitrite pool. Our results reveal a potential molecular mechanism for cross-talk between nitrite, NO and H2S biology.

show abstract

GPU-Based Combination of GO and PO for Electromagnetic Scattering of Satellite

Wei

Zhang

Niu

et al. 2012

IEEE Trans. Antennas Propagat.

View full text Add to dashboard Cite

In the radar cross section (RCS) prediction of complex target, the intensive computational burden occurs while calculating the multiple scattering effect. In order to overcome the large computing, we present the program executing on graphics processing units (GPU's). In this paper, we analyze the scattering properties of the satellite, on which the antennas are described as cubes and columns, by employing the GPU-based combinational method of geometrical optics (GO) and physical optics (PO) together with the kd-tree technique. Furthermore, due to this distinctive treatment, the improved method yields a superior performance at high frequency. Some examples will be displayed in the following text. The agreement of the results yielded in this paper with the experimental and other exact results demonstrates the accuracy and efficiency of this useful technique.Index Terms-Compute unified device architecture (CUDA), kd-tree, complex satellite

show abstract

YOLObile: Real-Time Object Detection on Mobile Devices via Compression-Compilation Co-Design

Cai¹,

Li²,

Yuan³

et al. 2020

Preprint

View full text Add to dashboard Cite

The rapid development and wide utilization of object detection techniques have aroused attention on both accuracy and speed of object detectors. However, the current state-of-theart object detection works are either accuracy-oriented using a large model but leading to high latency or speed-oriented using a lightweight model but sacrificing accuracy. In this work, we propose YOLObile framework, a real-time object detection on mobile devices via compression-compilation co-design. A novel block-punched pruning scheme is proposed for any kernel size. To improve computational efficiency on mobile devices, a GPU-CPU collaborative scheme is adopted along with advanced compiler-assisted optimizations. Experimental results indicate that our pruning scheme achieves 14× compression rate of YOLOv4 with 49.0 mAP. Under our YOLObile framework, we achieve 17 FPS inference speed using GPU on Samsung Galaxy S20. By incorporating our proposed GPU-CPU collaborative scheme, the inference speed is increased to 19.1 FPS, and outperforms the original YOLOv4 by 5× speedup.

show abstract

DNNFusion: accelerating deep neural networks execution with advanced operator fusion

Niu

Guan

Wang

et al. 2021

View full text Add to dashboard Cite

Deep Neural Networks (DNNs) have emerged as the core enabler of many major applications on mobile devices. To achieve high accuracy, DNN models have become increasingly deep with hundreds or even thousands of operator layers, leading to high memory and computational requirements for inference. Operator fusion (or kernel/layer fusion) is key optimization in many state-of-the-art DNN execution frameworks, such as TensorFlow, TVM, and MNN, that aim to improve the efficiency of the DNN inference. However, these frameworks usually adopt fusion approaches based on certain patterns that are too restrictive to cover the diversity of operators and layer connections, especially those seen in many extremely deep models. Polyhedral-based loop fusion techniques, on the other hand, work on a low-level view of the computation without operator-level information, and can also miss potential fusion opportunities. To address this challenge, this paper proposes a novel and extensive loop fusion framework called DNNFusion. The basic idea of this work is to work at an operator view of DNNs, but expand fusion opportunities by developing a classification of both individual operators and their combinations. In addition, DNNFusion includes 1) a novel mathematical-propertybased graph rewriting framework to reduce evaluation costs and facilitate subsequent operator fusion, 2) an integrated fusion plan generation that leverages the high-level analysis and accurate light-weight profiling, and 3) additional optimizations during fusion code generation. DNNFusion is extensively evaluated on 15 DNN models with varied types

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Wei Niu

PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning

S-Glutathionylation Enhances Human Cystathionine β-Synthase Activity Under Oxidative Stress Conditions

PCONV: The Missing but Desirable Sparsity in DNN Weight Pruning for Real-Time Execution on Mobile Devices

Neural Personalized Ranking for Image Recommendation

Nitrite Reductase Activity and Inhibition of H2S Biogenesis by Human Cystathionine ß-Synthase

GPU-Based Combination of GO and PO for Electromagnetic Scattering of Satellite

YOLObile: Real-Time Object Detection on Mobile Devices via Compression-Compilation Co-Design

DNNFusion: accelerating deep neural networks execution with advanced operator fusion

Contact Info

Product

Resources

About