Abstract:Whilst computer vision models built using self-supervised approaches are now commonplace, some important questions remain. Do self-supervised models learn highly redundant channel features? What if a self-supervised network could dynamically select the important channels and get rid of the unnecessary ones? Currently, convnets pre-trained with self-supervision have obtained comparable performance on downstream tasks in comparison to their supervised counterparts in computer vision. However, there are drawbacks… Show more
“…To make training feasible the Gumbel-softmax trick (Jang et al, 2016) is adopted. The Gumbel-Trick has been widely used as a reparameterisation technique for the task of dynamic channel selection (Krishna et al, 2022;Li et al, 2021;Herrmann et al, 2020;Veit & Belongie, 2018). For more clarity refer Figure 4 in Appendix A.3.…”
Section: Methodsmentioning
confidence: 99%
“…Most of the works on dynamic computation have been mostly confined to supervised learning. Recently, (Krishna et al, 2022) used SimSiam (Chen & He) as a selfsupervised objective combined with a dynamic channel gating (DGNet) (Li et al, 2021) mechanism trained from scratch, and showed that comparable performance can be achieved under channel budget constraints. Likewise (Meng et al, 2022) used a channel gating-based dynamic pruning (CGNet) (Hua et al, 2019) augmented with contrastive learning to achieve inference speed-ups without substantial loss of performance.…”
Section: Self-supervised Dynamic Computation and Beyondmentioning
confidence: 99%
“…A common practice to reduce this computational burden is to extract a lightweight sub-network 1 from an off-theshelf pre-trained model, or pre-training the model as a part of a multi-step training process and further compressing it by applying techniques like Knowledge distillation (KD) (Hinton et al, 2015), pruning (Frankle & Carbin, 2018), dynamic computation (DC) (Veit & Belongie, 2018), etc. SSL based pre-training combined with KD 2 (Tian et al, 2019;Abbasi Koohpayegani et al, 2020;Fang et al, 2021), DC (Krishna et al, 2022;Meng et al, 2022), or pruning (Caron et al, 2020; also serve as effective ways to obtain a lightweight sub-network for a given downstream task. This sequential learning procedure often involves finetuning a pre-trained self-supervised model on a downstream task along with the corresponding training objective of KD, DC, or pruning with cross entropy (CE) loss.…”
Section: Introductionmentioning
confidence: 99%
“…large language models (LLMs)) via fine-tuning makes the overall process computationally more expensive and cumbersome. Furthermore, downstream tasks are diverse and vary widely; therefore, any change in the downstream task usually requires repeating the entire procedure multiple (Krishna et al, 2022) but modified as per our use case (i.e, instead of SimSiam (Chen & He, 2021) we use VICReg objective). c. This work: we learn a dense encoder and some gates based on a budget constraint t d .…”
Section: Introductionmentioning
confidence: 99%
“…For obtaining W we follow dynamic channel selection (DCS) (Veit & Belongie, 2018;Li et al, 2021) to induce sparsity while maintaining network topology. Figure 1a depicts the traditional setting alternating between pre-training and fine-tuning while Figure 1b depicts the setting recently introduced in (Krishna et al, 2022) using dynamic channel selection along with self-supervision. Also from now on, lightweight, sub-network or gated network refer to same thing.…”
Self-supervised learning (SSL) approaches have made major strides forward by emulating the performance of their supervised counterparts on several computer vision benchmarks. This, however, comes at a cost of substantially larger model sizes, and computationally expensive training strategies, which eventually lead to larger inference times making it impractical for resource constrained industrial settings. Techniques like knowledge distillation (KD), dynamic computation (DC), and pruning are often used to obtain a lightweight subnetwork, which usually involves multiple epochs of fine-tuning of a large pre-trained model, making it more computationally challenging.In this work we propose a novel perspective on the interplay between SSL and DC paradigms that can be leveraged to simultaneously learn a dense and gated (sparse/lightweight) sub-network from scratch offering a good accuracy-efficiency tradeoff, and therefore yielding a generic and multipurpose architecture for application specific industrial settings. Our study overall conveys a constructive message: exhaustive experiments on several image classification benchmarks: CIFAR-10, STL-10, CIFAR-100, and ImageNet-100, demonstrates that the proposed training strategy provides a dense and corresponding sparse sub-network that achieves comparable (on-par) performance compared with the vanilla self-supervised setting, but at a significant reduction in computation in terms of FLOPs under a range of target budgets.
“…To make training feasible the Gumbel-softmax trick (Jang et al, 2016) is adopted. The Gumbel-Trick has been widely used as a reparameterisation technique for the task of dynamic channel selection (Krishna et al, 2022;Li et al, 2021;Herrmann et al, 2020;Veit & Belongie, 2018). For more clarity refer Figure 4 in Appendix A.3.…”
Section: Methodsmentioning
confidence: 99%
“…Most of the works on dynamic computation have been mostly confined to supervised learning. Recently, (Krishna et al, 2022) used SimSiam (Chen & He) as a selfsupervised objective combined with a dynamic channel gating (DGNet) (Li et al, 2021) mechanism trained from scratch, and showed that comparable performance can be achieved under channel budget constraints. Likewise (Meng et al, 2022) used a channel gating-based dynamic pruning (CGNet) (Hua et al, 2019) augmented with contrastive learning to achieve inference speed-ups without substantial loss of performance.…”
Section: Self-supervised Dynamic Computation and Beyondmentioning
confidence: 99%
“…A common practice to reduce this computational burden is to extract a lightweight sub-network 1 from an off-theshelf pre-trained model, or pre-training the model as a part of a multi-step training process and further compressing it by applying techniques like Knowledge distillation (KD) (Hinton et al, 2015), pruning (Frankle & Carbin, 2018), dynamic computation (DC) (Veit & Belongie, 2018), etc. SSL based pre-training combined with KD 2 (Tian et al, 2019;Abbasi Koohpayegani et al, 2020;Fang et al, 2021), DC (Krishna et al, 2022;Meng et al, 2022), or pruning (Caron et al, 2020; also serve as effective ways to obtain a lightweight sub-network for a given downstream task. This sequential learning procedure often involves finetuning a pre-trained self-supervised model on a downstream task along with the corresponding training objective of KD, DC, or pruning with cross entropy (CE) loss.…”
Section: Introductionmentioning
confidence: 99%
“…large language models (LLMs)) via fine-tuning makes the overall process computationally more expensive and cumbersome. Furthermore, downstream tasks are diverse and vary widely; therefore, any change in the downstream task usually requires repeating the entire procedure multiple (Krishna et al, 2022) but modified as per our use case (i.e, instead of SimSiam (Chen & He, 2021) we use VICReg objective). c. This work: we learn a dense encoder and some gates based on a budget constraint t d .…”
Section: Introductionmentioning
confidence: 99%
“…For obtaining W we follow dynamic channel selection (DCS) (Veit & Belongie, 2018;Li et al, 2021) to induce sparsity while maintaining network topology. Figure 1a depicts the traditional setting alternating between pre-training and fine-tuning while Figure 1b depicts the setting recently introduced in (Krishna et al, 2022) using dynamic channel selection along with self-supervision. Also from now on, lightweight, sub-network or gated network refer to same thing.…”
Self-supervised learning (SSL) approaches have made major strides forward by emulating the performance of their supervised counterparts on several computer vision benchmarks. This, however, comes at a cost of substantially larger model sizes, and computationally expensive training strategies, which eventually lead to larger inference times making it impractical for resource constrained industrial settings. Techniques like knowledge distillation (KD), dynamic computation (DC), and pruning are often used to obtain a lightweight subnetwork, which usually involves multiple epochs of fine-tuning of a large pre-trained model, making it more computationally challenging.In this work we propose a novel perspective on the interplay between SSL and DC paradigms that can be leveraged to simultaneously learn a dense and gated (sparse/lightweight) sub-network from scratch offering a good accuracy-efficiency tradeoff, and therefore yielding a generic and multipurpose architecture for application specific industrial settings. Our study overall conveys a constructive message: exhaustive experiments on several image classification benchmarks: CIFAR-10, STL-10, CIFAR-100, and ImageNet-100, demonstrates that the proposed training strategy provides a dense and corresponding sparse sub-network that achieves comparable (on-par) performance compared with the vanilla self-supervised setting, but at a significant reduction in computation in terms of FLOPs under a range of target budgets.
Convolutional neural networks have made significant strides in solving computer-vision tasks at the expense of high computational demands. This complexity hinders efficient processing, particularly on devices with limited computational resources such as edge devices. One way to overcome this limitation is conditional computing, which optimizes inference by selectively utilizing parts of the network depending on the characteristics of the input. A recent conditional execution method is Conditional Information Gain Trellis (CIGT), which routes samples based on an information gain-based router mechanism. The original CIGT model was designed to route a single sample along a single path in a trellis structure. In this study, advanced inference strategies that allow inputs to traverse multiple paths are proposed to improve the performance of the vanilla CIGT model. These strategies aim to find a middle ground between improved model performance and increased computational demands. For this purpose, two techniques were proposed: A Cross-Entropy Search-based threshold optimization algorithm and a Reinforcement Learningbased routing strategy. The first method treats multi-path routing in CIGT as a black-box optimization problem and the second interprets it as a Markov Decision Process with a Q-Learning-based supervised regression algorithm designed as the solution. It has been shown that both of these methods provide significant performance improvements compared to the original CIGT model, with an adjustable increase in the computation. Experiments were conducted on two image datasets, with additional statistical tests and analyses to inspect the behaviors of the proposed algorithms. The novel methods designed in this study for multi-path routing show potential for both the original CIGT model and similar conditional computation approaches that use specific routing mechanisms for selecting network parts based on samples.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.