Reducing the Model Order of Deep Neural Networks Using Information Theory

Tao, Ming; Berisha, Visar; Cao, Yu; Seo, Jae-sun

doi:10.1109/isvlsi.2016.117

Cited by 8 publications

(4 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here we move beyond considering the intrinsic dimensionality of datasets to study the dimension of deep representations of these datasets that networks use to classify them. Following the autoencoder example, we posit that our results may provide a foundation for future work to determine the most efficient sizes of networks that learn classification tasks [25,40,41]. For instance, if the maximum dimensionality achieved by a network is 50 in a middle layer, we conjecture that this will inform the layer size of a deep neural network that solves the task with high performance, either via standard training procedures or those that add pruning or compression steps.…”

Section: Discussionmentioning

confidence: 94%

Dimensionality compression and expansion in Deep Neural Networks

Recanatesi,

Farrell,

Advani

et al. 2019

Preprint

View full text Add to dashboard Cite

Datasets such as images, text, or movies are embedded in high-dimensional spaces. However, in important cases such as images of objects, the statistical structure in the data constrains samples to a manifold of dramatically lower dimensionality. Learning to identify and extract task-relevant variables from this embedded manifold is crucial when dealing with high-dimensional problems. We find that neural networks are often very effective at solving this task and investigate why. To this end, we apply state-of-the-art techniques for intrinsic dimensionality estimation to show that neural networks learn low-dimensional manifolds in two phases: first, dimensionality expansion driven by feature generation in initial layers, and second, dimensionality compression driven by the selection of task-relevant features in later layers. Our mathematical analysis shows how Stochastic Gradient Decent balances the dimensionality of neural representations by inducing an effective regularization term in the loss. We highlight the important relationship between low-dimensional compressed representations and generalization properties of the network. Our work contributes by shedding light on the success of deep neural networks in disentangling data in high-dimensional space while achieving good generalization. Furthermore, it invites new learning strategies focused on optimizing measurable geometric properties of learned representations, beginning with their intrinsic dimensionality.

show abstract

Section: Discussionmentioning

confidence: 94%

Dimensionality compression and expansion in Deep Neural Networks

Recanatesi,

Farrell,

Advani

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…The core mechanism of X-SNS lies in the derivation of sub-networks customized for individual languages and the computation of similarity between a pair of sub-networks. As a crucial component in the construction of our targeted sub-network, we introduce the Fisher information (Fisher, 1922), which provides a means of quantifying the amount of information contained in parameters within a neural network (Tu et al, 2016;Achille et al, 2019). Concretely, we derive the (empirical) Fisher information of a language model's parameters as follows.…”

Section: Proposed Method: X-snsmentioning

confidence: 99%

X-SNS: Cross-Lingual Transfer Prediction through Sub-Network Similarity

Yun,

Kim,

Kang

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Cross-lingual transfer (XLT) is an emergent ability of multilingual language models that preserves their performance on a task to a significant extent when evaluated in languages that were not included in the fine-tuning process. While English, due to its widespread usage, is typically regarded as the primary language for model adaption in various tasks, recent studies have revealed that the efficacy of XLT can be amplified by selecting the most appropriate source languages based on specific conditions. In this work, we propose the utilization of sub-network similarity between two languages as a proxy for predicting the compatibility of the languages in the context of XLT. Our approach is model-oriented, better reflecting the inner workings of foundation models. In addition, it requires only a moderate amount of raw text from candidate languages, distinguishing it from the majority of previous methods that rely on external resources. In experiments, we demonstrate that our method is more effective than baselines across diverse tasks. Specifically, it shows proficiency in ranking candidates for zero-shot XLT, achieving an improvement of 4.6% on average in terms of NDCG@3. We also provide extensive analyses that confirm the utility of sub-networks for XLT prediction.

show abstract

“…For the Laplace prior, we find that up to 30% of the weights in the fully-connected layers can be pruned without a significant drop in performance. However, following the work of [34], we also explore a pruning approach that uses the Fisher Information Matrix (FIM) of the weights. As also observed by [34], pruning the weights based on the Fisher information alone does not allow for a large number of parameters to be pruned effectively because many values in the FIM diagonal are close to zero; however, combining Fisher-based pruning and magnitude-based pruning allows for a larger number of weights to be pruned -up to 60% in this case.…”

Section: Uncertainty Calibrationmentioning

confidence: 99%

Weight Pruning and Uncertainty in Radio Galaxy Classification

Mohan¹,

Scaife²

2021

Preprint

View full text Add to dashboard Cite

In this work we use variational inference to quantify the degree of epistemic uncertainty in model predictions of radio galaxy classification and show that the level of model posterior variance for individual test samples is correlated with human uncertainty when labelling radio galaxies. We explore the model performance and uncertainty calibration for a variety of different weight priors and suggest that a sparse prior produces more well-calibrated uncertainty estimates. Using the posterior distributions for individual weights, we show that signal-tonoise ratio (SNR) ranking allows pruning of the fully-connected layers to the level of 30% without significant loss of performance, and that this pruning increases the predictive uncertainty in the model. Finally we show that, like other work in this field, we experience a cold posterior effect. We examine whether adapting the cost function in our model to accommodate model misspecification can compensate for this effect, but find that it does not make a significant difference. We also examine the effect of principled data augmentation and find that it improves upon the baseline but does not compensate for the observed effect fully. We interpret this as the cold posterior effect being due to the overly effective curation of our training sample leading to likelihood misspecification, and raise this as a potential issue for Bayesian deep learning approaches to radio galaxy classification in future.

show abstract

Reducing the Model Order of Deep Neural Networks Using Information Theory

Cited by 8 publications

References 28 publications

Dimensionality compression and expansion in Deep Neural Networks

Dimensionality compression and expansion in Deep Neural Networks

X-SNS: Cross-Lingual Transfer Prediction through Sub-Network Similarity

Weight Pruning and Uncertainty in Radio Galaxy Classification

Contact Info

Product

Resources

About