Seq2Biseq: Bidirectional Output-wise Recurrent Neural Networks for Sequence Modelling

Dinarelli, Marco; Grobol, Loïc

doi:10.48550/arxiv.1904.04733

Cited by 3 publications

(3 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To evaluate our method on large language models, we conducted experiments on all linear layers of OPT-6.7B ) with a full sequence length of 2048, including the last lm head layer. As shown in Table 3, our method achieved an acceleration rate of approximately ×4.64 and produced the most lossless results on wikitext2 (Merity et al 2016), ptb (Dinarelli and Grobol 2019), and c4 (Raffel et al 2019). Our acceleration rate is better than the current stateof-the-art methods such as GPTQ (Frantar et al 2022) and sparseGPT (Frantar and Alistarh 2023), while our method does not have an advantage in parameter storage.…”

Section: Experiments On Optmentioning

confidence: 91%

On the unsolvability of inverse singular value problems almost everywhere

Chen

Sun

2018

Linear and Multilinear Algebra

View full text Add to dashboard Cite

We present a simple yet novel parameterized form of linear mapping to achieves remarkable network compression performance: a pseudo SVD called Ternary SVD (TSVD). Unlike vanilla SVD, TSVD limits the U and V matrices in SVD to ternary matrices form in {±1, 0}. This means that instead of using the expensive multiplication instructions, TSVD only requires addition instructions when computing U (•) and V (•). We provide direct and training transition algorithms for TSVD like Post Training Quantization and Quantization Aware Training respectively. Additionally, we analyze the convergence of the direct transition algorithms in theory. In experiments, we demonstrate that TSVD can achieve stateof-the-art network compression performance in various types of networks and tasks, including current baseline models such as ConvNext, Swim, BERT, and large language model like OPT.

show abstract

Section: Experiments On Optmentioning

confidence: 91%

On the unsolvability of inverse singular value problems almost everywhere

Chen

Sun

2018

Linear and Multilinear Algebra

View full text Add to dashboard Cite

show abstract

“…PTB. The Penn Treebank (Marcus et al, 1993), in particular the sections of the corpus corresponding to the articles of Wall Street Journal (WSJ), is a standard dataset for language modeling (Mikolov et al, 2012) and sequence labeling (Dinarelli and Grobol, 2019). Following the setting in Shen et al (2021), we use the preprocessing method proposed in Mikolov et al (2012).…”

Section: D1 Masked Language Modelingmentioning

confidence: 99%

Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation

Wu¹,

Tu²

2023

Findings of the Association for Computational Linguistics: ACL 2023

View full text Add to dashboard Cite

Syntactic structures used to play a vital role in natural language processing (NLP), but since the deep learning revolution, NLP has been gradually dominated by neural models that do not consider syntactic structures in their design. One vastly successful class of neural models is transformers. When used as an encoder, a transformer produces contextual representation of words in the input sentence. In this work, we propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective. Specifically, we design a conditional random field that models discrete latent representations of all words in a sentence as well as dependency arcs between them; and we use mean field variational inference for approximate inference. Strikingly, we find that the computation graph of our model resembles transformers, with correspondences between dependencies and self-attention and between distributions over latent representations and contextual embeddings of words. Experiments show that our model performs competitively to transformers on small to medium sized datasets. We hope that our work could help bridge the gap between traditional syntactic and probabilistic approaches and cutting-edge neural approaches to NLP, and inspire more linguistically-principled neural approaches in the future. 1

show abstract

“…These gating mechanisms allow such RNN variants to be trained to keep necessary and relevant information for longer period from previous states, or discard information with less importance from previous states [5,6]. Recent works have shown the importance of RNNs with gating mechanisms in achieving improved results for classification and generation tasks with sequence modelling [2,10,35]. Recurrent neural networks' inherent capability of adequately modelling sequential data supplemented by the advantages of gated features in GRUs enables them to effectively model tasks that use short-term or long-term video sequences.…”

Section: Standard Grumentioning

confidence: 99%

SiTGRU: Single-Tunnelled Gated Recurrent Unit for Abnormality Detection

Fanta

Shao

2020

Information Sciences

View full text Add to dashboard Cite

A B S T R A C TAbnormality detection is a challenging task due to the dependence on a specific context and the unconstrained variability of practical scenarios. In recent years, it has benefited from the powerful features learnt by deep neural networks, and handcrafted features specialized for abnormality detectors. However, these approaches with large complexity still have limitations in handling long-term sequential data (e.g., videos), and their learnt features do not thoroughly capture useful information. Recurrent Neural Networks (RNNs) have been shown to be capable of robustly dealing with temporal data in long-term sequences. In this paper, we propose a novel version of Gated Recurrent Unit (GRU), called Single-Tunnelled GRU for abnormality detection. Particularly, the Single-Tunnelled GRU discards the heavy-weighted reset gate from GRU cells that overlooks the importance of past content by only favouring current input to obtain an optimized single-gated-cell model. Moreover, we substitute the hyperbolic tangent activation in standard GRUs with sigmoid activation, as the former suffers from performance loss in deeper networks. Empirical results show that our proposed optimized-GRU model outperforms standard GRU and Long Short-Term Memory (LSTM) networks on most metrics for detection and generalization tasks on CUHK Avenue and UCSD datasets. The model is also computationally efficient with reduced training and testing time over standard RNNs.

show abstract

Seq2Biseq: Bidirectional Output-wise Recurrent Neural Networks for Sequence Modelling

Cited by 3 publications

References 38 publications

On the unsolvability of inverse singular value problems almost everywhere

On the unsolvability of inverse singular value problems almost everywhere

Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation

SiTGRU: Single-Tunnelled Gated Recurrent Unit for Abnormality Detection

Contact Info

Product

Resources

About