Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua

doi:10.48550/arxiv.1412.3555

Cited by 2,270 publications

(2,788 citation statements)

References 9 publications

Supporting

Mentioning

2,368

Contrasting

Unclassified

Order By: Relevance

“…VQ is applied using a codebook of 512 vectors of dimensionality 128, with the commitment loss defined as in (14). The aggregator g (•) is implemented as a two-layer gated recurrent neural network (GRU) [35] with 128 hidden channels. Hence, in our experiments, K = E. The InfoNCE loss is computed using 10 negative samples and k = 12 steps.…”

Section: B Parameters Of Proposed Methodsmentioning

confidence: 99%

Non-Intrusive Binaural Speech Intelligibility Prediction from Discrete Latent Representations

McKinney,

Cauchi

2021

Preprint

View full text Add to dashboard Cite

Non-intrusive speech intelligibility (SI) prediction from binaural signals is useful in many applications. However, most existing signal-based measures are designed to be applied to single-channel signals. Measures specifically designed to take into account the binaural properties of the signal are often intrusive -characterised by requiring access to a clean speech signaland typically rely on combining both channels into a singlechannel signal before making predictions. This paper proposes a non-intrusive SI measure that computes features from a binaural input signal using a combination of vector quantization (VQ) and contrastive predictive coding (CPC) methods. VQ-CPC feature extraction does not rely on any model of the auditory system and is instead trained to maximise the mutual information between the input signal and output features. The computed VQ-CPC features are input to a predicting function parameterized by a neural network. Two predicting functions are considered in this paper. Both feature extractor and predicting functions are trained on simulated binaural signals with isotropic noise. They are tested on simulated signals with isotropic and real noise. For all signals, the ground truth scores are the (intrusive) deterministic binaural STOI. Results are presented in terms of correlations and MSE and demonstrate that VQ-CPC features are able to capture information relevant to modelling SI and outperform all the considered benchmarks -even when evaluating on data comprising of different noise field types.

show abstract

Section: B Parameters Of Proposed Methodsmentioning

confidence: 99%

Non-Intrusive Binaural Speech Intelligibility Prediction from Discrete Latent Representations

McKinney,

Cauchi

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Previous methods always use an encoder-fusion-decoder paradigm, which first adopts two uni-modal encoders (e.g. ResNet [11] and GRU [4]) to extract images features E I and languages features E L separately, and then designs a modality fusion module to fuse representations from different modalities to obtain the fused features F. In the end, F is fed into a decoder to generate the final segmentation prediction P. This paradigm can be formulated as three steps:…”

Section: Encoder-decoder Pipelinementioning

confidence: 99%

MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation

Li¹,

Wang²,

Mei³

et al. 2021

Preprint

View full text Add to dashboard Cite

Referring image segmentation is a typical multi-modal task, which aims at generating a binary mask for referent described in given language expressions. Prior arts adopt a bimodal solution, taking images and languages as two modalities within an encoder-fusion-decoder pipeline. However, this pipeline is sub-optimal for the target task for two reasons. First, they only fuse high-level features produced by uni-modal encoders separately, which hinders sufficient cross-modal learning. Second, the uni-modal encoders are pre-trained independently, which brings inconsistency between pre-trained uni-modal tasks and the target multi-modal task. Besides, this pipeline often ignores or makes little use of intuitively beneficial instance-level features. To relieve these problems, we propose MaIL, which is a more concise encoder-decoder pipeline with a Mask-Image-Language trimodal encoder. Specifically, MaIL unifies uni-modal feature extractors and their fusion model into a deep modality interaction encoder, facilitating sufficient feature interaction across different modalities. Meanwhile, MaIL directly avoids the second limitation since no unimodal encoders are needed anymore. Moreover, for the first time, we propose to introduce instance masks as an additional modality, which explicitly intensifies instancelevel features and promotes finer segmentation results. The proposed MaIL set a new state-of-the-art on all frequentlyused referring image segmentation datasets, including Ref-COCO, RefCOCO+, and G-Ref, with significant gains, 3%-10% against previous best methods. Code will be released soon.

show abstract

“…LSTM [47] captures the contextual representations of words with a short memory and has additional "forget" gates to thereby overcoming both the vanishing and exploding gradient problem. GRU [48] comprises of reset gate and update gate, and handles the information flow like LSTM sans a memory unit. TextCNN [49] obtains feature representation through 1dim convolution.…”

Section: B Time Sequence Modelingmentioning

confidence: 99%

SequentialPointNet: A strong frame-level parallel point cloud sequence network for 3D action recognition

Li¹,

Huang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Point cloud sequences of 3D human actions exhibit unordered intra-frame spatial information and ordered interframe temporal information. In order to capture the spatiotemporal structures of the point cloud sequences, cross-frame spatio-temporal local neighborhoods around the centroids are usually constructed. However, the computationally expensive construction procedure of spatio-temporal local neighborhoods severely limits the parallelism of models. Moreover, it is unreasonable to treat spatial and temporal information equally in spatio-temporal local learning, because human actions are complicated along the spatial dimensions and simple along the temporal dimension. In this paper, to avoid spatio-temporal local encoding, we propose a strong parallelized point cloud sequence network referred to as SequentialPointNet for 3D action recognition. SequentialPointNet is composed of two serial modules, i.e., an intra-frame appearance encoding module and an inter-frame motion encoding module. For modeling the strong spatial structures of human actions, each point cloud frame is processed in parallel in the intra-frame appearance encoding module and the feature vector of each frame is output to form a feature vector sequence that characterizes static appearance changes along the temporal dimension. For modeling the weak temporal changes of human actions, in the inter-frame motion encoding module, the temporal position encoding and the hierarchical pyramid pooling strategy are implemented on the feature vector sequence. In addition, in order to better explore spatio-temporal content, multiple level features of human movements are aggregated before performing the end-to-end 3D action recognition. Extensive experiments conducted on three public datasets show that SequentialPointNet outperforms stateof-the-art approaches. SequentialPointNet's code is available at https://github.com/XingLi1012/SequentialPointNet.git.

show abstract

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Cited by 2,270 publications

References 9 publications

Non-Intrusive Binaural Speech Intelligibility Prediction from Discrete Latent Representations

Non-Intrusive Binaural Speech Intelligibility Prediction from Discrete Latent Representations

MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation

SequentialPointNet: A strong frame-level parallel point cloud sequence network for 3D action recognition

Contact Info

Product

Resources

About