The Importance of Being Recurrent for Modeling Hierarchical Structure

Tran, Ke; Bisazza, Arianna; Monz, Christof

doi:10.18653/v1/d18-1503

Cited by 123 publications

(104 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For ELMo there is still a discernible difference for dependencies longer than 5, but for BERT the two curves are almost indistinguishable throughout the whole range. This could be related to the aforementioned intuition that a Transformer captures long dependencies more effectively than a BiLSTM (see Tran et al (2018) for contrary observations, albeit for different tasks). The overall trends for both baseline and enhanced models are quite consistent across languages, although with large variations in accuracy levels.…”

Section: Dependency Lengthmentioning

confidence: 99%

Deep Contextualized Word Embeddings in Transition-Based and Graph-Based Dependency Parsing - A Tale of Two Parsers Revisited

Kulmizev¹,

Lhoneux²,

Gontrum³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

Transition-based and graph-based dependency parsers have previously been shown to have complementary strengths and weaknesses: transition-based parsers exploit rich structural features but suffer from error propagation, while graph-based parsers benefit from global optimization but have restricted feature scope. In this paper, we show that, even though some details of the picture have changed after the switch to neural networks and continuous representations, the basic trade-off between rich features and global optimization remains essentially the same. Moreover, we show that deep contextualized word embeddings, which allow parsers to pack information about global sentence structure into local feature representations, benefit transition-based parsers more than graph-based parsers, making the two approaches virtually equivalent in terms of both accuracy and error profile. We argue that the reason is that these representations help prevent search errors and thereby allow transitionbased parsers to better exploit their inherent strength of making accurate local decisions. We support this explanation by an error analysis of parsing experiments on 13 languages.

show abstract

Section: Dependency Lengthmentioning

confidence: 99%

Deep Contextualized Word Embeddings in Transition-Based and Graph-Based Dependency Parsing - A Tale of Two Parsers Revisited

Kulmizev¹,

Lhoneux²,

Gontrum³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

show abstract

“…Jawahar et al (2019) extended this work to using multiple layers and tasks, supporting the claim that BERT's intermediate layers capture rich linguistic information. On the other hand, Tran et al (2018) concluded that LSTMs generalize to longer sequences better, and are more robust with respect to agreement distractors, compared to Transformers. Liu et al (2019) investigated the transferability of contextualized word representations to a number of probing tasks requiring linguistic knowledge.…”

Section: Related Workmentioning

confidence: 99%

Revealing the Dark Secrets of BERT

Kovaleva¹,

Romanov²,

Rogers³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

414

367

View full text Add to dashboard Cite

BERT-based architectures currently give stateof-the-art performance on many NLP tasks, but little is known about the exact mechanisms that contribute to its success. In the current work, we focus on the interpretation of selfattention, which is one of the fundamental underlying components of BERT. Using a subset of GLUE tasks and a set of handcrafted features-of-interest, we propose the methodology and carry out a qualitative and quantitative analysis of the information encoded by the individual BERT's heads. Our findings suggest that there is a limited set of attention patterns that are repeated across different heads, indicating the overall model overparametrization. While different heads consistently use the same attention patterns, they have varying impact on performance across different tasks. We show that manually disabling attention in certain heads leads to a performance improvement over the regular fine-tuned BERT models.

show abstract

“…The primary reason for adopting recurrent architecture for sentence-encoder is because recurrent neural networks have been shown to be essential for capturing the underlying hierarchical structure of sequential data [14]. By adopting this approach sentence-encoder is able to encode how sentences are structured in a document.…”

Section: ) Lexical Embeddingmentioning

confidence: 99%

“…For sentence-level encoder, we employ an attention-based recurrent neural network to capture the struc-tural patterns of sentences in the document. The primary reason for adopting recurrent architecture for sentence-encoder is because recurrent neural networks have been shown to be essential for capturing the underlying hierarchical structure of sequential data [14]. Hence, sentence-encoder in the proposed model is expected to capture the structural information of documents.…”

Section: Introductionmentioning

confidence: 99%

Style-Aware Neural Model with Application in Authorship Attribution

Jafariakinabad

Hua

2019

2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA)

View full text Add to dashboard Cite

Writing style is a combination of consistent decisions associated with a specific author at different levels of language production, including lexical, syntactic, and structural. In this paper we introduce a style-aware neural model to encode document information from three stylistic levels and evaluate it in the domain of authorship attribution. First, we propose a simple way to jointly encode syntactic and lexical representations of sentences. Subsequently, we employ an attention-based hierarchical neural network to encode the syntactic and semantic structure of sentences in documents while rewarding the sentences which contribute more to capturing the writing style. Our experimental results, based on four benchmark datasets, reveal the benefits of encoding document information from all three stylistic levels when compared to the baseline methods in the literature.

show abstract

The Importance of Being Recurrent for Modeling Hierarchical Structure

Cited by 123 publications

References 13 publications

Deep Contextualized Word Embeddings in Transition-Based and Graph-Based Dependency Parsing - A Tale of Two Parsers Revisited

Deep Contextualized Word Embeddings in Transition-Based and Graph-Based Dependency Parsing - A Tale of Two Parsers Revisited

Revealing the Dark Secrets of BERT

Style-Aware Neural Model with Application in Authorship Attribution

Contact Info

Product

Resources

About