Hierarchical Clause Annotation: Building a Clause-Level Corpus for Semantic Parsing with Complex Sentences
Yunlong Fan,
Bin Li,
Yikemaiti Sataer
et al.
Abstract:Most natural-language-processing (NLP) tasks suffer performance degradation when encountering long complex sentences, such as semantic parsing, syntactic parsing, machine translation, and text summarization. Previous works addressed the issue with the intuition of decomposing complex sentences and linking simple ones, such as rhetorical-structure-theory (RST)-style discourse parsing, split-and-rephrase (SPRP), text simplification (TS), simple sentence decomposition (SSD), etc. However, these works are not appl… Show more
“…In addition to RST, Fan et al [24] also propose a novel clausal feature, HCA, which represents a complex sentence as a tree consisting of clause nodes and inter-clause relation edges. The HCA framework is based on English grammar [21], where clauses are elementary grammar units that center around a verb, and inter-clause relations can be classified into two categories:…”
Section: Hierarchical Clause Annotationmentioning
confidence: 99%
“…where the parameter matrix W O ∈ R hN×d , and h is the attention head number mapping to 16 inter-clause relations in HCA [24].…”
“…For the HCA tree of each sentence, we use the manually annotated HCA trees for AMR 2.0 provided by [24] and auto-annotated HCA trees for the remaining datasets, which were all generated by the HCA Segmenter and the HCA Parser proposed by [24]. The training, development, and test sets in both datasets are a random split, and therefore we take them as ID datasets as in previous works [15,17,18,20,45].…”
Section: Datasetsmentioning
confidence: 99%
“…Rhetorical structure theory (RST) [23] provides a general way to describe the coherence relations among clauses and some phrases, i.e., elementary discourse units, and postulates a hierarchical discourse structure called discourse tree. Except for RST, a novel clausal feature, hierarchical clause annotation (HCA) [24], also captures a tree structure of a complex sentence, where the leaves are segmented clauses and the edges are the inter-clause relations.…”
mentioning
confidence: 99%
“…Due to the better parsing performances of the clausal structure [24], we select and integrate the HCA trees of complex sentences to cure LDDs in AMR parsing. Specifically, we propose two HCA-based approaches, HCA-based self-attention (HCA-SA) and HCA-based curriculum learning (HCA-CL), to integrate the HCA trees as clausal features in the popular AMR parsing codebase, SPRING [15].…”
Most natural language processing (NLP) tasks operationalize an input sentence as a sequence with token-level embeddings and features, despite its clausal structure. Taking abstract meaning representation (AMR) parsing as an example, recent parsers are empowered by transformers and pre-trained language models, but long-distance dependencies (LDDs) introduced by long sequences are still open problems. We argue that LDDs are not actually to blame for the sequence length but are essentially related to the internal clause hierarchy. Typically, non-verb words in a clause cannot depend on words outside of it, and verbs from different but related clauses have much longer dependencies than those in the same clause. With this intuition, we introduce a type of clausal feature, hierarchical clause annotation (HCA), into AMR parsing and propose two HCA-based approaches, HCA-based self-attention (HCA-SA) and HCA-based curriculum learning (HCA-CL), to integrate HCA trees of complex sentences for addressing LDDs. We conduct extensive experiments on two in-distribution (ID) AMR datasets (AMR 2.0 and AMR 3.0) and three out-of-distribution (OOD) ones (TLP, New3, and Bio). Experimental results show that our HCA-based approaches achieve significant and explainable improvements (0.7 Smatch score in both ID datasets; 2.3, 0.7, and 2.6 in three OOD datasets, respectively) against the baseline model and outperform the state-of-the-art (SOTA) model (0.7 Smatch score in the OOD dataset, Bio) when encountering sentences with complex clausal structures that introduce most LDD cases.
“…In addition to RST, Fan et al [24] also propose a novel clausal feature, HCA, which represents a complex sentence as a tree consisting of clause nodes and inter-clause relation edges. The HCA framework is based on English grammar [21], where clauses are elementary grammar units that center around a verb, and inter-clause relations can be classified into two categories:…”
Section: Hierarchical Clause Annotationmentioning
confidence: 99%
“…where the parameter matrix W O ∈ R hN×d , and h is the attention head number mapping to 16 inter-clause relations in HCA [24].…”
“…For the HCA tree of each sentence, we use the manually annotated HCA trees for AMR 2.0 provided by [24] and auto-annotated HCA trees for the remaining datasets, which were all generated by the HCA Segmenter and the HCA Parser proposed by [24]. The training, development, and test sets in both datasets are a random split, and therefore we take them as ID datasets as in previous works [15,17,18,20,45].…”
Section: Datasetsmentioning
confidence: 99%
“…Rhetorical structure theory (RST) [23] provides a general way to describe the coherence relations among clauses and some phrases, i.e., elementary discourse units, and postulates a hierarchical discourse structure called discourse tree. Except for RST, a novel clausal feature, hierarchical clause annotation (HCA) [24], also captures a tree structure of a complex sentence, where the leaves are segmented clauses and the edges are the inter-clause relations.…”
mentioning
confidence: 99%
“…Due to the better parsing performances of the clausal structure [24], we select and integrate the HCA trees of complex sentences to cure LDDs in AMR parsing. Specifically, we propose two HCA-based approaches, HCA-based self-attention (HCA-SA) and HCA-based curriculum learning (HCA-CL), to integrate the HCA trees as clausal features in the popular AMR parsing codebase, SPRING [15].…”
Most natural language processing (NLP) tasks operationalize an input sentence as a sequence with token-level embeddings and features, despite its clausal structure. Taking abstract meaning representation (AMR) parsing as an example, recent parsers are empowered by transformers and pre-trained language models, but long-distance dependencies (LDDs) introduced by long sequences are still open problems. We argue that LDDs are not actually to blame for the sequence length but are essentially related to the internal clause hierarchy. Typically, non-verb words in a clause cannot depend on words outside of it, and verbs from different but related clauses have much longer dependencies than those in the same clause. With this intuition, we introduce a type of clausal feature, hierarchical clause annotation (HCA), into AMR parsing and propose two HCA-based approaches, HCA-based self-attention (HCA-SA) and HCA-based curriculum learning (HCA-CL), to integrate HCA trees of complex sentences for addressing LDDs. We conduct extensive experiments on two in-distribution (ID) AMR datasets (AMR 2.0 and AMR 3.0) and three out-of-distribution (OOD) ones (TLP, New3, and Bio). Experimental results show that our HCA-based approaches achieve significant and explainable improvements (0.7 Smatch score in both ID datasets; 2.3, 0.7, and 2.6 in three OOD datasets, respectively) against the baseline model and outperform the state-of-the-art (SOTA) model (0.7 Smatch score in the OOD dataset, Bio) when encountering sentences with complex clausal structures that introduce most LDD cases.
Most natural language processing (NLP) tasks operate an input sentence as a sequence with token-level embeddings and features, despite its clausal structures. Taking Abstract Meaning Representation (AMR) parsing as an example, recent parsers are empowered by Transformers and pre-trained language models, but long-distance dependencies (LDDs) introduced by long sequences are still open problems. We argue that LDDs are not superficially blamed on the sequence length but are essentially related to the internal clause hierarchy. Typically, non-verb words in a clause cannot depend on words outside, and verbs from different but related clauses have much longer dependencies than those in the same clause. With this intuition, we introduce a type of clausal feature, hierarchical clause annotation (HCA), into AMR parsing and propose two HCA-based approaches, HCA-based self-attention (HCA-SA) and HCA-based curriculum learning (HCA-CL), to integrate HCA trees of complex sentences for addressing LDDs. We conduct extensive experiments on two in-distribution (ID) AMR datasets (AMR 2.0 and AMR 3.0) and three out-of-distribution (OOD) ones (TLP, New3, and Bio). Experimental results show that our HCA-based approaches achieve significant and explainable improvements against the baseline model and outperform the state-of-the-art (SOTA) model when encountering sentences with complex clausal structures that introduce most LDD cases.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.