A sentence embedding vector can be obtained by connecting a global average pooling (GAP) to a pre-trained language model. The problem of such a sentence embedding vector using a GAP is that it is generated with the same weight for all words appearing in the sentence. We propose a novel sentence embedding-method-based model Token Attention-SentenceBERT (TA-SBERT) to address this problem. The rationale of TA-SBERT is to enhance the performance of sentence embedding by introducing three strategies. First, we convert the base form while preprocessing the input sentence to reduce misunderstanding. Second, we propose a novel Token Attention (TA) technique that distinguishes important words to produce more informative sentence vectors. Third, we increase stability of fine-tuning to avoid catastrophic forgetting by adding a reconstruction loss to the word embedding vector. Extensive ablation studies demonstrate that our TA-SBERT outperforms the original SentenceBERT (SBERT) in the sentence vector evaluation using semantic textual similarity (STS) tasks and the SentEval toolkit.
A well-known limitation of existing rule-based text augmentation is that it cannot be applied to other languages because it depends on grammatical and structural characteristics. Moreover, most text Generative Adversarial Networks (GAN) are unstable in training due to inefficient generator optimization and rely on maximum likelihood pre-training. This paper addresses the above problems by proposing a novel augmentation method with a Sentence Generator (SG) and Sentence Discriminator (SD) for Iterative Translation-based Data Augmentation (ITDA). This paper makes three original contributions. First, the ITDA SG is designed to provide universal multiple-language support by generating comprehensive augmented sentences through serial and parallel iterations of an existing translator, such as Google Translate. Second, given that the quality of the generated sentences varies depending on the translation combination or the type of sentence, the ITDA addresses this issue using a discriminator to achieve sentence augmentation, which can select high-quality augmented data using a text classifier. Third, the ITDA can perform sentence augmentation for 109 different languages using discriminators based on text classifiers trained for a specific language or type of data set. Extensive experiments are conducted to evaluate the efficacy of the ITDA using a Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), CNN-BiLSTM, and self-attention. The results demonstrate that when the ITDA is applied to 480 sentence classification tasks, the average accuracy increases by 4.24%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.