BTS: Back TranScription for Speech-to-Text Post-Processor using Text-to-Speech-to-Text

Park, Chanjun; Seo, Jaehyung; Lee, Seolhwa; Lee, Chanhee; Moon, Hyeonseok; Eo, Sugyeong; Lim, Heuiseok

doi:10.18653/v1/2021.wat-1.10

Cited by 21 publications

(18 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The reason for choosing the commercialization system for comparison is that it is a certified system used by several researchers, and the latest deep learning-based grammatical correction methodology is applied; hence, it is the most objective and reliable system for accurate analysis. The performance of each corrector is measured by using the error sentences of K-NCT as input for three commercialization systems and performing quantitative analysis using the BLEU score [20] and GLEU score [21], which are used in various deep learning-based grammatical correction studies as evaluation indicators [1,22,23]. The experimental results are shown in Table 4.…”

Section: Experiments and Resultsmentioning

confidence: 99%

K-NCT: Korean Neural Grammatical Error Correction Gold-Standard Test Set Using Novel Error Type Classification Criteria

et al. 2022

Self Cite

View full text Add to dashboard Cite

Recently, active research has been conducted on Korean grammatical error correction on machine translation (MT) and automatic noise generation. However, there is no gold-standard test set for objective and official comparative analysis. A significant limitation is measuring the ill-defined performance because the experimental error types in the train set are also included in the test set. Moreover, error types in the training set are also included in the test set. Additionally, the types of errors for qualitative analysis are defined differently with no explicit guidelines. This study proposes a gold-standard test set called the Korean Neural Grammatical Correction Test set (K-NCT) for Korean grammatical error correction using a new error type classification guideline. To ensure the factuality and reliability of the proposal, we conduct a quantitative analysis using a commercialization system and human evaluation. Experimental results demonstrate that the proposed grammatical error correction test set has a well-balanced, diverse, and precise guideline. Our dataset is available at https://github.com/seonminkoo/K-NCT

show abstract

Section: Experiments and Resultsmentioning

confidence: 99%

K-NCT: Korean Neural Grammatical Error Correction Gold-Standard Test Set Using Novel Error Type Classification Criteria

et al. 2022

Self Cite

View full text Add to dashboard Cite

show abstract

“…It is because there are numerous negative effects in terms of data imbalance. This suggests in which direction we should build data and informs us that the performances can be improved through data cleaning such as PCF (Koehn et al, 2020a;Park et al, 2021c).…”

Section: Resultsmentioning

confidence: 99%

Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Park¹,

Shim²,

Eo³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other highresource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance.

show abstract

“…Currently, data augmentation is performed only on TOEIC Part 5 to improve performance, so it cannot be applied to other parts of the TOEIC yet. Therefore, in the future, it will be developed to cover all parts of TOEIC, and experiments will be conducted to examine whether the proposed method can improve performance even in other domain tasks [29,30].…”

Section: Discussionmentioning

confidence: 99%

BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders

et al. 2022

Self Cite

View full text Add to dashboard Cite

Recent studies have attempted to understand natural language and infer answers. Machine reading comprehension is one of the representatives, and several related datasets have been opened. However, there are few official open datasets for the Test of English for International Communication (TOEIC), which is widely used for evaluating people’s English proficiency, and research for further advancement is not being actively conducted. We consider that the reason why deep learning research for TOEIC is difficult is due to the data scarcity problem, so we therefore propose two data augmentation methods to improve the model in a low resource environment. Considering the attributes of the semantic and grammar problem type in TOEIC, the proposed methods can augment the data similar to the real TOEIC problem by using POS-tagging and Lemmatizing. In addition, we confirmed the importance of understanding semantics and grammar in TOEIC through experiments on each proposed methodology and experiments according to the amount of data. The proposed methods address the data shortage problem of TOEIC and enable an acceptable human-level performance.

show abstract

BTS: Back TranScription for Speech-to-Text Post-Processor using Text-to-Speech-to-Text

Cited by 21 publications

References 31 publications

K-NCT: Korean Neural Grammatical Error Correction Gold-Standard Test Set Using Novel Error Type Classification Criteria

K-NCT: Korean Neural Grammatical Error Correction Gold-Standard Test Set Using Novel Error Type Classification Criteria

Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders

Contact Info

Product

Resources

About