Korean Historical Documents Analysis with Improved Dynamic Word Embedding

Jin, Kyohoon; Wi, JeongA; Kang, Kyeongpil; Kim, Youngbin

doi:10.3390/app10217939

Cited by 3 publications

(4 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We note that the text pre-processing of our works is only based on preliminary syllable analysis. Recent techniques such as tokenization, word-embedding, multitask learning, and Bidirectional Encoder Representations from Transformers (BERT) [31][32][33][34] have not been prepared for old Korean characters yet. If the text tokenization or embedding is available for old Korean, one could remove input nouns such as the name of character and places, etc., and perform an intensive study based on corpus.…”

Section: Discussionmentioning

confidence: 99%

Created era estimation of old Korean documents via deep neural network

Yoo

Kim

2022

Herit Sci

View full text Add to dashboard Cite

In general, the created era of a literary work is significant information for understanding the background and the literary interpretation of the work. However, in the case of literary works of old Korea, especially works created in Hangul, there are few works of which the era of creation are known. In this paper, the created era of old Korean documents was estimated based on artificial intelligence. Hangul, a Korean letter system where one syllable is one character, has more than 10,000 combinations of characters, so it is available to predict changes in the structure or grammar of Hangul by analyzing the frequency of characters. Accordingly, a deep neural network model was constructed based on the term frequency of each character in Hangul. Model training was performed based on 496 documents with known publication years, and the mean-absolute-error of the test set for the entire prediction range from 1447 to 1934 was 13.77 years for test sets and 15.8 years for validation sets, which is less than an error ratio of 3.25% compared to the total year range. In addition, the predicted results of works from which only the approximate creation time was inferred were also within the range, and the predicted creation years for other divisions of the identical novel were similar. These results show that the deep neural network model based on character term frequency predicted the creation era of old Korean documents properly. This study is expected to support the literary history of Korea within the period from 15C to 19C by predicting the period of creation or enjoyment of the work. In addition, the method and algorithm using syllable term frequency are believed to have the potential to apply in other language documents.

show abstract

Section: Discussionmentioning

confidence: 99%

Created era estimation of old Korean documents via deep neural network

Yoo

Kim

2022

Herit Sci

View full text Add to dashboard Cite

show abstract

“…Text categorization algorithms have been successfully applied to Korean/French/Arabic/Tigrinya/Chinese languages for document/tweets classification (Kozlowski et al 2020 ), (Jin et al 2020 ). CNN with the CBOW model achieves an accuracy of 93.41% for classifying text in the Trigniya language (Fesseha et al 2021 ).…”

Section: Review On Text Analytics Word Embedding Application and Deep...mentioning

confidence: 99%

“… Pan et al ( 2019a ) Improve text classification by transforming knowledge from one domain to another Netease and Cnews are two public Chinese text classification datasets, English text datasets Yahoo dataset SVM, LSTM TF-IDF, BOW, Word2Vec LSTM + Word2Vec achieves an accuracy of 90.07% 23. Jin et al ( 2020 ) Korean historical documents analysis Korean historical documents Dynamic word embedding approach BERT NER task achieves an F1-score of 68% 24. Fesseha et al ( 2021 ) Low-Resource Languages: Tigrinya Tigrinya news datasets CNN fastText Word2Vec(CBOW, Skip-Gram) CNN + CBOW achieves an accuracy of 93.41% 25.…”

Section: Appendix Amentioning

confidence: 99%

Impact of word embedding models on text analytics in deep learning environment: a review

2023

View full text Add to dashboard Cite

The selection of word embedding and deep learning models for better outcomes is vital. Word embeddings are an n-dimensional distributed representation of a text that attempts to capture the meanings of the words. Deep learning models utilize multiple computing layers to learn hierarchical representations of data. The word embedding technique represented by deep learning has received much attention. It is used in various natural language processing (NLP) applications, such as text classification, sentiment analysis, named entity recognition, topic modeling, etc. This paper reviews the representative methods of the most prominent word embedding and deep learning models. It presents an overview of recent research trends in NLP and a detailed understanding of how to use these models to achieve efficient results on text analytics tasks. The review summarizes, contrasts, and compares numerous word embedding and deep learning models and includes a list of prominent datasets, tools, APIs, and popular publications. A reference for selecting a suitable word embedding and deep learning approach is presented based on a comparative analysis of different techniques to perform text analytics tasks. This paper can serve as a quick reference for learning the basics, benefits, and challenges of various word representation approaches and deep learning models, with their application to text analytics and a future outlook on research. It can be concluded from the findings of this study that domain-specific word embedding and the long short term memory model can be employed to improve overall text analytics task performance.

show abstract

“…However, there has been no research attempting to propose language models in Hanja, which is a dead language in Korea but absolutely necessary to explore Korean history. Most of the studies with Hanja only shed lights on translating historical Hanja documents and use AJD as their corpus (Park et al, 2020;Jin et al, 2020;Kang et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea

Yoo¹,

Jiho²,

Son³

et al. 2022

Findings of the Association for Computational Linguistics: NAACL 2022

View full text Add to dashboard Cite

Historical records in Korea before the 20 th century were primarily written in Hanja, an extinct language based on Chinese characters and not understood by modern Korean or Chinese speakers. Historians with expertise in this time period have been analyzing the documents, but that process is very difficult and time-consuming, and language models would significantly speed up the process. Toward building and evaluating language models for Hanja, we release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued pretraining on the two major corpora from the 14 th to the 19 th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats. 1 We compare the models with several baselines on all tasks and show there are significant improvements gained by training on the two corpora. Additionally, we run zeroshot experiments on the Daily Records of the Royal Court and Important Officials (DRRI). The DRRI dataset has not been studied much by the historians, and not at all by the NLP community.

show abstract

Korean Historical Documents Analysis with Improved Dynamic Word Embedding

Cited by 3 publications

References 36 publications

Created era estimation of old Korean documents via deep neural network

Created era estimation of old Korean documents via deep neural network

Impact of word embedding models on text analytics in deep learning environment: a review

HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea

Contact Info

Product

Resources

About