LDA-Based Word Image Representation for Keyword Spotting on Historical Mongolian Documents

Wei, Hongxi; Gao, Guanglai; Su, Xiangdong

doi:10.1007/978-3-319-46681-1_52

Cited by 13 publications

(3 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unsupervised algorithms mainly sort the keyword weights through some specified indicators and select keywords on the basis of the sorting results. Among them are representative TF-IDF based on statistical features [1,2] , TextRank based on word graph model [3,4] , and Latent Dirichlet Allocation (LDA) based on topic model [6] . To optimize the effect of algorithm extraction, Luo et al [7] derived the calculation formula for the number of words of the same frequency in the text through Zipf's law and then determined the proportion of each frequency word in the text by using the calculation formula for the number of words of the same frequency.…”

Section: Related Workmentioning

confidence: 99%

“…The importance of words with no representativeness is reduced in high-frequency words to the text, and then the accuracy of keyword extraction is improved. TextRank [3,4] , which is based on network graph, is a classic unsupervised keyword extraction method. This method decomposes the content of a single document into a network graph model by word segmentation and extracts keywords by considering the structural features and word frequency features of the document.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

News keyword extraction algorithm based on semantic clustering and word graph model

Xiong

Liu

Tian

et al. 2021

Tsinghua Sci. Technol.

View full text Add to dashboard Cite

The internet is an abundant source of news every day. Thus, efficient algorithms to extract keywords from the text are important to obtain information quickly. However, the precision and recall of mature keyword extraction algorithms need improvement. TextRank, which is derived from the PageRank algorithm, uses word graphs to spread the weight of words. The keyword weight propagation in TextRank focuses only on word frequency. To improve the performance of the algorithm, we propose Semantic Clustering TextRank (SCTR), a semantic clustering news keyword extraction algorithm based on TextRank. Firstly, the word vectors generated by the Bidirectional Encoder Representation from Transformers (BERT) model are used to perform k-means clustering to represent semantic clustering. Then, the clustering results are used to construct a TextRank weight transfer probability matrix. Finally, iterative calculation of word graphs and extraction of keywords are performed. The test target of this experiment is a Chinese news library. The results of the experiment conducted on this text set show that the SCTR algorithm has greater precision, recall, and F1 value than the traditional TextRank and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

News keyword extraction algorithm based on semantic clustering and word graph model

Xiong

Liu

Tian

et al. 2021

Tsinghua Sci. Technol.

View full text Add to dashboard Cite

show abstract

“…Topic modeling originates from early latent semantic analysis (LSA), which aims to discover meaningful semantic structures in the corpus [18], with a focus on keyword extraction. The representative approaches are through the use of TF-IDF, which is based on statistical features [19,20], TextRank, based on word graph models [21,22], and Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA), based on topic models [23]. PLSA and LDA are the most widely used probabilistic techniques in topic modeling [24].…”

Section: Introductionmentioning

confidence: 99%

Bert-Based Latent Semantic Analysis (Bert-LSA): A Case Study on Geospatial Data Technology and Application Trend Analysis

et al. 2021

View full text Add to dashboard Cite

Geospatial data is an indispensable data resource for research and applications in many fields. The technologies and applications related to geospatial data are constantly advancing and updating, so identifying the technologies and applications among them will help foster and fund further innovation. Through topic analysis, new research hotspots can be discovered by understanding the whole development process of a topic. At present, the main methods to determine topics are peer review and bibliometrics, however they just review relevant literature or perform simple frequency analysis. This paper proposes a new topic discovery method, which combines a word embedding method, based on a pre-trained model, Bert, and a spherical k-means clustering algorithm, and applies the similarity between literature and topics to assign literature to different topics. The proposed method was applied to 266 pieces of literature related to geospatial data over the past five years. First, according to the number of publications, the trend analysis of technologies and applications related to geospatial data in several leading countries was conducted. Then, the consistency of the proposed method and the existing method PLSA (Probabilistic Latent Semantic Analysis) was evaluated by using two similar consistency evaluation indicators (i.e., U-Mass and NMPI). The results show that the method proposed in this paper can well reveal text content, determine development trends, and produce more coherent topics, and that the overall performance of Bert-LSA is better than PLSA using NPMI and U-Mass. This method is not limited to trend analysis using the data in this paper; it can also be used for the topic analysis of other types of texts.

show abstract

A Hybrid Representation of Word Images for Keyword Spotting

Wei

Zhang

Liu

2020

Communications in Computer and Information Science

View full text Add to dashboard Cite

LDA-Based Word Image Representation for Keyword Spotting on Historical Mongolian Documents

Cited by 13 publications

References 15 publications

News keyword extraction algorithm based on semantic clustering and word graph model

News keyword extraction algorithm based on semantic clustering and word graph model

Bert-Based Latent Semantic Analysis (Bert-LSA): A Case Study on Geospatial Data Technology and Application Trend Analysis

A Hybrid Representation of Word Images for Keyword Spotting

Contact Info

Product

Resources

About