Mining Web data for Chinese segmentation

Wang, Fu Lee; Yang, Christopher C.

doi:10.1002/asi.20629

Cited by 8 publications

(4 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…After adjusting for the length of words, the combination of words with highest adjusted frequency are chosen as the segmentation result. Experiments showed that this segmentation algorithm outperformed existing state-of-art segmentation methods and were robust to text collections from different geographical areas [42].…”

Section: Chinese Text Segmentationmentioning

confidence: 92%

“…One way to alleviate the problem of mismatched training and testing dataset is to make use of a large-enough corpus. The web mining-based segmentation algorithm makes use of the n-gram collected by submitting corresponding queries to search engines such as Google and Yahoo [42]. After adjusting for the length of words, the combination of words with highest adjusted frequency are chosen as the segmentation result.…”

Section: Chinese Text Segmentationmentioning

confidence: 99%

“…Note that the translation in this stage only depends on the 470 key phrases extracted using Extended Mutual Information. Compared to the number of commonly used Chinese characters (about 6000; see for example [42]), this key phrase list is fairly small. As shown in Section 6, this key phrasebased translation approach led to good overall syndromic classification performance.…”

Section: Stage 03: Chinese Phrase Translationmentioning

confidence: 99%

See 2 more Smart Citations

Multilingual chief complaint classification for syndromic surveillance: An experiment with Chinese chief complaints

Chen

Zeng

et al. 2009

International Journal of Medical Informatics

View full text Add to dashboard Cite

show abstract

Section: Chinese Text Segmentationmentioning

confidence: 92%

Section: Chinese Text Segmentationmentioning

confidence: 99%

Section: Stage 03: Chinese Phrase Translationmentioning

confidence: 99%

See 1 more Smart Citation

Multilingual chief complaint classification for syndromic surveillance: An experiment with Chinese chief complaints

Chen

Zeng

et al. 2009

International Journal of Medical Informatics

View full text Add to dashboard Cite

show abstract

“…Since most non‐spatial attributes are composed of short texts, we process them into computable numerical indicators. In natural language processing of Chinese, because there is no natural separation between words, the performance of word segmentation tools is usually not ideal (Wang & Yang, 2007), and sometimes the result of word segmentation can be ambiguous.…”

Section: Methodsmentioning

confidence: 99%

A points of interest matching method using a multivariate weighting function with gradient descent optimization

Yang

Wang

Zhang

et al. 2020

Transactions in GIS

View full text Add to dashboard Cite

Volunteered geographic information contains abundant valuable data, which can be applied to various spatiotemporal geographical analyses. While the useful information may be distributed in different, low‐quality data sources, this issue can be solved by data integration. Generally, the primary task of integration is data matching. Unfortunately, due to the complexity and irregularities of multi‐source data, existing studies have found it difficult to efficiently establish the correspondence between different sources. Therefore, we present a multi‐stage method to match multi‐source data using points of interest. A spatial filter is constructed to obtain candidate sets for geographical entities. The weights of non‐spatial characteristics are examined by a machine learning‐related algorithm with artificially labeled random samples. A case study on Fuzhou reveals that an average of 95% of instances are accurately matched. Thus, our study provides a novel solution for researchers who are engaged in data mining and related work to accurately match multi‐source data via knowledge obtained by the idea and methods of machine learning.

show abstract

Algorithm of Webpage Update Detection Based on Body Text

Chen

Zhang

2012

Lecture Notes in Electrical Engineering

View full text Add to dashboard Cite

Mining Web data for Chinese segmentation

Cited by 8 publications

References 34 publications

Multilingual chief complaint classification for syndromic surveillance: An experiment with Chinese chief complaints

Multilingual chief complaint classification for syndromic surveillance: An experiment with Chinese chief complaints

A points of interest matching method using a multivariate weighting function with gradient descent optimization

Algorithm of Webpage Update Detection Based on Body Text

Contact Info

Product

Resources

About