2007
DOI: 10.1002/asi.20629
|View full text |Cite
|
Sign up to set email alerts
|

Mining Web data for Chinese segmentation

Abstract: Modern information retrieval systems use keywords within documents as indexing terms for search of relevant documents. As Chinese is an ideographic characterbased language, the words in the texts are not delimited by white spaces. Indexing of Chinese documents is impossible without a proper segmentation algorithm. Many Chinese segmentation algorithms have been proposed in the past. Traditional segmentation algorithms cannot operate without a large dictionary or a large corpus of training data. Nowadays, the We… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2008
2008
2020
2020

Publication Types

Select...
2
2
2

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 34 publications
0
4
0
Order By: Relevance
“…After adjusting for the length of words, the combination of words with highest adjusted frequency are chosen as the segmentation result. Experiments showed that this segmentation algorithm outperformed existing state-of-art segmentation methods and were robust to text collections from different geographical areas [42].…”
Section: Chinese Text Segmentationmentioning
confidence: 92%
See 2 more Smart Citations
“…After adjusting for the length of words, the combination of words with highest adjusted frequency are chosen as the segmentation result. Experiments showed that this segmentation algorithm outperformed existing state-of-art segmentation methods and were robust to text collections from different geographical areas [42].…”
Section: Chinese Text Segmentationmentioning
confidence: 92%
“…One way to alleviate the problem of mismatched training and testing dataset is to make use of a large-enough corpus. The web mining-based segmentation algorithm makes use of the n-gram collected by submitting corresponding queries to search engines such as Google and Yahoo [42]. After adjusting for the length of words, the combination of words with highest adjusted frequency are chosen as the segmentation result.…”
Section: Chinese Text Segmentationmentioning
confidence: 99%
See 1 more Smart Citation
“…Since most non‐spatial attributes are composed of short texts, we process them into computable numerical indicators. In natural language processing of Chinese, because there is no natural separation between words, the performance of word segmentation tools is usually not ideal (Wang & Yang, 2007), and sometimes the result of word segmentation can be ambiguous.…”
Section: Methodsmentioning
confidence: 99%