Digital libraries store materials in electronic format. Research and development in digital libraries includes content creation, conversion, indexing, organization, and dissemination. The key technological issues are how to search and display desired selections from and across large collections effectively [Schatz & Chen, 1996]. Digital library research projects (DLI-1) sponsored by NSF/ DARPA/NASA have a common theme of bringing search to the net, which is the flagship research effort for the National Information Infrastructure (NII) in the United States. A repository is an indexed collection of objects.Indexing is an important task for searching. The better the indexing, the better the searching result. Developing a universal digital library has been the dream of many researchers, however, there are still many problems to be solved before such a vision is fulfilled. The most critical is to support a cross-lingual retrieval or multilingual digital library. Much work has been done on English information retrieval, however, there is relatively less work on Chinese information retrieval. In this article, we focus on Chinese indexing, which is the foundation of Chinese and cross-lingual information retrieval. The smallest indexing units in Chinese digital libraries are words, while the smallest units in a Chinese sentence are characters. However, Chinese text has no delimiter to mark word boundaries as it is in English text. In English or other languages using Roman or Greek-based orthographies, often, spacing reliably indicates word boundaries. In Chinese, a number of characters are placed together without any delimiters indicating the boundaries between consecutive characters. In this article, we investigate the combination and boundary detection approaches based on mutual information for segmentation. The combination approach combines ngrams to form words with more number of characters. In the combination approach Algorithm 1 does not allow overlapping of n-grams while Algorithm 2 does. The boundary detection approach detects the segmentation points on a sentence based on the values and the change of values of the mutual information. Experiments are conducted to evaluate their performances.
The information available in languages other than English in the World Wide Web is increasing significantly. According to a report from Computer Economics in 1999, 54% of Internet users are English speakers (“English Will Dominate Web for Only Three More Years,” Computer Economics, July 9, 1999, http://www.computereconomics.com/new4/pr/pr990610.html). However, it is predicted that there will be only 60% increase in Internet users among English speakers verses a 150% growth among non‐English speakers for the next five years. By 2005, 57% of Internet users will be non‐English speakers. A report by CNN.com in 2000 showed that the number of Internet users in China had been increased from 8.9 million to 16.9 million from January to June in 2000 (“Report: China Internet users double to 17 million,” CNN.com, July, 2000, http://cnn.org/2000/TECH/computing/07/27/china.internet.reut/index.html). According to Nielsen/NetRatings, there was a dramatic leap from 22.5 millions to 56.6 millions Internet users from 2001 to 2002. China had become the second largest global at‐home Internet population in 2002 (US's Internet population was 166 millions) (Robyn Greenspan, “China Pulls Ahead of Japan,” Internet.com, April 22, 2002, http://cyberatlas.internet.com/big_picture/geographics/article/0,,5911_1013841,00.html). All of the evidences reveal the importance of cross‐lingual research to satisfy the needs in the near future. Digital library research has been focusing in structural and semantic interoperability in the past. Searching and retrieving objects across variations in protocols, formats and disciplines are widely explored (Schatz, B., & Chen, H. (1999). Digital libraries: technological advances and social impacts. IEEE Computer, Special Issue on Digital Libraries, February, 32(2), 45–50.; Chen, H., Yen, J., & Yang, C.C. (1999). International activities: development of Asian digital libraries. IEEE Computer, Special Issue on Digital Libraries, 32(2), 48–49.). However, research in crossing language boundaries, especially across European languages and Oriental languages, is still in the initial stage. In this proposal, we put our focus on cross‐lingual semantic interoperability by developing automatic generation of a cross‐lingual thesaurus based on English/Chinese parallel corpus. When the searchers encounter retrieval problems, professional librarians usually consult the thesaurus to identify other relevant vocabularies. In the problem of searching across language boundaries, a cross‐lingual thesaurus, which is generated by co‐occurrence analysis and Hopfield network, can be used to generate additional semantically relevant terms that cannot be obtained from dictionary. In particular, the automatically generated cross‐lingual thesaurus is able to capture the unknown words that do not exist in a dictionary, such as names of persons, organizations, and events. Due to Hong Kong's unique history background, both English and Chinese are used as official languages in all legal documents. Therefore, English/Chinese cross‐lin...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.