As the demand for global information increases significantly, multilingual corpora has become a valuable linguistic resource for applications to cross-lingual information retrieval and natural language processing. In order to cross the boundaries that exist between different languages, dictionaries are the most typical tools. However, the general-purpose dictionary is less sensitive in both genre and domain. It is also impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpusbased approaches, which do not have the limitation of dictionaries, provide a statistical translation model with which to cross the language boundary. There are many domain-specific parallel or comparable corpora that are employed in machine translation and cross-lingual information retrieval. Most of these are corpora between Indo-European languages, such as English/French and English/Spanish. The Asian/Indo-European corpus, especially English/Chinese corpus, is relatively sparse. The objective of the present research is to construct English/ Chinese parallel corpus automatically from the World Wide Web. In this paper, an alignment method is presented which is based on dynamic programming to identify the one-to-one Chinese and English title pairs. The method includes alignment at title level, word level and character level. The longest common subsequence (LCS) is applied to find the most reliable Chinese translation of an English word. As one word for a language may translate into two or more words repetitively in another language, the edit operation, deletion, is used to resolve redundancy. A score function is then proposed to determine the optimal title pairs. Experiments have been conducted to investigate the performance of the proposed method using the daily press release articles by the Hong Kong SAR government as the test bed. The precision of the result is 0.998 while the recall is 0.806.
As illustrated by the World Wide Web, the volume of information in languages other than English has grown significantly in recent years. This highlights the importance of multilingual corpora. Much effort has been devoted to the compilation of multilingual corpora for the purpose of cross-lingual information retrieval and machine translation. Existing parallel corpora mostly involve European languages, such as English-French and English-Spanish. There is still a lack of parallel corpora between European languages and Asian languages. In the authors' previous work, an alignment method to identify one-to-one Chinese and English title pairs was developed to construct an English-Chinese parallel corpus that works automatically from the World Wide Web, and a 100% precision and 87% recall were obtained. Careful analysis of these results has helped the authors to understand how the alignment method can be improved. A conceptual analysis was conducted, which includes the analysis of conceptual equivalent and conceptual information alternation in the aligned and nonaligned English-Chinese title pairs that are obtained by the alignment method. The result of the analysis not only reflects the characteristics of parallel corpora, but also gives insight into the strengths and weaknesses of the alignment method. In particular, conceptual alternation, such as omission and addition, is found to have a significant impact on the performance of the alignment method.
based approach to align English/Chinese Hong Kong Police press release documents from the Web is first presented. We also introduce an algorithmic approach to generate a robust knowledge base based on statistical correlation analysis of the semantics (knowledge) embedded in the bilingual press release corpus. The research output consisted of a thesaurus-like, semantic network knowledge base, which can aid in semanticsbased crosslingual information management and retrieval.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.