Abstract:This paper I)roposes ;t new tnethod for learning bilingual colloca, tions from sentence-aligned paralM corpora. Our method COml)ris('s two steps: (1) extracting llseftll word chunks (n-grmns) by word-level sorting and (2) constructing bilingua,l ('ollocations t)y combining the word-(;hunl(s a(-quired iu stag(' (1). We apply the method to a very ('hallenging text l)~tir: a stock market 1)ullet;in in Japanese and il;s abstract in En-glish. I)om;tin sl)ecific collocations are well captured ewm if they were not co… Show more
“…There has been a growing interest in corpus-based approaches which retrieve collocations from large corpora (Nagao and Mori, 1994), (Kupiec, 1993), (Fung, 1995), (Kitamura and Matsumoto, 1996), (Smadja, 1993), (Smadja et al, 1996), (Haruno et al, 1996). Although these approaches achieved good results for the task considered, most of them aim to extract fixed collocations, mainly noun phrases, and require the information which is dependent on each language such as dictionaries and parts of speech.…”
In this paper, we describe a method for automatically retrieving collocations from large text corpora. This method retrieve collocations in the following stages: 1) extracting strings of characters as units of collocations 2) extracting recurrent combinations of strings in accordance with their word order in a corpus as collocations. Through the method, various range of collocations, especially domain specific collocations, are retrieved. The method is practical because it uses plain texts without any information dependent on a language such as lexical knowledge and parts of speech.
“…There has been a growing interest in corpus-based approaches which retrieve collocations from large corpora (Nagao and Mori, 1994), (Kupiec, 1993), (Fung, 1995), (Kitamura and Matsumoto, 1996), (Smadja, 1993), (Smadja et al, 1996), (Haruno et al, 1996). Although these approaches achieved good results for the task considered, most of them aim to extract fixed collocations, mainly noun phrases, and require the information which is dependent on each language such as dictionaries and parts of speech.…”
In this paper, we describe a method for automatically retrieving collocations from large text corpora. This method retrieve collocations in the following stages: 1) extracting strings of characters as units of collocations 2) extracting recurrent combinations of strings in accordance with their word order in a corpus as collocations. Through the method, various range of collocations, especially domain specific collocations, are retrieved. The method is practical because it uses plain texts without any information dependent on a language such as lexical knowledge and parts of speech.
“…This method retrieves fixed collocations with high accuracy but may ignore collocations of exceptional types. Haruno et al (1996) constructed collocations by iteratively combining a couple of strings 5 of high mutual information. But the mutual information is estimated inadequately low when the cohesiveness between the two strings is greatly different.…”
Section: Related Workmentioning
confidence: 99%
“…There has been growing interest in corpus-based approaches that retrieve collocations from large corpora (Nagao, Makoto, and Mori 1994;Ikehara, Shirai, and Uchino 1996;Kupiec 1993;Fung 1995;Kitamura and Matsumoto 1996;Smadja 1993;Smadja, McKeown, and Hatzivassiloglou 1996;and Haruno, Ikehara, and Yamazaki 1996). As collocations have a large variety of forms, these approaches focus on fixed collocations depending on their points of view.…”
In this paper, we describe a method for automatically retrieving collocations from large text corpora. This method comprises the following stages: (1) extracting strings of characters as units of collocations, and (2) extracting recurrent combinations of strings as collocations. Through this method, various types of domain-specific collocations can be retrieved simultaneously. This method is practical because it uses plain text with no specific-languagedependent information, such as lexical knowledge and parts of speech. Experimental results using English and Japanese text corpora show that the method is equally applicable to both languages.
“…Since the number of Japanese articles is far greater than that of English articles, this rate with Japanese index terms becomes lower for the similarity lower bounds L d ≤ 0.4. 7 It is also very important to note that the results of this paper can be easily improved by employing more sophisticated techniques of estimating bilingual compound term correspondences from parallel corpora (e.g., [2]), especially in the performance of selecting appropriate monolingual compound terms in each language.…”
Abstract. For the purpose of overcoming resource scarcity bottleneck in corpus-based translation knowledge acquisition research, this paper takes an approach of semi-automatically acquiring domain specific translation knowledge from the collection of bilingual news articles on WWW news sites. This paper presents results of applying standard co-occurrence frequency based techniques of estimating bilingual term correspondences from parallel corpora to relevant article pairs automatically collected from WWW news sites. The experimental evaluation results are very encouraging and it is proved that many useful bilingual term correspondences can be efficiently discovered with little human intervention from relevant article pairs on WWW news sites.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.