A comparison was made of vectors derived by using ordinary co-occurrence statistics from large text corpora and of vectors derived by measuring the interword distances in dictionary definitions. The precision of word sense disambiguation by using co-occurrence vectors from the 1987 Wall Street Journal (20M total words) was higher than that by using distance vectors from the Collins English Dictionary (60K head words + 1.6M definition words). However, other experimental results suggest that distance vectors contain some different semantic information from co-occurrence vectors.
The paper presents a new approach to text segmentation --. which concerns dividing a text into coherent discourse units. The approach builds on tile ttleory of discourse segment (Nomoto and Nitta, 1993), incorporating ideas from the research on information retrieval (Salton, 1988). A discourse segment has to do with a structure of Japanese discourse; it could be thought of as a linguistic unit delnarcated by wa, a Japanese topic particle, which may extend over several sentences. The segmentation works with discourse segments and makes use of coherence measure ba~scd on tfidf, a standard information retrieval measurement (Salton, 1988; IIearst, 1993). Experi,nents have been done with a Japanese newspaper corpus. It has been found that the present approach is quite sucecssfld in recovering articles fronl tile unstructured corpus.
Practical machine translation must be considered from a heuristic point of view rather than from a purely rigid analytical linguistic method. An English-into-Japanese translation system named ATHENE based on a Heuristic Parsing Model (HPM) has been developed. The experiment shows some advantageous points such as simplification of transforming and generating phase, semilocalization of multiple meaning resolution, and extendability for future grammatical refinement. HPM-base parsing process, parsed tree, grammatical data representation, and translation results are also described. ]. INTRODUCTIONIs it true that the recipe to realize a successful machine translation is in precise and rigid language parsing? So far many studies have been done on rigid and detailed natural language parsing, some of which are so powerful as to detect some ungrammatical sentences If, 2, 3, 4]. Notwithstanding it seems that the detailed parsing is not always connected with practically satisfying machine translations. On the other hand actual human, even foreign language learners, can translate fairly difficult English sentences without going into details of parsing. They only use an elementary grammatical knowledge and dictionaries.Thu. we have paid attention on the heuristic methods of language-learners and have dew~ed a rather non-standard linguistic model named HPM (= Heuristic Parsing Model). Here, "non-standard" implies that sentential constituents in HPM are different from those in widely accepted modern English grammars [5] or in phrase structure grammars [6]. In order to prove the reasonability of HPM, we have developed an English-into-Japanese translation system named ATHENE (= Automatic T_ranslation of Hitachi from E_nglish into Nihongo with Editing Support)~f. Fig. I).The essential features of heuristic translation are summarized as in following three points.(I) To segment an input sentence into new elements named Phrasal Elements (PE) and Clausal Elements (CE),(2) To assign syntactic roles to PE's and CE's, and restructure the segmented elements into tree-forms by inclusive relation and into list-forms by modifying relation.(3) To permute the segmented elements, and to assiqn appropriate Japanese equivalents with necessary case suffixes and postpos~tions.The next section presents an overview of HPM, which is followed in Sec. 3 by a rough explication of machine translation process in ATHENE. Sec. 4 discusses the experimental results. Sec. 5 presents cohcluding remarks and current plans for 283
This paper aims to analyze word dependency structure in compound nouns appearing in Japanese newspaper articles. The analysis is a dil't:icult problem because such compound nouns can be quite long, have no word boundaries between contained nouns, and often contain nnregistered words such as abbreviations. The nonsegmentation property and unregistered words cause initial segmentation errors which result in erroneous analysis. This paper presents a corpus-based approach which scans a corpus with a set of pattern matchers and gathers cooccurrence examples to analyze compound nouns. It employs boot-strapping search to cope with unregistered words: if an unregistered word is lound in the process of searching the examples, it is recorded and invokes additional searches to gather the examples containing it.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.