A two-tier model for the description of morphological, syntactic and semantic variations of multi-word terms is presented. It is applied to term normalization of French and English corpora in the medical and agricultural domains. Five different sources of morphological and semantic knowledge are exploited (MULTEXT, CELEX, AGROVOC, Word-Netl.6, and Microsoft Word97 thesaurus).
Abstract. Recent developments in computational terminology call for the design of multiple and complementary tools for the acquisition, the structuring and the exploitation of terminological data. This paper proposes to bridge the gap between term acquisition and thesaurus construction by offering a framework for automatic structuring of multi-word candidate terms with the help of corpus-based links between single-word terms.Hypernym links acquired through an information extraction procedure are projected on multi-word terms through the recognition of semantic variations. The induced hierarchy is incomplete but provides an automatic generalization of singleword terms relations to multi-word terms that are pervasive in technical thesauri and corpora.
Terms are often supposed not to be prone to variation. Empirical observation of terms in various corpora (telecommunication, physics, medicine) shows, on the contrary, the quantitative and qualitative importance of term variation. We give a precise linguistic description of the rules relating to controlled terms and observed variants and of the constraints on these rules. This description leads to novel means of enriching terminologies via the generation of possible term variants or the simplification of nominal parse trees in order to discover potential variants.
A system for the automatic production of controlled index terms is presented using linguistically-motivated techniques. This includes a finite-state part of speech tagger, a derivational morphological processor for analysis and generation, and a unificationbased shallow-level parser using transformational rules over syntactic patterns. The contribution of this research is the successful combination of parsing over a seed term list coupled with derivational morphology to achieve greater coverage of multi-word terms for indexing and retrieval. Final results are evaluated for precision and recall, and implications for indexing and retrieval are discussed.
Array-based comparative genomic hybridization (aCGH) is a powerful tool to detect genomic imbalances in the human genome. The analysis of aCGH data sets has revealed the existence of a widespread technical artifact termed as ‘waves’, characterized by an undulating data profile along the chromosome. Here, we describe the development of a novel noise-reduction algorithm, waves aCGH correction algorithm (WACA), based on GC content and fragment size correction. WACA efficiently removes the wave artifact, thereby greatly improving the accuracy of aCGH data analysis. We describe the application of WACA to both real and simulated aCGH data sets, and demonstrate that our algorithm, by systematically correcting for all known sources of bias, is a significant improvement on existing aCGH noise reduction algorithms. WACA and associated files are freely available as Supplementary Data.
A system for the automatic production of controlled index terms is presented using linguistically-motivated techniques. This includes a finite-state part of speech tagger, a derivational morphological processor for analysis and generation, and a unificationbased shallow-level parser using transformational rules over syntactic patterns. The contribution of this research is the successful combination of parsing over a seed term list coupled with derivational morphology to achieve greater coverage of multi-word terms for indexing and retrieval. Final results are evaluated for precision and recall, and implications for indexing and retrieval are discussed. MotivationTerms are known to be excellent descriptors of the informational content of textual documents (Srinivasan, 1996), but they are subject to numerous linguistic variations. Terms cannot be retrieved properly with coarse text simplification techniques (e.g. stemming); their identification requires precise and efficient NLP techniques. We have developed a domain independent system for automatic term recognition from unrestricted text. The system presented in this paper takes as input a list of controlled terms and a corpus; it detects and marks occurrences of term We would like to thank the NLP Group of Columbia University, Bell Laboratories -Lucent Technologies, and the Institut Universitaire de Technologie de Nantes for their support of the exchange visitor program for the first author. We also thank the Institut de l'Information Scientifique et Technique (INIST-CNRS) for providing us with the agricultural corpus and the associated term list, and Didier Bourigault for providing us with terms extracted from the newspaper corpus through LEXTER (Bourigault, 1993).variants within the corpus. The system takes as input a precompiled (automatically or manually) term list, and transforms it dynamically into a more complete term list by adding automatically generated variants. This method extends the limits of term extraction as currently practiced in the IR community: it takes into account multiple morphological and syntactic ways linguistic concepts are expressed within language. Our approach is a unique hybrid in allowing the use of manually produced precompiled data as input, combined with fully automatic computational methods for generating term expansions. Our results indicate that we can expand term variations at least 30% within a scientific corpus. 2Background and Introduction NLP techniques have been applied to extraction of information from corpora for tasks such as free indexing (extraction of descriptors from corpora), (Metzler and Haas, 1989;Schwarz, 1990;Sheridan and Smeaton, 1992;Strzalkowski, 1996), term acquisition (Smadja and McKeown, 1991;Bourigault, 1993;Justeson and Katz, 1995; Dallle, 1996), or extraction of lin9uistic information e.g. support verbs (Grefenstette and Teufel, 1995), and event structure of verbs (Klavans and Chodorow, 1992). Although useful, these approaches suffer from two weaknesses which we address. First is the issue...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.