Information may be defined as the characteristics of the the more general definition encompasses all the phenomoutput of a process, these being informative about the ena of interest to the field that is covered by the fieldprocess and the input. This discipline independent defispecific definition and is consistent with the field-specific nition may be applied to all domains, from physics to definition. A more general definition allows frameworks, epistemology. Hierarchies of processes, linked together, theories, and results to be transferred across disciplinary provide a communication channel between each of the corresponding functions and layers in the hierarchies. boundaries, and provides for dialogue across these bound-Models of communication (Shannon), perception, obaries, while at the same time allowing individual disciservation, belief, and knowledge are suggested that are plines to focus on the specific information phenomena of consistent with this conceptual framework of informatheir discipline. Unfortunately, people in different fields tion as the value of the output of any process in a hierarand professions differ on what information is or how to chy of processes. Misinformation and errors are considered.
A probabilistic document-retrieval system may be seen as a sequential learning process, in which the system learns the characteristics of relevant documents, or more formally, it learns the parameters of probability distributions describing the frequencies of feature occurrences in relevant and nonrelevant documents. Probability distributions that may be used to describe the distribution of features include binary and Poisson distributions. Techniques for estimating the parameters of distributions are suggested. We have tested a proposal that parameters of distributions describing the distribution of features in nonrelevant documents be estimated from the parameters of the corresponding distributions of the entire database; the confidence parameter of such an estimate resulting in the highest average precision is given. Tests of several methods for estimating the parameters of distributions describing the distribution of features in relevant documents suggest that small vaiues of the confidence parameter be used in our initial estimates of parameters for relevant documents.
The use of terms from natural and social scientific titles and abstracts is studied from the perspective of sublanguages and their specialized dictionaries. Different notions of sublanguage distinctiveness are explored. Objective methods for separating hard and soft sciences are suggested based on measures of sublanguage use, dictionary characteristics, and sublanguage distinctiveness. Abstracts were automatically classified with a high degree of accuracy by using a formula that considers the degree of uniqueness of terms in each sublanguage. This may prove useful for text filtering or information retrieval systems. © 1995 John Wiley & Sons, Inc.
Relationships between terms and features are an essential component of thesauri, ontologies, and a range of controlled vocabularies. In this article, we describe ways to identify important concepts in documents using the relationships in a thesaurus or other vocabulary structures. We introduce a methodology for the analysis and modeling of the indexing process based on a weighted random walk algorithm. . We also introduce a thesaurus-centric matching algorithm intended to improve the quality of candidate concepts. In all cases, the weighted random walk improves automatic indexing performance over matching alone with an increase in average precision (AP) of 9% for HEP, 11% for MeSH, 35% for NALT, and 37% for AGROVOC. The results of the analysis support our hypothesis that subject indexing is in part a browsing process, and that using the vocabulary and its structure in a thesaurus contributes to the indexing process. The amount that the vocabulary structure contributes was found to differ among the 4 thesauri, possibly due to the vocabulary used in the corresponding thesauri and the structural relationships between the terms. Each of the thesauri and the manual indexing associated with it is characterized using the methods developed here.
The grammars of natural languages may be learned by using genetic algorithms that reproduce and mutate grammatical rules and part-of-speech tags, improving the quality of later generations of grammatical components. Syntactic rules are randomly generated and then evolve; those rules resulting in improved parsing and occasionally improved retrieval and filtering performance are allowed to further propagate. The LUST system learns the characteristics of the language or sublanguage used in document abstracts by learning from the document rankings obtained from the parsed abstracts. Unlike the application of traditional linguistic rules to retrieval and filtering applications, LUST develops grammatical structures and tags without the prior imposition of some common grammatical assumptions (e.g., part-of-speech assumptions), producing grammars that are empirically based and are optimized for this particular application. ¡The author wishes to thank Stephanie Haas for discussions during the course of this research.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.