Term position feature is widely and successfully used in IR and Web search engines, to enhance the retrieval effectiveness. This feature is essentially used for two purposes: to capture query terms proximity or to boost the weight of terms appearing in some parts of a document. In this paper, we are interested in this second category. We propose two novel query‐independent techniques based on absolute term positions in a document, whose goal is to boost the weight of terms appearing in the beginning of a document. The first one considers only the earliest occurrence of a term in a document. The second one takes into account all term positions in a document. We formalize each of these two techniques as a document model based on term position, and then we incorporate it into a basic language model (LM). Two smoothing techniques, Dirichlet and Jelinek‐Mercer, are considered in the basic LM. Experiments conducted on three TREC test collections show that our model, especially the version based on all term positions, achieves significant improvements over the baseline LMs, and it also often performs better than two state‐of‐the‐art baseline models, the chronological term rank model and the Markov random field model.
De nos jours, les ressources disponibles sur le web augmentent considérablement. Dans cet immense entrepôt de données, les systèmes de recherche d'information actuels ne permettent pas de retourner aux utilisateurs les documents répondant exactement à leurs besoins exprimés par une requête sur une collection de documents. Cela est dû, en grande partie, aux techniques d'indexation utilisées (mots-clés, thésaurus, etc.). Afin d'améliorer la pertinence de la recherche d'information, nous proposons dans ce papier une approche qui se base sur l'utilisation d'une ontologie de domaine pour l'indexation d'une base de documents et l'utilisation des liens sémantiques entre documents ou fragments de documents de la collection, pour permettre l'inférence de tous les documents pertinents. Cette approche est testée sur le domaine de l'e-learning de l'informatique dans le contexte du web sémantique. Quelques résultats obtenus sont également présentés.
Most existing Information Retrieval model including probabilistic and vector space models are based on the term independence hypothesis. To go beyond this assumption and thereby capture the semantics of document and query more accurately, several works have incorporated phrases or other syntactic information in IR, such attempts have shown slight benefit, at best. Particularly in language modeling approaches this extension is achieved through the use of the bigram or n-gram models. However, in these models all bigrams/ngrams are considered and weighted uniformly. In this paper we introduce a new approach to select and weight relevant n-grams associated with a document. Experimental results on three TREC test collections showed an improvement over three strongest state-of-the-art model baselines, which are the original unigram language model, the Markov Random Field model, and the positional language model.
Nowadays, resources on the web increase considerably. In this immense data warehouse, current information retrieval systems do not allow users to obtain results to their requests that meet exactly their needs. Mainly this is due to indexing techniques used (key words, thesaurus). In order to improve the relevance of information retrieval, an ontology-based approach called OBIREX is proposed in this paper. OBIREX is based on the use of ontology of the domain for indexing a collection of documents and the use of semantic links between documents to allow the inference of all relevant documents. This approach is tested on the domain of e-learning of computer science in the context of the semantic web. Some results obtained are also presented.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.