Purpose
A number of approaches and algorithms have been proposed over the years as a basis for automatic indexing. Many of these approaches suffer from precision inefficiency at low recall. The choice of indexing units has a great impact on search system effectiveness. The authors dive beyond simple terms indexing to propose a framework for multi-word terms (MWT) filtering and indexing.
Design/methodology/approach
In this paper, the authors rely on ranking MWT to filter them, keeping the most effective ones for the indexing process. The proposed model is based on filtering MWT according to their ability to capture the document topic and distinguish between different documents from the same collection. The authors rely on the hypothesis that the best MWT are those that achieve the greatest association degree. The experiments are carried out with English and French languages data sets.
Findings
The results indicate that this approach achieved precision enhancements at low recall, and it performed better than more advanced models based on terms dependencies.
Originality/value
Using and testing different association measures to select MWT that best describe the documents to enhance the precision in the first retrieved documents.