Reflective random indexing for semi-automatic indexing of the biomedical literature

Vasuki, Vidya; Cohen, Trevor

doi:10.1016/j.jbi.2010.04.001

Cited by 22 publications

(24 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However automatic systems also present problems because of the complexity of natural language processing (Sinkkilä et al, 2011). Consequently the semi-automatic indexing approach is a good solution, because in addition to obviating the problems of the automatic indexing system it facilitates the the task of indexers by providing suitable term suggestions (Vasuki and Cohen, 2010).…”

Section: Research Objectivesmentioning

confidence: 99%

A semi-automatic indexing system based on embedded information in HTML documents

et al. 2015

View full text Add to dashboard Cite

Purpose -This paper describes and evaluates the tool DigiDoc MetaEdit which allows the semi-automatic indexing of HTML documents. The tool works by identifying and suggesting keywords from a thesaurus according to the embedded information in HTML documents. This enables the parameterization of keyword assignment based on how frequently the terms appear in the document, the relevance of their position, and the combination of both.Design/methodology/approach -In order to evaluate the efficiency of the indexing tool, the descriptors/keywords suggested by the indexing tool are compared to the keywords which have been indexed manually by human experts. To make this comparison a corpus of HTML documents are randomly selected from a journal devoted to Library and Information Science.Findings -The results of the evaluation show that there: (1) is close to a 50% match or overlap between the two indexing systems, however if you take into consideration the related terms and the narrow terms the matches can reach 73%; and (2) the first terms identified by the tool are the most relevant.Originality/value -The tool presented identifies the most important keywords in an HTML document based on the embedded information in HTML documents. Nowadays, representing the contents of documents with keywords is an essential practice in areas such as information retrieval and e-commerce.

show abstract

Section: Research Objectivesmentioning

confidence: 99%

A semi-automatic indexing system based on embedded information in HTML documents

et al. 2015

View full text Add to dashboard Cite

show abstract

“…Vasuki and Cohen [10] use an interesting approach that employs reflective random indexing to find the nearest neighbors in the training dataset and use the indexing based similarity scores to rank the terms from the neighboring citations. A recent effort by Jimeno-Yepes et al [11] uses a large dataset and uses meta-learning to train custom binary classifiers for each MeSH term and index the best performing model for each terml for usage on new testing citations; we request the reader to refer to their work for a recent review of machine learning approaches used for MeSH term assignment.…”

Section: Background and Related Workmentioning

confidence: 99%

“…We experiment with two public datasets used by Huang et al [1]. The NLM2007 dataset has 200 test citations and is used by other recent studies on this subject [10]. The L1000 dataset is curated by Huang et al by random selection for the purposes of their work to test their methods on a larger dataset that spanned a large number of years.…”

Section: Datasets and Evaluation Metricsmentioning

confidence: 99%

“…For a detailed analysis of other RI variants and a thorough introduction, please see [6]. We note that Vasuki and Cohen [10] use TRRI to obtain the nearest neighbors of a testing citation and rank the neighbors’ terms using the citation similarity score sums as discussed in Section 5.3.1. Using this approach they obtain results better than the MTI method.…”

Section: Supervised Prediction With Co-occurrences and Latent Assomentioning

confidence: 99%

See 1 more Smart Citation

Leveraging output term co-occurrence frequencies and latent associations in predicting medical subject headings

Kavuluru

Lü

2014

Data & Knowledge Engineering

View full text Add to dashboard Cite

Trained indexers at the National Library of Medicine (NLM) manually tag each biomedical abstract with the most suitable terms from the Medical Subject Headings (MeSH) terminology to be indexed by their PubMed information system. MeSH has over 26,000 terms and indexers look at each article’s full text while assigning the terms. Recent automated attempts focused on using the article title and abstract text to identify MeSH terms for the corresponding article. Most of these approaches used supervised machine learning techniques that use already indexed articles and the corresponding MeSH terms. In this paper, we present a new indexing approach that leverages term co-occurrence frequencies and latent term associations computed using MeSH term sets corresponding to a set of nearly 18 million articles already indexed with MeSH terms by indexers at NLM. The main goal of our study is to gauge the potential of output label co-occurrences, latent associations, and relationships extracted from free text in both unsupervised and supervised indexing approaches. In this paper, using a novel and purely unsupervised approach, we achieve a micro-F-score that is comparable to those obtained using supervised machine learning techniques. By incorporating term co-occurrence and latent association features into a supervised learning framework, we also improve over the best results published on two public datasets.

show abstract

“…Les travaux sur le MeSH en anglais ont utilisé le modèle probabiliste [4] et les techniques d'apprentissage automatique telles que le réseau bayésien [5] et les k-plus proches voisins pour la classification des documents [2,[6][7][8]. Aronson et al [2] exploitent également l'outil MetaMap [9] et la méthode de tri-gram (cette méthode permet de déterminer la similarité entre deux phrases) pour l'extraction des concepts Unified Medical Language System (UMLS 1 ) qui sont ensuite restreints aux concepts MeSH.…”

Section: Introductionunclassified

Indexation automatique de documents en santé : évaluation et analyse de sources d’erreurs

Chebil

Soualmia²,

Dahamna³

et al. 2012

IRBM

View full text Add to dashboard Cite

Reflective random indexing for semi-automatic indexing of the biomedical literature

Cited by 22 publications

References 9 publications

A semi-automatic indexing system based on embedded information in HTML documents

A semi-automatic indexing system based on embedded information in HTML documents

Leveraging output term co-occurrence frequencies and latent associations in predicting medical subject headings

Indexation automatique de documents en santé : évaluation et analyse de sources d’erreurs

Contact Info

Product

Resources

About