Disambiguating proteins, genes, and RNA in text: a machine 
  learning approach

Hatzivassiloglou, Vasileios; Duboue, Pablo; Rzhetsky, Andrey

doi:10.1093/bioinformatics/17.suppl_1.s97

Cited by 162 publications

(124 citation statements)

References 15 publications

(19 reference statements)

Supporting

Mentioning

122

Contrasting

Unclassified

Order By: Relevance

“…Hatzivassiloglou and Duboué [19] used three supervised learning techniques, C4.5 decision trees, naïve Bayes, and inductive learning. They tested different features with an automatically created gold standard to distinguish between genes, proteins, and mRNA.…”

Section: Information Sourcesmentioning

confidence: 99%

“…Alas, compiling such gold standards is time-consuming and difficult. Some researchers have built gold standards automatically [16,19,21] to sidestep the difficulty of finding experts to create them. These standards are an excellent approach to comparing different algorithms.…”

Section: Research Questionmentioning

confidence: 99%

“…However, because they are systematically built, they deviate from the standard human experts would establish. This is illustrated by Hatzivassiloglou and Duboué [19], who asked human experts to assign labels to the same terms as in the artificial gold standard (the disambiguating terms were deleted). The pair-wise agreement of the experts was 78%.…”

Section: Research Questionmentioning

confidence: 99%

See 2 more Smart Citations

Effects of information and machine learning algorithms on word sense disambiguation with small datasets

Leroy

Rindflesch

2005

International Journal of Medical Informatics

View full text Add to dashboard Cite

Summary Current approaches to word sense disambiguation use (and often combine) various machine learning techniques. Most refer to characteristics of the ambiguity and its surrounding words and are based on thousands of examples. Unfortunately, developing large training sets is burdensome, and in response to this challenge, we investigate the use of symbolic knowledge for small datasets. A naïve Bayes classifier was trained for 15 words with 100 examples for each. Unified Medical Language System (UMLS) semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in nine experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. To investigate this large variance, we performed several followup evaluations, testing additional algorithms (decision tree and neural network), and gold standards (per expert), but the results did not significantly differ. However, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators. We conclude that neither algorithm nor individual human behavior cause these large differences, but that the structure of the UMLS Metathesaurus (used to represent senses of ambiguous words) contributes to inaccuracies in the gold standard, leading to varied performance of word sense disambiguation techniques.

show abstract

Section: Information Sourcesmentioning

confidence: 99%

Section: Research Questionmentioning

confidence: 99%

Section: Research Questionmentioning

confidence: 99%

See 1 more Smart Citation

Effects of information and machine learning algorithms on word sense disambiguation with small datasets

Leroy

Rindflesch

2005

International Journal of Medical Informatics

View full text Add to dashboard Cite

show abstract

“…A promising approach to handle the resulting information overload is to automate the process of knowledge extraction using data mining techniques, thereby extracting novel information and relationships between biological features (Fielding 1999;Hatzivassiloglou et al 2001). Machine learning techniques permit the building of models for a given classiWcation task.…”

Section: Introductionmentioning

confidence: 99%

Classification of dog barks: a machine learning approach

Molnár

Kaplan

Roy³

et al. 2008

Anim Cogn

View full text Add to dashboard Cite

In this study we analyzed the possible contextspeciWc and individual-speciWc features of dog barks using a new machine-learning algorithm. A pool containing more than 6,000 barks, which were recorded in six diVerent communicative situations was used as the sound sample. The algorithm's task was to learn which acoustic features of the barks, which were recorded in diVerent contexts and from diVerent individuals, could be distinguished from another. The program conducted this task by analyzing barks emitted in previously identiWed contexts by identiWed dogs. After the best feature set had been obtained (with which the highest identiWcation rate was achieved), the eYciency of the algorithm was tested in a classiWcation task in which unknown barks were analyzed. The recognition rates we found were highly above chance level: the algorithm could categorize the barks according to their recorded situation with an eYciency of 43% and with an eYciency of 52% of the barking individuals. These Wndings suggest that dog barks have context-speciWc and individual-speciWc acoustic features. In our opinion, this machine learning method may provide an eYcient tool for analyzing acoustic data in various behavioral studies.

show abstract

“…Although this is often probably true, it is not guaranteed and may lead to misinterpretations. In a test, experts agreed in only 78% of cases if a sentence was about DNA, mRNA, or protein [11]. This kind of distinction should be crystal clear.…”

Section: Why Bother?mentioning

confidence: 99%

Muddled genetic terms miss and mess the message

Vihinen

2015

Trends in Genetics

View full text Add to dashboard Cite

Muddled genetic terms miss and mess the message. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Vihinen, Mauno• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.• You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Muddled genetic terms miss and mess the message Mauno VihinenDepartment of Experimental Medical Science, Lund University, BMC D10, SE-22184 Lund, Sweden A critical aspect of science is the clear communication of complicated matters. However, language is often ambiguous, and the message can get lost in the telling. In particular, genetic terms can have different meanings for different people. Here, I discuss this problem and suggest remedies to clarify the message.

show abstract

Disambiguating proteins, genes, and RNA in text: a machine learning approach

Cited by 162 publications

References 15 publications

Effects of information and machine learning algorithms on word sense disambiguation with small datasets

Effects of information and machine learning algorithms on word sense disambiguation with small datasets

Classification of dog barks: a machine learning approach

Muddled genetic terms miss and mess the message

Contact Info

Product

Resources

About