Abstract:We present an automated system for assigning protein, gene, or mRNA class labels to biological terms in free text. Three machine learning algorithms and several extended ways for defining contextual features for disambiguation are examined, and a fully unsupervised manner for obtaining training examples is proposed. We train and evaluate our system over a collection of 9 million words of molecular biology journal articles, obtaining accuracy rates up to 85%.
“…Hatzivassiloglou and Duboué [19] used three supervised learning techniques, C4.5 decision trees, naïve Bayes, and inductive learning. They tested different features with an automatically created gold standard to distinguish between genes, proteins, and mRNA.…”
Section: Information Sourcesmentioning
confidence: 99%
“…Alas, compiling such gold standards is time-consuming and difficult. Some researchers have built gold standards automatically [16,19,21] to sidestep the difficulty of finding experts to create them. These standards are an excellent approach to comparing different algorithms.…”
Section: Research Questionmentioning
confidence: 99%
“…However, because they are systematically built, they deviate from the standard human experts would establish. This is illustrated by Hatzivassiloglou and Duboué [19], who asked human experts to assign labels to the same terms as in the artificial gold standard (the disambiguating terms were deleted). The pair-wise agreement of the experts was 78%.…”
Summary Current approaches to word sense disambiguation use (and often combine) various machine learning techniques. Most refer to characteristics of the ambiguity and its surrounding words and are based on thousands of examples. Unfortunately, developing large training sets is burdensome, and in response to this challenge, we investigate the use of symbolic knowledge for small datasets. A naïve Bayes classifier was trained for 15 words with 100 examples for each. Unified Medical Language System (UMLS) semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in nine experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. To investigate this large variance, we performed several followup evaluations, testing additional algorithms (decision tree and neural network), and gold standards (per expert), but the results did not significantly differ. However, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators. We conclude that neither algorithm nor individual human behavior cause these large differences, but that the structure of the UMLS Metathesaurus (used to represent senses of ambiguous words) contributes to inaccuracies in the gold standard, leading to varied performance of word sense disambiguation techniques.
“…Hatzivassiloglou and Duboué [19] used three supervised learning techniques, C4.5 decision trees, naïve Bayes, and inductive learning. They tested different features with an automatically created gold standard to distinguish between genes, proteins, and mRNA.…”
Section: Information Sourcesmentioning
confidence: 99%
“…Alas, compiling such gold standards is time-consuming and difficult. Some researchers have built gold standards automatically [16,19,21] to sidestep the difficulty of finding experts to create them. These standards are an excellent approach to comparing different algorithms.…”
Section: Research Questionmentioning
confidence: 99%
“…However, because they are systematically built, they deviate from the standard human experts would establish. This is illustrated by Hatzivassiloglou and Duboué [19], who asked human experts to assign labels to the same terms as in the artificial gold standard (the disambiguating terms were deleted). The pair-wise agreement of the experts was 78%.…”
Summary Current approaches to word sense disambiguation use (and often combine) various machine learning techniques. Most refer to characteristics of the ambiguity and its surrounding words and are based on thousands of examples. Unfortunately, developing large training sets is burdensome, and in response to this challenge, we investigate the use of symbolic knowledge for small datasets. A naïve Bayes classifier was trained for 15 words with 100 examples for each. Unified Medical Language System (UMLS) semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in nine experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. To investigate this large variance, we performed several followup evaluations, testing additional algorithms (decision tree and neural network), and gold standards (per expert), but the results did not significantly differ. However, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators. We conclude that neither algorithm nor individual human behavior cause these large differences, but that the structure of the UMLS Metathesaurus (used to represent senses of ambiguous words) contributes to inaccuracies in the gold standard, leading to varied performance of word sense disambiguation techniques.
“…A promising approach to handle the resulting information overload is to automate the process of knowledge extraction using data mining techniques, thereby extracting novel information and relationships between biological features (Fielding 1999;Hatzivassiloglou et al 2001). Machine learning techniques permit the building of models for a given classiWcation task.…”
In this study we analyzed the possible contextspeciWc and individual-speciWc features of dog barks using a new machine-learning algorithm. A pool containing more than 6,000 barks, which were recorded in six diVerent communicative situations was used as the sound sample. The algorithm's task was to learn which acoustic features of the barks, which were recorded in diVerent contexts and from diVerent individuals, could be distinguished from another. The program conducted this task by analyzing barks emitted in previously identiWed contexts by identiWed dogs. After the best feature set had been obtained (with which the highest identiWcation rate was achieved), the eYciency of the algorithm was tested in a classiWcation task in which unknown barks were analyzed. The recognition rates we found were highly above chance level: the algorithm could categorize the barks according to their recorded situation with an eYciency of 43% and with an eYciency of 52% of the barking individuals. These Wndings suggest that dog barks have context-speciWc and individual-speciWc acoustic features. In our opinion, this machine learning method may provide an eYcient tool for analyzing acoustic data in various behavioral studies.
“…Although this is often probably true, it is not guaranteed and may lead to misinterpretations. In a test, experts agreed in only 78% of cases if a sentence was about DNA, mRNA, or protein [11]. This kind of distinction should be crystal clear.…”
Muddled genetic terms miss and mess the message. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
Vihinen, Mauno• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.• You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.
Muddled genetic terms miss and mess the message
Mauno VihinenDepartment of Experimental Medical Science, Lund University, BMC D10, SE-22184 Lund, Sweden A critical aspect of science is the clear communication of complicated matters. However, language is often ambiguous, and the message can get lost in the telling. In particular, genetic terms can have different meanings for different people. Here, I discuss this problem and suggest remedies to clarify the message.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.