This paper proposes a methodology for text mining relying on the classical knowledge discovery loop, with a number of adaptations. First, texts are indexed and prepared to be processed by frequent itemset levelwise search. Association rules are then extracted and interpreted, with respect to a set of quality measures and domain knowledge, under the control of an analyst. The article includes an experimentation on a real-world text corpus holding on molecular biology.
A text mining process using association rules generates a very large number of rules. According to experts of the domain, most of these rules basically convey a common knowledge, that is, rules which associate terms that experts may likely relate to each other. In order to focus on the result interpretation and discover new knowledge units, it is necessary to define criteria for classifying the extracted rules. Most of the rule classification methods are based on numerical quality measures. In this chapter, the authors introduce two classification methods: the first one is based on a classical numerical approach, that is, using quality measures, and the other one is based on domain knowledge. They propose the second original approach in order to classify association rules according to qualitative criteria using domain model as background knowledge. Hence, they extend the classical numerical approach in an effort to combine data mining and semantic techniques for post mining and selection of association rules. The authors mined a corpus of texts in molecular biology and present the results of both approaches, compare them, and give a discussion on the benefits of taking into account a knowledge domain model of the data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.