Many different types of similarity coefficients have been described in the literature. Since different coefficients take into account different characteristics when assessing the degree of similarity between molecules, it is reasonable to combine them to further optimize the measures of similarity between molecules. This paper describes experiments in which data fusion is used to combine several binary similarity coefficients to get an overall estimate of similarity for searching databases of bioactive molecules. The results show that search performances can be improved by combining coefficients with little extra computational cost. However, there is no single combination which gives a consistently high performance for all search types.
We discuss the size-bias inherent in several chemical similarity coefficients when used for the similarity searching or diversity selection of compound collections. Limits to the upper bounds of 14 standard similarity coefficients are investigated, and the results are used to identify some exceptional characteristics of a few of the coefficients. An additional numerical contribution to the known size bias in the Tanimoto coefficient is identified. Graphical plots with respect to relative bit density are introduced to further assess the coefficients. Our methods reveal the asymmetries inherent in most similarity coefficients that lead to bias in selection, most notably with the Forbes and Russell-Rao coefficients. Conversely, when applied to the recently introduced Modified Tanimoto coefficient our methods provide support for the view that it is less biased toward molecular size than most. In this work we focus our discussion on fragment-based bit strings, but we demonstrate how our approach can be generalized to continuous representations.
Abstract-Plagiarism can be of many different natures, ranging from copying texts to adopting ideas, without giving credit to its originator. This paper presents a new taxonomy of plagiarism that highlights differences between literal plagiarism and intelligent plagiarism, from the plagiarist's behavioral point of view. The taxonomy supports deep understanding of different linguistic patterns in committing plagiarism, for example, changing texts into semantically equivalent but with different words and organization, shortening texts with concept generalization and specification, and adopting ideas and important contributions of others. Different textual features that characterize different plagiarism types are discussed. Systematic frameworks and methods of monolingual, extrinsic, intrinsic, and cross-lingual plagiarism detection are surveyed and correlated with plagiarism types, which are listed in the taxonomy. We conduct extensive study of state-of-the-art techniques for plagiarism detection, including character n-gram-based (CNG), vector-based (VEC), syntax-based (SYN), semantic-based (SEM), fuzzy-based (FUZZY), structuralbased (STRUC), stylometric-based (STYLE), and cross-lingual techniques (CROSS). Our study corroborates that existing systems for plagiarism detection focus on copying text but fail to detect intelligent plagiarism when ideas are presented in different words.
The rapid increase in the flow rate of published digital information in all disciplines has resulted in a pressing need for techniques that can simplify the use of this information. The chemistry literature is very rich with information about chemical entities. Extracting molecules and their related properties and activities from the scientific literature to “text mine” these extracted data and determine contextual relationships helps research scientists, particularly those in drug development. One of the most important challenges in chemical text mining is the recognition of chemical entities mentioned in the texts. In this review, the authors briefly introduce the fundamental concepts of chemical literature mining, the textual contents of chemical documents, and the methods of naming chemicals in documents. We sketch out dictionary-based, rule-based and machine learning, as well as hybrid chemical named entity recognition approaches with their applied solutions. We end with an outlook on the pros and cons of these approaches and the types of chemical entities extracted.
A Bayesian inference network (BIN) provides an interesting alternative to existing tools for similarity-based virtual screening. The BIN is particularly effective when the active molecules being sought have a high degree of structural homogeneity but has been found to perform less well with structurally heterogeneous sets of actives. In this paper, we introduce an alternative network model, called a Bayesian belief network (BBN), that seeks to overcome this limitation of the BIN approach. Simulated virtual screening experiments with the MDDR, WOMBAT and MUV data sets show that the BIN and BBN methods allow effective screening searches to be carried out. However, the results obtained are not obviously superior to those obtained using a much simpler approach that is based on the use of the Tanimoto coefficient and of the square roots of fragment occurrence frequencies.
In plagiarism detection (PD) systems, two important problems should be considered: the problem of retrieving candidate documents that are globally similar to a document q under investigation, and the problem of side-by-side comparison of q and its candidates to pinpoint plagiarized fragments in detail. In this article, the authors investigate the usage of structural information of scientific publications in both problems, and the consideration of citation evidence in the second problem. Three statistical measures namely Inverse Generic Class Frequency, Spread, and Depth are introduced to assign a degree of importance (i.e., weight) to structural components in scientific articles. A term-weighting scheme is adjusted to incorporate component-weight factors, which is used to improve the retrieval of potential sources of plagiarism. A plagiarism screening process is applied based on a measure of resemblance, in which component-weight factors are exploited to ignore less or nonsignificant plagiarism cases. Using the notion of citation evidence, parts with proper citation evidence are excluded, and remaining cases are suspected and used to calculate the similarity index. The authors compare their approach to two flat-based baselines, TF-IDF weighting with a Cosine coefficient, and shingling with a Jaccard coefficient. In both baselines, they use different comparison units with overlapping measures for plagiarism screening. They conducted extensive experiments using a dataset of 15,412 documents divided into 8,657 source publications and 6,755 suspicious queries, which included 18,147 plagiarism cases inserted automatically. Component-weight factors are assessed using precision, recall, and F -measure averaged over a 10-fold cross-validation and compared using the ANOVA statistical test. Results from structural-based candidate retrieval and plagiarism detection are evaluated statistically against the flat baselines using paired-t tests on 10-fold cross-validation runs, which demonstrate the efficacy achieved by the proposed framework. An empirical study on the system's response shows that structural information, unlike existing plagiarism detectors, helps to flag significant plagiarism cases, improve the similarity index, and provide human-like plagiarism screening results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.