Multi-label learning has been becoming an increasingly active area into the machine learning community since a wide variety of real world problems are naturally multi-labeled. However, it is not uncommon to find disparities among the number of samples of each class, which constitutes an additional challenge for the learning algorithm. Smote is an oversampling technique that has been successfully applied for balancing single-labeled data sets, but has not been used in multi-label frameworks so far. In this work, several strategies are proposed and compared in order to generate synthetic samples for balancing data sets in the training of multi-label algorithms. Results show that a correct selection of seed samples for oversampling improves the classification performance of multi-label algorithms. The uniform generation oversampling, provides an efficient methodology for a wide scope of real world problems.
Over the past decade, advances in sensing devices and computer systems have allowed for the proliferation of high-throughput plant phenotyping systems (Das Choudhury et al., 2019). These systems are designed to acquire and analyze a large number of plant traits (Han et al., 2014; Krieger, 2014), including the measure of small structures, such as the venation network of leaves (Endler, 1998; Green et al., 2014). However, the characterization of plant roots is more challenging because they are "hidden" in the soil (Atkinson et al., 2019), which limits the type of sensors and techniques that can be applied. A number of types of methods have previously been used to analyze root traits. Non-imaging-based in situ methods estimate
Predicting the sub-cellular localization of a protein can provide useful information to uncover its molecular functions. In this sense, numerous prediction techniques have been developed, which usually have been focused on global information of the protein or sequence alignments. However, several studies have shown that the functional nature of proteins is ruled by conserved sub-sequence patterns known as domains. In this paper, an alternative methodology (PfamFeat) for gram-positive bacterial sub-cellular localization was developed. PfamFeat is based on information provided by Pfam database, which stores a series of HMM-profiles describing common protein domains. The likelihood of a sequence, to be generated by a given HMM-profile, can be used to characterize sequences in order to use pattern recognition techniques. Success rates obtained with a simple one-nearest neighbor classifier demonstrate that this method is competitive with popular sub-cellular prediction algorithms and it constitutes a promising research trend.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.