Background: The number of porcine Single Nucleotide Polymorphisms (SNPs) used in genetic association studies is very large, suitable for statistical testing. However, in breed classification problem, one needs to have a much smaller porcine-classifying SNPs (PCSNPs) set that could accurately classify pigs into different breeds. This study attempted to find such PCSNPs by using several combinations of feature selection and classification methods. We experimented with different combinations of feature selection methods including information gain, conventional as well as modified genetic algorithms, and our developed frequency feature selection method in combination with a common classification method, Support Vector Machine, to evaluate the method's performance. Experiments were conducted on a comprehensive data set containing SNPs from native pigs from America, Europe, Africa, and Asia including Chinese breeds, Vietnamese breeds, and hybrid breeds from Thailand. Results: The best combination of feature selection methods-information gain, modified genetic algorithm, and frequency feature selection hybrid-was able to reduce the number of possible PCSNPs to only 1.62% (164 PCSNPs) of the total number of SNPs (10,210 SNPs) while maintaining a high classification accuracy (95.12%). Moreover, the near-identical performance of this PCSNPs set to those of bigger data sets as well as even the entire data set. Moreover, most PCSNPs were well-matched to a set of 94 genes in the PANTHER pathway, conforming to a suggestion by the Porcine Genomic Sequencing Initiative. Conclusions: The best hybrid method truly provided a sufficiently small number of porcine SNPs that accurately classified swine breeds.
A panel of a large number of common Single Nucleotide Polymorphisms (SNPs) distributed across an entire porcine genome has been widely used to represent genetic variability of pigs. With the advent of SNP-array technology, a genome-wide genetic profile of a specimen can be easily observed. Among the large number of such variations, there exists a much smaller subset of the SNP panel that could equally be used to correctly identify the corresponding breed. This work presents a SNP selection heuristic that can still be used effectively in the breed classification. The features were selected by combining a filter method and a wrapper method–information gain method and genetic algorithma“plus a feature frequency selection step, while classification used a support vector machine. We were able to reduce the number of significant SNPs to 0.86 % of the total number of SNPs in a swine dataset with 94.80 % classification accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.