Classifying samples into categories is a common problem in analytical chemistry and other fields. Classification is usually based on only one method, but numerous classifiers are available with some being complex, such as neural networks, and others are simple, such as k nearest neighbors. Regardless, most classification schemes require optimization of one or more tuning parameters for best classification accuracy, sensitivity, and specificity. A process not requiring exact selection of tuning parameter values would be useful. To improve classification, several ensemble approaches have been used in past work to combine classification results from multiple optimized single classifiers. The collection of classifications for a particular sample are then combined by a fusion process such as majority vote to form the final classification. Presented in this Article is a method to classify a sample by combining multiple classification methods without specifically classifying the sample by each method, that is, the classification methods are not optimized. The approach is demonstrated on three analytical data sets. The first is a beer authentication set with samples measured on five instruments, allowing fusion of multiple instruments by three ways. The second data set is composed of textile samples from three classes based on Raman spectra. This data set is used to demonstrate the ability to classify simultaneously with different data preprocessing strategies, thereby reducing the need to determine the ideal preprocessing method, a common prerequisite for accurate classification. The third data set contains three wine cultivars for three classes measured at 13 unique chemical and physical variables. In all cases, fusion of nonoptimized classifiers improves classification. Also presented are atypical uses of Procrustes analysis and extended inverted signal correction (EISC) for distinguishing sample similarities to respective classes.
A significant and common problem in analytical chemistry is determining if a sample belongs to a specific class, e.g., establishing if a food product is genuine or counterfeit or a tissue sample is benign or malignant. This problem is termed one-class classification (class modeling). Problematic with class modeling is determining which one-class classifier to use followed by the challenge of optimizing the chosen classifier (identifying the best tuning parameter value(s)). With spectroscopic data, two other conundrums arise: which data preprocessing method(s) and spectral region(s) to use. Presented in this paper is a hybrid fusion process that can combine nonoptimized classifiers across multiple instruments, preprocessing methods, and measurements. Instead of optimizing classifiers, a window of tuning parameters is used for each classifier. The flexible fusion method of sum of ranking differences (SRD) is applied to combine all assessment values. Defining the best SRD ranking value (threshold) for determining class membership is the one tuning parameter value needed. However, this SRD ranking value is automatically optimized by using a receiver operator characteristic (ROC) curve. The approach is demonstrated on two analytical data sets. The first is a beer authentication sample set measured on five instruments: near-infrared, mid infrared (MIR), ultraviolet, visible, and thermogravimetric. Three different fusion protocols of all five instruments are demonstrated. The second data set is MIR spectra of strawberry puree with two categories: strawberry puree and nonstrawberry puree. Fusing nonoptimized classifiers provides reliable classifications relative to accuracy, sensitivity, and specificity.
Developing spectroscopic calibration models requires calibration samples that mimic as much as possible new sample compositions as well as measurement conditions. This requirement is known as matrix matching calibration samples to new samples, that is, samples are matrix matched chemically, physically, and instrumentally. To accomplish this task, calibration sets have large sample numbers to span the expected sample matrix variations. This large range of calibration variability can result in poor performance. Preferred is a calibration set distinctly matched to the new samples. However, assessing whether each sample in a particular calibration set is appropriately matched to new samples relative to the specific analyte content and all other constituents is not an easy task. It is well documented that even though calibration samples are spectral matches to new sample spectra (have similar measured spectra), the calibration set is usually not fully matrix matched to new sample compositions. For example, using a spectral similarity measure such as Euclidean distance, the same calibration samples are deemed spectral matches to new samples regardless of the analyte of interest. This work presents a process to assess underlying sample matrix interactions between calibration model regression vectors and new sample spectra allowing fully matrix matched samples to be identified. The process is general and applicable to other situations such as matching historical batch processing data where references values are not known for new samples (unlabeled). Two data sets are used to demonstrate the functionality of the process. One consists of nuclear magnetic resonance spectra for mixtures of three alcohols and the other is near-infrared corn spectra with four prediction properties measured on three instruments. General trends are reported for a few of the possible data situations. Calibration samples identified as matrix matched to new samples are shown to predict the new samples with the lowest prediction errors.
• Models are formed using MLR • Wavelengths of filtered models are collected• Partial Least Squares (PLS)• PLS models are formed using selected wavelengthsIn multivariate calibration, wavelengths selection is often used to lower prediction errors of sample properties. As a result, many methods have been created to select wavelengths. Several of the wavelength selection methods involve many tuning parameters that are typically complex or difficult to work with. The purpose of this poster is to show an easy way to select wavelengths while using few simple tuning parameters. The proposed method uses multiple linear regression (MLR) as an indicator to which wavelengths should be used to create a model. From a collection of random MLR models, those models with an acceptable bias/variance balance are evaluated to determine the wavelengths most frequently used. Portions of the most frequently selected wavelengths are chosen as the final MLR selected wavelengths. These MLR selected wavelengths are used to produce a calibration model by the method of partial least squares (PLS). This proposed wavelength selection method is compared to PLS models containing all wavelengths using several near infrared data sets. The PLS models with the selected wavelengths show an improvement in prediction error, suggesting this method as a simple way to select wavelengths. Leveraging Multiple Linear Regression for Wavelength Selection Abstract Objectives Results Conclusions• MLR wavelength selection helps from improved calibration models • Generally does better than all wavelength PLS • Most datasets choses banded wavelengths • Gasoline did not • Larger L 2 norm • Tuning parameters• Goal was to limit the number of parameters • Out of the five, only two can be changed • Gasoline needs adjustment to improve • The proposed method is successful and can be used for wavelength selection
A rising problem in the food industry is identifying the origin and/or purity of food products, and classification is often used for the identification. The process involves determining whether a food sample is similar to a collection of samples characterizing a known food product (is the food product a class member?). However, classification requires the analyst to face a myriad of complex decisions. With modern instrumentation, classification based on conventional data fusion processes increases the assortment of decisions. Typically, testing food products without data fusion involves (1) selection of an instrument or measuring device, (2) determining suitable measurement variables, for example, sensors or wavelengths for spectral data, (3) data preprocessing optimization, and (4) selection of a classification method followed by identifying the best tuning parameter setup for that classifier. If data fusion is desired, optimization of combined variations of (1)-( 4) is required. This paper overviews a recent approach that simplifies data fusion decisions for food authentication and adulteration classification purposes. This unique self-optimized fusion method avoids the confounding decisions by simultaneously evaluating any number of permutations of (1)-(3) in combination with a collection of non-optimized classifiers based on respective tuning parameter windows for (4). Self-optimization is obtained by using the sum of ranking differences (SRD) to form the final fused class membership decision for each new food test sample where the SRD was automatically optimized using a receiver operator characteristic (ROC) curve. The simple automatic SRD hybrid fusion/ROC curve process removes the myriad of subjective decisions imposed on the analyst and increases food classification reliability.Results are demonstrated with three datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.