New social and economic activities massively exploit big data and machine learning algorithms to do inference on people's lives. Applications include automatic curricula evaluation, wage determination, and risk assessment for credits and loans. Recently, many governments and institutions have raised concerns about the lack of fairness, equity and ethics in machine learning to treat these problems. It has been shown that not including sensitive features that bias fairness, such as gender or race, is not enough to mitigate the discrimination when other related features are included. Instead, including fairness in the objective function has been shown to be more efficient. We present novel fair regression and dimensionality reduction methods built on a previously proposed fair classification framework. Both methods rely on using the Hilbert Schmidt independence criterion as the fairness term. Unlike previous approaches, this allows us to simplify the problem and to use multiple sensitive variables simultaneously. Replacing the linear formulation by kernel functions allows the methods to deal with nonlinear problems. For both linear and nonlinear formulations the solution reduces to solving simple matrix inversions or generalized eigenvalue problems. This simplifies the evaluation of the solutions for different trade-off values between the predictive error and fairness terms. We illustrate the usefulness of the proposed methods in toy examples, and evaluate their performance on real world datasets to predict income using gender and/or race discrimination as sensitive variables, and contraceptive method prediction under demographic and socio-economic sensitive descriptors.
Current remote sensing image classification problems have to deal with an unprecedented amount of heterogeneous and complex data sources. Upcoming missions will soon provide large data streams that will make land cover/use classification difficult. Machine learning classifiers can help at this, and many methods are currently available. A popular kernel classifier is the Gaussian process classifier (GPC), since it approaches the classification problem with a solid probabilistic treatment, thus yielding confidence intervals for the predictions as well as very competitive results to state-of-the-art neural networks and support vector machines. However, its computational cost is prohibitive for large scale applications, and constitutes the main obstacle precluding wide adoption. This paper tackles this problem by introducing two novel efficient methodologies for Gaussian Process (GP) classification. We first include the standard random Fourier features approximation into GPC, which largely decreases its computational cost and permits large scale remote sensing image classification. In addition, we propose a model which avoids randomly sampling a number of Fourier frequencies, and alternatively learns the optimal ones within a variational Bayes approach. The performance of the proposed methods is illustrated in complex problems of cloud detection from multispectral imagery and infrared sounding data. Excellent empirical results support the proposal in both computational cost and accuracy.
Developing accurate models of crop stress, phenology and productivity is of paramount importance, given the increasing need of food. Earth observation (EO) remote sensing data provides a unique source of information to monitor crops in a temporally resolved and spatially explicit way. In this study, we propose the combination of multisensor (optical and microwave) remote sensing data for crop yield estimation and forecasting using two novel approaches. We first propose the lag between Enhanced Vegetation Index (EVI) derived from MODIS and Vegetation Optical Depth (VOD) derived from SMAP as a new joint metric combining the information from the two satellite sensors in a unique feature or descriptor. Our second approach avoids summarizing statistics and uses machine learning to combine full time series of EVI and VOD. This study considers two statistical methods, a regularized linear regression and its nonlinear extension called kernel ridge regression to directly estimate the county-level surveyed total production, as well as individual yields of the major crops grown in the region: corn, soybean and wheat. The study area includes the US Corn Belt, and we use agricultural survey data from the National Agricultural Statistics Service (USDA-NASS) for year 2015 for quantitative assessment. Results show that (1) the proposed EVI-VOD lag metric correlates well with crop yield and outperforms common single-sensor metrics for crop yield estimation; (2) the statistical (machine learning) models working directly with the time series largely improve results compared to previously reported estimations; (3) the combined exploitation of information from the optical and microwave data leads to improved predictions over the use of single sensor approaches with coefficient of determination R≥20.76; (4) when models are used for within-season forecasting with limited time information, crop yield prediction is feasible up to four months before harvest (models reach a plateau in accuracy); and (5) the robustness of the approach is confirmed in a multi-year setting, reaching similar performances than when using single-year data. In conclusion, results confirm the value of using both EVI and VOD at the same time, and the advantage of using automatic machine learning models for crop yield/production estimation.
In preparation for new-generation imaging spectrometer missions and the accompanying unprecedented inflow of hyperspectral data, optimized models are needed to generate vegetation traits routinely. Hybrid models, combining radiative transfer models with machine learning algorithms, are preferred, however, dealing with spectral collinearity imposes an additional challenge. In this study, we analyzed two spectral dimensionality reduction methods: principal component analysis (PCA) and band ranking (BR), embedded in a hybrid workflow for the retrieval of specific leaf area (SLA), leaf area index (LAI), canopy water content (CWC), canopy chlorophyll content (CCC), the fraction of absorbed photosynthetic active radiation (FAPAR), and fractional vegetation cover (FVC). The SCOPE model was used to simulate training data sets, which were optimized with active learning. Gaussian process regression (GPR) algorithms were trained over the simulations to obtain trait-specific models. The inclusion of PCA and BR with 20 features led to the so-called GPR-20PCA and GPR-20BR models. The 20PCA models encompassed over 99.95% cumulative variance of the full spectral data, while the GPR-20BR models were based on the 20 most sensitive bands. Validation against in situ data obtained moderate to optimal results with normalized root mean squared error (NRMSE) from 13.9% (CWC) to 22.3% (CCC) for GPR-20PCA models, and NRMSE from 19.6% (CWC) to 29.1% (SLA) for GPR-20BR models. Overall, the GPR-20PCA slightly outperformed the GPR-20BR models for all six variables. To demonstrate mapping capabilities, both models were tested on a PRecursore IperSpettrale della Missione Applicativa (PRISMA) scene, spectrally resampled to Copernicus Hyperspectral Imaging Mission for the Environment (CHIME), over an agricultural test site (Jolanda di Savoia, Italy). The two strategies obtained plausible spatial patterns, and consistency between the two models was highest for FVC and LAI (R2=0.91, R2=0.86) and lowest for SLA mapping (R2=0.53). From these findings, we recommend implementing GPR-20PCA models as the most efficient strategy for the retrieval of multiple crop traits from hyperspectral data streams. Hence, this workflow will support and facilitate the preparations of traits retrieval models from the next-generation operational CHIME.
Current remote sensing applications of biophysical parameter estimation and image classification have to deal with an unprecedented big amount of heterogeneous and complex data sources. New satellite sensors involving a high number of improved time, space and wavelength resolutions give rise to challenging computational problems. Standard physical inversion techniques cannot cope efficiently with this new scenario. Dealing with land cover classification of the new image sources has also turned to be a complex problem requiring large amount of memory and processing time. In order to cope with these problems, statistical learning has greatly helped in the last years to develop statistical retrieval and classification models that can ingest large amounts of Earth observation data. Kernel methods constitute a family of powerful machine learning algorithms, which have found wide use in remote sensing and geosciences. However, kernel methods are still not widely adopted because of the high computational cost when dealing with large scale problems, such as the inversion of radiative transfer models or the classification of high spatial-spectral-temporal resolution data. This paper introduces an efficient kernel method for fast statistical retrieval of bio-geo-physical parameters and image classification problems. The method allows to approximate a kernel matrix with a set of projections on random bases sampled from the Fourier domain. The method is simple, computationally very efficient in both memory and processing costs, and easily parallelizable. We show that kernel regression and classification is now possible for datasets with millions of examples and high dimensionality. Examples on atmospheric parameter retrieval from hyperspectral infrared sounders like IASI/Metop; large scale emulation and inversion of the familiar PROSAIL radiative transfer model on Sentinel-2 data; and the identification of clouds over landmarks in time series of MSG/Seviri images show the efficiency and effectiveness of the proposed technique.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.