SummaryA common problem in multi-environment trials arises when some genotypeby-environment combinations are missing. In Arciniegas-Alarcón et al. (2010) we outlined a method of data imputation to estimate the missing values, the computational algorithm for which was a mixture of regression and lower-rank approximation of a matrix based on its singular value decomposition (SVD). In the present paper we provide two extensions to this methodology, by including weights chosen by cross-validation and allowing multiple as well as simple imputation. The three methods are assessed and compared in a simulation study, using a complete set of real data in which values are deleted randomly at different rates. The quality of the imputations is evaluated using three measures: the Procrustes statistic, the squared correlation between matrices and the normalised root mean squared error between these estimates and the true observed values. None of the methods makes any distributional or structural assumptions, and all of them can be used for any pattern or mechanism of the missing values.
A common problem in the analysis of data from multi‐environment trials is imbalance caused by missing observations. To get around this problem, Yan proposed a method for imputing the missing values based on the singular‐value decomposition (SVD) of a matrix. However, this SVD can be affected by outliers and produce low quality imputations. In this article, we propose four extensions of the Yan method that are resistant to outliers, replacing the standard SVD method with four robust SVD extensions. We evaluate these methods, using exclusively numerical criteria in a simulation study and in a cross‐validation study based on real data. We conclude that in the presence of outliers, the standard SVD method should not be used; instead, the best alternatives are the robust SVD methods based on sub‐sampling when the percentage of contamination is less than 2% following a completely random missing data mechanism. In any other case, methods that either minimize the L2 norm or that involve L1 regressions are preferable.
RESUMOUm problema comum em dados climáticos é a informação ausente. Recentemente, foram desenvolvidos quatro métodos de imputação que têm como base a decomposição por valores singulares de uma matriz (DVS). O objetivo deste artigo é avaliar os novos desenvolvimentos fazendo uma comparação por meio de um estudo de simulação baseado em duas matrizes completas de dados reais. Uma matriz corresponde à precipitação histórica de Piracicaba/SP -Brasil, enquanto a outra matriz corresponde às características meteorológicas multivariadas na mesma cidade desde o ano 1997 até 2012. No estudo foram feitas retiradas aleatórias de diferentes porcentagens com posterior imputação, comparando as metodologias através de três critérios: a raiz quadrada normalizada do erro quadrático médio, a estatística de similaridade de Procrustes e o coeficiente de correlação não paramétrico de Spearman. Concluiu-se que a DVS deve ser utilizada unicamente quando sejam analisadas matrizes multivariadas e, no caso de matrizes de precipitação, a imputação pela média mensal supera o desempenho de métodos baseados na DVS. Palavras-chave: Imputação, DVS, observações ausentes.ABSTRACT CLIMATE DATA IMPUTATION USING THE SINGULAR VALUE DECOMPOSITION: AN EMPIRICAL COMPARISON A common problem in climate data is missing information. Recently, four methods have been developed which are based in the singular value decomposition of a matrix (SVD). The aim of this paper is to evaluate these new developments making a comparison by means of a simulation study based on two complete matrices of real data. One corresponds to the historical precipitation of Piracicaba / SP -Brazil and the other matrix corresponds to multivariate meteorological characteristics in the same city from year 1997 to 2012. In the study, values were deleted randomly at different percentages with subsequent imputation, comparing the methodologies by three criteria: the normalized root mean squared error, the similarity statistic of Procrustes and the Spearman correlation coefficient. It was concluded that the SVD should be used only when multivariate matrices are analyzed and when matrices of precipitation are used, the monthly mean overcome the performance of other methods based on the SVD. Keywords: Imputation, SVD, missing values. devido a varias razões, como falhas dos instrumentos de medição, condições climáticas extremas e erros na digitação. Uma maneira muito comum de analisar dados provenientes de estudos com informação faltante é imputar as observações ausentes e posteriormente, aplicar procedimentos clássicos sobre os dados completados (observados + imputados). Um método amplamente usado na literatura é utilizar a média como imputação.
This paper proposes five new imputation methods for unbalanced experiments with genotype by-environment interaction ( × ).The methods use cross-validation by eigenvector, based on an iterative scheme with the singular value decomposition (SVD) of a matrix. To test the methods, we performed a simulation study using three complete matrices of real data, obtained from × interaction trials of peas, cotton, and beans, and introducing lack of balance by randomly deleting in turn 10%, 20%, and 40% of the values in each matrix. The quality of the imputations was evaluated with the additive main effects and multiplicative interaction model (AMMI), using the root mean squared predictive difference (RMSPD) between the genotypes and environmental parameters of the original data set and the set completed by imputation. The proposed methodology does not make any distributional or structural assumptions and does not have any restrictions regarding the pattern or mechanism of missing values.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.