A real data-driven simulation strategy to select an imputation method for mixed-type trait data

May, Jacqueline A.; Feng, Zeny; Adamowicz, Sarah J.

doi:10.1101/2022.05.03.490388

Cited by 3 publications

(3 citation statements)

References 80 publications

(162 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, while our results agree with others that random forest models (as implemented by the missForest R function) are an accurate imputation method for trait data (Johnson et al, 2021), care should be taken to ensure use of imputation is appropriate. Our findings regarding the utility of imputation are only applicable to continuous trait imputation, as the efficacy of categorical traits imputation was not explored (although see May et al, 2023), and to large trait data sets on the scale of hundreds or thousands of species rather than tens. The utility of imputation in tackling missing and biased data has been shown to depend on the correlation between traits, and extent of phylogenetic autocorrelation (Clavel et al, 2015).…”

Section: Discussionmentioning

confidence: 99%

“…We tested two ways of dealing with the generated incomplete data sets: (1) removal of species with missing data (complete case analysis) and (2) filling data gaps through imputation. We used missForest imputation, implemented through the missForest (Stekhoven & Bühlmann, 2012), due to its demonstrated accuracy (Hong & Lynn, 2020; May et al, 2023; Penone et al, 2014), and fast computation times. Accounting for phylogenetic relatedness between species can improve imputation accuracy (May et al, 2023; Penone et al, 2014).…”

Section: Methodsmentioning

confidence: 99%

“…We used missForest imputation, implemented through the missForest (Stekhoven & Bühlmann, 2012), due to its demonstrated accuracy (Hong & Lynn, 2020; May et al, 2023; Penone et al, 2014), and fast computation times. Accounting for phylogenetic relatedness between species can improve imputation accuracy (May et al, 2023; Penone et al, 2014). One way of including phylogenetic data is through eigenvectors (Debastiani et al, 2021).…”

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

Functional diversity metrics can perform well with highly incomplete data sets

Stewart,

Carmona,

Clements

et al. 2023

Methods Ecol Evol

View full text Add to dashboard Cite

Characterising changes in functional diversity at large spatial scales provides insight into the impact of human activity on ecosystem structure and function. However, the approach is often based on trait data sets that are incomplete and unrepresentative, with uncertain impacts on functional diversity estimates. To address this knowledge gap, we simulated random and biased removal of data from three empirical trait data sets: an avian data set (9579 species), a plant data set (2185 species) and a crocodilian data set (25 species). For these data sets, we assessed whether functional diversity metrics were robust to data incompleteness with and without using imputation to fill data gaps. We compared two metrics each calculated with two methods: functional richness (calculated with convex hulls and trait probabilities densities) and functional divergence (calculated with distance‐based Rao and trait probability densities). Without imputation, estimates of functional diversity (richness and divergence) for birds and plants were robust when 20%–70% of species had missing data for four out of 11 and two out of six continuous traits, respectively, depending on the severity of bias and method used. However, when missing traits were imputed, functional diversity metrics consistently remained representative of the true value when 70% of bird species were missing data for four out of 11 traits and when 50% of plant species were missing data for two out of six traits. Trait probability densities and distance‐based Rao were particularly robust to missingness and bias when combined with imputation. Convex hull‐based estimations of functional richness were less reliable. When applied to a smaller data set (crocodilians, 25 species), all functional diversity metrics were much more sensitive to missing data. Expanding global morphometric data sets to represent more taxa and traits, and to quantify intraspecific variation, remains a priority. In the meantime, our results show that widely used methods can successfully quantify large‐scale functional diversity even when data are missing for half of species, provided that missing traits are estimated using imputation. We recommend the use of trait probability densities or distance‐based Rao when working with large incomplete data sets and filling data gaps with imputation.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Functional diversity metrics can perform well with highly incomplete data sets

Stewart,

Carmona,

Clements

et al. 2023

Methods Ecol Evol

View full text Add to dashboard Cite

show abstract

The impact of misclassifications and outliers on imputation methods

Templ,

Ulmer

2024

Journal of Applied Statistics

View full text Add to dashboard Cite

A real data-driven simulation strategy to select an imputation method for mixed-type trait data

May

Feng

Adamowicz

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Imputation offers an alternative to removing cases with missing values from datasets. Imputation techniques that incorporate phylogenetic information into their estimations have demonstrated improved accuracy over standard techniques. However, previous studies of phylogenetic imputation tools are largely limited to simulations of numerical trait data, with categorical data not evaluated. It also remains to be explored whether the type of genetic data used affects imputation accuracy. We conducted a real data-based simulation study to compare the performance of imputation methods using a mixed-type trait dataset (lizards and amphisbaenians; order: Squamata). Selected methods included mean/mode imputation, k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Known values were removed from a complete-case dataset to simulate different missingness scenarios: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Each method (with and without phylogenetic information derived from mitochondrial and nuclear gene trees) was used to impute the removed values. The performances of the methods were evaluated for each trait and in each missingness scenario. A random forest method supplemented with a nuclear-derived phylogeny performed best overall, and this method was used to impute missing values in the original squamate dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to the complete-case data. However, phylogeny did not always improve performance for every trait and in every missingness scenario, and caution should be taken when imputing trait data, particularly in cases of extreme bias. Ultimately, these results support the use of a real data-based simulation procedure to select a suitable imputation strategy for a given mixed-type trait dataset. Moreover, they highlight the potential biases that complete-case usage may introduce into analyses.Author summaryThe issue of missing data is problematic in trait datasets as observations for rare or threatened species are often missing disproportionately. When only complete cases are used in an analysis, derived results may be biased. Imputation is an alternative to complete-case analysis and entails filling in the missing values using known observations. It has been demonstrated that including phylogenetic information in the imputation process improves accuracy of predicted values. However, most previous evaluations of imputation methods for trait datasets are limited to numerical, simulated data, with categorical traits not considered. Using a reptile dataset comprised of both numerical and categorical trait data, we employed a real data-based simulation strategy to select an optimal imputation method for the dataset. We evaluated the performance of four different imputation methods across different missingness scenarios (e.g. missing completely at random, values missing disproportionately for smaller species. Results indicate that imputed data better reflected the original dataset characteristics compared to complete-case data; however, the optimal imputation strategy for a given scenario was contingent on missingness scenario and trait type. As imputation performance varies depending on the properties of a given dataset, a real data-based simulation strategy can be used to provide guidance on best imputation practices.

show abstract

A real data-driven simulation strategy to select an imputation method for mixed-type trait data

Cited by 3 publications

References 80 publications

Functional diversity metrics can perform well with highly incomplete data sets

Functional diversity metrics can perform well with highly incomplete data sets

The impact of misclassifications and outliers on imputation methods

A real data-driven simulation strategy to select an imputation method for mixed-type trait data

Contact Info

Product

Resources

About