Social skills of adolescents: convergent validity between IHSA-Del-Prette and MESSY

To accelerate scientific progress on remote tree classification—as well as biodiversity and ecology sampling—The National Institute of Science and Technology created a community-based competition where scientists were invited to contribute informatics methods for classifying tree species and genus using crown-level images of trees. We predicted tree species and genus at the pixel level using hyperspectral and LIDAR observations. We compared three algorithms that have been implemented extensively across a broad range of research applications: support vector machines, random forests, and multilayer perceptron. At the pixel level, the multilayer perceptron algorithm predicted species or genus with high accuracy (92.7 and 95.9%, respectively) on the training data and performed better than the other algorithms (85.8-93.5%). This indicates promise for the use of the MLP algorithm for tree-species classification and coincides with a growing body of research in which neural network-based algorithms outperform other types of classification algorithms for machine vision. To aggregate patterns across the images, we used an ensemble approach that averages the pixel-level outputs of the MLP algorithm to predict species at the crown level. The accuracy of these predictions on the test set was 68.8% for species.

show abstract

ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data

Piccolo

Lee

Suh

et al. 2020

View full text Add to dashboard Cite

Background Classification algorithms assign observations to groups based on patterns in data. The machine-learning community have developed myriad classification algorithms, which are used in diverse life science research domains. Algorithm choice can affect classification accuracy dramatically, so it is crucial that researchers optimize the choice of which algorithm(s) to apply in a given research domain on the basis of empirical evidence. In benchmark studies, multiple algorithms are applied to multiple datasets, and the researcher examines overall trends. In addition, the researcher may evaluate multiple hyperparameter combinations for each algorithm and use feature selection to reduce data dimensionality. Although software implementations of classification algorithms are widely available, robust benchmark comparisons are difficult to perform when researchers wish to compare algorithms that span multiple software packages. Programming interfaces, data formats, and evaluation procedures differ across software packages; and dependency conflicts may arise during installation. Findings To address these challenges, we created ShinyLearner, an open-source project for integrating machine-learning packages into software containers. ShinyLearner provides a uniform interface for performing classification, irrespective of the library that implements each algorithm, thus facilitating benchmark comparisons. In addition, ShinyLearner enables researchers to optimize hyperparameters and select features via nested cross-validation; it tracks all nested operations and generates output files that make these steps transparent. ShinyLearner includes a Web interface to help users more easily construct the commands necessary to perform benchmark comparisons. ShinyLearner is freely available at https://github.com/srp33/ShinyLearner. Conclusions This software is a resource to researchers who wish to benchmark multiple classification or feature-selection algorithms on a given dataset. We hope it will serve as example of combining the benefits of software containerization with a user-friendly approach.

show abstract

Remote sensing tree classification with a multilayer perceptron

Sumsion

Bradshaw

Hill

et al. 2019

View full text Add to dashboard Cite

To accelerate scientific progress on remote tree classification—as well as biodiversity and ecology sampling—The National Institute of Science and Technology created a community-based competition where scientists were invited to contribute informatics methods for classifying tree species and genus using crown-level images of trees. We classified tree species and genus at the pixel level using hyperspectral and LiDAR observations. We compared three algorithms that have been implemented extensively across a broad range of research applications: support vector machines, random forests, and multilayer perceptron. At the pixel level, the multilayer perceptron algorithm classified species or genus with high accuracy (92.7% and 95.9%, respectively) on the training data and performed better than the other two algorithms (85.8–93.5%). This indicates promise for the use of the multilayer perceptron (MLP) algorithm for tree-species classification based on hyperspectral and LiDAR observations and coincides with a growing body of research in which neural network-based algorithms outperform other types of classification algorithm for machine vision. To aggregate patterns across the images, we used an ensemble approach that averages the pixel-level outputs of the MLP algorithm to classify species at the crown level. The average accuracy of these classifications on the test set was 68.8% for the nine species.

show abstract

Coordinate-based mapping of tabular data enables fast and scalable queries

Piccolo

Hill

et al. 2019

Preprint

View full text Add to dashboard Cite

8Motivation: Biologists commonly store data in tabular form with observations as rows, attributes 9 as columns, and measurements as values. Due to advances in high-throughput technologies, the 10 sizes of tabular datasets are increasing. Some datasets contain millions of rows or columns. To 11 work effectively with such data, researchers must be able to efficiently extract subsets of the data 12 (using filters to select specific rows and retrieving specific columns). However, existing 13 methodologies for querying tabular data do not scale adequately to large datasets or require 14 specialized tools for processing. We sought a methodology that would overcome these challenges 15 and that could be applied to an existing, text-based format. 16Results: In a systematic benchmark, we tested 10 techniques for querying simulated, tabular 17 datasets. These techniques included a delimiter-splitting method, the Python pandas module, 18 regular expressions, object serialization, the awk utility, and string-based indexing. We found that 19 storing the data in fixed-width formats provided excellent performance for extracting data subsets. 20Because columns have the same width on every row, we could pre-calculate column and row 21 coordinates and quickly extract relevant data from the files. Memory mapping led to additional 22 performance gains. A limitation of fixed-width files is the increased storage requirement of buffer 23 characters. Compression algorithms help to mitigate this limitation at a cost of reduced query 24 speeds. Lastly, we used this methodology to transpose tabular files that were hundreds of gigabytes 25 in size, without creating temporary files. We propose coordinate-based, fixed-width storage as a 26 fast, scalable methodology for querying tabular biological data.Biologists often generate data suitable for representation in an attribute-value system 1 , also known 30 as an information system 2 , simple frame 3 , object-predicate table 4 , or flat file. In this representation, 31 an object might be a biological organism, an attribute might be a characteristic of that organism, 32 and a value might be a datum for that object and attribute. For example, a researcher might observe 33 200 cancer patients (objects) and collect transcriptomic measurements for 20,000 genes (attributes); 34 each value would indicate the relative number of transcripts present in tumor cells for each 35 patient/gene combination 5 . In this example, the data values would have been summarized 36 previously using preprocessing tools, such as a reference aligner and a transcript-quantification 37 algorithm 6-9 . For convenience and compactness, researchers typically store attribute-value data in 38 2-dimensional, tabular formats. Commonly, in such tables, each row contains data for a given 39 object, and each column contains data for a given attribute 10 ; but in some cases, the table is 40 transposed (objects as columns, attributes as rows). Researchers use tabular data to perform 41 analytical tasks, such as executing statistic...

show abstract

ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data

Piccolo

Lee

Suh

et al. 2019

Preprint

View full text Add to dashboard Cite

AbstractClassification algorithms assign observations to groups based on patterns in data. The machine-learning community have developed myriad classification algorithms, which are employed in diverse life-science research domains. When applying such algorithms, researchers face the challenge of deciding which algorithm(s) to apply in a given research domain. Algorithm choice can affect classification accuracy dramatically, so it is crucial that researchers optimize these choices based on empirical evidence rather than hearsay or anecdotal experience. In benchmark studies, multiple algorithms are applied to multiple datasets, and the researcher examines overall trends. In addition, the researcher may evaluate multiple hyperparameter combinations for each algorithm and use feature selection to reduce data dimensionality. Although software implementations of classification algorithms are widely available, robust benchmark comparisons are difficult to perform when researchers wish to compare algorithms that span multiple software packages.Programming interfaces, data formats, and evaluation procedures differ across software packages; and dependency conflicts may arise during installation. To address these challenges, we created ShinyLearner, an open-source project for integrating machine-learning packages into software containers. ShinyLearner provides a uniform interface for performing classification, irrespective of the library that implements each algorithm, thus facilitating benchmark comparisons. In addition, ShinyLearner enables researchers to optimize hyperparameters and select features via nested cross validation; it tracks all nested operations and generates output files that make these steps transparent. ShinyLearner includes a Web interface to help users more easily construct the commands necessary to perform benchmark comparisons. ShinyLearner is freely available at https://github.com/srp33/ShinyLearner.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Kimball T Hill

Remote sensing tree classification with a multilayer perceptron

ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data

Remote sensing tree classification with a multilayer perceptron

Coordinate-based mapping of tabular data enables fast and scalable queries

ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data

Contact Info

Product

Resources

About