2022
DOI: 10.7717/peerj.13837
|View full text |Cite
|
Sign up to set email alerts
|

Accurate image-based identification of macroinvertebrate specimens using deep learning—How much training data is needed?

Abstract: Image-based methods for species identification offer cost-efficient solutions for biomonitoring. This is particularly relevant for invertebrate studies, where bulk samples often represent insurmountable workloads for sorting, identifying, and counting individual specimens. On the other hand, image-based classification using deep learning tools have strict requirements for the amount of training data, which is often a limiting factor. Here, we examine how classification accuracy increases with the amount of tra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 33 publications
0
7
0
Order By: Relevance
“…Automated monitoring systems must be as costefficient as possible to achieve the needed spatial coverage and resolution (Hahn et al, 2022). For AI implementation, training data needs to be collected and labelled in sufficient amounts, which can be expensive, especially under in situ conditions (Høye et al, 2022).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Automated monitoring systems must be as costefficient as possible to achieve the needed spatial coverage and resolution (Hahn et al, 2022). For AI implementation, training data needs to be collected and labelled in sufficient amounts, which can be expensive, especially under in situ conditions (Høye et al, 2022).…”
Section: Introductionmentioning
confidence: 99%
“…Automated monitoring systems must be as cost‐efficient as possible to achieve the needed spatial coverage and resolution (Hahn et al, 2022). For AI implementation, training data needs to be collected and labelled in sufficient amounts, which can be expensive, especially under in situ conditions (Høye et al, 2022). Moreover, the development from prototype to fully functional monitoring tool requires an adaptive process whose applicability and functionality in the real world is constantly re‐evaluated and adjusted accordingly (Hahn et al, 2022).…”
Section: Introductionmentioning
confidence: 99%
“…Machine-learning-based identification methods for benthic invertebrates are well established (Ärje et al, 2020;Høye et al, 2022;Lytle et al, 2010;Raitoharju et al, 2018). However, they rely on preserved organisms to allow detailed image generation in laboratory conditions, and hence depend on labour-intensive, net-based sampling.…”
Section: Introductionmentioning
confidence: 99%
“…Recent advances in camera and machine‐learning technologies have enabled novel image‐based methods, which provide vast amounts of data and enable ecologists to study organisms cost‐effectively and non‐invasively (Høye et al, 2021; Lürig et al, 2021). Machine‐learning‐based identification methods for benthic invertebrates are well established (Ärje et al, 2020; Høye et al, 2022; Lytle et al, 2010; Raitoharju et al, 2018). However, they rely on preserved organisms to allow detailed image generation in laboratory conditions, and hence depend on labour‐intensive, net‐based sampling.…”
Section: Introductionmentioning
confidence: 99%
“…The exact sources of quasi-replication will vary greatly depending on the model's use case, but separating the training and test datasets by observer, location and/or time might be sensible (see section 3.2 of Van Horn et al, 2018 for a useful illustration of this concept).In general, the accuracy of a model will increase with the volume of data used to train the model. While the minimum required amount of training data will vary depending on the algorithm being used and the quality of the image data, a general 'rule of thumb' is ~50 specimens per class when there are few classes(Høye et al, 2022).However, as the total number of classes, data domains and species morphs increases, more training data are necessary. Additionally, it is ideal to have relatively equal representation of specimens per class in the dataset, as unevenness can introduce bias to the model (e.g Miao, 2021;Schneider et al, 2020)…”
mentioning
confidence: 99%