The rise of data science is leading to new paradigms
in data-driven
materials discovery. This carries an essential notion that large data
sources containing chemical structure and property information can
be mined in a fashion that detects and exploits structure–property
relationships, such that chemicals can be predicted to suit a given
material application. The success of material predictions is predicated
on these large data sources of chemical structure and property information
being suited to a target application. Microscopy is commonly used
to characterize chemical structure, especially in fields such as nanotechnology
where material properties are highly dependent on the size and shape
of nanoparticles. Large data sources of nanoparticle information stemming
from microscopy images would thus be highly beneficial. Millions of
microscopy images exist, but they lie fragmented across the literature,
typically presented individually within a paper article and usually
in a qualitative fashion therein, even though they harbor a wealth
of numeric information. We present the ImageDataExtractor toolkit
that autoidentifies and autoextracts microscopy images from scientific
documents, whereupon it autonomously analyzes each image to produce
quantitative particle size and shape information about its subject
material. Each image is quantified by decoding its scale bar information
using optical character recognition, with help from super-resolution
convolutional neural networks where required. Individual particles
are detected and profiled using various thresholding, segmentation,
polygon fitting, and edge correction routines. The high-throughput
operational capability of ImageDataExtractor means that it can be
used to generate large-data sources of particle information for data-driven
materials discovery. Evaluation metrics, precision and recall, are
greater than 80% for the majority of the image processing steps, and
precision is above 80% for all critical steps. The ImageDataExtractor
tool is released under the MIT license and is available to download
from .