Big data approaches have greatly improved scientific
decision making,
but they are highly dependent on the availability of data, impeding
their use in data-poor scenarios. In addition to data abundance, enhancing
data diversity is likewise a way to access knowledge. Herein, we propose
a data-driven method for toxicity endpoint selection when directly
relevant data are deficient, and shale gas exploitation sites were
used as an example scenario. From the 1173 substances in the U.S.
Environmental Protection Agency’s HFList, the most concerning
endpoints in zebrafish embryo toxicity tests (FET) were inferred using
a newly developed relational database (RDB) strategy that integrated
chemical, high-throughput screening (HTS) bioactivity, genome, and
FET endpoint information. This RDB strategy based on text mining and
data fusion approaches enabled the integration of 255 bioactive contaminants,
955 HTS bioassays with known modes of action (MoAs), 214 gene ontologies,
65 pathways, and 27 phenotypic data and predicted measurement endpoints
within 10 MoAs for shale gas pollution. This data-driven approach
was further validated using zebrafish FET and transcriptomic sequencing
with field-collected samples and achieved 89% and 97% accuracy for
the predictive ontologies and pathways, respectively. This highlighted
the applicability of RDB-based data-driven strategies for predicting
toxicity endpoints from a priori knowledge of contaminants by improving
data diversity.