“…While available data includes trusted curated sets, experimental data provided by various depositors, as well as literature and biomedical publications that are annotated manually by indexers ( MEDLINE, 2021 ); an abundance of data can be extracted from unstructured text using named-entity recognition software ( Ratinov, 2009 ). Current named-entity recognition approaches include dictionary matching, use of rules to recognize specialized terminology, and context analysis using statistical and neural language models ( Sayle et al, 2011 ; Vazquez et al, 2011 ; Jessop et al, 2012 ; Rocktäschel et al, 2012 ; Gurulingappa et al, 2013 ; Lowe and Sayle, 2015 ; Pletscher-Frankild et al, 2015 ; Song et al, 2018 ; Devlin et al, 2019 ; Lee et al, 2020 ; Tian et al, 2020 ). To produce data for the PubChem literature knowledge panels, entities are annotated in a PubMed record using a third-party named-entity recognition software, LeadMine ( Lowe and Sayle, 2015 ), and matched to chemical synonyms in the PubChem Compound database and to gene, protein, and disease names, as described in Materials and Methods .…”