2018
DOI: 10.1038/s41598-018-25440-6
|View full text |Cite
|
Sign up to set email alerts
|

Linguistic measures of chemical diversity and the “keywords” of molecular collections

Abstract: Computerized linguistic analyses have proven of immense value in comparing and searching through large text collections (“corpora”), including those deposited on the Internet – indeed, it would nowadays be hard to imagine browsing the Web without, for instance, search algorithms extracting most appropriate keywords from documents. This paper describes how such corpus-linguistic concepts can be extended to chemistry based on characteristic “chemical words” that span more than traditional functional groups and, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
16
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 25 publications
(17 citation statements)
references
References 30 publications
0
16
0
Order By: Relevance
“…We experimented with k ranging between the values of 7-12 and there was no statistically significant difference in the prediction performance, therefore we chose 8 to be the character length similar to our previous work [30]. A recent study by [25] also showed that most of the maximum common substructures (MCS) in drugs had between 8 and 12 characters.…”
Section: Ligand Smiles (Ls)mentioning
confidence: 99%
See 1 more Smart Citation
“…We experimented with k ranging between the values of 7-12 and there was no statistically significant difference in the prediction performance, therefore we chose 8 to be the character length similar to our previous work [30]. A recent study by [25] also showed that most of the maximum common substructures (MCS) in drugs had between 8 and 12 characters.…”
Section: Ligand Smiles (Ls)mentioning
confidence: 99%
“…We utilize the PROSITE database [24] to extract motifs and profiles that are associated with a biologically significant function and domain. Then, we benefit from a recent study that showed that maximum common substructures (MCS) of ligands constitute the actual words in the chemical space [25]. Approximately 100K MCS were used to extract a new set of words from the ligands.…”
Section: Introductionmentioning
confidence: 99%
“…WideDTA [26] is an extension of DeepDTA [25] where drugs and proteins are represented as words, instead of characters as in DeepDTA. In particular, drugs are described via most common substructures, denoted as Ligand Maximum Common Substructures (LMCS) [39]; and proteins are represented through most conserved subse-quences, which are Protein Domain profiles or Motifs (PDM), retrieved from PROSITE database [30].…”
Section: Deep Learning (Deepdta and Widedta)mentioning
confidence: 99%
“…The same authors extended DeepDTA to WideDTA [ 152 ]. This time, instead of only considering the SMILES label encoding for the ligand, substructure information is also included where a list of the 100,000 most frequent maximum common substructures defined by Woźniak et al [ 159 ] are used. For the protein description, approximately 500 motifs and domains are extracted from the PROSITE database [ 160 ] and label encoded.…”
Section: Recent Developmentsmentioning
confidence: 99%