Determining the aqueous solubility of molecules is a vital step in many pharmaceutical, environmental, and energy storage applications. Despite efforts made over decades, there are still challenges associated with developing a solubility prediction model with satisfactory accuracy for many of these applications. The goals of this study are to assess current deep learning methods for solubility prediction, develop a general model capable of predicting the solubility of a broad range of organic molecules, and to understand the impact of data properties, molecular representation, and modeling architecture on predictive performance. Using the largest currently available solubility data set, we implement deep learning-based models to predict solubility from the molecular structure and explore several different molecular representations including molecular descriptors, simplified molecular-input line-entry system strings, molecular graphs, and three-dimensional atomic coordinates using four different neural network architectures—fully connected neural networks, recurrent neural networks, graph neural networks (GNNs), and SchNet. We find that models using molecular descriptors achieve the best performance, with GNN models also achieving good performance. We perform extensive error analysis to understand the molecular properties that influence model performance, perform feature analysis to understand which information about the molecular structure is most valuable for prediction, and perform a transfer learning and data size study to understand the impact of data availability on model performance.
Aqueous organic redox flow batteries offer an environmentally benign, tunable, and safe route to large-scale energy storage. The energy density is one of the key performance parameters of organic redox flow batteries, which critically depends on the solubility of the redox-active molecule in water. Prediction of aqueous solubility remains a challenge in chemistry. Recently, machine learning models have been developed for molecular properties prediction in chemistry and material science. The fidelity of a machine learning model critically depends on the diversity, accuracy, and abundancy of the training datasets. We build a comprehensive open access organic molecular database “Solubility of Organic Molecules in Aqueous Solution” (SOMAS) containing about 12,000 molecules that covers wider chemical and solubility regimes suitable for aqueous organic redox flow battery development efforts. In addition to experimental solubility, we also provide eight distinctive quantum descriptors including optimized geometry derived from high-throughput density functional theory calculations along with six molecular descriptors for each molecule. SOMAS builds a critical foundation for future efforts in artificial intelligence-based solubility prediction models.
The awareness about software vulnerabilities is crucial to ensure effective cybersecurity practices, the development of high-quality software, and, ultimately, national security. This awareness can be better understood by studying the spread, structure and evolution of software vulnerability discussions across online communities. This work is the first to evaluate and contrast how discussions about software vulnerabilities spread on three social platforms-Twitter, GitHub, and Reddit. Moreover, we measure how user-level e.g., bot or not, and content-level characteristics e.g., vulnerability severity, post subjectivity, targeted operating systems as well as social network topology influence the rate of vulnerability discussion spread. To lay the groundwork, we present a novel fundamental framework for measuring information spread in multiple social platforms that identifies spread mechanisms and observables, units of information, and groups of measurements. We then contrast topologies for three social networks and analyze the effect of the network structure on the way discussions about vulnerabilities spread. We measure the scale and speed of the discussion spread to understand how far and how wide they go, how many users participate, and the duration of their spread. To demonstrate the awareness of more impactful vulnerabilities, a subset of our analysis focuses on vulnerabilities targeted during recent major cyber-attacks and those exploited by advanced persistent threat groups. One of our major findings is that most discussions start on GitHub not only before Twitter and Reddit, but even before a vulnerability is officially published. The severity of a vulnerability contributes to how much it spreads, especially on Twitter. Highly severe vulnerabilities have significantly deeper, broader and more viral discussion threads. When analyzing vulnerabilities in software products we found that different flavors of Linux received the highest discussion volume. We also observe that Twitter discussions started by humans have larger size, breadth, depth, adoption rate, lifetime, and structural virality compared to those started by bots. On Reddit, discussion threads of positive posts are larger, wider, and deeper than negative or neutral posts. We also found that all three networks have high modularity that encourages spread. However, the spread on GitHub is different from other networks, because GitHub is
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.