During March, 2020, most European countries implemented lockdowns to restrict the transmission of SARS-CoV-2, the virus which causes COVID-19 through their populations. These restrictions had positive impacts for air quality due to a dramatic reduction of economic activity and emissions. In this work, a machine learning approach was designed and implemented to analyze local air quality improvements during the COVID-19 lockdown in Graz, Austria. The machine learning approach was used as a robust alternative to simple, historical measurement comparisons for various individual pollutants. Concentrations of NO2 (nitrogen dioxide), PM10, O3 (ozone) and Ox (total oxidant) were selected from five measurement sites in Graz and were set as target variables for random forest regression models to predict their expected values during the city's lockdown period. The true vs. expected difference is presented here as an indicator of true pollution during the lockdown. The machine learning models showed a high level of generalization for predicting the concentrations. Therefore, the approach was suitable for analyzing reductions in pollution concentrations. Results on the validation set showed very good performance for Ox and NO2 when compared to PM10 and O3. The analysis indicated that the city's average concentration reductions for the lockdown period were:-36.9 to-41.6%, and-6.6 to-14.2% for NO2 and PM10, respectively. However, an increase of 11.6 to 33.8% for O3 was estimated. The reduction in pollutant concentration, especially NO2 can be explained by significant drops in traffic-flows during the lockdown period (-51.6 to-43.9%). The results presented give a real-world example of what pollutant concentration reductions can be achieved by reducing traffic-flows and other economic activities.
We present a collection of publicly available intrinsic aqueous solubility data of 829 drug-like compounds. Four different machine learning algorithms (random forests [RF], LightGBM, partial least squares, and least absolute shrinkage and selection operator [LASSO]) coupled with multistage permutation importance for feature selection and Bayesian hyperparameter optimization were used for the prediction of solubility based on chemical structural information. Our results show that LASSO yielded the best predictive ability on an external test set with a root mean square error (RMSE) (test) of 0.70 log points, an R 2 (test) of 0.80, and 105 features. Taking into account the number of descriptors as well, an RF model achieves the best balance between complexity and predictive ability with an RMSE(test) of 0.72 log points, an R 2 (test) of 0.78, and with only 17 features. On a more aggressive test set (principal component analysis [PCA]-based split), better generalization was observed for the RFmodel. We propose a ranking score for choosing the best model, as test set performance is only one of the factors in creating an applicable model. The ranking score is a weighted combination of generalization, number of features, and test performance. Out of the two best learners, a consensus model was built exhibiting the best predictive ability and generalization with RMSE(test) of 0.67 log points and a R 2 (test) of 0.81.
The authors present an implementation of the cheminformatics toolkit RDKit in a distributed computing environment, Apache Hadoop. Together with the Apache Spark analytics engine, wrapped by PySpark, resources from commodity scalable hardware can be employed for cheminformatic calculations and query operations with basic knowledge in Python programming and understanding of the resilient distributed datasets (RDD). Three use cases of cheminfomatical computing in Spark on the Hadoop cluster are presented; querying substructures, calculating fingerprint similarity and calculating molecular descriptors. The source code for the PySpark‐RDKit implementation is provided. The use cases showed that Spark provides a reasonable scalability depending on the use case and can be a suitable choice for datasets too big to be processed with current low‐end workstations.
Shortcomings of the correlation coefficient (Pearson's) as a measure for estimating and calculating the accuracy of predictive model properties are analysed. Here we discuss two such cases that can often occur in the application of the model in predicting properties of a new external set of compounds. The first problem in using the correlation coefficient is its insensitivity to the systemic error that must be expected in predicting properties of a novel external set of compounds, which is not a random sample selected from the training set. The second problem is that an external set can be arbitrarily large or small and have an arbitrary and uneven distribution of the measured value of the target variable, whose values are not known in advance. In these conditions, the correlation coefficient can be an overoptimistic measure of agreement of predicted values with the corresponding experimental values and can lead to a highly optimistic conclusion about the predictive ability of the model. Due to these shortcomings of the correlation coefficient, the use of standard error (root-mean-square-error) of prediction is suggested as a better quality measure of predictive capabilities of a model. In the case of classification models, the use of the difference between the real accuracy and the most probable random accuracy of the model shows very good characteristics in ranking different models according to predictive quality, having at the same time an obvious interpretation.
The expanding antibiotic resistance crisis calls for a more in depth understanding of the importance of antimicrobial resistance genes (ARGs) in pristine environments. We, therefore, studied the microbiome associated with Sphagnum moss forming the main vegetation in undomesticated, evolutionary old bog ecosystems. In our complementary analysis of culture collections, metagenomic data and a fosmid library from different geographic sites in Europe, we identified a low abundant but highly diverse pool of resistance determinants, which targets an unexpectedly broad range of 29 antibiotics including natural and synthetic compounds. This derives both, from the extraordinarily high abundance of efflux pumps (up to 96%), and the unexpectedly versatile set of ARGs underlying all major resistance mechanisms. Multi-resistance was frequently observed among bacterial isolates, e.g. in Serratia, Rouxiella, Pandoraea, Paraburkholderia and Pseudomonas. In a search for novel ARGs, we identified the new class A β-lactamase Mm3. The native Sphagnum resistome comprising a highly diversified and partially novel set of ARGs contributes to the bog ecosystem´s plasticity. Our results reinforce the ecological link between natural and clinically relevant resistomes and thereby shed light onto this link from the aspect of pristine plants. Moreover, they underline that diverse resistomes are an intrinsic characteristic of plant-associated microbial communities, they naturally harbour many resistances including genes with potential clinical relevance.
Maternal nutrition and lifestyle in pregnancy are important modifiable factors for both maternal and offspring’s health. Although the Mediterranean diet has beneficial effects on health, recent studies have shown low adherence in Europe. This study aimed to assess the Mediterranean diet adherence in 266 pregnant women from Dalmatia, Croatia and to investigate their lifestyle habits and regional differences. Adherence to the Mediterranean diet was assessed through two Mediterranean diet scores. Differences in maternal characteristics (diet, education, income, parity, smoking, pre-pregnancy body mass index (BMI), physical activity, contraception) with regards to location and dietary habits were analyzed using the non-parametric Mann–Whitney U test. The machine learning approach was used to reveal other potential non-linear relationships. The results showed that adherence to the Mediterranean diet was low to moderate among the pregnant women in this study, with no significant mainland–island differences. The highest adherence was observed among wealthier women with generally healthier lifestyle choices. The most significant mainland–island differences were observed for lifestyle and socioeconomic factors (income, education, physical activity). The machine learning approach confirmed the findings of the conventional statistical method. We can conclude that adverse socioeconomic and lifestyle conditions were more pronounced in the island population, which, together with the observed non-Mediterranean dietary pattern, calls for more effective intervention strategies.
Numerous industrial applications of machine learning feature critical issues that need to be addressed. This work proposes a framework to deal with these issues, such as competing objectives and class imbalance in designing a machine vision system for the in-line detection of surface defects on glass substrates of thin-film transistor liquid crystal displays (TFT-LCDs). The developed inspection system composes of (i) feature engineering: extraction of only the defect-relevant features from images using two-dimensional wavelet decomposition and (ii) training ensemble classifiers (proof of concept with a C5.0 ensemble, random forests (RF), and adaptive boosting (AdaBoost)). The focus is on cost sensitivity, increased generalization, and robustness to handle class imbalance and address multiple competing manufacturing objectives. Comprehensive performance evaluation was conducted in terms of accuracy, sensitivity, specificity, and the Matthews correlation coefficient (MCC) by calculating their 12,000 bootstrapped estimates. Results revealed significant differences (p < 0.05) between the three developed diagnostic algorithms. RFR (accuracy of 83.37%, sensitivity of 60.62%, specificity of 89.72%, and MCC of 0.51) outperformed both AdaBoost (accuracy of 81.14%, sensitivity of 69.23%, specificity of 84.48%, and MCC of 0.50) and the C5.0 ensemble (accuracy of 78.35%, sensitivity of 65.35%, specificity of 82.03%, and MCC of 0.44) in all the metrics except sensitivity. AdaBoost exhibited stronger performance in detecting defective TFT-LCD glass substrates. These promising results demonstrated that the proposed ensemble approach is a viable alternative to manual inspections when applied to an industrial case study with issues such as competing objectives and class imbalance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.