Blaž Škrlj scite author profile

As a result of social network popularity, in recent years, hate speech phenomenon has significantly increased. Due to its harmful effect on minority groups as well as on large communities, there is a pressing need for hate speech detection and filtering. However, automatic approaches shall not jeopardize free speech, so they shall accompany their decisions with explanations and assessment of uncertainty. Thus, there is a need for predictive machine learning models that not only detect hate speech but also help users understand when texts cross the line and become unacceptable. The reliability of predictions is usually not addressed in text classification. We fill this gap by proposing the adaptation of deep neural networks that can efficiently estimate prediction uncertainty. To reliably detect hate speech, we use Monte Carlo dropout regularization, which mimics Bayesian inference within neural networks. We evaluate our approach using different text embedding methods. We visualize the reliability of results with a novel technique that aids in understanding the classification reliability and errors.

show abstract

tax2vec: Constructing Interpretable Features from Taxonomies for Short Text Classification

Škrlj

Martinc

Kralj

et al. 2021

Computer Speech & Language

View full text Add to dashboard Cite

The use of background knowledge remains largely unexploited in many text classification tasks. In this work, we explore word taxonomies as means for constructing new semantic features, which may improve the performance and robustness of the learned classifiers. We propose tax2vec, a parallel algorithm for constructing taxonomy based features, and demonstrate its use on six short-text classification problems, including gender, age and personality type prediction, drug effectiveness and side effect prediction, and news topic prediction. The experimental results indicate that the interpretable features constructed using tax2vec can notably improve the performance of classifiers; the constructed features, in combination with fast, linear classifiers tested against strong baselines, such as hierarchical attention neural networks, achieved comparable or better classification results on short documents. Further, tax2vec can also serve for extraction of corpus-specific keywords. Finally, we investigated the semantic space of potential features where we observe a similarity with the well known Zipf's law.

show abstract

ProBiS-Dock Database: A Web Server and Interactive Web Repository of Small Ligand–Protein Binding Sites for Drug Design

Konc

Lešnik

Škrlj

et al. 2021

J. Chem. Inf. Model.

View full text Add to dashboard Cite

We have developed a new system, ProBiS-Dock, which can be used to determine the different types of protein binding sites for small ligands. The binding sites identified this way are then used to construct a new binding site database, the ProBiS-Dock Database, that allows for the ranking of binding sites according to their utility for drug development. The newly constructed database currently has more than 1.4 million binding sites and offers the possibility to investigate potential drug targets originating from different biological species. The interactive ProBiS-Dock Database, a web server and repository that consists of all small-molecule ligand binding sites in all of the protein structures in the Protein Data Bank, is freely available at http:// probis-dock-database.insilab.org. The ProBiS-Dock Database will be regularly updated to keep pace with the growth of the Protein Data Bank, and our anticipation is that it will be useful in drug discovery.

show abstract

GenProBiS: web server for mapping of sequence variants to protein binding sites

Konc¹,

Škrlj²,

Eržen³

et al. 2017

View full text Add to dashboard Cite

Discovery of potentially deleterious sequence variants is important and has wide implications for research and generation of new hypotheses in human and veterinary medicine, and drug discovery. The GenProBiS web server maps sequence variants to protein structures from the Protein Data Bank (PDB), and further to protein–protein, protein–nucleic acid, protein–compound, and protein–metal ion binding sites. The concept of a protein–compound binding site is understood in the broadest sense, which includes glycosylation and other post-translational modification sites. Binding sites were defined by local structural comparisons of whole protein structures using the Protein Binding Sites (ProBiS) algorithm and transposition of ligands from the similar binding sites found to the query protein using the ProBiS-ligands approach with new improvements introduced in GenProBiS. Binding site surfaces were generated as three-dimensional grids encompassing the space occupied by predicted ligands. The server allows intuitive visual exploration of comprehensively mapped variants, such as human somatic mis-sense mutations related to cancer and non-synonymous single nucleotide polymorphisms from 21 species, within the predicted binding sites regions for about 80 000 PDB protein structures using fast WebGL graphics. The GenProBiS web server is open and free to all users at http://genprobis.insilab.org.

show abstract

Prioritisation of Compounds for 3CLpro Inhibitor Development on SARS-CoV-2 Variants

Jukič

Škrlj

Tomšič³

et al. 2021

Molecules

View full text Add to dashboard Cite

COVID-19 represents a new potentially life-threatening illness caused by severe acute respiratory syndrome coronavirus 2 or SARS-CoV-2 pathogen. In 2021, new variants of the virus with multiple key mutations have emerged, such as B.1.1.7, B.1.351, P.1 and B.1.617, and are threatening to render available vaccines or potential drugs ineffective. In this regard, we highlight 3CLpro, the main viral protease, as a valuable therapeutic target that possesses no mutations in the described pandemically relevant variants. 3CLpro could therefore provide trans-variant effectiveness that is supported by structural studies and possesses readily available biological evaluation experiments. With this in mind, we performed a high throughput virtual screening experiment using CmDock and the “In-Stock” chemical library to prepare prioritisation lists of compounds for further studies. We coupled the virtual screening experiment to a machine learning-supported classification and activity regression study to bring maximal enrichment and available structural data on known 3CLpro inhibitors to the prepared focused libraries. All virtual screening hits are classified according to 3CLpro inhibitor, viral cysteine protease or remaining chemical space based on the calculated set of 208 chemical descriptors. Last but not least, we analysed if the current set of 3CLpro inhibitors could be used in activity prediction and observed that the field of 3CLpro inhibitors is drastically under-represented compared to the chemical space of viral cysteine protease inhibitors. We postulate that this methodology of 3CLpro inhibitor library preparation and compound prioritisation far surpass the selection of compounds from available commercial “corona focused libraries”.

show abstract

RaKUn: Rank-based Keyword Extraction via Unsupervised Learning and Meta Vertex Aggregation

Škrlj

Repar

Pollak

2019

View full text Add to dashboard Cite

Keyword extraction is used for summarizing the content of a document and supports efficient document retrieval, and is as such an indispensable part of modern text-based systems. We explore how load centrality, a graph-theoretic measure applied to graphs derived from a given text can be used to efficiently identify and rank keywords. Introducing meta vertices (aggregates of existing vertices) and systematic redundancy filters, the proposed method performs on par with stateof-the-art for the keyword extraction task on 14 diverse datasets. The proposed method is unsupervised, interpretable and can also be used for document visualization.Keywords: keyword extraction · graph applications · vertex ranking· load centrality · information retrieval 1 Introduction and related work Keywords are terms (i.e. expressions) that best describe the subject of a document [2]. A good keyword effectively summarizes the content of the document and allows it to be efficiently retrieved when needed. Traditionally, keyword assignment was a manual task, but with the emergence of large amounts of textual data, automatic keyword extraction methods have become indispensable. Despite a considerable effort from the research community, state-of-the-art keyword extraction algorithms leave much to be desired and their performance is still lower than on many other core NLP tasks [13]. The first keyword extraction methods mostly followed a supervised approach [14,24,31]: they first extract keyword features and then train a classifier on a gold standard dataset. For example, KEA [31], a state of the art supervised keyword extraction algorithm is based on the Naive Bayes machine learning algorithm. While these methods offer quite good performance, they rely on an annotated gold standard dataset and require a (relatively) long training process. In contrast, unsupervised approaches need no training and can be applied directly without relying on a gold standard document collection. They can be further divided into statistical and graph-based arXiv:1907.06458v1 [cs.CL] 15 Jul 2019 2Škrlj, Repar and Pollak.

show abstract

A comprehensive comparison of molecular feature representations for use in predictive modeling

Stepišnik

Škrlj

Wicker

et al. 2021

Computers in Biology and Medicine

View full text Add to dashboard Cite

Propositionalization and embeddings: two sides of the same coin

2020

View full text Add to dashboard Cite

Data preprocessing is an important component of machine learning pipelines, which requires ample time and resources. An integral part of preprocessing is data transformation into the format required by a given learning algorithm. This paper outlines some of the modern data processing techniques used in relational learning that enable data fusion from different input data types and formats into a single table data representation, focusing on the propositionalization and embedding data transformation approaches. While both approaches aim at transforming data into tabular data format, they use different terminology and task definitions, are perceived to address different goals, and are used in different contexts. This paper contributes a unifying framework that allows for improved understanding of these two data transformation techniques by presenting their unified definitions, and by explaining the similarities and differences between the two approaches as variants of a unified complex data transformation task. In addition to the unifying framework, the novelty of this paper is a unifying methodology combining propositionalization and embeddings, which benefits from the advantages of both in solving complex data transformation and learning tasks. We present two efficient implementations of the unifying methodology: an instance-based PropDRM approach, and a feature-based PropStar approach to data transformation and learning, together with their empirical evaluation on several relational problems. The results show that the new algorithms can outperform existing relational learners and can solve much larger problems.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Blaž Škrlj

Prediction Uncertainty Estimation for Hate Speech Classification

tax2vec: Constructing Interpretable Features from Taxonomies for Short Text Classification

ProBiS-Dock Database: A Web Server and Interactive Web Repository of Small Ligand–Protein Binding Sites for Drug Design

GenProBiS: web server for mapping of sequence variants to protein binding sites

Prioritisation of Compounds for 3CLpro Inhibitor Development on SARS-CoV-2 Variants

RaKUn: Rank-based Keyword Extraction via Unsupervised Learning and Meta Vertex Aggregation

A comprehensive comparison of molecular feature representations for use in predictive modeling

Propositionalization and embeddings: two sides of the same coin

Contact Info

Product

Resources

About