Scaling Machine Learning for Target Prediction in Drug Discovery using Apache Spark

Harnie, Dries; Vapirev, A.; Wegner, Jörg K.; Gedich, Andrey; Steijaert, Marvin; Wuyts, Roel; Meuter, Wolfgang De

doi:10.1109/ccgrid.2015.50

Cited by 20 publications

(17 citation statements)

References 20 publications

(17 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given the particular focus of their study, Xu et al do not propose a general methodology for distributing the ML pipeline, and in particular for parallelizing the prediction task of possibly more sophisticated applications and leave it to the developer to address such a problem. Similarly, Harnie et al [24] use the Apache Spark technology [7] to achieve the desired scalability in chemoinformatics applications, which is also a choice we made in our work. However, as their main target is obtaining performance enhancements through the parallelization of the scientific application at hand, they do not provide a general methodology for parallelizing prediction services.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Parallelizing Machine Learning as a service for the end-user

Loreti¹,

Lippi²,

Torroni³

2020

Future Generation Computer Systems

View full text Add to dashboard Cite

As Machine Learning (ML) applications are becoming ever more pervasive, fully-trained systems are made increasingly available to a wide public, allowing end-users to submit queries with their own data, and to efficiently retrieve results. With increasingly sophisticated such services, a new challenge is how to scale up to ever growing user bases. In this paper, we present a distributed architecture that could be exploited to parallelize a typical ML system pipeline. We propose a case study consisting of a text mining service, and discuss how the method can be generalized to many similar applications. We demonstrate the significance of the computational gain boosted by the distributed architecture by way of an extensive experimental evaluation.

show abstract

Section: Related Workmentioning

confidence: 99%

“…The steps of these pipelines recall the MapReduce programming paradigm. bine the information coming from the different classifiers [24]. In computer vision, video segmentation is typically performed on frame groups, resulting from a first processing stage [34].…”

Section: Prediction As a Servicementioning

confidence: 99%

Parallelizing Machine Learning as a service for the end-user

Loreti¹,

Lippi²,

Torroni³

2020

Future Generation Computer Systems

View full text Add to dashboard Cite

show abstract

“…These research outcomes are continuously integrated with Spark as its machine learning library called MLlib [12], [11]. As growing the proposals of cutting-edge technologies, application field of machine learning on Spark is spreading to industrial area including electric power [34], telecommunication [35] and drug discovery [36]. In this movement, the importance of Spark as a data science platform for KDD community is also growing as we can find several tutorials held in the last several KDD conferences [37], [38], [39].…”

Section: Related Workmentioning

confidence: 99%

Distributed Bayesian piecewise sparse linear models

Asahara¹,

Fujimaki

2017

2017 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

The importance of interpretability of machine learning models has been increasing due to emerging enterprise predictive analytics, threat of data privacy, accountability of artificial intelligence in society, and so on. Piecewise linear models have been actively studied to achieve both accuracy and interpretability. They often produce competitive accuracy against state-of-the-art non-linear methods. In addition, their representations (i.e., rule-based segmentation plus sparse linear formula) are often preferred by domain experts. A disadvantage of such models, however, is high computational cost for simultaneous determinations of the number of "pieces" and cardinality of each linear predictor, which has restricted their applicability to middle-scale data sets. This paper proposes a distributed factorized asymptotic Bayesian (FAB) inference of learning piece-wise sparse linear models on distributed memory architectures. The distributed FAB inference solves the simultaneous model selection issue without communicating O(N ) data where N is the number of training samples and achieves linear scale-out against the number of CPU cores. Experimental results demonstrate that the distributed FAB inference achieves high prediction accuracy and performance scalability with both synthetic and benchmark data.

show abstract

“…It is based on the Bulk Synchronous Parallel (BSP) computation model . Recently, research studies proposed parallel machine learning algorithms on the Spark . They indicated that these algorithms can take advantage of big data tools.…”

Section: Introductionmentioning

confidence: 99%

Scalable and optimal planning based on Pregel

Jiang

Rao

2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary Automated planning generates plans for specific tasks. Optimal planning aims at generating optimal plans under global constraints. As a result, the divide‐and‐conquer method is not applicable for optimal planning. Therefore, engineering applications of optimal planning face the scalability issue. Fortunately, cloud computing tools are on the shelf. For example, the Apache Spark is an engine for big data processing. It supports the Pregel for scalable computing. Therefore, we proposed an optimal Planning method based on the Pregel, called the PbP. Unlike classical planning, the PbP method uses the Pregel as the computation model, instead of the traditional state‐space searching. The core idea is to transform planning problems into graph processing problems. Specifically, actions are mapped into vertices, partial orders between actions are mapped into edges between vertices, and states are mapped into messages. Furthermore, the planning is viewed as message propagating in the graph, and plan traces are stored as attributes of vertices. Experimental results showed the feasibility of the proposed method PbP. Moreover, compared with state‐of‐the‐art optimal planners, our approach is more scalable and faster.

show abstract

Scaling Machine Learning for Target Prediction in Drug Discovery using Apache Spark

Cited by 20 publications

References 20 publications

Parallelizing Machine Learning as a service for the end-user

Parallelizing Machine Learning as a service for the end-user

Distributed Bayesian piecewise sparse linear models

Scalable and optimal planning based on Pregel

Contact Info

Product

Resources

About