2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing 2015
DOI: 10.1109/ccgrid.2015.50
|View full text |Cite
|
Sign up to set email alerts
|

Scaling Machine Learning for Target Prediction in Drug Discovery using Apache Spark

Abstract: In the context of drug discovery, a key problem is the identification of candidate molecules that affect proteins associated with diseases. Inside Janssen Pharmaceutica, the Chemogenomics project aims to derive new candidates from existing experiments through a set of machine learning predictor programs, written in single-node C++. These programs take a long time to run and are inherently parallel, but do not use multiple nodes. We show how we reimplemented the pipeline using Apache Spark, which enabled us to … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
17
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 20 publications
(17 citation statements)
references
References 20 publications
(17 reference statements)
0
17
0
Order By: Relevance
“…Given the particular focus of their study, Xu et al do not propose a general methodology for distributing the ML pipeline, and in particular for parallelizing the prediction task of possibly more sophisticated applications and leave it to the developer to address such a problem. Similarly, Harnie et al [24] use the Apache Spark technology [7] to achieve the desired scalability in chemoinformatics applications, which is also a choice we made in our work. However, as their main target is obtaining performance enhancements through the parallelization of the scientific application at hand, they do not provide a general methodology for parallelizing prediction services.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Given the particular focus of their study, Xu et al do not propose a general methodology for distributing the ML pipeline, and in particular for parallelizing the prediction task of possibly more sophisticated applications and leave it to the developer to address such a problem. Similarly, Harnie et al [24] use the Apache Spark technology [7] to achieve the desired scalability in chemoinformatics applications, which is also a choice we made in our work. However, as their main target is obtaining performance enhancements through the parallelization of the scientific application at hand, they do not provide a general methodology for parallelizing prediction services.…”
Section: Related Workmentioning
confidence: 99%
“…The steps of these pipelines recall the MapReduce programming paradigm. bine the information coming from the different classifiers [24]. In computer vision, video segmentation is typically performed on frame groups, resulting from a first processing stage [34].…”
Section: Prediction As a Servicementioning
confidence: 99%
“…These research outcomes are continuously integrated with Spark as its machine learning library called MLlib [12], [11]. As growing the proposals of cutting-edge technologies, application field of machine learning on Spark is spreading to industrial area including electric power [34], telecommunication [35] and drug discovery [36]. In this movement, the importance of Spark as a data science platform for KDD community is also growing as we can find several tutorials held in the last several KDD conferences [37], [38], [39].…”
Section: Related Workmentioning
confidence: 99%
“…It is based on the Bulk Synchronous Parallel (BSP) computation model . Recently, research studies proposed parallel machine learning algorithms on the Spark . They indicated that these algorithms can take advantage of big data tools.…”
Section: Introductionmentioning
confidence: 99%