Discretized streams

Zaharia, Matei; Das, Tamal; Li, Haoyuan; Hunter, Timothy; Shenker, Scott; Stoica, Ion

doi:10.1145/2517349.2522737

Cited by 755 publications

(53 citation statements)

References 25 publications

(40 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We also summarize some contributions and case studies from the industry. [7,33,61,83,89,90,[93][94][95]. For example, the development of Spark's MLlib began from MLbase 6 project, and then, other projects started to contribute (e.g., KeystoneML 7 ).…”

Section: Overview Of Apache Sparkmentioning

confidence: 99%

“…Apache Spark system consists of several main components including Spark core [90,93,94] and upper-level libraries: Spark's MLlib for machine learning [61], GraphX [33,83,85] for graph analysis, Spark Streaming [95] for stream processing and Spark SQL [7] for structured data processing. It is evolving rapidly with changes to its core APIs and addition of upper-level libraries.…”

Section: Main Components and Featuresmentioning

confidence: 99%

“…Streaming [95] for streaming analysis and Spark SQL [7] for structured data processing. Improvements in Spark core lead to corresponding improvements in the upper-level libraries as these libraries are built on top of Spark core.…”

Section: Upper-level Librariesmentioning

confidence: 99%

“…Stream processing Discretized Streams (DStreams) [95], an RDD extension for streaming processing in Spark Streaming.…”

Section: Transformations and Actionsmentioning

confidence: 99%

See 3 more Smart Citations

Big data analytics on Apache Spark

Salloum

Dautov

Chen

et al. 2016

Int J Data Sci Anal

317

121

View full text Add to dashboard Cite

Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with language-integrated APIs in Scala, Java, Python and R. As a rapidly evolving open source project, with an increasing number of contributors from both academia and industry, it is difficult for researchers to comprehend the full body of development and research behind Apache Spark, especially those who are beginners in this area. In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics.

show abstract

Section: Overview Of Apache Sparkmentioning

confidence: 99%

Section: Main Components and Featuresmentioning

confidence: 99%

Section: Upper-level Librariesmentioning

confidence: 99%

“…Stream processing Discretized Streams (DStreams) [95], an RDD extension for streaming processing in Spark Streaming.…”

Section: Transformations and Actionsmentioning

confidence: 99%

See 2 more Smart Citations

Big data analytics on Apache Spark

Salloum

Dautov

Chen

et al. 2016

Int J Data Sci Anal

317

121

View full text Add to dashboard Cite

show abstract

“…Mario also uses HBase for data provenance and single-pass reservoir sampling. The iterative processing in Mario is similar to Spark Streaming [38]. Mario splits the data randomly into many small parts and distributes these on the cluster nodes.…”

Section: Mariomentioning

confidence: 99%

Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

Bongo

Pedersen

Ernstsen

2015

Computational Intelligence Methods for Bioinformatics and Biostatistics

View full text Add to dashboard Cite

Abstract. Biological data analysis is typically implemented using a deep pipeline that combines a wide array of tools and databases. These pipelines must scale to very large datasets, and consequently require parallel and distributed computing. It is therefore important to choose a hardware platform and underlying data management and processing systems well suited for processing large datasets. There are many infrastructure systems for such data-intensive computing. However, in our experience, most biological data analysis pipelines do not leverage these systems.We give an overview of data-intensive computing infrastructure systems, and describe how we have leveraged these for: (i) scalable fault-tolerant computing for large-scale biological data; (ii) incremental updates to reduce the resource usage required to update large-scale compendium; and (iii) interactive data analysis and exploration. We provide lessons learned and describe problems we have encountered during development and deployment. We also provide a literature survey on the use of data-intensive computing systems for biological data processing. Our results show how unmodified biological data analysis tools can benefit from infrastructure systems for data-intensive computing.

show abstract

Streaming Data and Data Streams

Kolajo

Daramola

Adebiyi

2021

Wiley StatsRef: Statistics Reference Online

View full text Add to dashboard Cite

Recent advances in computer networking, smart cities, smart grid, remote sensing, surveillance, telecommunication, and social media have led to a high volume of streaming data. The amount of data generated for the past two years is more than what has been in the history of the entire human race. This high volume, high‐traffic, and brief life‐span data need online analysis and intelligent processing to uncover useful and exciting information that is contained in them. To expand the existing knowledge in the domain of data science, broad areas on streaming data and data streams, which embrace data stream mining issues, streaming data tools and technologies, streaming data pre‐processing, streaming data algorithms, and strategies for processing data streams, were discussed in this article. The article also recommends the best practices for managing data streams and suggests the way forward.

show abstract

Discretized streams

Cited by 755 publications

References 25 publications

Big data analytics on Apache Spark

Big data analytics on Apache Spark

Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

Streaming Data and Data Streams

Contact Info

Product

Resources

About