Twister2: Design of a big data toolkit

Kamburugamuve, Supun; Govindarajan, K.; Wickramasinghe, Pulasthi; Abeykoon, Vibhatha; Fox, Geoffrey

doi:10.1002/cpe.5189

Cited by 16 publications

(19 citation statements)

References 69 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The integration presented here 1 provides APIs in Python and Java. We chose those two languages because Java is the native language for COMPSs and HDFS, while Python popular in Data Science scenarioes, and it is required for the Lemonade environment, which will be described later.…”

Section: Data Abstractionmentioning

confidence: 99%

“…Using HDFS, each fragment can be read in parallel by the multiple instances of the task (task1). From there, the next steps are similar to the existing solutions in COMPSs programming, that 1 https://github.com/eubr-bigsea/compss-hdfs Algorithm 1: COMPSs HDFS API usage example.…”

Section: Data Abstractionmentioning

confidence: 99%

“…HPC applications are those that explore high-level parallelism and high-performance hardware, including low latency networks, to process mostly structured data with scientific algorithms. On the other hand, Big Data scenarios involve the processing of massive data volumes *Correspondence: lucasmsp@dcc.ufmg.br 1 Departamento de Ciência da Computação, Universidade Federal de Minas Gerais (UFMG), 31270-901 Belo Horizonte, Minas Gerais, Brazil Full list of author information is available at the end of the article (usually unstructured), leveraging the use of conventional hardware and exploiting data parallelism. In this case, data could be processed as multiple individual streams and analyzed collectively in stream or in batch, for the discovery of knowledge.…”

Section: Introductionmentioning

confidence: 99%

“…In this case, data could be processed as multiple individual streams and analyzed collectively in stream or in batch, for the discovery of knowledge. In such scenarios, data mining in big data has become one of the key tasks in many fields of Science [1].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Upgrading a high performance computing environment for massive data processing

Ponce

Santos

Meira

et al. 2019

J Internet Serv Appl

View full text Add to dashboard Cite

High-performance computing (HPC) and massive data processing (Big Data) are two trends that are beginning to converge. In that process, aspects of hardware architectures, systems support and programming paradigms are being revisited from both perspectives. This paper presents our experience on this path of convergence with the proposal of a framework that addresses some of the programming issues derived from such integration. Our contribution is the development of an integrated environment that integretes (i) COMPSs, a programming framework for the development and execution of parallel applications for distributed infrastructures; (ii) Lemonade, a data mining and analysis tool; and (iii) HDFS, the most widely used distributed file system for Big Data systems. To validate our framework, we used Lemonade to create COMPSs applications that access data through HDFS, and compared them with equivalent applications built with Spark, a popular Big Data framework. The results show that the HDFS integration benefits COMPSs by simplifying data access and by rearranging data transfer, reducing execution time. The integration with Lemonade facilitates COMPSs's use and may help its popularization in the Data Science community, by providing efficient algorithm implementations for experts from the data domain that want to develop applications with a higher level abstraction.

show abstract

Section: Data Abstractionmentioning

confidence: 99%

Section: Data Abstractionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Upgrading a high performance computing environment for massive data processing

Ponce

Santos

Meira

et al. 2019

J Internet Serv Appl

View full text Add to dashboard Cite

show abstract

“…Já em cenários convencionalmente denominados de Big Data, que buscam processar grandes volumes de dados de diversos tipos, normalmente não estruturados, utilizam hardware convencional e se valem fortemente de técnicas de paralelismo de dados. Nesse caso, os dados podem ser processados como fluxos individuais e analisados coletivamente em stream ou em lote, para a descoberta de conhecimento, sendo a mineração em Big Data uma das tarefas chaves em muitos domínios da ciência [Kamburugamuve et al 2017].…”

Section: Introductionunclassified

Extensão de um ambiente de computação de alto desempenho para o processamento de dados massivos

Santos¹,

Meira²,

Guedes³

2018

Anais Do XXXVI Simpósio Brasileiro De Redes De Computadores E Sistemas Distribuídos (SBRC 2018)

View full text Add to dashboard Cite

A computação de alto desempenho (HPC) e o processamento de dados massivos (Big Data) são duas tendências em sistemas de computação que estão começando a convergir. Este trabalho apresenta nossa experiência nesse caminho de convergência, estendendo o COMP Superscalar (COMPSs), um modelo de programação paralela e distribuída já conhecido no mundo de HPC, para o processamento de dados massivos. Para isso, ele foi integrado ao HDFS, sistema de arquivos distribuído mais usado para Big Data, e ao Lemonade, uma ferramenta de análise e mineração de dados desenvolvida na UFMG. Os resultados mostram que a integração com o HDFS beneficia o COMPSs pela abstração de dados fornecida e a integração com o Lemonade facilita sua utilização e popularização na área de Ciência dos Dados.

show abstract