Big Data Analytics using Hadoop

Dhyani, Bijesh; Barthwal, Anurag

doi:10.5120/18960-0288

Cited by 13 publications

(5 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hadoop – high-availability distributed object-oriented platform – is a group of classes written in Java, open source by the Apache foundation allowing to meet the needs of big data. It contains four basic components (Alam and Ahmed, 2014; Dhyani and Barthwal, 2014). The first, Hadoop Distributed File System (HDFS), is a distributed, expandable and portable file system inspired by the Google File System (GFS).…”

Section: Related Workmentioning

confidence: 99%

Dynamic Distributed and Parallel Machine Learning algorithms for big data mining processing

Djafri

2021

DTA

View full text Add to dashboard Cite

PurposeThis work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.Design/methodology/approachIn the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.FindingsThe authors got very satisfactory classification results.Originality/valueDDPML system is specially designed to smoothly handle big data mining classification.

show abstract

Section: Related Workmentioning

confidence: 99%

Dynamic Distributed and Parallel Machine Learning algorithms for big data mining processing

Djafri

2021

DTA

View full text Add to dashboard Cite

show abstract

“…With the rapid development and application of Internet of Things, cloud computing and big data technology, the access control scheme of cloud computing platform under big data application environment must be highly scalable, flexible and efficient. However, the access control currently adopted by the traditional big data platform, such as Hadoop [1][2][3], is based on the static policy specified by the user/user group, and cannot be authorized in groups according to multiple attribute tags of users, let alone the dynamic change of permissions according to the changes of users' attributes, which makes it only suitable for the rights management of a small number of users. The data under the environment of big data is large and dynamic, so the access control list is large and difficult to maintain, the phenomenon of over-authorization and under-authorization is more and more serious, and rights management is complex and difficult [4].…”

Section: Introductionmentioning

confidence: 99%

Research on ABAC Access Control Based on Big Data Platform

Yang¹,

Jin²,

Zeng³

2021

Journal of Cyber Security

View full text Add to dashboard Cite

In the environment of big data, the traditional access control lacks effective and flexible access mechanism. Based on attribute access control, this paper proposes a HBMC-ABAC big data access control framework. It solves the problems of difficult authority change, complex management, over-authorization and lack of authorization in big data environment. At the same time, binary mapping codes are proposed to solve the problem of low efficiency of policy retrieval in traditional ABAC. Through experimental analysis, the results show that our proposed HBMC-ABAC model can meet the current large and complex environment of big data.

show abstract

“…These tools run interactively or via a batch job. Similar to Hadoop, all of this software is available through the module environments package [54,61]. Since the investigation was for user interactivity and usability the following was elaborated on:…”

Section: Implementation Of Apache Spark Technology Systemmentioning

confidence: 99%

“…Its speed can be fast because of its judicious use of memory to cache the data, because Spark's main operators of transformations to form actions/filters were applied to an immutable RDD [32][33]. Each transformation produces an RDD that needs to be cached in memory and/or persisted to disk, depending on the user's choice [32,61]. In Spark, transformation was a lazy operator; instead, direct acyclic graphs (DAG) of the RDD transformations build, optimize, and only executed when action applied [32].…”

Section: Implementation Of Apache Spark Technology Systemmentioning

confidence: 99%