Recently, huge amount of data has been generated in all over the world; these data are very huge, extremely fast and varies in its type. In order to extract the value from this data and make sense of it, a lot of frameworks and tools are needed to be developed for analyzing it. Until now a lot of tools and frameworks were generated to capture, store, analyze and visualize it. In this study we categorized the existing frameworks which is used for processing the big data into three groups, namely as, Batch processing, Stream analytics and Interactive analytics, we discussed each of them in detailed and made comparison on each of them.
Many critical applications need more accuracy and speed in the decision making process. Data mining scholars developed set of artificial automated tools to enhance the entire decisions based on type of application. Phishing is one of the most critical application needs for high accuracy and speed in decision making when a malicious webpage impersonates as legitimate webpage to acquire secret information from the user. In this paper, we proposed a new Association Classification (AC) algorithm as an artificial automated tool to increase the accuracy level of the classification process that aims to discover any malicious webpage. An Intelligent Association Classification (IAC) algorithm developed in this article by employing the Harmonic Mean measure instead of the support and confidence measure to solve the estimation problem in these measures and discovering hidden pattern not generated by the existing AC algorithms. Our algorithm compared with four well-known AC algorithm in terms of accuracy, F1, Precision, Recall and execution time. The experiments and the visualization process show that the IAC algorithm outperformed the others in all cases and emphasize on the importance of the general and specific rules in the classification process
Data is the fastest growing asset in the 21st century, extracting insights is becoming of the essence as the traditional ecosystems are incapable to process the resulting amounts, complying with different structural levels, and is rapidly produced. Along this paradigm, the need for processing mostly real time data among other factors highlights the need for optimized Job Scheduling Algorithms, which is the interest of this paper. It is one of the most important aspects to guarantee an efficient processing ecosystem with minimal execution time, while exploiting the available resources taking into consideration granting all the users a fair share of the dedicated resources. Through this work, we lay some needed background on the Hadoop MapReduce framework. We run a comparative analysis on different algorithms that are classified on different criteria. The light is shed on different classifications: Cluster Environment, Job Allocation Strategy, Optimization Strategy, and Metrics of Quality. We, also, construct use cases to showcase the characteristics of selected Job Scheduling Algorithms, then we present a comparative display featuring the details for the use cases.
Autism Spectrum Disorder (ASD) is a psychiatric disorder that puts constraints on the ability to use of cognitive, linguistic, communicative, and social skills. Recently, many data mining techniques employed to serve this domain by determining the main features of the condition and the correlation between them. In this article, we investigate the Association Classification (AC) technique as a data mining technique in predicting whether an individual has autism or not. Accordingly, seven well-known algorithms are selected to conduct analysis and evaluation of the performance of the AC technique in term of identifying correlations between the features to help decide early on whether an individual has autism; this is particularly significant for children. The evaluation for the behavior and the performance in the prediction tasks for the AC algorithms was conducted for the common metrics of including Precision, Accuracy F-Measure as well as Recall. Finally, a comparative performance analysis among the algorithms was used as final result for the study. The results show better performance for the WCBA algorithm in most test scenarios with accuracy of 97 % although, the majority of algorithms exhibited excellent accuracy when applied in this domain.
Data Curation on data streams is effective in operating and reducing costs of BIG DATA analytic. Basically, analytic preparation requires data curation of available heterogeneous data sets available in big data clusters and such analytic process becomes harder when it comes to the concept of conducting the curation process on Data-on-Motion, in order to come at actionable insights and valuable analytic on a real-time basis including the Machine Learning further analytic and processing. In our paper, we identified and surveyed the different issues and challenges among different areas that are related to the big data. In addition to investigate, the most common techniques and methods followed through the implementations including Streams Curation, the Machine Learning Different Algorithms used in such implementations and the Feature Engineering different techniques that can be considered as curation pre-processing paradigm for data streams analytic. Furthermore, our paper shows the different application areas were data curation concept plays a critical role. Finally, we draw the map between the techniques and methods that are related to the data curation field to emphasize on its main critical role among Business, Retails, Culture, Arts, Health, Medicine, Social Media, Wireless Sensor Networks, Natural Language Processing (NLP) and Automated Feature Engineering (FE). On other hand, we identified the different issues and challenges among different areas including the IoT and Media Streams Curation to help the scholars in this region accordingly.
Developing in Big Data applications become very important in the last few years, many organizations and industries are aware that data analysis is becoming an important factor to be more competitive and discover new trends and insights. Data ingestion and preparation step is the starting point for developing any Big Data project. This paper is a review for some of the most widely used Big Data ingestion and preparation tools, it discusses the main features, advantages and usage for each tool. The purpose of this paper is to help users to select the right ingestion and preparation tool according to their needs and applications’ requirements.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.