In recent years, the radical advancement of technologies has given rise to an abundance of software applications, social media, and smart devices such as smartphone, sensors, and so on. More extensive use of these applications and tools in various industrial domains has led to data deluge, which has fostered enormous challenges and opportunities. However, it is not only the volume of the data but also the speed, variety, and uncertainty, which are promoting a massive challenge for traditional technologies such as data warehouse. These diverse and unprecedented characteristics have engendered the notion of ''Big Data.'' The data-intensive industries have been experiencing a wide variety of challenges in terms of processing, managing, and analysis of data. For instance, the healthcare sector is confronting difficulties in respect of integration or fusion of diverse medical data stemming from multiple heterogeneous sources. Data integration is critically important within the healthcare sector because it enriches data, enhances its value, and more importantly paves a solid foundation for highly efficient and effective healthcare analytics such as predicting diseases or an outbreak. Several data integration technologies and tools have been developed over the last two decades. This paper aims at studying data integration technologies, tools, and applications within the healthcare domain. Furthermore, this paper discusses future research directions in the integration of Big healthcare data. INDEX TERMS Big data, data integration, healthcare data.
With the rapid growth of collected data and the variety of its content, the need for efficient integration at a Big Data level becomes crucial. Semantic technologies, as a means of integration and coordination of heterogeneous systems, may help big data to manage terminology and relationships to link various data from different data sources. However, and due to the difficulty of integration and analytics of some datasets with high-precision, automated processes cannot reach a high level of accuracy without the human cognitive ability. Crowdsourcing platforms have the potential to integrate (entity matching, entity resolution) and analyze (sentiment analysis, image recognition) heterogeneous data sources when in some cases these integration tasks may prove to be problematic for computers. In this survey, we explore and compare empirical research studies that rely on merging semantic and crowdsourcing technologies. And, in the light of this comparison, we propose a high-level integration workflow, which shows how merging these technologies can enhance the big data integration process and tackle the data analysis challenges.
In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This paper tries to generalize the quality assessment operations by providing a new ISO-based declarative data quality assessment framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in different domains and contexts. It facilitates the planning and execution of big data quality assessment operations for data domain experts and data management specialists at any phase in the data life cycle. This work implements BIGQA to demonstrate its ability to produce customized data quality reports while running efficiently on parallel or distributed computing frameworks. BIGQA generates data quality assessment plans using straightforward operators designed to handle big data and guarantee a high degree of parallelism when executed. Moreover, it allows incremental data quality assessment to avoid reading the whole data set each time the quality assessment operation is required. The result was validated using radiation wireless sensor data and Stack Overflow users’ data to show that it can be implemented within different contexts. The experiments show a 71% performance improvement over a 1 GB flat file on a single processing machine compared with a non-parallel application and 75% over a 25 GB flat file within a distributed environment compared to a non-distributed application.
The term data quality refers to measuring the fitness of data regarding the intended usage. Poor data quality leads to inadequate, inconsistent, and erroneous decisions that could escalate the computational cost, cause a decline in profits, and cause customer churn. Thus, data quality is crucial for researchers and industry practitioners. Different factors drive the assessment of data quality. Data context is deemed one of the key factors due to the contextual diversity of real-world use cases of various entities such as people and organizations. Data used in a specific context (e.g., an organization policy) may need to be more efficacious for another context. Hence, implementing a data quality assessment solution in different contexts is challenging. Traditional technologies for data quality assessment reached the pinnacle of maturity. Existing solutions can solve most of the quality issues. The data context in these solutions is defined as validation rules applied within the ETL (extract, transform, load) process, i.e., the data warehousing process. In contrast to traditional data quality management, it is impossible to specify all the data semantics beforehand for big data. We need context-aware data quality rules to detect semantic errors in a massive amount of heterogeneous data generated at high speed. While many researchers tackle the quality issues of big data, they define the data context from a specific standpoint. Although data quality is a longstanding research issue in academia and industries, it remains an open issue, especially with the advent of big data, which has fostered the challenge of data quality assessment more than ever. This paper provides a scoping review to study the existing context-aware data quality assessment solutions, starting with the existing big data quality solutions in general and then covering context-aware solutions. The strength and weaknesses of such solutions are outlined and discussed. The survey showed that none of the existing data quality assessment solutions could guarantee context awareness with the ability to handle big data. Notably, each solution dealt only with a partial view of the context. We compared the existing quality models and solutions to reach a comprehensive view covering the aspects of context awareness when assessing data quality. This led us to a set of recommendations framed in a methodological framework shaping the design and implementation of any context-aware data quality service for big data. Open challenges are then identified and discussed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.