Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline

Feldman, Keith; Faust, Louis; Wu, Xian; Huang, Chao; Chawla, Nitesh V.

doi:10.1007/978-3-319-69775-8_9

Cited by 15 publications

(18 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It has been identified that text ambiguity, lack of resources, complex nested entities, identification of contextual information, noise in the form of homonyms, language variability and missing data are important challenges in entity recognition from unstructured big data [11,16,105]. It is also found that the volume of unstructured big data changed the technological paradigm from traditional rule-based or learning-based techniques to [9,10].…”

Section: Named Entity Recognition (Ner)mentioning

confidence: 99%

“…These techniques extract entity mentions from the text, clusters the similar entities and identify relations [120]. In this case, intensive data preprocessing will be required for big data because unstructured big data sets have missing values, noise and other errors [16] that produce uninformative as well as incoherent extractions. Semi-supervised techniques use both labeled and unlabeled corpus with small degree of supervision [121].…”

Section: Rule-based Approaches Learning-based Approachesmentioning

confidence: 99%

“…Over-fitting can be resolved with self-training [18] and to overcome the limitation of large annotated dataset availability, reinforcement learning or distant supervision can be used because these techniques use small labeled dataset [26], [126]. Timeliness of distribution of data [126], balance of informativeness, representativeness, and diversity [127], data modeling performance for heterogeneous, dimensional, sparse and imbalance data [16] and structuring the unstructured data [10] are open challenges for IE using unstructured big data sets.…”

Section: Rule-based Approaches Learning-based Approachesmentioning

confidence: 99%

“…With huge volume and complexity of unstructured big data, natural language free text data implies various issues for the users to extract the most relevant and required information. Noisy and low-quality data is one of the major challenges in IE from big data [16,31,128,129]. It causes difficulties in identifying semantic relatedness among entities and terms [130], improving the effectiveness and performance of IE systems [128], extracting contextually relevant information [31], data modeling [16] and structuring the data [10].…”

Section: Unstructured Big Data Barriers For Iementioning

confidence: 99%

See 3 more Smart Citations

An analytical study of information extraction from unstructured and multidimensional big data

2019

View full text Add to dashboard Cite

IntroductionInformation extraction (IE) process extracts useful structured information from the unstructured data in the form of entities, relations, objects, events and many other types. The extracted information from unstructured data is used to prepare data for analysis. Therefore, the efficient and accurate transformation of unstructured data in the IE process improves the data analysis. Numerous techniques have been introduced for different data types i.e. text, image, audio, and video.The advancement in technology promoted the rapid growth of data volume in recent years. The volume, variety (structured, unstructured, and semi-structured data) and velocity of big data have also changed the paradigm of computational capabilities of the systems. IBM estimated that more than 2.5 quintillion bytes of data are generated every Abstract Process of information extraction (IE) is used to extract useful information from unstructured or semi-structured data. Big data arise new challenges for IE techniques with the rapid growth of multifaceted also called as multidimensional unstructured data. Traditional IE systems are inefficient to deal with this huge deluge of unstructured big data. The volume and variety of big data demand to improve the computational capabilities of these IE systems. It is necessary to understand the competency and limitations of the existing IE techniques related to data pre-processing, data extraction and transformation, and representations for huge volumes of multidimensional unstructured data. Numerous studies have been conducted on IE, addressing the challenges and issues for different data types such as text, image, audio and video. Very limited consolidated research work have been conducted to investigate the task-dependent and task-independent limitations of IE covering all data types in a single study. This research work address this limitation and present a systematic literature review of state-of-the-art techniques for a variety of big data, consolidating all data types. Recent challenges of IE are also identified and summarized. Potential solutions are proposed giving future research directions in big data IE. The research is significant in terms of recent trends and challenges related to big data analytics. The outcome of the research and recommendations will help to improve the big data analytics by making it more productive. which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Adnan and Akbar J Big Data (2019) 6:91 Malaysia Adnan and Akbar J Big Data (2019) 6:91 RESEARCHday. Among these statistics, it was also predicted that unstructured data from diverse sources will grow up to 90% in few years. IDC estimated that unstructured data will be 95% of the global data in 2020 with estimated 65% annual growth rate [1]. The common characteristics of unstructured data are, (i) it comes in multiple formats...

show abstract

Section: Named Entity Recognition (Ner)mentioning

confidence: 99%

Section: Rule-based Approaches Learning-based Approachesmentioning

confidence: 99%

Section: Rule-based Approaches Learning-based Approachesmentioning

confidence: 99%

Section: Unstructured Big Data Barriers For Iementioning

confidence: 99%

See 2 more Smart Citations

An analytical study of information extraction from unstructured and multidimensional big data

2019

View full text Add to dashboard Cite

show abstract

“…The question of the quality of medical record and of the data extracted from there is still understudied [81,10], let alone in regard to machine learning projects [27].…”

Section: Between Gold Standards and Ghost Standardsmentioning

confidence: 99%

A Giant with Feet of Clay: On the Validity of the Data that Feed Machine Learning in Medicine

Cabitza

Ciucci

Rasoini

2018

Organizing for the Digital World

View full text Add to dashboard Cite

This paper considers the use of Machine Learning (ML) in medicine by focusing on the main problem that this computational approach has been aimed at solving or at least minimizing: uncertainty. To this aim, we point out how uncertainty is so ingrained in medicine that it biases also the representation of clinical phenomena, that is the very input of ML models, thus undermining the clinical significance of their output. Recognizing this can motivate both medical doctors, in taking more responsibility in the development and use of these decision aids, and the researchers, in pursuing different ways to assess the value of these systems. In so doing, both designers and users could take this intrinsic characteristic of medicine more seriously and consider alternative approaches that do not "sweep uncertainty under the rug" within an objectivist fiction, which everyone can come up by believing as true.

show abstract

Three–Way Classification: Ambiguity and Abstention in Machine Learning

2019

View full text Add to dashboard Cite

Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline

Cited by 15 publications

References 56 publications

An analytical study of information extraction from unstructured and multidimensional big data

An analytical study of information extraction from unstructured and multidimensional big data

A Giant with Feet of Clay: On the Validity of the Data that Feed Machine Learning in Medicine

Three–Way Classification: Ambiguity and Abstention in Machine Learning

Contact Info

Product

Resources

About