Mining of Massive Datasets

Leskovec, Jure; Rajaraman, Anand; Ullman, Jeffrey D.

doi:10.1017/cbo9781139924801

Cited by 904 publications

(508 citation statements)

References 0 publications

Supporting

Mentioning

442

Contrasting

Unclassified

Order By: Relevance

“…Still, the challenges for efficiently gathering and exploiting such statistics metadata for optimizing data-intensive flows remain due to the required close to zero overhead of an optimization process and the "right-time" data delivery demands in the next generation BI settings (i.e., ETO). To this end, the existing algorithms proposed for efficiently capturing the approximate summaries out of massive data streams [60], should be reconsidered here and adopted for gathering approximate statistics for data-intensive flows over large input datasets.…”

Section: Discussionmentioning

confidence: 99%

A Unified View of Data-Intensive Flows in Business Intelligence Systems: A Survey

Jovanovic

Romero

Abelló

2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Data-intensive flows are central processes in today's business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysisready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of dataintensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today's research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.

show abstract

Section: Discussionmentioning

confidence: 99%

A Unified View of Data-Intensive Flows in Business Intelligence Systems: A Survey

Jovanovic

Romero

Abelló

2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Basically, the larger the cosine similarity is, the smaller the cosine distance is, and the two words are more related [19]. Here we are building a query index that is actually a kNN pre-trained data classification model for a given query set.…”

Section: Index Constructionmentioning

confidence: 99%

Approximate Semantic Matching over Linked Data Streams

Qin

Yao

Sheng

2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. In the Internet of Things (IoT), data can be generated by all kinds of smart things. In such context, enabling machines to process and understand such data is critical. Semantic Web technologies, such as Linked Data, provide an effective and machine-understandable way to represent IoT data for further processing. It is a challenging issue to match Linked Data streams semantically based on text similarity as text similarity computation is time consuming. In this paper, we present a hashing-based approximate approach to efficiently match Linked Data streams with users' needs. We use the Resource Description Framework (RDF) to represent IoT data and adopt triple patterns as user queries to describe users' data needs. We then apply locality-sensitive hashing techniques to transform semantic data into numerical values to support efficient matching between data and user queries. We design a modified k nearest neighbors (kNN) algorithm to speedup the matching process. The experimental results show that our approach is up to five times faster than the traditional methods and can achieve high precisions and recalls.

show abstract

“…The words with higher TF-IDF scores are often the words that best characterize the topic of the document. [28] Intuitively, if a word is less frequent in the whole training set but appears often in one single sentence, then it means this word is of high probability to be significant to the theme of this sentence. Thus this word should be given more weight.…”

Section: Our Approach 31 Extension Of Compositional Distributional Smentioning

confidence: 99%

Information-based methods for evaluating the semantics of automatically generated test items

Latifi

Gierl

Ren

et al. 2016

AIR

View full text Add to dashboard Cite

Multiple-choice questions are the popular type of test items that are used for testing the knowledge of health-science students in north America and elsewhere. The motivation of this article is to present the recent advances in the automatic item generation (AIG) and to propose a novel unsupervised approach that extends the information-based Compositional Distributional Semantic Model (CDSM) to measure the semantic relatedness among the pool of automatically generated items. We have used operational item bank from the medical science domain for developing the CDSM and demonstrated our approach using the concepts from AIG research. We illustrated our approach using eleven item models from the medical education domain, and discussed the possible applications to advance the AIG research.

show abstract

Mining of Massive Datasets

Cited by 904 publications

References 0 publications

A Unified View of Data-Intensive Flows in Business Intelligence Systems: A Survey

A Unified View of Data-Intensive Flows in Business Intelligence Systems: A Survey

Approximate Semantic Matching over Linked Data Streams

Information-based methods for evaluating the semantics of automatically generated test items

Contact Info

Product

Resources

About