2014
DOI: 10.1017/cbo9781139924801
|View full text |Cite
|
Sign up to set email alerts
|

Mining of Massive Datasets

Abstract: Written by leading authorities in database and Web technologies, this book is essential reading for students and practitioners alike. The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and can be applied successfully to even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
442
0
18

Year Published

2015
2015
2019
2019

Publication Types

Select...
4
4
2

Relationship

0
10

Authors

Journals

citations
Cited by 904 publications
(508 citation statements)
references
References 0 publications
1
442
0
18
Order By: Relevance
“…Still, the challenges for efficiently gathering and exploiting such statistics metadata for optimizing data-intensive flows remain due to the required close to zero overhead of an optimization process and the "right-time" data delivery demands in the next generation BI settings (i.e., ETO). To this end, the existing algorithms proposed for efficiently capturing the approximate summaries out of massive data streams [60], should be reconsidered here and adopted for gathering approximate statistics for data-intensive flows over large input datasets.…”
Section: Discussionmentioning
confidence: 99%
“…Still, the challenges for efficiently gathering and exploiting such statistics metadata for optimizing data-intensive flows remain due to the required close to zero overhead of an optimization process and the "right-time" data delivery demands in the next generation BI settings (i.e., ETO). To this end, the existing algorithms proposed for efficiently capturing the approximate summaries out of massive data streams [60], should be reconsidered here and adopted for gathering approximate statistics for data-intensive flows over large input datasets.…”
Section: Discussionmentioning
confidence: 99%
“…Basically, the larger the cosine similarity is, the smaller the cosine distance is, and the two words are more related [19]. Here we are building a query index that is actually a kNN pre-trained data classification model for a given query set.…”
Section: Index Constructionmentioning
confidence: 99%
“…The words with higher TF-IDF scores are often the words that best characterize the topic of the document. [28] Intuitively, if a word is less frequent in the whole training set but appears often in one single sentence, then it means this word is of high probability to be significant to the theme of this sentence. Thus this word should be given more weight.…”
Section: Our Approach 31 Extension Of Compositional Distributional Smentioning
confidence: 99%