Sequential pattern mining has raised great interest in data mining research field in recent years. However, to our best knowledge, no existing work studies the problem of frequent sequence generator mining. In this paper we present a novel algorithm, FEAT (abbr. Frequent sEquence generATor miner), to perform this task. Experimental results show that FEAT is more efficient than traditional sequential pattern mining algorithms but generates more concise result set, and is very effective for classifying Web product reviews.
Box score statistics are the baseline measures of performance for National Collegiate Athletic Association (NCAA) basketball. Between the 2011-2012 and 2015-2016 seasons, NCAA teams performed better at home compared to on the road in nearly all box score statistics across both genders and all three divisions. Using box score data from over 100,000 games spanning the three divisions for both women and men, we examine the factors underlying this discrepancy. The prevalence of neutral location games in the NCAA provides an additional angle through which to examine the gaps in box score statistic performance, which we believe has been underutilized in existing literature. We also estimate a regression model to quantify the home court advantages for box score statistics after controlling for other factors such as number of possessions, and team strength. Additionally, we examine the biases of scorekeepers and referees. We present evidence that scorekeepers tend to have greater home team biases when observing men compared to women, higher divisions compared to lower divisions, and stronger teams compared to weaker teams. Finally, we present statistically significant results indicating referee decisions are impacted by attendance, with larger crowds resulting in greater bias in favor of the home team.
In this paper we consider the problem of mining frequently occurring interesting phrases in large document collections in an ad-hoc fashion. Ad-hoc refers to the ability to perform such analyses over text corpora that can be an arbitrary subset of a global set of documents. Most of the times the identification of these ad-hoc document collections is driven by a user or application defined query with the aim of gathering statistics describing the sub-collection, as a starting point for further data analysis tasks. Our approach to mine the top-k most interesting phrases consists of a novel indexing technique, called Sequence Pattern Indexing (SeqPattIndex), that benefits from the observation that phrases often overlap sequentially. We devise a forest based index for phrases and an further improved version with additional redundancy elimination power. The actual top-k phrase mining algorithm operating on these indices is a combination of a simple merge join and inspired by the pattern-growth framework from the data mining community, making use of early termination and search space pruning technologies that enhance the runtime performance. Overall, our approach has on average a lower index space consumption as well as a lower runtime for the top-k phrase mining task, as we demonstrate in the experimental evaluation using real-world data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.