Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset. In this paper, we present a new algorithm, called Fast and Exact K-means Clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset, and provably produces the same cluster centers as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centers, and then takes one or more passes over the entire dataset to adjust these cluster centers. We provide theoretical analysis to show that the cluster centers thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared to k-means. This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analyzing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down-loading all data and running sequential k-means, or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance.
Product search engine faces unique challenges that differ from web page search. The goal of a product search engine is to rank relevant items that the user may be interested in purchasing. Clicks provide a strong signal of a user's interest in an item. Traditional click prediction models include many features such as document text, price, and user information. In this paper, we propose adding information extracted from the thumbnail image of the item as additional features for click prediction. Specifically, we use two types of image features -photographic features and object features. Our experiments reveal that both types of features can be highly useful in click prediction. We measure our performance in both prediction accuracy and NDCG. Overall, our experiments show that augmenting with image features to a standard model in click prediction provides significant improvement in precision and recall and boosts NDCG.
In recent times we have witnessed the emergence of large online markets with two-sided preferences that are responsible for businesses worth billions of dollars. Recommendation systems are critical components of such markets. It is to be noted that the matching in such a market depends on the preferences of both sides, consequently, the construction of a recommendation system for such a market calls for consideration of preferences of both sides. The online dating market, and the online freelancer market are examples of markets with two-sided preferences. Recommendation systems for such markets are fundamentally different from typical rating based product recommendations. We pose this problem as a bipartite ranking problem. There has been extensive research on bipartite ranking algorithms. Typically, generalized linear regression models are popular methods of constructing such ranking on account of their ability to be learned easily from big data, and their computational simplicity on engineering platforms. However, we show that for markets with two sided preferences, one can improve the AUC (Area Under the receiver operator Curve) score by considering separate models for preferences of both the sides and constructing a two layer architecture for ranking. We call this a two-level model algorithm. For both synthetic and real data we show that the two-level model algorithm has a better AUC performance than the direct application of a generalized linear model such as L1 logistic regression or an ensemble method such as random forest algorithm. We provide a theoretical justification of AUC optimality of two-level model and pose a theoretical problem for a more general result.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.