To overcome the shortcomings of K-means clustering including clustering numbers, sensitivity to clustering center (seeds) and local optimization, this article proposes an improved genetic algorithm (GA) with a novel Lagrange-based fitness function and an initial population technique(called NicheClust algorithm); the NicheClust can determine the best chromosomes and then feeds these into K-means as initial seeds to achieve higher-quality clustering results by allowing the initial seeds to readjust in terms of clustering demands. The GA approach is proposed to search for a global optimally solution. The initial population method is presented to automatically capture the appropriate number of clusters and find the initial seeds. The Lagrange-based approach is used to prevent the fitness function from prematurely converging and capture global optimization for K-means clustering results. Experimental results based on six taxi Global Positioning System (GPS) datasets verify the higher performance of NicheClust compared to other clustering methods and validate the effectiveness with statistical analysis method. K E Y W O R D S improved genetic algorithm, initialization population technology, K-means clustering, Lagrange-based fitness function 1 INTRODUCTION Data clustering is considered to be a difficult and challenging problem in unsupervised machine learning. 1-6 There are many clustering algorithms, 7,8 of which the K-means algorithm is undoubtedly the most widely used and important due to its effectiveness and simplicity. However, K-means has a number of well-known drawbacks including sensitivity to initial cluster center, 8-10 convergence to a local optimum and difficulty of determining the number of clusters. In order to overcome these shortcomings, a variety of clustering algorithms have been proposed. Several existing techniques have been proposed for finding higher-quality initial seeds than the random initial seeds K-means chooses. 1,5,11-13 For example, the work in Reference 5 presented an efficient K-means clustering filtering algorithm using density-based initial seeds. Fast density clustering strategies based on K-means was presented in Reference 1. In addition, K-means++ 14 is typically used to address the sensitivity of the choice of the initial seeds for K-means. However, K-means++ cannot perceive distribution states of data points, resulting in seeds that have an uneven distribution and repeated calculation. Meanwhile, K-means clustering finds it difficult to obtain a globally optimal solution due to the quality of the initial seeds. 10,15,16 Therefore, in order to improve the performance and enhance the efficiency of K-means clustering, several Genetic algorithms (GAs) based K-means 17-20 have been developed in recent years. These clustering techniques produce better clustering results than simple K-means or basic GA-based clustering. The use of GA with K-means also help to avoid minima issues of K-means. 10,17,18,20 Typically, a GA-based clustering technique does not require user input regarding the ...