In the Big Data era, scalability is an essential characteristic of machine learning algorithms. Most data discovery algorithms apply a feature selection (FS) method as a crucial preprocessing step. The main objective of FS is to select a subset of informative features in such a way that the discriminating power will be kept. Unluckily, most traditional feature selection algorithm is not scalable, which is a significant weakness in coping with big datasets. This paper proposes a distributed and Scalable Global Mutual Information-based feature selection framework called SGMI to deal with large-scale datasets.The framework first generates a similarity matrix to representing dependency among all features. To this aim, the joint values histograms of paired columns are generated in a scalable way and a single pass. Next, based on these histograms, the dependency criterion elements, including individual and joint entropies, are extracted independently. Finally, the SGMI framework applies an optimization method to make feature rankings based on the similarity matrix. In this paper, three popular optimization methods, Quadratic Programming (QP), Spectral Relaxation (SR), and Truncated Power (TP), are plugged into the proposed framework. Consequently, three scalable FS methods, called SGMI-QP, SGMI-SR, SGMI-TP, will be produced. The experimental studies are performed on four balanced and imbalanced large-scale datasets. Then, the empirical outcomes are compared with a distributed feature selection method, DiRelief, and the original version of the produced methods.The experimental results illustrate that (i) all produced methods are scalable and have a lower execution time than their traditional version and DiRelief method. (ii) SGMI-QP has a lower execution time than the two others. (iii) There is no significant difference among produced methods outcomes on experimental balanced big datasets. (iv) Generally, SGMI-SR produces better results to cope with big datasets than SGMI-QP, SGMI-TP, and DiRelief.