Phones are critical components in various computational linguistic fields, for example, phone distributions could be helpful in speech recognition and speech synthesis. Traditional approaches to estimate phone distributions typically involve G2P systems which are either manually designed by linguists or trained on large datasets. These prohibitive requirements make research on low resource languages extremely challenging. In this work, we propose a novel approach to estimate phone distributions by only requiring raw audio datasets: We first estimate the phone ranks by combining language-independent recognition results and Learning to Rank results. Next, we approximate the distribution with Expectation-Maximization by fitting Yule distribution. The results on 7 languages show the joint-model has better performance in both ranking estimation and distribution estimation tasks.