I INTRODUCTIONCNNs have been used extensively in solving various complicated machine learning problems such as sentiment analysis, feature extraction, genre classification and prediction. Hybrid models of CNNs and RNNs have been recently applied for temporal data like audio signals and word sequencing. Convolution Recurrent Neural Networks (CRNN's) are complex neural networks formed by combining Convolution CNN and RNN networks. CRNN architecture as a modified model of CNN with a RNN structure placed over it. This architecture has the capability to be as a robust structure to extract local feature using CNN layers and temporal summation by RNN networks. CNN's have been very popular in music recognition in diver's aspects such as automatic tagging, hybrid music recommender and feature learning. The key elements for a CNN network are: type of input signal, learning rate, activation function, batches and architecture. Mel-spectrogram is the preferred input type for music information retrieval. Mel-spectrograms consist of widespread fe1atures for tagging, boundary and onset detection, latent feature learning and it has been proved that Mel-scale is similar to the human auditory system. To achieve mel-spectogam signal, STFT (short time Fourier transform), and Log-amplitude spectrogram are required as preprocessing phase. Music feature learning with deep networks was improved with ReLu as activation function. Later this function is replaced with ELU (Exponential Linear Unit) to get fast and accurate learning. Recurrent neural networks also experienced significant improvement when gated recurrent neural network are applied. Gated RNN's have gating units which limit the flow of information through them, allowing to capture critical information from different time scales.
II LITERATURE SURVEY