Zalán Bodó scite author profile

Music information retrieval has lately become an important field of information retrieval, because by profound analysis of music pieces important information can be collected: genre labels, mood prediction, artist identification, just to name a few. The lack of large-scale music datasets containing audio features and metadata has lead to the construction and publication of the Million Song Dataset (MSD) and its satellite datasets. Nonetheless, mainly because of licensing limitations, no freely available lyrics datasets have been published for research. In this paper we describe the construction of an English lyrics dataset based on the Last.fm Dataset, connected to LyricWiki’s database and MusicBrainz’s encyclopedia. To avoid copyright issues, only the URLs to the lyrics are stored in the database. In order to demonstrate the eligibility of the compiled dataset, in the second part of the paper we present genre classification experiments with lyrics-based features, including bagof-n-grams, as well as higher-level features such as rhyme-based and statistical text features. We obtained results similar to the experimental outcomes presented in other works, showing that more sophisticated textual features can improve genre classification performance, and indicating the superiority of the binary weighting scheme compared to tf–idf.

show abstract

On the impact of domain-specific knowledge in evolutionary music composition

Sulyok

Harte²,

Bodó

2019

View full text Add to dashboard Cite

A note on label propagation for semi-supervised learning

Bodó

Csató

2015

View full text Add to dashboard Cite

Abstract. Semi-supervised learning has become an important and thoroughly studied subdomain of machine learning in the past few years, because gathering large unlabeled data is almost costless, and the costly human labeling process can be minimized by semi-supervision. Label propagation is a transductive semi-supervised learning method that operates on the-most of the time undirected-data graph. It was introduced in [8] and since many variants were proposed. However, the base algorithm has two variants: the first variant presented in [8] and its slightly modified version used afterwards, e.g. in [7]. This paper presents and compares the two algorithms-both theoretically and experimentally-and also tries to make a recommendation which variant to use.

show abstract

A CiteSeerX-Based Dataset for Record Linkage and Metadata Extraction

Bodó

2018

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Zalán Bodó

Wikipedia-Based Kernels for Text Categorization

Connecting the Last.fm Dataset to LyricWiki and MusicBrainz. Lyrics-based experiments in genre classification

On the impact of domain-specific knowledge in evolutionary music composition

A note on label propagation for semi-supervised learning

A CiteSeerX-Based Dataset for Record Linkage and Metadata Extraction

Contact Info

Product

Resources

About