“…-EndRhyme [24], which considers the number of matching vowel phonemes at the end of candidate line c i and sκ; -rhyme2vec, our novel rhyme embedding method, as described in Section 3.2; -NN5 [24], a character-level neural network for rap line encoding, which takes five previous lines as the query (i.e.,{s κ−i } 4 i=0 ); -doc2vec [30], a popular sentence embedding method, which handles {sκ; c i } as a unified paragraph; -DopeLearning [24] 7 , a state-of-the-art rap lyric representation learning method, which concatenates a series of statistical characteristics, including the features of EndRhyme, EndRhyme-1 (number of matching vowel phonemes at the end of c i and s κ−1 ), Other-Rhyme (average number of matching vowel phonemes per word), LineLength (line similarity of c i and sκ), BOW (Jaccard similarity between the corresponding bags of words of c i and sκ), BOW5 (Jaccard similarity between the corresponding bags of words of five previous lines and sκ), LSA (latent semantic analysis similarity of c i and sκ), and NN5 (confidence value generated from the last sof tmax layer); -early fusion [6], a widely used multi-modal aggregation method, which concatenates all of the features as a unified representation (i.e., v t := [vr, vs]); -EF-AE, a variant of HAVAE, which adopts the same learning manipulations as that of HAVAE, but bypasses the sampling strategy and renders [vr, vs] as the input of the network; and -EF-VAE, another variant of HAVAE, which renders [vr, vs] as the input of the VAE network instead of the INPUT stage.…”