The availability of sizeable RNA structure databases and powerful deep learning (DL) frameworks has prompted recent developments of DL models for RNA secondary structure prediction. Taking RNA sequences as the only inputs, the class of de novo DL models has demonstrated far superior performances than traditional algorithms. However, key questions remain over the statistical underpinning of such DL models which make no use of co-evolutionary information or physical laws of RNA folding. Here we present a quantitative study of the capacity and generalizability of a series of de novo DL models, with a minimal two-module architecture and no post-processing, under varied sequence distributions of seen and unseen datasets. Excellent performances are observed from our models with as few as 16K parameters, affirming their remarkable learning capacity. Our DL models prove to generalize well over non-identical unseen sequences, but the model generalizability degrades rapidly as the sequence distributions between the seen and unseen become dissimilar. Examinations of RNA family-specific behaviors manifest not only disparate family-dependent model performances but substantial generalization gaps within the same RNA family. We further determine how model generalization decreases with the decrease of sequence similarity via pairwise sequence alignment, providing quantitative insights into the limitations of statistical learning. Model generalizability thus poses major hurdles for practical uses of current single-sequence-based DL models and we discuss avenues for future advances of such de novo DL models for RNA secondary structure prediction.