We propose a two-level stochastic context-free grammar (SCFG) architecture
for parametrized stochastic modeling of a family of RNA sequences, including
their secondary structure. A stochastic model of this type can be used for
maximum a posteriori estimation of the secondary structure of any new sequence
in the family. The proposed SCFG architecture models RNA subsequences
comprising paired bases as stochastically weighted Dyck-language words, i.e.,
as weighted balanced-parenthesis expressions. The length of each run of
unpaired bases, forming a loop or a bulge, is taken to have a phase-type
distribution: that of the hitting time in a finite-state Markov chain. Without
loss of generality, each such Markov chain can be taken to have a bounded
complexity. The scheme yields an overall family SCFG with a manageable number
of parameters.Comment: 5 pages, submitted to the 2007 Information Theory and Applications
Workshop (ITA 2007