Motivation
The secondary structure of RNA is of importance to its function. Over the last few years, several papers attempted to use machine learning to improve de novo RNA secondary structure prediction. Many of these papers report impressive results for intra-family predictions, but seldom address the much more difficult (and practical) inter-family problem.
Results
We demonstrate that it is nearly trivial with convolutional neural networks to generate pseudo-free energy changes, modeled after structure mapping data, that improve the accuracy of structure prediction for intra-family cases. We propose a more rigorous method for inter-family cross-validation that can be used to assess the performance of learning-based models. Using this method, we further demonstrate that intra-family performance is insufficient proof of generalisation despite the widespread assumption in the literature, and provide strong evidence that many existing learning-based models have not generalised inter-family.
Availability
Source code and data is available at https://github.com/marcellszi/dl-rna.
Supplementary information
Supplementary data are available at Bioinformatics online.
Algorithmic prediction of RNA secondary structure has been an area of active inquiry since the 1970s. Despite many innovations since then, our best techniques are not yet perfect. The workhorses of the RNA secondary structure prediction engine are recursions first described by Zuker and Stiegler in 1981. These have well understood caveats; a notable flaw is the ad-hoc treatment of multi-loops, also called helical-junctions, that persists today. While several advanced models for multi-loops have been proposed, it seems to have been assumed that incorporating them into the recursions would lead to intractability, and so no algorithms for these models exist. Some of these models include the classical model based on Jacobson–Stockmayer polymer theory, and another by Aalberts and Nadagopal that incorporates two-length-scale polymer physics. We have realized practical, tractable algorithms for each of these models. However, after implementing these algorithms, we found that no advanced model was better than the original, ad-hoc model used for multi-loops. While this is unexpected, it supports the praxis of the current model.
Ribonucleic acid (RNA) is an essential molecule in a wide range of biological functions. In 1990, McCaskill introduced a dynamic programming algorithm for computing the partition function of an RNA sequence. This forward model is widely used for understanding the thermodynamic properties of a given RNA. In this work, we introduce a generalization of McCaskill's algorithm that is well-defined over continuous inputs and is differentiable. This allows us to tackle the inverse folding problem---designing a sequence with desired equilibrium thermodynamic properties---directly using gradient optimization. This has applications to creating RNA-based drugs such as mRNA vaccines. Furthermore, it allows McCaskill's foundational algorithm to be incorporated into machine learning pipelines directly since we have made it end-to-end differentiable. This work highlights how principles from differentiable programming can be translated to existing physical models to develop powerful tools for machine learning. We provide a concrete example by implementing an effective and interpretable RNA design algorithm.
An RNA design algorithm takes a target RNA structure and finds a sequence that folds into that structure. This is fundamentally important for engineering therapeutics using RNA. Computational RNA design algorithms are guided by fitness functions, but not much research has been done on the merits of these functions. We survey current RNA design approaches with a particular focus on the fitness functions used. We experimentally compare the most widely used fitness functions in RNA design algorithms on both synthetic and natural sequences. It has been almost 20 years since the last comparison was published, and we find similar results with a major new result: maximizing probability outperforms minimizing ensemble defect. The probability is the likelihood of a structure at equilibrium and the ensemble defect is the weighted average number of incorrect positions in the ensemble. We find that maximizing probability leads to better results on synthetic RNA design puzzles and agrees more often than other fitness functions with natural sequences and structures, which were designed by evolution. Also, we observe that many recently published approaches minimize structure distance to the minimum free energy prediction, which we find to be a poor fitness function.
Callus-like tissues, isolated from protonemal cultures of two species of mosses, grow vigorously and without marked differentiation on media containing sucrose, casamino acids, and coconut milk. On mineral agar and on media containing sorbitol the tissue from Polytrichum (found diploid) reverts to the growth pattern of apparently normal moss plants.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.