With advances in machine learning (ML)-assisted protein
engineering,
models based on data, biophysics, and natural evolution are being
used to propose informed libraries of protein variants to explore.
Synthesizing these libraries for experimental screens is a major bottleneck,
as the cost of obtaining large numbers of exact gene sequences is
often prohibitive. Degenerate codon (DC) libraries are a cost-effective
alternative for generating combinatorial mutagenesis libraries where
mutations are targeted to a handful of amino acid sites. However,
existing computational methods to optimize DC libraries to include
desired protein variants are not well suited to design libraries for
ML-assisted protein engineering. To address these drawbacks, we present
DEgenerate Codon Optimization for Informed Libraries (DeCOIL), a generalized
method that directly optimizes DC libraries to be useful for protein
engineering: to sample protein variants that are likely to have both
high fitness and high diversity in the sequence search space. Using
computational simulations and wet-lab experiments, we demonstrate
that DeCOIL is effective across two specific case studies, with the
potential to be applied to many other use cases. DeCOIL offers several
advantages over existing methods, as it is direct, easy to use, generalizable,
and scalable. With accompanying software (), DeCOIL can be readily implemented to generate desired informed
libraries.