Many of life's most fascinating phenomena emerge from interactions among many elements-many amino acids determine the structure of a single protein, many genes determine the fate of a cell, many neurons are involved in shaping our thoughts and memories. Physicists have long hoped that these collective behaviors could be described using the ideas and methods of statistical mechanics. In the past few years, new, larger scale experiments have made it possible to construct statistical mechanics models of biological systems directly from real data. We review the surprising successes of this "inverse" approach, using examples form families of proteins, networks of neurons, and flocks of birds. Remarkably, in all these cases the models that emerge from the data are poised at a very special point in their parameter space-a critical point. This suggests there may be some deeper theoretical principle behind the behavior of these diverse systems.
Flocking is a typical example of emergent collective behavior, where interactions between individuals produce collective patterns on the large scale. Here we show how a quantitative microscopic theory for directional ordering in a flock can be derived directly from field data. We construct the minimally structured (maximum entropy) model consistent with experimental correlations in large flocks of starlings. The maximum entropy model shows that local, pairwise interactions between birds are sufficient to correctly predict the propagation of order throughout entire flocks of starlings, with no free parameters. We also find that the number of interacting neighbors is independent of flock density, confirming that interactions are ruled by topological rather than metric distance. Finally, by comparing flocks of different sizes, the model correctly accounts for the observed scale invariance of long-range correlations among the fluctuations in flight direction.animal groups | statistical inference T he collective behavior of large groups of animals is an imposing natural phenomenon, very hard to cast into a systematic theory (1). Physicists have long hoped that such collective behaviors in biological systems could be understood in the same way as we understand collective behavior in physics, where statistical mechanics provides a bridge between microscopic rules and macroscopic phenomena (2, 3). A natural test case for this approach is the emergence of order in a flock of birds: Out of a network of distributed interactions among the individuals, the entire flock spontaneously chooses a unique direction in which to fly (4), much as local interactions among individual spins in a ferromagnet lead to a spontaneous magnetization of the system as a whole (5). Despite detailed development of these ideas (6-9), there still is a gap between theory and experiment. Here we show how to bridge this gap by constructing a maximum entropy model (10) based on field data of large flocks of starlings (11-13). We use this framework to show that the effective interactions among birds are local and that the number of interacting neighbors is independent of flock density, confirming that interactions are ruled by topological rather than metric distance (14). The statistical mechanics models that we derive in this way provide an essentially complete, parameter-free theory for the propagation of directional order throughout the flock.We consider flocks of European starlings, Sturnus vulgaris, as in Fig. 1A. At any given instant of time, following refs. 11-13, we can attach to each bird i a vector velocity~v i and define the normalized velocity~s i ¼~v i ∕j~v i j (Fig. 1B). On the hypothesis that flocks have statistically stationary states, we can think of all these normalized velocities as being drawn (jointly) from a probability distribution Pðf~s i gÞ. It is not possible to infer this full distribution directly from experiments, because the space of states specified by f~s i g is too large. However, what we can measure from field data is th...
Stochastic rearrangement of germline V-, D-, and J-genes to create variable coding sequence for certain cell surface receptors is at the origin of immune system diversity. This process, known as "VDJ recombination", is implemented via a series of stochastic molecular events involving gene choices and random nucleotide insertions between, and deletions from, genes. We use large sequence repertoires of the variable CDR3 region of human CD4+ T-cell receptor beta chains to infer the statistical properties of these basic biochemical events. Because any given CDR3 sequence can be produced in multiple ways, the probability distribution of hidden recombination events cannot be inferred directly from the observed sequences; we therefore develop a maximum likelihood inference method to achieve this end. To separate the properties of the molecular rearrangement mechanism from the effects of selection, we focus on nonproductive CDR3 sequences in T-cell DNA. We infer the joint distribution of the various generative events that occur when a new T-cell receptor gene is created. We find a rich picture of correlation (and absence thereof), providing insight into the molecular mechanisms involved. The generative event statistics are consistent between individuals, suggesting a universal biochemical process. Our probabilistic model predicts the generation probability of any specific CDR3 sequence by the primitive recombination process, allowing us to quantify the potential diversity of the T-cell repertoire and to understand why some sequences are shared between individuals. We argue that the use of formal statistical inference methods, of the kind presented in this paper, will be essential for quantitative understanding of the generation and evolution of diversity in the adaptive immune system. convergent recombination | expectation maximization | palindromic nucleotides | insertion/deletion profiles R eceptor proteins on the surfaces of B and T cells in the immune system interact with pathogens, recognize them and initiate an immune response. The diversity of these receptors is the outcome of a remarkable process in which germline DNA is edited to produce a repertoire of (Tor B) cells with varied antigen receptor genes (1). The process is called "VDJ recombination" because the germline contains multiple versions of so-called V-, D-, and J-genes, particular instances of which are quasi-randomly selected, stochastically edited, and joined together to produce a new surface receptor gene each time a new immune system cell is generated.The statistical distribution of these biochemical events (and the resulting receptor coding sequences) in a population of newly created receptors is an important quantity: It contains information about the in vivo functioning of the biochemical editing mechanism and provides the baseline for a quantitative assessment of the downstream workings of selection in the adaptive immune system. Here, we address the problem of inferring this distribution from the large T-cell sequence repertoires that are becom...
High-throughput immune repertoire sequencing is promising to lead to new statistical diagnostic tools for medicine and biology. Successful implementations of these methods require a correct characterization, analysis, and interpretation of these data sets. We present IGoR (Inference and Generation Of Repertoires)—a comprehensive tool that takes B or T cell receptor sequence reads and quantitatively characterizes the statistics of receptor generation from both cDNA and gDNA. It probabilistically annotates sequences and its modular structure can be used to investigate models of increasing biological complexity for different organisms. For B cells, IGoR returns the hypermutation statistics, which we use to reveal co-localization of hypermutations along the sequence. We demonstrate that IGoR outperforms existing tools in accuracy and estimate the sample sizes needed for reliable repertoire characterization.
Recognition of pathogens relies on families of proteins showing great diversity. Here we construct maximum entropy models of the sequence repertoire, building on recent experiments that provide a nearly exhaustive sampling of the IgM sequences in zebrafish. These models are based solely on pairwise correlations between residue positions but correctly capture the higher order statistical properties of the repertoire. By exploiting the interpretation of these models as statistical physics problems, we make several predictions for the collective properties of the sequence ensemble: The distribution of sequences obeys Zipf's law, the repertoire decomposes into several clusters, and there is a massive restriction of diversity because of the correlations. These predictions are completely inconsistent with models in which amino acid substitutions are made independently at each site and are in good agreement with the data. Our results suggest that antibody diversity is not limited by the sequences encoded in the genome and may reflect rapid adaptation to antigenic challenges. This approach should be applicable to the study of the global properties of other protein families. D regions | immune receptor proteins | statistical modelsT he number of possible amino acid sequences exceeds the number of individual protein molecules that have ever been synthesized. As a result, the limited set of sequences that we see today carries a signature of evolutionary history (1). But not all of the limitations are historical-randomly chosen sequences will not fold into stable, compact structures (2, 3), and carrying out specific functions places yet more requirements on the sequence. Regardless of the balance between historical and functional constraints, the stochastic nature of evolutionary change means that the sequences we observe should be thought of as being drawn out of a probability distribution. The goal of this paper is to construct an approximation to this distribution, by using a limited but biologically important example, the problem of antibody diversity.The ensemble of all proteins is daunting, so most work focuses on particular families of proteins. The most tractable examples are those in which the relevant segments of the proteins are short, and experiments provide many independent samples of sequences from the family. For a family of small proteins that mediate protein-protein interactions, methods were developed to generate artificial sequences that are consistent with the patterns of single site substitutions and correlations between substitutions at pairs of sites; remarkably, most of these artificial sequences fold into functional structures (4, 5). Although this work did not lead to an explicit construction of the underlying probability distribution, the implicit model is equivalent to a maximum entropy model that captures pairwise correlations but ignores higher order interactions (6) and thus connects to other efforts to describe biological networks with simplified models (7-12). Maximum entropy methods have si...
The activity of a neural network is defined by patterns of spiking and silence from the individual neurons. Because spikes are (relatively) sparse, patterns of activity with increasing numbers of spikes are less probable, but, with more spikes, the number of possible patterns increases. This tradeoff between probability and numerosity is mathematically equivalent to the relationship between entropy and energy in statistical physics. We construct this relationship for populations of up to N = 160 neurons in a small patch of the vertebrate retina, using a combination of direct and model-based analyses of experiments on the response of this network to naturalistic movies. We see signs of a thermodynamic limit, where the entropy per neuron approaches a smooth function of the energy per neuron as N increases. The form of this function corresponds to the distribution of activity being poised near an unusual kind of critical point. We suggest further tests of criticality, and give a brief discussion of its functional significance.entropy | information | neural networks | Monte Carlo | correlation
Flocks of birds exhibit a remarkable degree of coordination and collective response. It is not just that thousands of individuals fly, on average, in the same direction and at the same speed, but that even the fluctuations around the mean velocity are correlated over long distances. Quantitative measurements on flocks of starlings, in particular, show that these fluctuations are scale-free, with effective correlation lengths proportional to the linear size of the flock. Here we construct models for the joint distribution of velocities in the flock that reproduce the observed local correlations between individuals and their neighbors, as well as the variance of flight speeds across individuals, but otherwise have as little structure as possible. These minimally structured or maximum entropy models provide quantitative, parameter-free predictions for the spread of correlations throughout the flock, and these are in excellent agreement with the data. These models are mathematically equivalent to statistical physics models for ordering in magnets, and the correct prediction of scale-free correlations arises because the parameterscompletely determined by the data-are in the critical regime. In biological terms, criticality allows the flock to achieve maximal correlation across long distances with limited speed fluctuations.collective behavior | statistical mechanics I n a flock of birds, thousands of individuals will fly in the same direction and at the same speed, for long periods of time. However, this average behavior is not enough for flocking to be advantageous. The entire flock must respond to dangers that may be visible only to a small fraction of individuals, requiring information to propagate over long distances. Although it is difficult to measure this information flow directly (1), we know that attacks by predators on a flock have very low success rates (2-4), and that the evasion of predators by starling flocks is associated with the triggering and propagation of waves through the flock (5). Even in the absence of predators, we can see deviations of individual behavior from the average behavior of the flock, and correlations in these fluctuations provide a signature of information flow through the flock. Strikingly, observations on flocks of starlings show that these correlations extend over very long distances, comparable to the size of the flock itself (6).It is generally believed that the interactions among birds in a flock are local-each bird aligns its flight direction and speed to those of its near neighbors (7). If this is correct, then we have to understand how local interactions can generate correlations over much longer distances. In physics, we have two very different mechanisms for local interactions to produce long-ranged correlations. If the system spontaneously breaks a continuous symmetry, for example when all of the spins in a magnet select a particular direction in space along which the macroscopic magnetization will point, then the fluctuations in the system are dominated by Goldstone m...
Motivation High-throughput sequencing of large immune repertoires has enabled the development of methods to predict the probability of generation by V(D)J recombination of T- and B-cell receptors of any specific nucleotide sequence. These generation probabilities are very non-homogeneous, ranging over 20 orders of magnitude in real repertoires. Since the function of a receptor really depends on its protein sequence, it is important to be able to predict this probability of generation at the amino acid level. However, brute-force summation over all the nucleotide sequences with the correct amino acid translation is computationally intractable. The purpose of this paper is to present a solution to this problem. Results We use dynamic programming to construct an efficient and flexible algorithm, called OLGA (Optimized Likelihood estimate of immunoGlobulin Amino-acid sequences), for calculating the probability of generating a given CDR3 amino acid sequence or motif, with or without V/J restriction, as a result of V(D)J recombination in B or T cells. We apply it to databases of epitope-specific T-cell receptors to evaluate the probability that a typical human subject will possess T cells responsive to specific disease-associated epitopes. The model prediction shows an excellent agreement with published data. We suggest that OLGA may be a useful tool to guide vaccine design. Availability and implementation Source code is available at https://github.com/zsethna/OLGA. Supplementary information Supplementary data are available at Bioinformatics online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.