Characteristics of Sentence Length in Running Text

Schils, Erik; Haan, Pieter de

doi:10.1093/llc/8.1.20

Cited by 9 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Shot lengths were analyzed using partial autocorrelation and power analyses, which allowed us to look for local patterns (shot-to-shot relations) and global patterns (whole-film editing profiles), respectively. Schils and de Haan (1993) performed a similar local analysis on sentence lengths in texts, and Salt (2006, p. 396) provided some piecemeal, local analyses of a number of films. In addition, Richards, Wilson, and Sommer (1994, Experiment 4) analyzed portions of four films in a manner related to our global analysis.…”

Section: Film Choice Shot Parsing and Analysismentioning

confidence: 99%

Attention and the Evolution of Hollywood Film

2010

View full text Add to dashboard Cite

Reaction times exhibit a spectral patterning known as 1/f, and these patterns can be thought of as reflecting time-varying changes in attention. We investigated the shot structure of Hollywood films to determine if these same patterns are found. We parsed 150 films with release dates from 1935 to 2005 into their sequences of shots and then analyzed the pattern of shot lengths in each film. Autoregressive and power analyses showed that, across that span of 70 years, shots became increasingly more correlated in length with their neighbors and created power spectra approaching 1/f. We suggest, as have others, that 1/f patterns reflect world structure and mental process. Moreover, a 1/f temporal shot structure may help harness observers' attention to the narrative of a film.

show abstract

Section: Film Choice Shot Parsing and Analysismentioning

confidence: 99%

Attention and the Evolution of Hollywood Film

2010

View full text Add to dashboard Cite

show abstract

“…Three main challenges arose. (1) Proteins range from about 30 to 33,000 residues, a much larger range than for the average English sentence extending over 15–30 words [44], and even more extreme than notable literary exceptions such as James Joyce’s Ulysses (1922) with almost 4000 words in a sentence. Longer proteins require more GPU memory and the underlying models (so-called LSTMs: Long Short-Term Memory networks [45]) have only a limited capability to remember long-range dependencies.…”

Section: Introductionmentioning

confidence: 99%

Modeling aspects of the language of life through transfer-learning protein sequences

et al. 2019

View full text Add to dashboard Cite

BackgroundPredicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome. Both these problems are addressed by the new methodology introduced here.ResultsWe introduced a novel way to represent protein sequences as continuous vectors (embeddings) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as SeqVec (Sequence-to-Vector) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localization was predicted in ten classes (Q10 = 68% ± 1) and membrane-bound were distinguished from water-soluble proteins (Q2 = 87% ± 1). Although SeqVec embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary information. Nevertheless, our approach improved over some popular methods using evolutionary information and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speed: where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created embeddings on average in 0.03 s. As this speed-up is independent of the size of growing sequence databases, SeqVec provides a highly scalable approach for the analysis of big data in proteomics, i.e. microbiome or metaproteome analysis.ConclusionTransfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks. SeqVec modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods. The exception is evolutionary information, however, that information is not available on the level of a single sequence.

show abstract

“…In this study, we introduced basic algorithms and reviewed the recent literature concerning representation learning applications in sequence analysis. Heinzinger, et al highlighted three difficulties in biological sequence modeling with NLP [68] as follows: (i) proteins range from approximately 30 to 33,000 residues, which is markedly longer than the average English sentence, which consists of 15 to 30 words [106] ; (ii) proteins use only 20 amino acids in most cases; if we consider one amino acid as a word, the word repertoire is 1/100,000 of English language, and if we consider 3-mer as a word, the word repertoire is 1/10 to 1/100 of English language; (iii) UniProt [90] is 10 times larger than the size of Wikipedia in terms of data repository size, and extracting information from a very large biological database may require the use of a commensurate model. Embedding of biological sequences using NLP overcomes these difficulties and outperforms existing methods in several tasks, such as function, structure, localization, and disorder prediction ( Table 1 ).…”

Section: Discussionmentioning

confidence: 99%

Representation learning applications in biological sequence analysis

Iuchi

Matsutani

Yamada

et al. 2021

Computational and Structural Biotechnology Journal

View full text Add to dashboard Cite

Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis.

show abstract

Characteristics of Sentence Length in Running Text

Cited by 9 publications

References 0 publications

Attention and the Evolution of Hollywood Film

Attention and the Evolution of Hollywood Film

Modeling aspects of the language of life through transfer-learning protein sequences

Representation learning applications in biological sequence analysis

Contact Info

Product

Resources

About