1 Abstract 1 Natural language descriptions of plant phenotypes are a rich source of information for genetics and 2 genomics research. We computationally translated descriptions of plant phenotypes into structured 3 representations that can be analyzed to identify biologically meaningful associations. These repre-4 sentations include the EQ (Entity-Quality) formalism, which uses terms from biological ontologies 5 to represent phenotypes in a standardized, semantically-rich format, as well as numerical vector 6 representations generated using Natural Language Processing (NLP) methods (such as the bag-of-7 words approach and document embedding). We compared resulting phenotype similarity measures 8 to those derived from manually curated data to determine the performance of each method. Com-9 putationally derived EQ and vector representations were comparably successful in recapitulating 10 biological truth to representations created through manual EQ statement curation. Moreover, NLP 11 methods for generating vector representations of phenotypes are scalable to large quantities of text 12 because they require no human input. These results indicate that it is now possible to compu-13 tationally and automatically produce and populate large-scale information resources that enable 14 researchers to query phenotypic descriptions directly. 15 2 Background 16 Phenotypes encompass a wealth of important and useful information about plants, potentially 17including states related to fitness, disease, and agricultural value. They comprise the material on 18 which natural and artificial selection act to increase fitness or to achieve desired traits, respectively. 19 Determining which genes are associated with traits of interest and understanding the nature of 20 these relationships is crucial for manipulating phenotypes. When causal alleles for phenotypes of 21 1 interest are identified, they can be selected for in populations, targeted for deletion, or employed 22 as transgenes to introduce desirable traits within and across species. The process of identifying 23 candidate genes and specific alleles associated with a trait of interest is called candidate gene 24 prediction. 25 Genes with similar sequences often share biological functions and therefore can create similar 26 phenotypes. This is one reason sequence similarity search algorithms like BLAST are so useful 27 for candidate gene prediction (Altschul et al., 1990). However, similar phenotypes can also be 28 attributed to the function of genes that have no sequence similarity. This is how protein-coding 29 genes that are involved in different steps of the same metabolic pathway or transcription factors 30 involved in regulating gene expression contribute to shared phenotypes. For example, knocking 31 out any one of the many genes involved in the maize anthocyanin pathway can result in pigment 32 changes (reviewed in Sharma et al., 2011). This concept is modelled in Figure 1, where, notably, 33 the sequence-based search with Gene 1 as a query can only return gen...