The European Nucleotide Archive in 2019

Amid, Clara; Alako, Blaise T. F.; Kadhirvelu, Vishnukumar Balavenkataraman; Burdett, Tony; Burgin, Josephine; Fan, Jun; Harrison, Peter W.; Holt, Sam; Hussein, Abdulrahman; Ivanov, Eugene; Jayathilaka, Suran; Kay, Simon; Keane, Thomas; Leinonen, Rasko; Xin, Liu; Martínez-Villacorta, Josué; Milano, Annalisa; Pakseresht, Amir; Rahman, Nadim; Rajan, Jeena; Reddy, Kethi; Richards, E. G.; Smirnov, Alexander; Sokolov, Alexey; Vijayaraja, Senthilnathan; Cochrane, Guy

doi:10.1093/nar/gkz1063

Cited by 106 publications

(114 citation statements)

References 17 publications

Supporting

Mentioning

111

Contrasting

Unclassified

Order By: Relevance

“…EMBL-EBI hosts the European Nucleotide Archive [1], which has a broader scope, accepting submissions of nucleotide sequencing information, including raw sequencing data, sequence assembly information and functional annotations.…”

Section: Current Scenariomentioning

confidence: 99%

Empowering Virus Sequences Research through Conceptual Modeling

Bernasconi

Canakoglu²,

Pinoli³

et al. 2020

Preprint

View full text Add to dashboard Cite

The pandemic outbreak of the coronavirus disease has attracted attention towards the genetic mechanisms of viruses. We hereby present the Viral Conceptual Model (VCM), centered on the virus sequence and described from four perspectives: biological (virus type and hosts/sample), analytical (annotations and variants), organizational (sequencing project) and technical (experimental technology). VCM is inspired by GCM, our previously developed Genomic Conceptual Model, but it introduces many novel concepts, as viral sequences significantly differ from human genomes. When applied to SARS-CoV2 virus, complex conceptual queries upon VCM are able to replicate the search results of recent articles, hence demonstrating huge potential in supporting virology research. In addition to VCM, we also illustrate the data dictionary for patient's phenotype used by the COVID-19 Host Genetic Initiative. Our effort is part of a broad vision: availability of conceptual models for both human genomics and viruses will provide important opportunities for research, especially if interconnected by the same human being, playing the role of virus host as well as provider of genomic and phenotype information.

show abstract

Section: Current Scenariomentioning

confidence: 99%

Empowering Virus Sequences Research through Conceptual Modeling

Bernasconi

Canakoglu²,

Pinoli³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Then, for each minimizer file, super-k-mers are broken into their constituent k-mers. The k-mers and their count-vectors are inserted into a hash table 1 . When a k-mer is first inserted, it has a count-vector that records the abundance of its originating super-k-mer in the corresponding dataset.…”

Section: Construction Of the Monotigsmentioning

confidence: 99%

“…When a count-vector is written to the disk, its monotig identifier given by BLight is also recorded next to it. Then, reading each file separately, we select a single representative of each vector by inserting it into an efficient dynamic hash table 1 . Once a partition is processed, we write the set of de-duplicated count vectors to disk, and record the mapping between monotig indices and their positions in the de-duplicated count-vector matrix.…”

Section: Low-memory De-duplication Of Rows In the Matrixmentioning

confidence: 99%

See 1 more Smart Citation

REINDEER: efficient indexing ofk-mer presence and abundance in sequencing datasets

Marchet

Iqbal

Gautheret

et al. 2020

Preprint

View full text Add to dashboard Cite

Motivation:In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets.Results: We used REINDEER to index the abundances of sequences within 2,585 human RNA-seq experiments in 45 hours using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of ∼4 billion distinct k-mers across 2,585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph (DBG) of each dataset, then conceptually merges those DBGs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances.Availability: https://github.com/kamimrcht/REINDEERWe also highlight that merely adapting existing data structures by transforming the 1-bit presence/absence information into a (e.g. 16-bit) counter is unlikely to be a viable strategy. For instance, consider the HowDeSBT data structure [9], a recent technique for indexing the presence/absence of k-mers across dataset collections. It saves space by using a single memory location to encode the presence of a k-mer across multiple datasets. Yet this scheme cannot be adapted to record abundances, as a k-mer may be present in multiple datasets at different abundances, which cannot all be recorded by a single memory location. Likewise, BIGSI [12] uses Bloom filters with 25% false positive rate to encode presence/absence of k-mers; extending Bloom filters to support abundance queries (e.g. using Count-Min sketches) at a comparable false positive rate would possibly introduce significant abundance estimation errors.Here we introduce REINDEER (REad Index for abuNDancE quERy), a novel computational method that performs indexing of k-mers and records their counts across a collection of datasets. REINDEER uses a combination of several concepts. The first novelty is to associate k-mers to their counts within datasets, instead of only recording the presence/absence of k-mers as is nearly universally done in previous works. To achieve this, a second novelty is the introduction of monotigs, which allows space-efficient grouping of k-mers having similar count profiles across datasets. An additional contribution is a set of techniques to further save space: discretization and compression of counts, on-disk row de-duplication algorithm of the count matrix. As a proof of concept, in this article we apply REINDEER to index a de facto benchmark collection of 2,585 human RNA-seq datasets, and provide relevant performance metrics. We further illustrate its utility by showing the results of queries on four oncogenes and three tumor suppressor genes within this collection.

show abstract

“…Per species, extensive literature research was performed to validate their aerobicity (Data S5). 1628 Genomes of facultative anaerobic and strict anaerobic strains from the Pseudomonas genus were obtained from the European Nucleotide Archive repository in March 2015 [27]. All genomes were de-novo annotated in SAPP [28] using Prodigal for gene prediction (version 2.6) [29], 2010] and InterProScan version 5.4-47.0 [30] for functional annotation using Pfam [31].…”

Section: Genome Annotationmentioning

confidence: 99%

A Rational Design of Pseudomonas putida KT2440 capable of Anaerobic Respiration

Kampers¹,

Koehorst²,

Heck³

et al. 2020

Preprint

View full text Add to dashboard Cite

Pseudomonas putida KT2440 is a metabolically versatile, HV1-certified, genetically accessible, and thus interesting microbial chassis for biotechnological applications. However, its obligate aerobic nature hampers production of oxygen sensitive products and drives up costs in large scale fermentation. The inability to perform anaerobic fermentation has been attributed to insufficient ATP production and an inability to produce pyrimidines under these conditions. Addressing these bottlenecks enabled growth under micro-oxic conditions, but does not lead to growth or survival under anoxic conditions.Here, a data-driven approach was used to develop a rational design for a P. putida KT2440 derivative strain capable of anaerobic respiration. To come to the design, data derived from a genome comparison of 1628 Pseudomonas strains was combined with genome-scale metabolic modelling simulations and a transcriptome dataset of 47 samples representing 14 environmental conditions from the facultative anaerobe Pseudomonas aeruginosa.The results indicate that the implementation of anaerobic respiration in P. putida KT2440 would require at least 61 additional genes of known function, at least 8 genes encoding proteins of unknown function, and 3 externally added vitamins.

show abstract

The European Nucleotide Archive in 2019

Cited by 106 publications

References 17 publications

Empowering Virus Sequences Research through Conceptual Modeling

Empowering Virus Sequences Research through Conceptual Modeling

REINDEER: efficient indexing ofk-mer presence and abundance in sequencing datasets

A Rational Design of Pseudomonas putida KT2440 capable of Anaerobic Respiration

Contact Info

Product

Resources

About