Maël Kerbiriou scite author profile

Maël Kerbiriou

6Publications

64Citation Statements Received

119Citation Statements Given

How they've been cited

How they cite others

113

119

Affiliations

University of Lille, Inria research centre Lille - Nord Europe, Laboratoire de Mathématiques

Publications

Order By: Most citations

BLight: efficient exact associative structure for k-mers

Marchet

Kerbiriou

Limasset

2021

View full text Add to dashboard Cite

Motivation A plethora of methods and applications share the fundamental need to associate information to words for high throughput sequence analysis. Doing so for billions of k-mers is commonly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of overlaps between k-mers to leverage this challenge. Yet existing data structures are either unable to associate information to k-mers or are not lightweight enough. Results We present BLight, a static and exact data structure able to associate unique identifiers to k-mers and determine their membership in a set without false positive, that scales to huge k-mer sets with a low memory cost. This index combines an extremely compact representation along with very fast queries. Besides, its construction is efficient and needs no additional memory. Our implementation achieves to index the k-mers from the human genome using 8GB of RAM (23 bits per k-mer) within 10 minutes and the k-mers from the large axolotl genome using 63 GB of memory (27 bits per k-mer) within 76 minutes. Furthermore, while being memory efficient, the index provides a very high throughput: 1.4 million queries per second on a single CPU or 16.1 million using 12 cores. Finally, we also present how BLight can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range. Availability We wrote the BLight index as an open source C ++ library under the AGPL3 license available at github.com/Malfoy/BLight. It is designed as a user-friendly library and comes along with code usage samples.

show abstract

A clustering package for nucleotide sequences using Laplacian Eigenmaps and Gaussian Mixture Model

Bruneau

Mottet

Moulin

et al. 2018

Computers in Biology and Medicine

View full text Add to dashboard Cite

In this article, a new Python package for nucleotide sequences clustering is proposed. This package, freely available on-line, implements a Laplacian eigenmap embedding and a Gaussian Mixture Model for DNA clustering. It takes nucleotide sequences as input, and produces the optimal number of clusters along with a relevant visualization. Despite the fact that we did not optimise the computational speed, our method still performs reasonably well in practice. Our focus was mainly on data analytics and accuracy and as a result, our approach outperforms the state of the art, even in the case of divergent sequences. Furthermore, an a priori knowledge on the number of clusters is not required here. For the sake of illustration, this method is applied on a set of 100 DNA sequences taken from the mitochondrially encoded NADH dehydrogenase 3 (ND3) gene, extracted from a collection of Platyhelminthes and Nematoda species. The resulting clusters are tightly consistent with the phylogenetic tree computed using a maximum likelihood approach on gene alignment. They are coherent too with the NCBI taxonomy. Further test results based on synthesized data are then provided, showing that the proposed approach is better able to recover the clusters than the most widely used software, namely Cd-hit-est and BLASTClust.

show abstract

Parallel Decompression of Gzip-Compressed Files and Random Access to DNA Sequences

Kerbiriou

Chikhi

2019

View full text Add to dashboard Cite

Decompressing a file made by the gzip program at an arbitrary location is in principle impossible, due to the nature of the DEFLATE compression algorithm. Consequently, no existing program can take advantage of parallelism to rapidly decompress large gzip-compressed files. This is an unsatisfactory bottleneck, especially for the analysis of large sequencing data experiments. Here we propose a parallel algorithm and an implementation, pugz, that performs fast and exact decompression of any text file. We show that pugz is an order of magnitude faster than gunzip, and 5x faster than a highly-optimized sequential implementation (libdeflate). We also study the related problem of random access to compressed data. We give simple models and experimental results that shed light on the structure of gzip-compressed files containing DNA sequences. Preliminary results show that random access to sequences within a gzip-compressed FASTQ file is almost always feasible at low compression levels, yet is approximate at higher compression levels.

show abstract

Efficient exact associative structure for sequencing data

Marchet

Kerbiriou

Limasset

2019

Preprint

View full text Add to dashboard Cite

Motivation: A plethora of methods and applications share the fundamental need to associate information to words for high throughput sequence analysis. Indexing billions of k-mers is promptly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of the properties of the k-mer sets to leverage this challenge. They exploit the overlaps shared among k-mers by using a de Bruijn graph as a compact k-mer set to provide lightweight structures. Results: We present Blight, a static and exact index structure able to associate unique identifiers to indexed k-mers and to reject alien k-mers that scales to the largest kmer sets with a low memory cost. The proposed index combines an extremely compact representation along with very high throughput. Besides, its construction from the de Bruijn graph sequences is efficient and does not need supplementary memory. The efficient index implementation achieves to index the k-mers from the human genome with 8GB within 10 minutes and can scale up to the large axolotl genome with 63 GB within 76 minutes. Furthermore, while being memory efficient, the index allows above a million queries per second on a single CPU in our experiments, and the use of multiple cores raises its throughput. Finally, we also present how the index can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range. Availability: The index is implemented as a C++ library, is open source under AGPL3 license, and available at github.com/Malfoy/Blight. It is designed as a user-friendly library and comes along with samples code usage.

show abstract

A clustering tool for nucleotide sequences using Laplacian Eigenmaps and Gaussian Mixture Models

Bruneau¹,

Mottet²,

Moulin³

et al. 2016

Preprint

View full text Add to dashboard Cite

Parallel decompression of gzip-compressed files and random access to DNA sequences

Kerbiriou¹,

Chikhi²

2019

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Maël Kerbiriou

BLight: efficient exact associative structure for k-mers

A clustering package for nucleotide sequences using Laplacian Eigenmaps and Gaussian Mixture Model

Parallel Decompression of Gzip-Compressed Files and Random Access to DNA Sequences

Efficient exact associative structure for sequencing data

A clustering tool for nucleotide sequences using Laplacian Eigenmaps and Gaussian Mixture Models

Parallel decompression of gzip-compressed files and random access to DNA sequences

Contact Info

Product

Resources

About