John G. Cleary scite author profile

fellow, ieee Arithmetic coding provides an eeective mechanism for removing redundancy in the encoding of data. We show how arithmetic coding works and describe an eecient implementation that uses table lookup as a fast alternative to arithmetic operations. The reduced-precision arithmetic has a provably negligible eeect on the amount of compression achieved. We can speed up the implementation further by use of parallel processing. We discuss the role of probability models and how they provide probability information to the arithmetic coder. We conclude with perspectives on the comparative advantages and disadvantages of arithmetic coding.

show abstract

Data Compression Using Adaptive Coding and Partial String Matching

Cleary

Witten

1984

IEEE Trans. Commun.

965

549

View full text Add to dashboard Cite

The recently developed technique of arithmetic coding, in conjunction with a Markov model of the source, is a powerful method of data compression in situations where a linear treatment is inappropriate. Adaptive coding allows the model to be constructed dynamically by both encoder and decoder during the course of the transmission, and has been shown to incur a smaller coding overhead than explicit transmission of the model's statistics. But there is a basic conflict between the desire to use high-order Markov models and the need to have them formed quickly as the initial part of the message is sent. This paper describes how the conflict can be resolved with partial string matching, and reports experimental results which show that mixed-case English text can be coded in as little as 2.2 bits/ character with no prior knowledge of the source.

show abstract

Modeling for text compression

1989

View full text Add to dashboard Cite

The best schemes for text compression use large models to help them predict which characters will come next. The actual next characters are coded with respect to the prediction, resulting in compression of information. Models are best formed adaptively, based on the text seen so far. This paper surveys successful strategies for adaptive modeling that are suitable for use in practical text compression systems. The strategies fall into three main classes: finite-context modeling, in which the last few characters are used to condition the probability distribution for the next one; finite-state modeling, in which the distribution is conditioned by the current state (and which subsumes finite-context modeling as an important special case); and dictionary modeling, in which strings of characters are replaced by pointers into an evolving dictionary. A comparison of different methods on the same sample texts is included, along with an analysis of future research directions.

show abstract

K*: An Instance-based Learner Using an Entropic Distance Measure

1995

View full text Add to dashboard Cite

Unbounded Length Contexts for PPM

Cleary

Teahan

1997

The Computer Journal

209

184

View full text Add to dashboard Cite

Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines

Cleary

Braithwaite

Gaastra

et al. 2015

Preprint

190

176

View full text Add to dashboard Cite

To evaluate and compare the performance of variant calling methods and their confidence scores, comparisons between a test call set and a "gold standard" need to be carried out. Unfortunately, these comparisons are not straightforward with the current Variant Call Files (VCF), which are the standard output of most variant calling algorithms for high-throughput sequencing data. Comparisons of VCFs are often confounded by the different representations of indels, MNPs, and combinations thereof with SNVs in complex regions of the genome, resulting in misleading results. A variant caller is inherently a classification method designed to score putative variants with confidence scores that could permit controlling the rate of false positives (FP) or false negatives (FN) for a given application. Receiver operator curves (ROC) and the area under the ROC (AUC) are efficient metrics to evaluate a test call set versus a gold standard. However, in the case of VCF data this also requires a special accounting to deal with discrepant representations. We developed a novel algorithm for comparing variant call sets that deals with complex call representation discrepancies and through a dynamic programing method that minimizes false positives and negatives globally across the entire call sets for accurate performance evaluation of VCFs.

show abstract

Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing

et al. 2011

View full text Add to dashboard Cite

Mutations in the voltage-gated potassium channel gene KCNH1 cause Temple-Baraitser syndrome and epilepsy

et al. 2014

View full text Add to dashboard Cite

Temple-Baraitser syndrome (TBS) is a multisystem developmental disorder characterized by intellectual disability, epilepsy, and hypoplasia or aplasia of the nails of the thumb and great toe. Here we report damaging de novo mutations in KCNH1 (encoding a protein called ether à go-go, EAG1 or KV10.1), a voltage-gated potassium channel that is predominantly expressed in the central nervous system (CNS), in six individuals with TBS. Characterization of the mutant channels in both Xenopus laevis oocytes and human HEK293T cells showed a decreased threshold of activation and delayed deactivation, demonstrating that TBS-associated KCNH1 mutations lead to deleterious gain of function. Consistent with this result, we find that two mothers of children with TBS, who have epilepsy but are otherwise healthy, are low-level (10% and 27%) mosaic carriers of pathogenic KCNH1 mutations. Consistent with recent reports, this finding demonstrates that the etiology of many unresolved CNS disorders, including epilepsies, might be explained by pathogenic mosaic mutations.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

John G. Cleary

Arithmetic coding for data compression

Data Compression Using Adaptive Coding and Partial String Matching

Modeling for text compression

K*: An Instance-based Learner Using an Entropic Distance Measure

Unbounded Length Contexts for PPM

Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines

Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing

Mutations in the voltage-gated potassium channel gene KCNH1 cause Temple-Baraitser syndrome and epilepsy

Contact Info

Product

Resources

About