A model selection approach for multiple sequence segmentation and dimensionality reduction

Castro, Bruno M.; Lemes, Renan Barbosa; Cesar, Jonatas; Hünemeier, Tábita; Leonardi, Florencia

doi:10.1016/j.jmva.2018.05.006

Cited by 6 publications

(4 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Classically, change-point detection refers to the problem of determining the times at which sequential observed data undergoes an abrupt change. In that type of setting, a change-point may refer to changes in mean ( Page 1954 , Tsay 1988 , Keshavarz et al 2018 ), variance ( Chen and Gupta 1997 , Hawkins and Zamba 2005 ), regression slope ( Chow 1960 , Qu and Perron 2007 ), general distributions forms ( Matteson and James 2014 ), or other types of change ( Castro et al 2018 , Leonardi et al 2021 ). Many of these methods have been applied to a wide range of problems such as stream anomaly detection in industry ( Li et al 2018 ), monitoring of sleep stages using EEG/EMG ( Agudelo-España et al 2020 ), identification of cyberattacks on networks ( Tartakovsky et al 2006 ), between many other interesting applications.…”

Section: Introductionmentioning

confidence: 99%

Population-based change-point detection for the identification of homozygosity islands

et al. 2023

Self Cite

View full text Add to dashboard Cite

Motivation This work is motivated by the problem of identifying homozygosity islands on the genome of individuals in a population. Our method directly tackles the issue of identification of the homozygosity islands at the population level, without the need of analysing single individuals and then combine the results, as is made nowadays in state-of-the-art approaches. Results We propose regularised offline change-point methods to detect changes in the parameters of a multidimensional distribution when we have several aligned, independent samples of fixed resolution. We present a penalised maximum likelihood approach that can be efficiently computed by a dynamic programming algorithm or approximated by a fast binary segmentation algorithm. Both estimators are shown to converge almost surely to the set of change-points without the need of specifying a priori the number of change-points. In simulation we observed similar performances from the exact and greedy estimators. Moreover, we provide a new methodology for the selection of the regularisation constant which has the advantage of being automatic, consistent and less prone to subjective analysis. Availability The data used in the application is from the Human Genome Diversity Project (HGDP) and is publicly available. Algorithms were implemented using the R software R Core Team (2020) in the R package blockcpd, found at https://github.com/Lucas-Prates/blockcpd Supplementary information Supplementary material is available online at Bioinformatics

show abstract

Section: Introductionmentioning

confidence: 99%

Population-based change-point detection for the identification of homozygosity islands

et al. 2023

Self Cite

View full text Add to dashboard Cite

show abstract

“…These are some of the many existing references that use hypothesis testing to discover or study independence. However, to the best of our knowledge, the estimation of points of independence, as proposed in this work, has not received much attention, aside from the work presented in Castro et al (2018). In the later, the authors consider this problem to detect recombination hotspots in single nucleotide polymorphisms data, assuming that the random vector takes values in A d , where A is a finite alphabet and the observations are independent.…”

Section: Introductionmentioning

confidence: 99%

Independent block identification in multivariate time series

Leonardi

Lopez‐Rosenfeld

Rodríguez

et al. 2020

Journal Time Series Analysis

Self Cite

View full text Add to dashboard Cite

In this work we propose a model selection criterion to estimate the points of independence of a random vector, producing a decomposition of the vector distribution function into independent blocks. The method, based on a general estimator of the distribution function, can be applied for discrete or continuous random vectors, and for i.i.d. data or dependent time series. We prove the consistency of the approach under general conditions on the estimator of the distribution function and we show that the consistency holds for i.i.d. data and discrete time series with mixing conditions. We also propose an efficient algorithm to approximate the estimator and show the performance of the method on simulated data. We apply the method in a real dataset to estimate the distribution of the flow over several locations on a river, observed at different time points.

show abstract

“…We allow for multiple change points without assuming an a priori fixed, known number. The penalized maximum likelihood approach has also been considered recently in Castro et al (2018); Leonardi et al (2021), but on a different type of change-point problem. There, the approach was introduced for non-parametric discrete distributions in order to detect points of independence on a multidimensional random vector, under independent or non-independent sampling.…”

mentioning

confidence: 99%

Population based change-point detection for the identification of homozygosity islands

Prates¹,

Lemes²,

Hünemeier³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we propose a new method for offline change-point detection on some parameters of the distribution of a random vector. We introduce a penalized maximum likelihood approach that can be efficiently computed by a dynamic programming algorithm or approximated by a fast greedy binary splitting algorithm. We prove both algorithms converge almost surely to the set of change-points under very general assumptions on the distribution and independent sampling of the random vector. In particular, we show the assumptions leading to the consistency of the algorithms are satisfied by categorical and Gaussian random variables. This new approach is motivated by the problem of identifying homozygosity islands on the genome of individuals in a population. Our method directly tackles the issue of identification of the homozygosity islands at the population level, without the need of analyzing single individuals and then combining the results, as is made nowadays in state-of-the-art approaches.

show abstract

A model selection approach for multiple sequence segmentation and dimensionality reduction

Cited by 6 publications

References 27 publications

Population-based change-point detection for the identification of homozygosity islands

Population-based change-point detection for the identification of homozygosity islands

Independent block identification in multivariate time series

Population based change-point detection for the identification of homozygosity islands

Contact Info

Product

Resources

About