The starting point of this article is the question "How to retrieve fingerprints of rhythm in written texts?" We address this problem in the case of Brazilian and European Portuguese. These two dialects of Modern Portuguese share the same lexicon and most of the sentences they produce are superficially identical. Yet they are conjectured, on linguistic grounds, to implement different rhythms. We show that this linguistic question can be formulated as a problem of model selection in the class of variable length Markov chains. To carry on this approach, we compare texts from European and Brazilian Portuguese. These texts are previously encoded according to some basic rhythmic features of the sentences which can be automatically retrieved. This is an entirely new approach from the linguistic point of view. Our statistical contribution is the introduction of the smallest maximizer criterion which is a constant free procedure for model selection. As a by-product, this provides a solution for the problem of optimal choice of the penalty constant when using the BIC to select a variable length Markov chain. Besides proving the consistency of the smallest maximizer criterion when the sample size diverges, we also make a simulation study comparing our approach with both the standard BIC selection and the Peres-Shields order estimation. Applied to the linguistic sample constituted for our case study, the smallest maximizer criterion assigns different context-tree models to the two dialects of Portuguese. The features of the selected models are compatible with current conjectures discussed in the linguistic literature.
Abstract:The Partition Markov Model characterizes the process by a partition L of the state space, where the elements in each part of L share the same transition probability to an arbitrary element in the alphabet. This model aims to answer the following questions: what is the minimal number of parameters needed to specify a Markov chain and how to estimate these parameters. In order to answer these questions, we build a consistent strategy for model selection which consist of: giving a size n realization of the process, finding a model within the Partition Markov class, with a minimal number of parts to represent the process law. From the strategy, we derive a measure that establishes a metric in the state space. In addition, we show that if the law of the process is Markovian, then, eventually, when n goes to infinity, L will be retrieved. We show an application to model internet navigation patterns.
In this paper, we propose a procedure of selecting samples from a set of samples coming from Markovian processes of finite order and finite alphabet. Under the assumption of the existence of a law that prevails in at least q% of the samples of the collection, we show that the procedure allows to identify samples governed by the predominant law. The approach is based on a local metric between samples, which tends to zero when we compare samples of identical law and tends to infinity when comparing samples with different laws. The local metric allows to define a criterion which takes arbitrarily large values when the previous assumption about the existence of a predominant law does not hold. By means of this procedure, we map similarities and dissimilarities of some Brazilian stocks' daily trading volume dynamic.
In this paper, we analyze the model proposed in García and Londoño1 in which a set of p‐independent sequences of discrete time Markov chains is considered, over a finite alphabet A and with finite order o. The model is obtained identifying the states on the state space Ao where two or more sequences share the same transition probabilities (see also García and González‐López2). This identification establishes a partition on {1,…,p}×Ao, the set of sequences, and the state space. We show that by means of the Bayesian information criterion (BIC), the partition can be estimated eventually almost surely. Also, in García and Londoño,1 it is given a notion of divergence, derived from the BIC, which serves to identify the proximity/discrepancy between elements of {1,…,p}×Ao (see also García et al3). In the present article, we prove that this notion is a metric in the space where the model is built and that it is statistically consistent to determine proximity/discrepancy between the elements of the space {1,…,p}×Ao. We apply the notions discussed here for the construction of a parsimonious model that represents the common stochastic structure of 153 complete genomic Zika sequences, coming from tropical and subtropical regions.
In this paper, we address the problem of deciding if two independent samples coming from discrete Markovian processes are governed by the same stochastic law. We establish a local metric between samples based on the Bayesian information criterion. In addition, we derive the bound that must be used in this metric to take the decision. In the case on which is decided that the laws are not the same, the metric allows to detect the specific elements of the state space where the discrepancies are manifested. We prove that the metric is statistically consistent to detect if the samples follow the same law, tending to zero when the sample sizes increase. Moreover, we show that the metric assumes arbitrarily large values when the sample sizes increase and the stochastic laws are different. This concept is applied to analyze two lines of production of alcohol fuel, described by five variables each. We identify the variables that most contribute to the discrepancy and, using the local nature of the metric, we list the realizations in which the processes behave differently. KEYWORDSBayesian information criterion, Markov processes, proximity between processes, relative entropy 868
In this paper, we investigate a specific structure within the theoretical framework of Partition Markov Models (PMM) [see García Jesús and González-López, Entropy 19, 160 (2017)]. The structure of interest lies in the formulation of the underlying partition, which defines the process, in which, in addition to a finite memory o associated with the process, a parameter G is introduced, allowing an extra dependence on the past complementing the dependence given by the usual memory o. We show, by simulations, how algorithms designed for the classic version of the PMM can have difficulties in recovering the structure investigated here. This specific structure is efficient for modeling a complete genome sequence, coming from the newly decoded Coronavirus Covid-19 in humans [see Wu et al., Nature 579, 265–269 (2020)]. The sequence profile is represented by 13 units (parts of the state space’s partition), for each of the 13 units, their respective transition probabilities are computed for any element of the genetic alphabet. Also, the structure proposed here allows us to develop a comparison study with other genomic sequences of Coronavirus, collected in the last 25 years, through which we conclude that Covid-19 is shown next to SARS-like Coronaviruses (SL-CoVs) from bats specimens in Zhoushan [see Hu et al., Emerg Microb Infect 7, 1–10 (2018)].
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.