Data with multiple responses is ubiquitous in modern applications. However, few tools are available for regression analysis of multivariate counts. The most popular multinomial-logit model has a very restrictive mean-variance structure, limiting its applicability to many data sets. This article introduces an R package MGLM, short for multivariate response generalized linear models, that expands the current tools for regression analysis of polytomous data. Distribution fitting, random number generation, regression, and sparse regression are treated in a unifying framework. The algorithm, usage, and implementation details are discussed. IntroductionMultivariate categorical data arises in many fields, including genomics, image analysis, text mining, and sports statistics. The multinomial-logit model (Agresti, 2002, Chapter 7) has been the most popular tool for analyzing such data. However, it is limiting due to its specific mean-variance structure and the strong assumption that the counts are negatively correlated. Models that address over-dispersion relative to a multinomial distribution and incorporate positive and/or negative correlation structures would offer greater flexibility for analysis of polytomous data.In this article, we introduce an R package MGLM, short for multivariate response generalized linear models. The MGLM package provides a unified framework for random number generation, distribution fitting, regression, hypothesis testing, and variable selection for multivariate response generalized linear models, particularly four models listed in Table 1. These models considerably broaden the class of generalized linear models (GLM) for analysis of multivariate categorical data.MGLM overlaps little with existing packages in R and other softwares. The standard multinomiallogit model is implemented in several R packages (Venables and Ripley, 2002) with VGAM (Yee, 2010(Yee, , 2015(Yee, , 2017 being the most comprehensive. We include the classical multinomial-logit regression model in MGLM not only for completeness, but also to complement it with various penalty methods for variable selection and regularization. If invoked by the group penalty, MGLM is able to perform variable selection at the predictor level for easier interpretation. This is different from the elastic net penalized multinomial-logit model implemented in glmnet (Friedman et al., 2010), which does selection at the matrix entry level. Although MGLM focuses on regression, it also provides distribution fitting and random number generation for the models listed in Table 1. VGAM and dirmult (Tvedebrink, 2010) packages can estimate the parameters of the Dirichlet-multinomial (DM) distribution using Fisher's scoring and Newton's method respectively. As indicated in the manual (Yee, 2017), the convergence of Fisher's scoring method may be slow due to the difficulty in evaluating the expected information matrix. Further the Newton's method is unstable as the log-likelihood function may be non-concave. As explained later, MGLM achieves both stability an...
The growing prevalence of big and streaming data requires a new generation of tools. Data often has infinite size in the sense that new observations are continually arriving daily, hourly, etc. In recent years, several new technologies such as Kafka (Apache Software Foundation, n.d.-a) and Spark Streaming (Apache Software Foundation, n.d.-b) have been introduced for processing streaming data. Statistical tools for data streams, however, are under-developed and offer only basic functionality. The majority of statistical software can only operate on finite batches and require re-loading possibly large datasets for seemingly simple tasks such as incorporating a few more observations into an analysis.OnlineStats is a Julia (Bezanson, Edelman, Karpinski, & Shah, 2017) package for highperformance online algorithms. The OnlineStats framework is easily extensible, includes a large catalog of algorithms, provides primitives for parallel computing, and offers a weighting mechanism that allows new observations have a higher relative influence over the value of the statistic/model/visualization. InterfaceEach algorithm is associated with its own type (e.g. Mean, Variance, etc.). The OnlineStats interface is built on several key functions from the OnlineStatsBase package. A new type must provide implementations of these functions in order to use the rest of the OnlineStats framework.
Abstract. In many real-life situations, e.g., in medicine, it is necessary to process data while preserving the patients' confidentiality. One of the most efficient methods of preserving privacy is to replace the exact values with intervals that contain these values. For example, instead of an exact age, a privacy-protected database only contains the information that the age is, e.g., between 10 and 20, or between 20 and 30, etc. Based on this data, it is important to compute correlation and covariance between different quantities. For privacy-protected data, different values from the intervals lead, in general, to different estimates for the desired statistical characteristic. Our objective is then to compute the range of possible values of these estimates. Algorithms for effectively computing such ranges have been developed for situations when intervals come from the original surveys, e.g., when a person fills in whether his or her age is between 10 or 20, between 20 and 30, etc. These intervals, however, do not always lead to an optimal privacy protection; it turns out that more complex, computer-generated "intervalization" can lead to better privacy under the same accuracy, or, alternatively, to more accurate estimates of statistical characteristics under the same privacy constraints. In this paper, we extend the existing efficient algorithms for computing covariance and correlation based on privacy-protected data to this more general case of interval data. Formulation of the ProblemNeed for processing data in statistical databases. Often, we collect data for the purpose of finding possible dependencies between different quantities. For example, we collect all possible information about the medical patients with the hope of finding out which factors affect different illnesses and which factors affect the success of different cures. The resulting collection of records r i = (r i1 , . . . , r ip ), 1 ≤ i ≤ n, is known as a statistical database since typically, statistical methods are used for look for possible dependencies; see, e.g., [7]. These statistical methods are usually based on computing statistical characteristics such as mean
This paper addresses the fundamental research question: “How can we determine the sequential decision- making process inside a decision maker’s mind?” We construct a dynamic Markov Decision Process using a Double Transition Model (DTM). The DTM is a cognitive model decomposing the decision-making process into episodic tasks that are extracted from a stream of incoming information. In a DTM, each state reflects a stage en route to a decision, and each action reflects a possible move from collecting data to hypothesizing and inferencing. The reward reflects how close a stage is to the final decision. We demonstrate this process through a proof-of-concept DTM using a hypothetical scenario for Typhoon Haiyan in the Philippines (2013). The DTM constructed from this scenario enables a Commander to reason about damaged areas, death tolls, and assistance methods while allowing his actions to be captured and used to explain why and how each decision is made.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.