For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least squares estimate in linear regression, where statistical leverage scores are often used to define subsampling probabilities. In this paper, we propose fast subsampling algorithms to efficiently approximate the maximum likelihood estimate in logistic regression. We first establish consistency and asymptotic normality of the estimator from a general subsampling algorithm, and then derive optimal subsampling probabilities that minimize the asymptotic mean squared error of the resultant estimator. An alternative minimization criterion is also proposed to further reduce the computational cost. The optimal subsampling probabilities depend on the full data estimate, so we develop a two-step algorithm to approximate the optimal subsampling procedure. This algorithm is computationally efficient and has a significant reduction in computing time compared to the full data approach. Consistency and asymptotic normality of the estimator from a two-step algorithm are also established. Synthetic and real data sets are used to evaluate the practical performance of the proposed method.
Extraordinary amounts of data are being produced in many branches of science. Proven statistical methods are no longer applicable with extraordinary large data sets due to computational limitations. A critical step in big data analysis is data reduction. Existing investigations in the context of linear regression focus on subsampling-based methods. However, not only is this approach prone to sampling errors, it also leads to a covariance matrix of the estimators that is typically bounded from below by a term that is of the order of the inverse of the subdata size. We propose a novel approach, termed information-based optimal subdata selection (IBOSS). Compared to leading existing subdata methods, the IBOSS approach has the following advantages: (i) it is significantly faster; (ii) it is suitable for distributed parallel computing; (iii) the variances of the slope parameter estimators converge to 0 as the full data size increases even if the subdata size is fixed, i.e., the convergence rate depends on the full data size; (iv) data analysis for IBOSS subdata is straightforward and the sampling distribution of an IBOSS estimator is easy to assess. Theoretical results and extensive simulations demonstrate that the IBOSS approach is superior to subsampling-based methods, sometimes by orders of magnitude. The advantages of the new approach are also illustrated through analysis of real data.
Metal–organic frameworks (MOFs) that respond to external stimuli such as guest molecules, temperature, or redox conditions are highly desirable. Herein, we coupled redox-switchable properties with breathing behavior induced by guest molecules in a single framework. Guided by topology, two flexible isomeric MOFs, compounds 1 and 2, with a formula of In(Me2NH2)(TTFTB), were constructed via a combination of [In(COO)4]− metal nodes and tetratopic tetrathiafulvalene-based linkers (TTFTB). The two compounds show different breathing behaviors upon the introduction of N2. Single-crystal X-ray diffraction, accompanied by molecular simulations, reveals that the breathing mechanism of 1 involves the bending of metal–ligand bonds and the sliding of interpenetrated frameworks, while 2 undergoes simple distortion of linkers. Reversible oxidation and reduction of TTF moieties changes the linker flexibility, which in turn switches the breathing behavior of 2. The redox-switchable breathing behavior can potentially be applied to the design of stimuli-responsive MOFs.
Transformation-mediated mutagenesis in both targeted and random manners has been widely applied to decipher gene function in diverse fungi. However, a transformation system has not yet been established for lichen fungi, severely limiting our ability to study their biology and mechanism underpinning symbiosis via gene manipulation. Here, we report the first successful transformation of the lichen fungus, Umbilicaria muehlenbergii, via the use of Agrobacterium tumefaciens. We generated a total of 918 transformants employing a binary vector that carries the hygromycin B phosphotransferase gene as a selection marker and the enhanced green fluorescent protein gene for labeling transformants. Randomly selected transformants appeared mitotically stable, based on their maintenance of hygromycin B resistance after five generations of growth without selection. Genomic Southern blot showed that 88% of 784 transformants contained a single T-DNA insert in their genome. A number of putative mutants affected in colony color, size, and/or morphology were found among these transformants, supporting the utility of Agrobacterium tumefaciens-mediated transformation (ATMT) for random insertional mutagenesis of U. muehlenbergii. This ATMT approach potentially offers a systematic gene functional study with genome sequences of U. muehlenbergii that is currently underway.
Summary
We investigate optimal subsampling for quantile regression. We derive the asymptotic distribution of a general subsampling estimator and then derive two versions of optimal subsampling probabilities. One version minimizes the trace of the asymptotic variance-covariance matrix for a linearly transformed parameter estimator and the other minimizes that of the original parameter estimator. The former does not depend on the densities of the responses given covariates and is easy to implement. Algorithms based on optimal subsampling probabilities are proposed and asymptotic distributions, and the asymptotic optimality of the resulting estimators are established. Furthermore, we propose an iterative subsampling procedure based on the optimal subsampling probabilities in the linearly transformed parameter estimation which has great scalability to utilize available computational resources. In addition, this procedure yields standard errors for parameter estimators without estimating the densities of the responses given the covariates. We provide numerical examples based on both simulated and real data to illustrate the proposed method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.