Tensor decomposition is a fundamental unsupervised machine learning method in data science, with applications including network analysis and sensor data processing. This work develops a generalized canonical polyadic (GCP) low-rank tensor decomposition that allows other loss functions besides squared error. For instance, we can use logistic loss or Kullback-Leibler divergence, enabling tensor decomposition for binary or count data. We present a variety of statisticallymotivated loss functions for various scenarios. We provide a generalized framework for computing gradients and handling missing data that enables the use of standard optimization methods for fitting the model. We demonstrate the flexibility of GCP on several real-world examples including interactions in a social network, neural activity in a mouse, and monthly rainfall measurements in India.
medRxiv preprint 2The ongoing pandemic of SARS-CoV-2, a novel coronavirus, caused over 3 million reported cases of coronavirus disease 2019 (COVID-19) and 200,000 reported deaths between December 2019 and April 2020 1 . Cases and deaths will increase as the virus continues its global march outward. In the absence of effective pharmaceutical interventions or a vaccine, wide-spread virological screening is required to inform where restrictive isolation measures should be targeted and when they can be lifted 2-6 . However, limitations on testing capacity have restricted the ability of governments and institutions to identify individual clinical cases, appropriately measure community prevalence, and mitigate transmission. Group testing offers a way to increase efficiency, by combining samples and testing a small number of pools 7-9 . Here, we evaluate the effectiveness of group testing designs for individual identification or prevalence estimation of SARS-CoV-2 infection when testing capacity is limited. To do this, we developed mathematical models for epidemic spread, incorporating empirically measured individual-level viral kinetics to simulate changing viral loads in a large population over the course of an epidemic. We used these to construct representative populations and assess pooling strategies for community screening, accounting for variability in viral load samples, dilution effects, changing prevalence and resource constraints. We confirmed our group testing framework through pooled tests on de-identified human nasopharyngeal specimens with viral loads representative of the larger population. We show that group testing designs can both accurately estimate overall prevalence using a small number of measurements and substantially increase the identification rate of infected individuals in resource-limited settings. : medRxiv preprint 3 We aimed to evaluate the effectiveness of group testing for overall prevalence estimation and individual case identification. In the classical version of the identification problem 7 , samples from multiple individuals are combined and tested as a single pool ( Fig. 1a). If the test is negative (which might be likely if the prevalence is low and the pool is not too large), then each of the individuals is assumed to have been negative. If the test is positive, it is assumed that at least one individual in the pool was positive; each of the pooled samples is then tested individually. This strategy leverages the low frequency of cases which would otherwise cause substantial inefficiency, as the majority of pools will test negative when prevalence is low. The simple pooling method can be expanded to combinatorial pooling (each sample represented in multiple pools) for direct sample identification 8,9 ( Fig. 1b) and to pooled testing for prevalence estimation 10,11 ( Fig. 1c).To deploy group testing in the current pandemic, we need designs that can account for the (i) prevalence of infection within the population, (ii) position along the epidemic curve , and (iii) within-host kin...
Virological testing is central to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) containment, but many settings face severe limitations on testing. Group testing offers a way to increase throughput by testing pools of combined samples; however, most proposed designs have not yet addressed key concerns over sensitivity loss and implementation feasibility. Here, we combined a mathematical model of epidemic spread and empirically derived viral kinetics for SARS-CoV-2 infections to identify pooling designs that are robust to changes in prevalence, and to ratify sensitivity losses against the time course of individual infections. We show that prevalence can be accurately estimated across a broad range, from 0.02% to 20%, using only a few dozen pooled tests, and using up to 400 times fewer tests than would be needed for individual identification. We then exhaustively evaluated the ability of different pooling designs to maximize the number of detected infections under various resource constraints, finding that simple pooling designs can identify up to 20 times as many true positives as individual testing with a given budget. We illustrate how pooling affects sensitivity and overall detection capacity during an epidemic and on each day post infection, finding that only 3% of false negative tests occurred when individuals are sampled during the first week of infection following peak viral load, and that sensitivity loss is mainly attributable to individuals sampled at the end of infection when detection for limiting transmission has minimal benefit. Crucially, we confirmed that our theoretical results can be translated into practice using pooled human nasopharyngeal specimens by accurately estimating a 1% prevalence among 2,304 samples using only 48 tests, and through pooled sample identification in a panel of 960 samples. Our results show that accounting for variation in sampled viral loads provides a nuanced picture of how pooling affects sensitivity to detect infections. Using simple, practical group testing designs can vastly increase surveillance capabilities in resource-limited settings.
Cytoplasmic sequestration of the p53 tumor suppresser protein has been proposed as a mechanism involved in abolishing p53 function. However, the mechanisms regulating p53 subcellular localization remain unclear. In this report, we analyzed the possible existence of cis-acting sequences involved in intracellular trafficking of the p53 protein. To study p53 trafficking, the jellyfish green fluorescent protein (GFP) was fused to the wild-type or mutated p53 proteins for fast and sensitive analysis of protein localization in human MCF-7 breast cancer, RKO colon cancer, and SAOS-2 sarcoma cells. The wild-type p53/GFP fusion protein was localized in the cytoplasm, the nucleus, or both compartments in a subset of the cells. Mutagenesis analysis demonstrated that a single amino acid mutation of Lys-305 (mt p53) caused cytoplasmic sequestration of the p53 protein in the MCF-7 and RKO cells, whereas the fusion protein was distributed in both the cytoplasm and the nucleus of SAOS-2 cells. In SAOS-2 cells, the mutant p53 was a less efficient inducer of p21/CIP1/WAF1 expression. Cytoplasmic sequestration of the mt p53 was dependent upon the C-terminal region (residues 326 -355) of the protein. These results indicated the involvement of cis-acting sequences in the regulation of p53 subcellular localization. Lys-305 is needed for nuclear import of p53 protein, and amino acid residues 326 -355 can sequester mt p53 in the cytoplasm.
Principal Component Analysis (PCA) is a classical method for reducing the dimensionality of data by projecting them onto a subspace that captures most of their variation. Effective use of PCA in modern applications requires understanding its performance for data that are both high-dimensional and heteroscedastic. This paper analyzes the statistical performance of PCA in this setting, i.e., for high-dimensional data drawn from a low-dimensional subspace and degraded by heteroscedastic noise. We provide simplified expressions for the asymptotic PCA recovery of the underlying subspace, subspace amplitudes and subspace coefficients; the expressions enable both easy and efficient calculation and reasoning about the performance of PCA. We exploit the structure of these expressions to show that, for a fixed average noise variance, the asymptotic recovery of PCA for heteroscedastic data is always worse than that for homoscedastic data (i.e., for noise variances that are equal across samples). Hence, while average noise variance is often a practically convenient measure for the overall quality of data, it gives an overly optimistic estimate of the performance of PCA for heteroscedastic data.
Introducing SDN into an existing network causes both deployment and operational issues. A systematic incremental deployment methodology as well as a hybrid operation model is needed. We present such a system for incremental deployment of hybrid SDN networks consisting of both legacy forwarding devices (i.e., traditional IP routers) and programmable SDN switches. We design the system on a production SDN controller to answer the following questions: which legacy devices to upgrade to SDN, and how legacy and SDN devices can interoperate in a hybrid environment to satisfy a variety of traffic engineering (TE) goals such as load balancing and fast failure recovery. Evaluation on real ISP and enterprise topologies shows that with only 20% devices upgraded to SDN, our system reduces the maximum link usage by an average of 32% compared with pure-legacy networks (shortest path routing), while only requiring an average of 41% of flow table capacity compared with pure-SDN networks.
Tensor decomposition is a well-known tool for multiway data analysis. This work proposes using stochastic gradients for efficient generalized canonical polyadic (GCP) tensor decomposition of largescale tensors. GCP tensor decomposition is a recently proposed version of tensor decomposition that allows for a variety of loss functions such as Bernoulli loss for binary data or Huber loss for robust estimation. The stochastic gradient is formed from randomly sampled elements of the tensor and is efficient because it can be computed using the sparse matricized-tensor times Khatri-Rao product tensor kernel. For dense tensors, we simply use uniform sampling. For sparse tensors, we propose two types of stratified sampling that give precedence to sampling nonzeros. Numerical results demonstrate the advantages of the proposed approach and its scalability to large-scale problems.
Abstract-Principal Component Analysis (PCA) is a method for estimating a subspace given noisy samples. It is useful in a variety of problems ranging from dimensionality reduction to anomaly detection and the visualization of high dimensional data. PCA performs well in the presence of moderate noise and even with missing data, but is also sensitive to outliers. PCA is also known to have a phase transition when noise is independent and identically distributed; recovery of the subspace sharply declines at a threshold noise variance. Effective use of PCA requires a rigorous understanding of these behaviors. This paper provides a step towards an analysis of PCA for samples with heteroscedastic noise, that is, samples that have nonuniform noise variances and so are no longer identically distributed. In particular, we provide a simple asymptotic prediction of the recovery of a one-dimensional subspace from noisy heteroscedastic samples. The prediction enables: a) easy and efficient calculation of the asymptotic performance, and b) qualitative reasoning to understand how PCA is impacted by heteroscedasticity (such as outliers).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.