Casey Battaglino scite author profile

The CANDECOMP/PARAFAC (CP) decomposition is a leading method for the analysis of multiway data. The standard alternating least squares algorithm for the CP decomposition (CP-ALS) involves a series of highly overdetermined linear least squares problems. We extend randomized least squares methods to tensors and show the workload of CP-ALS can be drastically reduced without a sacrifice in quality. We introduce techniques for efficiently preprocessing, sampling, and computing randomized least squares on a dense tensor of arbitrary order, as well as an efficient sampling-based technique for checking the stopping condition. We also show more generally that the Khatri-Rao product (used within the CP-ALS iteration) produces conditions favorable for direct sampling. In numerical results, we see improvements in speed, reductions in memory requirements, and robustness with respect to initialization.

show abstract

An input-adaptive and in-place approach to dense tensor-times-matrix multiply

Battaglino

Perros

et al. 2015

View full text Add to dashboard Cite

On the communication complexity of 3D FFTs and its implications for Exascale

Czechowski

Battaglino

McClanahan

et al. 2012

View full text Add to dashboard Cite

This paper revisits the communication complexity of largescale 3D fast Fourier transforms (FFTs) and asks what impact trends in current architectures will have on FFT performance at exascale. We analyze both memory hierarchy traffic and network communication to derive suitable analytical models, which we calibrate against current software implementations; we then evaluate models to make predictions about potential scaling outcomes at exascale, based on extrapolating current technology trends. Of particular interest is the performance impact of choosing high-density processors, typified today by graphics co-processors (GPUs), as the base processor for an exascale system. Among various observations, a key prediction is that although inter-node all-to-all communication is expected to be the bottleneck of distributed FFTs, intra-node communication-expressed precisely in terms of the relative balance among compute capacity, memory bandwidth, and network bandwidth-will play a critical role.

show abstract

GraSP: distributed streaming graph partitioning

Battaglino

Pienta

Vuduc

2015

View full text Add to dashboard Cite

This paper presents a distributed, streaming graph partitioner, Graph Streaming Partitioner (GraSP), which makes partition decisions as each vertex is read from memory, simulating an online algorithm that must process nodes as they arrive. GraSP is a lightweight high-performance computing (HPC) library implemented in MPI, designed to be easily substituted for existing HPC partitioners such as ParMETIS. It is the first MPI implementation for streaming partitioning of which we are aware, and is empirically orders-ofmagnitude faster than existing partitioners while providing comparable partitioning quality. We demonstrate the scalability of GraSP on up to 1024 compute nodes of NERSC's Edison supercomputer. Given a minute of run-time, GraSP can partition a graph three orders of magnitude larger than ParMETIS can.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.