Xiaosong Ma scite author profile

Performance prediction across platforms is increasingly important as developers can choose from a wide range of execution platforms. The main challenge remains to perform accurate predictions at a low-cost across different architectures.In this paper, we derive an affordable method approaching cross-platform performance translation based on relative performance between two platforms. We argue that relative performance can be observed without running a parallel application in full. We show that it suffices to observe very short partial executions of an application since most parallel codes are iterative and behave predictably manner after a minimal startup period. This novel prediction approach is observation-based. It does not require program modeling, code analysis, or architectural simulation. Our performance results using real platforms and production codes demonstrate that prediction derived from partial executions can yield high accuracy at a low cost. We also assess the limitations of our model and identify future research directions on observationbased performance prediction.

show abstract

KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores

El-Sayed¹,

Mukkara

Tsai

et al. 2018

View full text Add to dashboard Cite

Cache partitioning is now available in commercial hardware. In theory, software can leverage cache partitioning to use the last-level cache better and improve performance. In practice, however, current systems implement way-partitioning, which offers a limited number of partitions and often hurts performance. These limitations squander the performance potential of smart cache management.We present KPart, a hybrid cache partitioning-sharing technique that sidesteps the limitations of way-partitioning and unlocks significant performance on current systems. KPart first groups applications into clusters, then partitions the cache among these clusters. To build clusters, KPart relies on a novel technique to estimate the performance loss an application suffers when sharing a partition. KPart automatically chooses the number of clusters, balancing the isolation benefits of waypartitioning with its potential performance impact. KPart uses detailed profiling information to make these decisions. This information can be gathered either offline, or online at low overhead using a novel profiling mechanism.We evaluate KPart in a real system and in simulation. KPart improves throughput by 24% on average (up to 79%) on an Intel Broadwell-D system, whereas prior per-application partitioning policies improve throughput by just 1.7% on average and hurt 30% of workloads. Simulation results show that KPart achieves most of the performance of more advanced partitioning techniques that are not yet available in hardware.

show abstract

A human behavior integrated hierarchical model of airborne disease transmission in a large city

Zhang

Huang

Su³

et al. 2018

Building and Environment

View full text Add to dashboard Cite

A B S T R A C TEpidemics of infectious diseases such as SARS, H1N1, and MERS threaten public health, particularly in large cities such as Hong Kong. We constructed a human behavior integrated hierarchical (HiHi) model based on the SIR (Susceptible, Infectious, and Recovered) model, the Wells-Riley equation, and population movement considering both spatial and temporal dimensions. The model considers more than 7 million people, 3 million indoor environments, and 2566 public transport routes in Hong Kong. Smallpox, which could be spread through airborne routes, is studied as an example. The simulation is based on people's daily commutes and indoor human behaviors, which were summarized by mathematical patterns. We found that 59.6%, 18.1%, and 13.4% of patients become infected in their homes, offices, and schools, respectively. If both work stoppage and school closure measures are taken when the number of infected people is greater than 1000, an infectious disease will be effectively controlled after 2 months. The peak number of infected people will be reduced by 25% compared to taking no action, and the time of peak infections will be delayed by about 40 days if 90% of the infected people go to hospital during the infectious period. When ventilation rates in indoor environments increase to five times their default settings, smallpox will be naturally controlled. Residents of Kowloon and the north part of Hong Kong Island have a high risk of infection from airborne infectious diseases. Our HiHi model reduces the calculation time for infection rates to an acceptable level while preserving accuracy.

show abstract

FreeLoader: Scavenging Desktop Storage Resources for Scientific Data

Vazhkudai

Freeh

et al.

View full text Add to dashboard Cite

High-end computing is suffering a data deluge from experiments, simulations, and apparatus that creates overwhelming application dataset sizes. End-user workstations-despite more processing power than ever before-are ill-equipped to cope with such data demands due to insufficient secondary storage space and I/O rates. Meanwhile, a large portion of desktop storage is unused. We present the FreeLoader framework, which aggregates unused desktop storage space and I/O bandwidth into a shared cache/scratch space, for hosting large, immutable datasets and exploiting data access locality. Our experiments show that FreeLoader is an appealing low-cost solution to storing massive datasets, by delivering higher data access rates than traditional storage facilities. In particular, we present novel data striping techniques that allow FreeLoader to efficiently aggregate a workstation's network communication bandwidth and local I/O bandwidth. In addition, the performance impact on the native workload of donor machines is small and can be effectively controlled.

show abstract

Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling

Mukkara¹,

Beckmann²,

Abeydeera³

et al. 2018

View full text Add to dashboard Cite

Graph processing is increasingly bottlenecked by main memory accesses. On-chip caches are of little help because the irregular structure of graphs causes seemingly random memory references. However, most real-world graphs offer significant potential locality-it is just hard to predict ahead of time. In practice, graphs have well-connected regions where relatively few vertices share edges with many common neighbors. If these vertices were processed together, graph processing would enjoy significant data reuse. Hence, a graph's traversal schedule largely determines its locality.This paper explores online traversal scheduling strategies that exploit the community structure of real-world graphs to improve locality. Software graph processing frameworks use simple, locality-oblivious scheduling because, on general-purpose cores, the benefits of locality-aware scheduling are outweighed by its overheads. Software frameworks rely on offline preprocessing to improve locality. Unfortunately, preprocessing is so expensive that its costs often negate any benefits from improved locality. Recent graph processing accelerators have inherited this design. Our insight is that this misses an opportunity: Hardware acceleration allows for more sophisticated, online locality-aware scheduling than can be realized in software, letting systems significantly improve locality without any preprocessing.To exploit this insight, we present bounded depth-first scheduling (BDFS), a simple online locality-aware scheduling strategy. BDFS restricts each core to explore one small, connected region of the graph at a time, improving locality on graphs with good community structure. We then present HATS, a hardwareaccelerated traversal scheduler that adds just 0.4% area and 0.2% power over general-purpose cores.We evaluate BDFS and HATS on several algorithms using large real-world graphs. On a simulated 16-core system, BDFS reduces main memory accesses by up to 2.4× and by 30% on average. However, BDFS is too expensive in software and degrades performance by 21% on average. HATS eliminates these overheads, allowing BDFS to improve performance by 83% on average (up to 3.1×) over a locality-oblivious software implementation and by 31% on average (up to 2.1×) over specialized prefetchers.

show abstract

Using Shared Memory to Accelerate MapReduce on Graphics Processing Units

2011

View full text Add to dashboard Cite

Coordinating Computation and I/O in Massively Parallel Sequence Search

Lin

Feng

et al. 2011

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-With the explosive growth of genomic information, the searching of sequence databases has emerged as one of the most computation-and data-intensive scientific applications. Our previous studies suggested that parallel genomic sequence-search possesses highly irregular computation and I/O patterns. Effectively addressing these run-time irregularities is thus the key to designing scalable sequence-search tools on massively parallel computers. While the computation scheduling for irregular scientific applications and the optimization of noncontiguous file accesses have been well studied independently, little attention has been paid to the interplay between the two. In this paper, we systematically investigate the computation and I/O scheduling for data-intensive, irregular scientific applications within the context of genomic sequence search. Our study reveals that the lack of coordination between computation scheduling and I/O optimization could result in severe performance issues. We then propose an integrated scheduling approach that effectively improves sequence-search throughput by gracefully coordinating the dynamic load-balancing of computation and highperformance noncontiguous I/O.

show abstract

Scalable I/O tracing and analysis

Vijayakumar

Mueller

et al. 2009

View full text Add to dashboard Cite

As supercomputer performance approached and then surpassed the petaflop level, I/O performance has become a major performance bottleneck for many scientific applications. Several tools exist to collect I/O traces to assist in the analysis of I/O performance problems. However, these tools either produce extremely large trace files that complicate performance analysis, or sacrifice accuracy to collect high-level statistical information. We propose a multi-level trace generator tool, ScalaIOTrace, that collects traces at several levels in the HPC I/O stack. ScalaIOTrace features aggressive trace compression that generates trace files of near constant size for regular I/O patterns and orders of magnitudes smaller for less regular ones. This enables the collection of I/O and communication traces of applications running on thousands of processors.Our contributions also include automated trace analysis to collect selected statistical information of I/O calls by parsing the compressed trace on-the-fly and time-accurate replay of communication events with MPI-IO calls. We evaluated our approach with

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Xiaosong Ma

Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution

KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores

A human behavior integrated hierarchical model of airborne disease transmission in a large city

FreeLoader: Scavenging Desktop Storage Resources for Scientific Data

Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling

Using Shared Memory to Accelerate MapReduce on Graphics Processing Units

Coordinating Computation and I/O in Massively Parallel Sequence Search

Scalable I/O tracing and analysis

Contact Info

Product

Resources

About