Kai Tian scite author profile

Kai Tian

Sign up to set email alerts

|

17Publications

366Citation Statements Received

334Citation Statements Given

How they've been cited

How they cite others

Affiliations

Tencent (China), William & Mary, Williams (United States)

Publications

Order By: Most citations

Using Latent Dirichlet Allocation for automatic categorization of software

¹

,

²

,

³

2009

View full text Add to dashboard Cite

Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors?

¹

,

²

,

³

et al. 2010

View full text Add to dashboard Cite

On Chip Multiprocessors (CMP), it is common that multiple cores share certain levels of cache. The sharing increases the contention in cache and memory-to-chip bandwidth, further highlighting the importance of data locality analysis. As a rigorous and hardware-independent locality metric, reuse distance has served for a variety of locality analysis, program transformations, and performance prediction. However, previous studies have concentrated on sequential programs running on unicore processors. On CMP, accesses by different threads (or jobs) interact in the shared cache. How reuse distance applies to the new architecture remains an open question-particularly, how the interactions in shared cache affect the collection and application of reuse distance, and how reusedistance-based locality analysis should adapt to such architecture changes. This paper presents our explorations towards answering those questions. It first introduces the concept of concurrent reuse distance, a direct extension of the traditional concept of reuse distance with data references by all co-running threads (or jobs) considered. It then discusses the properties of concurrent reuse distance, revealing the special challenges facing the collection and application of concurrent reuse distance on CMP platforms. Finally, it presents the solutions to those challenges for a class of multithreading applications. The solutions center on a probabilistic model that connects concurrent reuse distance with the data locality of each individual thread. Experiments demonstrate the effectiveness of the proposed techniques in facilitating the uses of concurrent reuse distance for CMP computing.

Exploiting statistical correlations for proactive prediction of program behaviors

¹

,

²

,

³

et al. 2010

View full text Add to dashboard Cite

Combining Locality Analysis with Online Proactive Job Co-scheduling in Chip Multiprocessors

¹

,

²

,

³

2010

View full text Add to dashboard Cite

A study on optimally co-scheduling jobs of different lengths on chip multiprocessors

¹

,

²

,

³

2009

View full text Add to dashboard Cite

On-the-fly elimination of dynamic irregularities for GPU computing

¹

,

²

,

³

et al. 2011

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to achieve those gains through software approaches on modern GPUs. This paper presents a systematic exploration to tackle dynamic irregularities in both control flows and memory references. It reveals some properties of dynamic irregularities in both control flows and memory references, their interactions, and their relations with program data and threads. It describes several heuristics-based algorithms and runtime adaptation techniques for effectively removing dynamic irregularities through data reordering and job swapping. It presents a framework, G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. G-Streamline has several distinctive properties. It is a pure software solution and works on the fly, requiring no hardware extensions or offline profiling. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations. Its optimization overhead is largely transparent to GPU kernel executions, jeopardizing no basic efficiency of the GPU application. Finally, it is robust to the presence of various complexities in GPU applications. Experiments show that G-Streamline is effective in reducing dynamic irregularities in GPU computing, producing speedups between 1.07 and 2.5 for a variety of applications.

An input-centric paradigm for program dynamic optimizations

¹

,

²

,

³

et al. 2010

View full text Add to dashboard Cite

The Complexity of Optimal Job Co-Scheduling on Chip Multiprocessors and Heuristics-Based Solutions

¹

,

²

,

³

et al. 2011

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Copyright © 2024 scite LLC. All rights reserved.

Made with 💙 for researchers

Part of the Research Solutions Family.