Alain Ketterlin scite author profile

International audienceThis paper describes an algorithm that takes a trace (i.e., a sequence of numbers or vectors of numbers) as input, and from that produces a sequence of loop nests that, when run, produces exactly the original sequence. The input format is suitable for any kind of program execution trace, and the output conforms to standard models of loop nests. The first, most obvious, use of such an algorithm is for program behavior modeling for any measured quantity (memory accesses, number of cache misses, etc.). Finding loops amounts to detecting periodic behavior and provides an explanatory model. The second application is trace compression, i.e., storing the loop nests instead of the original trace. Decompression consists of running the loops, which is easy and fast. A third application is value prediction. Since the algorithm forms loops while reading input, it is able to extrapolate the loop under construction to predict further incoming values. Throughout the paper, we provide examples that explain our algorithms. Moreover, we evaluate trace compression and value prediction on a subset of the SPEC2000 benchmarks

show abstract

Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization

Ketterlin

Clauss

2012

View full text Add to dashboard Cite

Polyhedral parallelization of binary code

Pradelle

Ketterlin

Clauss

2012

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

International audienceMany automatic software parallelization systems have been proposed in the past decades, but most of them are dedicated to source-to-source transformations. This paper shows that parallelizing executable programs is feasible, even if they require complex transformations, and in effect decouples parallelization from compilation, for example, for closed-source or legacy software, where binary code is the only available representation. We propose an automatic parallelizer, which is able to perform advanced parallelization on binary code. It first parses the binary code and extracts high-level information. From this information, a C program is generated. This program captures only a subset of the program semantics, namely, loops and memory accesses. This C program is then parallelized using existing, state-of-the-art parallelizers, including advanced polyhedral parallelizers. The original program semantics is then re-injected, and the transformed parallel loop nests are recompiled by a standard C compiler. We show on the PolyBench benchmark suite that our system successfully detects and parallelizes almost all the loop nests from the binary code, using a recent polyhedral loop parallelizer as a backend. The paper ends by elaborating a strategy to parallelize more complex programs, such as those containing non-linear accesses to memory, and provides a few example case-studies

show abstract

Hierarchical Clustering of Composite Objects with a Variable Number of Components

Ketterlin

Gançarski

Korczak

1996

View full text Add to dashboard Cite

Sequence Similarity and Multi-Date Image Segmentation

Ketterlin¹,

Gançarski

2007

View full text Add to dashboard Cite

Improving the Performance of X10 Programs by Clock Removal

Feautrier

Violard

Ketterlin

2014

View full text Add to dashboard Cite

Abstract. X10 is a promising recent parallel language designed specifically to address the challenges of productively programming a wide variety of target platforms. The sequential core of X10 is an object-oriented language in the Java family. This core is augmented by a few parallel constructs that create activities as a generalization of the well known fork/join model. Clocks are a generalization of the familiar barriers. Synchronization on a clock is specified by the Clock.advanceAll() method call. Activities that execute advances stall until all existent activities have done the same, and then are released at the same (logical) time.This naturally raises the following question: are clocks strictly necessary for X10 programs? Surprisingly enough, the answer is no, at least for sufficiently regular programs. One assigns a date to each operation, denoting the number of advances that the activity has executed before the operation. Operations with the same date constitute a front, fronts are executed sequentially in order of increasing dates, while operations in a front are executed in parallel if possible. Depending on the nature of the program, this may entail some overhead, which can be reduced to zero for polyhedral programs. We show by experiments that, at least for the current X10 runtime, this transformation usually improves the performance of our benchmarks. Besides its theoretical interest, this transformation may be of interest for simplifying a compiler or runtime library.

show abstract

A Space and Bandwidth Efficient Multicore Algorithm for the Particle-in-Cell Method

Barsamian

Charguéraud

Ketterlin

2018

View full text Add to dashboard Cite

Abstract. The Particle-in-Cell (PIC) method allows solving partial differential equation through simulations, with important applications in plasma physics. To simulate thousands of billions of particles on clusters of multicore machines, prior work has proposed hybrid algorithms that combine domain decomposition and particle decomposition with carefully optimized algorithms for handling particles processed on each multicore socket. Regarding the multicore processing, existing algorithms either suffer from suboptimal execution time, due to sorting operations or use of atomic instructions, or suffer from suboptimal space usage. In this paper, we propose a novel parallel algorithm for two-dimensional PIC simulations on multicore hardware that features asymptotically-optimal memory consumption, and does not perform unnecessary accesses to the main memory. In practice, our algorithm reaches 65% of the maximum bandwidth, and shows excellent scalability on the classical Landau damping and two-stream instability test cases.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Alain Ketterlin

A global averaging method for dynamic time warping, with applications to clustering

Prediction and trace compression of data access addresses through nested loop recognition

Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization

Polyhedral parallelization of binary code

Hierarchical Clustering of Composite Objects with a Variable Number of Components

Sequence Similarity and Multi-Date Image Segmentation

Improving the Performance of X10 Programs by Clock Removal

A Space and Bandwidth Efficient Multicore Algorithm for the Particle-in-Cell Method

Contact Info

Product

Resources

About