Guillaume Mercier scite author profile

Abstract-The increasing numbers of cores, shared caches and memory nodes within machines introduces a complex hardware topology. High-performance computing applications now have to carefully adapt their placement and behavior according to the underlying hierarchy of hardware resources and their software affinities.We introduce the Hardware Locality (hwloc) software which gathers hardware information about processors, caches, memory nodes and more, and exposes it to applications and runtime systems in a abstracted and portable hierarchical manner. hwloc may significantly help performance by having runtime systems place their tasks or adapt their communication strategies depending on hardware affinities.We show that hwloc can already be used by popular highperformance OPENMP or MPI software. Indeed, scheduling OPENMP threads according to their affinities or placing MPI processes according to their communication patterns shows interesting performance improvement thanks to hwloc. An optimized MPI communication strategy may also be dynamically chosen according to the location of the communicating processes in the machine and its hardware characteristics.

show abstract

Process Placement in Multicore Clusters:Algorithmic Issues and Practical Techniques

Jeannot

Mercier

Tessier

2014

IEEE Trans. Parallel Distrib. Syst.

113

View full text Add to dashboard Cite

Current generations of NUMA node clusters feature multicore or manycore processors. Programming such architectures eciently is a challenge because numerous hardware characteristics have to be taken into account, especially the memory hierarchy. One appealing idea to improve the performance of parallel applications is to decrease their communication costs by matching the communication pattern to the underlying hardware architecture. In this report, we detail the algorithm and techniques proposed to achieve such a result: rst, we gather both the communication pattern information and the hardware details. Then we compute a relevant reordering of the various process ranks of the application. Finally, those new ranks are used to reduce the communication costs of the application.

show abstract

Population Genetics of Factor V Leiden in Europe

Lucotte¹,

Mercier²

2001

Blood Cells, Molecules, and Diseases

View full text Add to dashboard Cite

Distribution of the CCR5 Gene 32-bp Deletion in Europe

Lucotte

Mercier²

1998

Journal of Acquired Immune Deficiency Syndromes and Human Retro

View full text Add to dashboard Cite

The chemokine receptor CCR5 constitutes the major coreceptor for the macrophage-tropic strains of HIV-1. A mutant allele of the CCR5 gene called delta32 was shown to provide strong resistance to homozygotes against infection by HIV. The frequency of the delta32 allele was investigated in 2522 noninfected unrelated individuals from 16 different European populations. The delta32 allele was found in all populations studied, with a mean frequency of about 9.1%. A north-to-south gradient correlating latitude with delta32 allelic frequencies was found (r = 0.726), with highest allele frequencies in Denmark and Northern France, and the lowest allele frequencies in Corsica.

show abstract

Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis

Buntinas

Goglin

Goodell

et al. 2009

View full text Add to dashboard Cite

The emergence of multicore processors raises the need to efficiently transfer large amounts of data between local processes. MPICH2 is a highly portable MPI implementation whose large-message communication schemes suffer from high CPU utilization and cache pollution because of the use of a double-buffering strategy, common to many MPI implementations. We introduce two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers. The first one uses the now widely available vmsplice Linux system call; the second one further improves performance thanks to a custom kernel module called KNEM. The latter also offers I/OAT copy offload, which is dynamically enabled depending on both hardware cache characteristics and message size. These new solutions outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred. Collective communication operations show a dramatic improvement, and the IS NAS parallel benchmark shows a 25% speedup and better cache efficiency.

show abstract

Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures

Jeannot

Mercier

2010

View full text Add to dashboard Cite

Towards an Efficient Process Placement Policy for MPI Applications in Multicore Environments

Mercier

Clet-Ortega

2009

View full text Add to dashboard Cite

Abstract. This paper presents a method to efficiently place MPI processes on multicore machines. Since MPI implementations often feature efficient supports for both shared-memory and network communication, an adequate placement policy is a crucial step to improve applications performance. As a case study, we show the results obtained for several NAS computing kernels and explain how the policy influences overall performance. In particular, we found out that a policy merely increasing the intranode communication ratio is not enough and that cache utilization is also an influential factor. A more sophisticated policy (eg. one taking into account the architecture's memory structure) is required to observe performance improvements.

show abstract

Transcriptional disruptions in Down syndrome: a case study in the Ts1Cje mouse cerebellum during post‐natal development

Potier¹,

Rivals²,

Mercier³

et al. 2006

Journal of Neurochemistry

View full text Add to dashboard Cite

To understand the aetiology and the phenotypic severity of Down syndrome, we searched for transcriptional signatures in a substructure of the brain (cerebellum) during post-natal development in a segmental trisomy 16 model, the Ts1Cje mouse. The goal of this study was to investigate the effects of trisomy on changes in gene expression across development time. The primary gene-dosage effect on triplicated genes (1.5) was observed at birth [post-natal day 0 (P0)], at P15 and P30. About 5% of the non-triplicated genes were significantly differentially expressed between trisomic and control cerebellum, while 25% of the transcriptome was modified during post-natal development of the cerebellum. Indeed, only 165, 171 and 115 genes were dysregulated in trisomic cerebellum at P0, P15 and P30, respectively. Surprisingly, there were only three genes dysregulated in development and in trisomic animals in a similar or opposite direction. These three genes (Dscr1, Son and Hmg14) were, quite unexpectedly, triplicated in the Ts1Cje model and should be candidate genes for understanding the aetiology of the phenotype observed in the cerebellum.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.