INTRODUCTIONRecently, a Message Passing Interface (MPI) [16] has been proposed as an industrial standard for writing ''portable'' message-passing parallel programs. The MPI standardization effort involved about 60 people from 40 organizations including universities, national laboratories, and most MPP vendors. Version 1 of MPI was released in May 1994. MPI adopts most, if not all, common practices from existing communication libraries. One of the key components of MPI is the collective communication subset that allows users to conveniently call library routines for various ''global'' communication operations, like broadcast, scatter, and gather. All MPI collective communication routines are implicitly defined with respect to a process group [3] which specifies an ordered set of processors within which the collective communication will be performed. For example, a multicast is specified as a broadcast to a particular process group. The performance of a parallel program depends on an efficient implementation of point-to-point as well as collective communication.In existing parallel programming environments, such as PVM, EXPRESS, and IBM's MPL [2,12,20], for Local Area Networks (LANs), collective communication routines are implemented on top of point-to-point communication. As a result, these environments suffer from poor collective communication performance. For example, a broadcast that is implemented using a TCP or point-topoint UDP over a LAN is obviously inefficient as it is not utilizing the fact that most LANs are based on a broadcast medium.In this paper, we present an efficient design and implementation of the Collective Communication Library in MPI (MPI-CCL) that is optimized for clusters of workstations. In particular, we demonstrate the implementation on a traditional 10-Mbit Ethernet-based LAN. We note here that the ideas presented in this paper can be easily extended to any Network of Workstations (NOW) [21] that provides an unreliable broadcast transport protocol (such as to an ATM network where the ATM switches have broadcast capability as provided by many vendors nowadays).JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 40, 19-34 (1997) Our system is integrated with the operating system via an efficient kernel extension mechanism that we developed. The kernel extension significantly improves the performance of our implementation as it can handle part of the communication overhead without involving user space. We have implemented our system on a collection of IBM RS/6000 workstations connected via a 10-Mbit Ethernet LAN. Our performance measurements are taken from typical scientific programs that run in a parallel mode by means of the MPI. The hypothesis behind our design is that the system's performance will be bounded by interactions between the kernel and user space rather than by the bandwidth delivered by the LAN Data-Link Layer. Our results indicate that the performance of our MPI Broadcast (on top of Ethernet) is about twice as fast as a recently published software implementation of broadcas...