Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify performance bottlenecks. In this paper we present two profiling techniques for the fine-grained parallel programming language Split-C, which provides a simple global address space memory model. One profiler provides a detailed analysis of a program's execution. The other profiler collects cumulative information. As our experience shows, it is quite challenging to profile programs that make use of efficient, low-overhead communication. We incorporated techniques which minimize profiling effects on the running program, and quantified the profiling overhead. We present several Split-C applications showing that the profiler is useful in determining performance bottlenecks.Most Split-C implementations are based on Active Messages [14], a fast communication mechanism. Each Active Message has a handler function associated with it which is executed on the destination processor when the message arrives. Under the Active Message model, messages travel from user space (the send instruction) directly to user space (the message handler), avoiding any form of buffer management and synchronization usually encountered in traditional send & receive. As a result, Active Messages achieve an order of magnitude performance improvement over more traditional communication mechanisms. Therefore, Split-C applications can be much more fine-grained than typical PVM or MPI programs.Many existing tools have been geared towards coarser grained problems, and are well suited for this purpose. Unfortunately, it is much more difficult to profile applications that make use of efficient, low-overhead communication. The time spent in profiling overhead is now significant compared to the communication time. Additionally, the space needed to store the profiling data on disk can be much larger for programs which use fine-grained communication. Writing out the trace data to disk is very intrusive. We address those limitations and propose a number of potential solutions. One such solution involves a new technique, flush-on-barrier, to reduce perturbation when flushing trace data.We examine and address these limitations in the context of a tracing offline-profiler we developed for the parallel language Split-C running on a high-end MPP (a 64 processor Meiko CS-2). As the basis of our tracing profiler we used PICL/ParaGraph [1,2], which was originally designed for a send & receive protocol (MPI). We adapted this profiler to the Active Message model underlying Split-C. We experimentally quantify the tracing overheads of our implementations and study the merits and limitations of tracing techniques for systems with fine-grained communication. We find that, although our tracing profiler is a useful tool, the large size of the trace files is a major problem. Because of this, the tracing profiler is not well suited for large problem instances, which...