Otto J. Anshus scite author profile

This paper describes a method to improve the cache locality of sequential programs by scheduling fine-grained threads. The algorithm relies upon hints provided at the time of thread creation to determine a thread execution order likely to reduce cache misses. This technique may be particularly valuable when compiler-directed tiling is not feasible. Experiments with several application programs, on two systems with different cache structures, show that our thread scheduling method can improve program performance by reducing second-level cache misses.

show abstract

Tools and Applications for Large-Scale Display Walls

Wallace

Anshus

et al. 2005

IEEE Comput. Grap. Appl.

View full text Add to dashboard Cite

Operating system support for multimedia systems

Plagemann

Goebel

Halvorsen

et al. 2000

Computer Communications

View full text Add to dashboard Cite

The Synchronization Power of Coalesced Memory Accesses

Tsigas

Anshus

2008

View full text Add to dashboard Cite

Abstract. Multicore processor architectures have established themselvesas the new generation of processor architectures. As part of the one core to many cores evolution, memory access mechanisms have advanced rapidly. Several new memory access mechanisms have been implemented in many modern commodity multicore processors. Memory access mechanisms, by devising how processing cores access the shared memory, directly inuence the synchronization capabilities of the multicore processors. Therefore, it is crucial to investigate the synchronization power of these new memory access mechanisms. This paper investigates the synchronization power of coalesced memory accesses, a family of memory access mechanisms introduced in recent large multicore architectures like the CUDA graphics processors. We rst design three memory access models to capture the fundamental features of the new memory access mechanisms. Subsequently, we prove the exact synchronization power of these models in terms of their consensus numbers. These tight results show that the coalesced memory access mechanisms can facilitate strong synchronization between the threads of multicore processors, without the need of synchronization primitives other than reads and writes. In the case of the contemporary CUDA processors, our results imply that the coalesced memory access mechanisms have consensus numbers up to sixteen.

show abstract

NB-FEB: A Universal Scalable Easy-to-Use Synchronization Primitive for Manycore Architectures

Tsigas

Anshus

2009

View full text Add to dashboard Cite

Abstract. This paper addresses the problem of universal synchronization primitives that can support scalable thread synchronization for largescale manycore architectures. The universal synchronization primitives that have been deployed widely in conventional architectures, are the compare-and-swap (CAS) and load-linked/store-conditional (LL/SC) primitives. However, such synchronization primitives are expected to reach their scalability limits in the evolution to manycore architectures with thousands of cores. We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallel programming on manycore architectures. We show that the NB-FEB primitive is universal, scalable, feasible and easy to use. NB-FEB, together with registers, can solve the consensus problem for an arbitrary number of processes (universality). NB-FEB is combinable, namely its memory requests to the same memory location can be combined into only one memory request, which consequently makes NB-FEB scalable (scalability). Since NB-FEB is a variant of the original full/empty bit that always returns a value instead of waiting for a conditional ag, it is as feasible as the original full/empty bit, which has been implemented in many computer systems (feasibility). We construct, on top of NB-FEB, a non-blocking software transactional memory system called NBFEB-STM, which can be used as an abstraction to handle concurrent threads easily. NBFEB-STM is space e cient: the space complexity of each object updated by N concurrent threads/transactions is Θ(N ), which is optimal.

show abstract

Improving the performance of VNC for high-resolution display walls

Liu

Anshus

2009

View full text Add to dashboard Cite

Using Overdecomposition to Overlap Communication Latencies with Computation and Take Advantage of SMT Processors

Bongo

Vinter

Anshus

et al.

View full text Add to dashboard Cite

EventSpace – Exposing and Observing Communication Behavior of Parallel Cluster Applications

Bongo

Anshus

Bjørndalen

2003

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Otto J. Anshus

Thread scheduling for cache locality

Tools and Applications for Large-Scale Display Walls

Operating system support for multimedia systems

The Synchronization Power of Coalesced Memory Accesses

NB-FEB: A Universal Scalable Easy-to-Use Synchronization Primitive for Manycore Architectures

Improving the performance of VNC for high-resolution display walls

Using Overdecomposition to Overlap Communication Latencies with Computation and Take Advantage of SMT Processors

EventSpace – Exposing and Observing Communication Behavior of Parallel Cluster Applications

Contact Info

Product

Resources

About