Wessam Hassanein scite author profile

The sort operation is a core part of many critical applications (e.g., database management systems). Despite the large efforts to parallelize it, the fact that it suffers from high data-dependencies vastly limits its performance. Multithreaded architectures are emerging as the most demanding technology in leading-edge processors. These architectures include simultaneous multithreading, chip multiprocessors, and machines combining different multithreading technologies. In this paper, we analyze the memory behavior and improve the performance of the most recent parallel radix and quick integer sort algorithms on modern multithreaded architectures. We achieve speedups up to 4.69× for radix sort and up to 4.17× for quicksort on a machine with 4 multithreaded processors compared to single threaded versions, respectively. We find that since radix sort is CPU-intensive, it exhibits better results on chip multiprocessors where multiple CPUs are available. While quicksort is accomplishing speedups on all types of multithreading processers due to its ability to overlap memory miss latencies with other useful processing.

show abstract

Data forwarding through in-memory precomputation threads

Hassanein

Fortes

Eigenmann

2004

View full text Add to dashboard Cite

1In modern architectures, memory access latency is an increasingly performance-limiting factor. To reduce this latency, we propose concepts and implementation of a new technique that uses an inmemory processor to precompute future, critical load addresses and forward the computed values to the main processor. The acronym for this technique is IMPT for In-Memory Precomputation-based forwarding Threads. IMPT combines the advantages of precomputation-based techniques with the low memory access latency of processing-in-memory. To evaluate IMPT, we use a cycle-accurate simulation of an aggressive out-oforder processor with accurate simulation of bus and memory contention. The results show a performance gain of up to 1.47 (1.21 on average) over an aggressive superscalar processor. The average load access latency decreases by up to 55% (32% on average).

show abstract

Analyzing the Effects of Hyperthreading on the Performance of Data Management Systems

Hassanein

Rashid

Hammad

2008

Int J Parallel Prog

View full text Add to dashboard Cite

As information processing applications take greater roles in our everyday life, database management systems (DBMSs) are growing in importance. DBMSs have traditionally exhibited poor cache performance and large memory footprints, therefore performing only at a fraction of their ideal execution and exhibiting low processor utilization. Previous research has studied the memory system of DBMSs on researchbased simultaneous multithreading (SMT) processors. Recently, several differences have been noted between the real hyper-threaded architecture implemented by the Intel Pentium 4 and the earlier SMT research architectures. This paper characterizes the performance of a prototype open-source DBMS running TPC-equivalent benchmark queries on an Intel Pentium 4 Hyper-Threading processor. We use hardware counters provided by the Pentium 4 to evaluate the micro-architecture and study the memory system behavior of each query running on the DBMS. Our results show a performance improvement of up to 1.16 in TPC-C-equivalent and 1.26 in TPC-H-equivalent queries due to hyperthreading.

show abstract

Understanding and improving JVM GC work stealing at the data center scale

Hassanein

2016

SIGPLAN Not.

View full text Add to dashboard Cite

Garbage collection (GC) is a critical part of performance in managed run-time systems such as the OpenJDK Java Virtual Machine (JVM). With a large number of latency sensitive applications written in Java the performance of the JVM is essential. Java application servers run in data centers on a large number of multi-core servers, thus load balancing in multithreaded GC phases is critical. Dynamic load balancing in the JVM GC is achieved through work stealing, a well known and effective method to balance tasks across threads. This paper analyzes the JVM work stealing behaviour, and introduces a novel work stealing technique that improves performance, GC CPU utilization, scalability, and reduces the cost of jobs running on Google's data-centers. We analyze both the DaCapo benchmark suite as well as Google's data-center jobs. Our results show that the Gmail front-end server shows a 15-20% GC CPU reduction, and a 5% CPU performance improvement. Our analysis of a sample of ~59K jobs shows that GC CPU utilization improves by 38% geomean, 12% weighted geomean. GC pause time improves by 16% geomean, 20% weighted geomean. Full GC pause time improves by 34% geomean, 12% weighted geomean.

show abstract

1D performance analysis and tracing of technical and Java applications on the Itanium2 processor

Hassanein

Astfalk

Eigenmann

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.