Thomas Moscibroda scite author profile

In a chip-multiprocessor (CMP) system, the DRAM system is shared among cores. In a shared DRAM system, requests from a thread can not only delay requests from other threads by causing bank/bus/row-buffer conflicts but they can also destroy other threads' DRAM-bank-level parallelism. Requests whose latencies would otherwise have been overlapped could effectively become serialized. As a result both fairness and system throughput degrade, and some threads can starve for long time periods.This paper proposes a fundamentally new approach to designing a shared DRAM controller that provides quality of service to threads, while also improving system throughput. Our parallelism-aware batch scheduler (PAR-BS) design is based on two key ideas. First, PAR-BS processes DRAM requests in batches to provide fairness and to avoid starvation of requests. Second, to optimize system throughput, PAR-BS employs a parallelism-aware DRAM scheduling policy that aims to process requests from a thread in parallel in the DRAM banks, thereby reducing the memory-related stall-time experienced by the thread. PAR-BS seamlessly incorporates support for system-level thread priorities and can provide different service levels, including purely opportunistic service, to threads with different priorities.We evaluate the design trade-offs involved in PAR-BS and compare it to four previously proposed DRAM scheduler designs on 4-, 8-, and 16-core systems. Our evaluations show that, averaged over 100 4-core workloads, PAR-BS improves fairness by 1.11X and system throughput by 8.3% compared to the best previous scheduling technique, StallTime Fair Memory (STFM) scheduling. Based on simple request prioritization rules, PAR-BS is also simpler to implement than STFM.

show abstract

Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors

Mutlu¹,

Moscibroda²

2007

132

312

View full text Add to dashboard Cite

DRAM memory is a major resource shared among cores in a chip multiprocessor (CMP) system. Memory requests from different threads can interfere with each other. Existing memory access scheduling techniques try to optimize the overall data throughput obtained from the DRAM and thus do not take into account inter-thread interference. Therefore, different threads running together on the same chip can experience extremely different memory system performance: one thread can experience a severe slowdown or starvation while another is unfairly prioritized by the memory scheduler.This paper proposes a new memory access scheduler, called the Stall-Time Fair Memory scheduler (STFM), that provides quality of service to different threads sharing the DRAM memory system. The goal of the proposed scheduler is to "equalize" the DRAM-related slowdown experienced by each thread due to interference from other threads, without hurting overall system performance. As such, STFM takes into account inherent memory characteristics of each thread and does not unfairly penalize threads that use the DRAM system without interfering with other threads.We show that STFM significantly reduces the unfairness in the DRAM system while also improving system throughput (i.e., weighted speedup of threads) on a wide variety of workloads and systems. For example, averaged over 32 different workloads running on an 8-core CMP, the ratio between the highest DRAM-related slowdown and the lowest DRAM-related slowdown reduces from 5.26X to 1.4X, while the average system throughput improves by 7.6%. We qualitatively and quantitatively compare STFM to one new and three previouslyproposed memory access scheduling algorithms, including network fair queueing. Our results show that STFM provides the best fairness, system throughput, and scalability. 1 The cores have private L2 caches, but they share the memory controller and the DRAM memory. Our methodology is described in detail in Section 6.2 libquantum is a memory-intensive streaming application that has a very high row-buffer locality (98.4% row-buffer hit rate). Other applications have significantly lower row-buffer hit rates. Since libquantum can generate its rowbuffer-hit memory requests fast enough, its accesses are almost always unfairly prioritized over other threads' accesses by the FR-FCFS scheduling algorithm. 40th IEEE/ACM International Symposium on Microarchitecture

show abstract

White space networking with wi-fi like connectivity

et al. 2009

View full text Add to dashboard Cite

Networking over UHF white spaces is fundamentally different from conventional Wi-Fi along three axes: spatial variation, temporal variation, and fragmentation of the UHF spectrum. Each of these differences gives rise to new challenges for implementing a wireless network in this band. We present the design and implementation of WhiteFi, the first Wi-Fi like system constructed on top of UHF white spaces. WhiteFi incorporates a new adaptive spectrum assignment algorithm to handle spectrum variation and fragmentation, and proposes a low overhead protocol to handle temporal variation. WhiteFi builds on a simple technique, called SIFT, that reduces the time to detect transmissions in variable channel width systems by analyzing raw signals in the time domain. We provide an extensive system evaluation in terms of a prototype implementation and detailed experimental and simulation results.

show abstract

The Complexity of Connectivity in Wireless Networks

2006

View full text Add to dashboard Cite

Abstract-We define and study the scheduling complexity in wireless networks, which expresses the theoretically achievable efficiency of MAC layer protocols. Given a set of communication requests in arbitrary networks, the scheduling complexity describes the amount of time required to successfully schedule all requests. The most basic and important network structure in wireless networks being connectivity, we study the scheduling complexity of connectivity, i.e., the minimal amount of time required until a connected structure can be scheduled. In this paper, we prove that the scheduling complexity of connectivity grows only polylogarithmically in the number of nodes. Specifically, we present a novel scheduling algorithm that successfully schedules a strongly connected set of links in time O(log 4 n) even in arbitrary worst-case networks.On the other hand, we prove that standard MAC layer or scheduling protocols can perform much worse. Particularly, any protocol that either employs uniform or linear (a node's transmit power is proportional to the minimum power required to reach its intended receiver) power assignment has a Ω(n) scheduling complexity in the worst case, even for simple communication requests. In contrast, our polylogarithmic scheduling algorithm allows many concurrent transmission by using an explicitly formulated non-linear power assignment scheme.Our results show that even in large-scale worst-case networks, there is no theoretical scalability problem when it comes to scheduling transmission requests, thus giving an interesting complement to the more pessimistic bounds for the capacity in wireless networks. All results are based on the physical model of communication, which takes into account that the signal-tonoise plus interference ratio (SINR) at a receiver must be above a certain threshold if the transmission is to be received correctly.

show abstract

The price of being near-sighted

2006

View full text Add to dashboard Cite

The question of what can be computed, and how efficiently, are at the core of computer science. Not surprisingly, in distributed systems and networking research, an equally fundamental question is what can be computed in a distributed fashion. More precisely, if nodes of a network must base their decision on information in their local neighborhood only, how well can they compute or approximate a global (optimization) problem? In this paper we give the first poly-logarithmic lower bound on such local computation for (optimization) problems including minimum vertex cover, minimum (connected) dominating set, maximum matching, maximal independent set, and maximal matching. In addition we present a new distributed algorithm for solving general covering and packing linear programs. For some problems this algorithm is tight with the lower bounds, for others it is a distributed approximation scheme. Together, our lower and upper bounds establish the local computability and approximability of a large class of problems, characterizing how much local information is required to solve these tasks. ]. We are grateful to and Schwartzman [7] for pointing out an error in an earlier draft [30] of this paper. |V | = n, and a parameter k (k might depend on n or some other property of G). At each node v ∈ V there is an independent agent (for simplicity, we identify the agent at node v with v as well). Every node v ∈ V has a unique identifier id(v) 1 and possibly some additional input. We assume that each node v ∈ V can learn the complete neighborhood Γ k (v) up to distance k in G (see below for a formal definition of Γ k (v)). Based on this information, all nodes need to make independent computations and need to individually decide on their outputs without communicating with each other. Hence, the output of each node v ∈ V can be computed as a function of it's k-neighborhood Γ k (v).Synchronous Message Passing Model: The described graph-theoretic local computation model is equivalent to the classic message passing model of distributed computing. In this model, the distributed system is modeled as a point-to-point communication network, described by an undirected graph G = (V, E), in which each vertex v ∈ V represents a node (host, device, processor, . . . ) of the network, and an edge (u, v) ∈ E is a bidirectional communication channel that connects the two nodes. Initially, nodes have no knowledge about the network graph; they only know their own identifier and potential additional inputs. All nodes wake up simultaneously and computation proceeds in synchronous rounds. In each round, every node can send one, arbitrarily long message to each of its neighbors. Since we consider point-to-point networks, a node may send different messages to different neighbors in the same round. Additionally, every node is allowed to perform local computations based on information obtained in messages of previous rounds. Communication is reliable, i.e., every message that is sent during a communication round is correctly received by the end of the ro...

show abstract

A case for adapting channel width in wireless networks

Chandra

Mahajan

Moscibroda

et al. 2008

SIGCOMM Comput. Commun. Rev.

168

197

View full text Add to dashboard Cite

We study a fundamental yet under-explored facet in wireless communication -the width of the spectrum over which transmitters spread their signals, or the channel width. Through detailed measurements in controlled and live environments, and using only commodity 802.11 hardware, we first quantify the impact of channel width on throughput, range, and power consumption. Taken together, our findings make a strong case for wireless systems that adapt channel width. Such adaptation brings unique benefits. For instance, when the throughput required is low, moving to a narrower channel increases range and reduces power consumption; in fixed-width systems, these two quantities are always in conflict.We then present SampleWidth, a channel width adaptation algorithm for the base case of two communicating nodes. This algorithm is based on a simple search process that builds on top of existing techniques for adapting modulation. Per specified policy, it can maximize throughput or minimize power consumption. Evaluation using a prototype implementation shows that SampleWidth correctly identities the optimal width under a range of scenarios. In our experiments with mobility, it increases throughput by more than 60% compared to the best fixed-width configuration.

show abstract

What cannot be computed locally!

2004

View full text Add to dashboard Cite

We give time lower bounds for the distributed approximation of minimum vertex cover (MVC) and related problems such as minimum dominating set (MDS). In k communication rounds, MVC and MDS can only be approximated by factors Ω(n c/k 2 /k) and Ω(∆ 1/k /k) for some constant c, where n and ∆ denote the number of nodes and the largest degree in the graph. The number of rounds required in order to achieve a constant or even only a polylogarithmic approximation ratio is at least Ω(log n/ log log n) and Ω(log ∆/ log log ∆). By a simple reduction, the latter lower bounds also hold for the construction of maximal matchings and maximal independent sets.

show abstract

Allocating dynamic time-spectrum blocks in cognitive radio networks

et al. 2007

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.