Abstract-Our work is motivated by the desire to design packet switches with large aggregate capacity and fast line rates. In this paper, we consider building a packet switch from multiple lower speed packet switches operating independently and in parallel. In particular, we consider a (perhaps obvious) parallel packet switch (PPS) architecture in which arriving traffic is demultiplexed over identical lower speed packet switches, switched to the correct output port, then recombined (multiplexed) before departing from the system. Essentially, the packet switch performs packet-by-packet load balancing, or inverse multiplexing, over multiple independent packet switches. Each lower speed packet switch operates at a fraction of the line rate . For example, each packet switch can operate at rate . It is a goal of our work that all memory buffers in the PPS run slower than the line rate. Ideally, a PPS would share the benefits of an output-queued switch, i.e., the delay of individual packets could be precisely controlled, allowing the provision of guaranteed qualities of service.In this paper, we ask the question: Is it possible for a PPS to precisely emulate the behavior of an output-queued packet switch with the same capacity and with the same number of ports? We show that it is theoretically possible for a PPS to emulate a first-come first-served (FCFS) output-queued (OQ) packet switch if each lower speed packet switch operates at a rate of approximately 2. We further show that it is theoretically possible for a PPS to emulate a wide variety of quality-of-service queueing disciplines if each lower speed packet switch operates at a rate of approximately 3 . It turns out that these results are impractical because of high communication complexity, but a practical high-performance PPS can be designed if we slightly relax our original goal and allow a small fixed-size coordination buffer running at the line rate in both the demultiplexer and the multiplexer. We determine the size of this buffer and show that it can eliminate the need for a centralized scheduling algorithm, allowing a full distributed implementation with low computational and communication complexity. Furthermore, we show that if the lower speed packet switch operates at a rate of (i.e., without speedup), the resulting PPS can emulate an FCFS-OQ switch within a delay bound.Index Terms-Clos network, inverse multiplexing, load balancing, output queueing, packet switch.
--All routers contain buffers to hold packets during times of congestion. When designing a high-capacity router (or linecard) it is challenging to design buffers because of the buffer's speed and size, both of which grow linearly with line-rate, . With today's DRAM technology, it is barely possible to design buffers for a 40Gb/s linecard in which packets are written to (read from) memory at the rate at which they arrive (depart). Over time, the problem will get harder: Link rates will increase, line cards will connect to more lines, and buffers will get larger. Ideally, we would like a memory with the density of DRAM, and the speed of SRAM. And so some commercial routers sometimes use hybrid packet buffers built from a combination of small fast SRAM and large slow DRAM. The SRAM holds ("caches") the heads and tails of packet FIFOs, allowing arriving packets to be written quickly to the tail, and departing packets to be read quickly from the head. The large DRAMs are used for bulk storage, to hold the majority of packets in each FIFO that are neither at the head nor the tail. Because of the relatively long time to write to (or read from) the DRAMs, data is transferred between SRAM and DRAM in large fixedsize blocks, consisting of perhaps many packets at a time. A memory manager shuttles packets between the SRAM cache and the DRAM with two goals: (1) Arriving packets are written to DRAM before the SRAM overflows, and (2) Departing packets are guaranteed to be in the SRAM when it's their turn to leave. In this paper we find optimal memory managers that achieve both goals, while minimizing the size of the SRAM cache. When the delay through the buffer is minimized, the size of the SRAM cache is proportional to , where is the number of FIFOs that the buffer maintains. There is a tradeoff between the size of the SRAM and the minimum pipeline delay through the packet buffer. When a pipeline delay can be tolerated, we find memory managers that reduce the required SRAM cache size so as to be proportional to .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.