Manolis Katevenis scite author profile

Abstract-One of the most widely used architectures for packet switches is the crossbar. A special version of a it is the buffered crossbar, where small buffers are associated with the crosspoints. The advantages of this organization, when compared to the unbuffered architecture, is that it needs much simpler and slower scheduling circuits, while it can shape the switched traffic according to a given set of Quality of Service (QoS) criteria in a more efficient way. Furthermore, by supporting variable length packets throughout a buffered crossbar: a) there is no need for segmentation and reassembly circuits, b) no internal speedup is necessary, and c) synchronization between the input and output clock domains is simplified. In this paper we present an architecture, a hardware implementation analysis, and a performance evaluation of such a buffered crossbar. The proposed organization is simple, yet powerful and can be easily implemented using today's technologies. Our evaluation shows that it outperforms most of the existing packet switch architectures, while its hardware cost is kept to a minimum.

show abstract

Scheduling in Non-Blocking Buffered Three-Stage Switching Fabrics

Chrysos¹,

Katevenis²

2006

View full text Add to dashboard Cite

Abstract-Three-stage non-blocking switching fabrics are the next step in scaling current crossbar switches to many hundreds or few thousands of ports. Congestion management, however, is the central open problem; without it, performance suffers heavily under real-world traffic patterns. Schedulers for bufferless crossbars perform congestion management but are not scalable to high valencies and to multi-stage fabrics. Distributed scheduling, as used in buffered crossbars, is scalable but has never been scaled beyond crossbar valencies. We combine ideas from central and distributed schedulers, from request-grant protocols and from credit-based flow control, to propose a novel, practical architecture for scheduling in non-blocking buffered switching fabrics. The new architecture relies on multiple, independent, single-resource schedulers, operating in a pipeline. It: (i) isolates well-behaved against congested flows; (ii) provides throughput in excess of 95% under unbalanced traffic, and delays that successfully compete again output queueing; (iii) provides weighted max-min fairness; (iv) directly operates on variable-size packets or multi-packet segments; (v) resequences cells or segments using very small buffers; and (vi) can be realistically implemented for a 1024×1024 reference fabric made out of 32×32 buffered crossbar switch elements. This paper carefully studies the many intricacies of the problem and the solution, discusses implementation, and provides performance simulation results.

show abstract

The ExaNeSt Project: Interconnects, Storage, and Packaging for Exascale Systems

Katevenis

Chrysos

Marazakis

et al. 2016

View full text Add to dashboard Cite

ExaNest is one of three European projects that support a ground-breaking computing architecture for exascale-class systems built upon power-efficient 64-bit ARM processors. This group of projects share an 'everything-close' and 'share-anything' paradigm, which trims down the power consumption - by shortening the distance of signals for most data transfers - as well as the cost and footprint area of the installation - by reducing the number of devices needed to meet performance targets. In ExaNeSt, we will design and implement: (i) a physical rack prototype and its liquid-cooling subsystem providing ultra-dense compute packaging, (ii) a storage architecture with distributed (in-node) non-volatile memory (NVM) devices, (iii) a unified, low-latency interconnect, designed to efficiently uphold desired Quality-of-Service guarantees for a mix of storage with inter-processor flows, and (iv) efficient rack-level memory sharing, where each page is cacheable at only a single node . Our target is to test alternative storage and interconnect options on actual hardware, using real-world HPC applications. The ExaNeSt consortium brings together technology, skills, and knowledge across the entire value chain, from computing IP, packaging, and system deployment, all the way up to operating systems, storage, HPC, big data frameworks, and cutting-edge applications

show abstract

Weighted fairness in buffered crossbar scheduling

Chrysos

Katevenis

View full text Add to dashboard Cite

The crossbar is the most popular packet switch architecture. By adding small buffers at the crosspoints, important advantages can be obtained: (1) Crossbar scheduling is simplified. (2) High throughput is achievable. (3) Weighted scheduling becomes feasible. In this paper we study the fairness properties of a buffered crossbar with weighted fair schedulers. We show by means of simulation that, under heavy demand, the system will allocate throughput in a weighted max-min fair manner. We study the impact of the size of the crosspoint buffers in approximating the weighted max-min fair rates and we find that a small amount of buffering per crosspoint (3-8 cells) suffices for the maximum percentage discrepancy, to fall below 5% for £ ¥ ¤ § ¦ £ ¥ ¤ switches. 1. INTRODUCTION Switches, and the routers that use them, are the basic building blocks for constructing high-speed networks that employ point-to-point links. As the demand for network throughput keeps climbing, switches with an increasing number of faster ports are needed. At the same time, mechanisms are sought for higher sophistication in quality of service (QoS) guarantees. The crossbar is the simplest fabric for high-speed switches. It is the architecture of choice for up to several tens of ports, although for higher port counts, © , the order of the crossbar cost, © , makes other alternatives more attractive. The hardest part of a high-speed crossbar is the scheduler needed to keep it busy. With virtual-output queues (VOQ) at the input ports, the crossbar scheduler has to coordinate the use of © interdependent resources. Each input has to choose among © candidate VOQ's, thus potentially affecting all © outputs; at the same time, each output has to choose among potentially all © inputs, thus making all © port schedulers depend on each other. Known architectures for high-speed crossbar scheduling include [1] [2] [3]; their complexity and cost increases significantly when the number of ports rises, thus negatively affecting the achievable port speed. An advanced form of quality of service (QoS) architecture uses weighted round-robin (WRR) scheduling-often in the form of weighted fair queueing (WFQ) [4]-which takes weight factors into consideration when determining "equality". This type of scheduling is needed when some customers pay more than others, or when each flow is an aggregate of a different number of sub-flows and we wish to treat sub-flows equally. The weight factors may be static (during the lifetime

show abstract

EUROSERVER: Energy Efficient Node for European Micro-Servers

Durand

Carpenter

Adami

et al. 2014

View full text Add to dashboard Cite

Crossbar NoCs Are Scalable Beyond 100 Nodes

Passas

Katevenis

Pnevmatikatos

2012

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

Pipelined heap (priority queue) management for advanced scheduling in high-speed networks

Ιωάννου

Katevenis

View full text Add to dashboard Cite

ABSTRACT:Per-flow queueing with sophisticated scheduling is one of the methods for providing advanced Quality-of-Service (QoS) guarantees. The hardest and most interesting scheduling algorithms rely on a common computational primitive, implemented via priority queues. To support such scheduling for a large number of flows at OC-192 (10 Gbps) rates and beyond, pipelined management of the priority queue is needed. Large priority queues can be built using either calendar queues or heap data structures; heaps feature smaller silicon area than calendar queues. We present heap management algorithms that can be gracefully pipelined; they constitute modifications of the traditional ones. We discuss how to use pipelined heap managers in switches and routers and their costperformance tradeoffs. The design can be configured to any heap size, and, using 2-port 4-wide SRAM's, it can support initiating a new operation on every clock cycle, except that an insert operation or one idle (bubble) cycle is needed between two successive delete operations. We present a pipelined heap manager implemented in synthesizable Verilog form, as a core integratable into ASIC's, along with cost and performance analysis information. For a 16K entry example in 0.13-micron CMOS technology, silicon area is below 10mm 2 (less than 8% of a typical ASIC chip) and performance is a few hundred million operations per second. We have verified our design by simulating it against three heap models of varying abstraction. KEYWORDS: high speed network scheduling, weightedround robin, weighted fair queueing, priority queue, pipelined hardware heap, synthesizable core.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.