Produção do espaço urbano e processos espaciais em Natal: o conjunto Ponta Negra em foco

To exploit the power of modern heterogeneous multiprocessor embedded platforms on partitioned applications, the designer usually needs to efficiently map and schedule all the tasks and the communications of the application, respecting the constraints imposed by the target architecture. Since the problem is heavily constrained, common methods used to explore such design space usually fail, obtaining low-quality solutions. In this paper, we propose an ant colony optimization (ACO) heuristic that, given a model of the target architecture and the application, efficiently executes both scheduling and mapping to optimize the application performance. We compare our approach with several other heuristics, including simulated annealing, tabu search, and genetic algorithms, on the performance to reach the optimum value and on the potential to explore the design space. We show that our approach obtains better results than other heuristics by at least 16% on average, despite an overhead in execution time. Finally, we validate the approach by scheduling and mapping a JPEG encoder on a realistic target architecture

show abstract

Distributed Louvain Algorithm for Graph Community Detection

Ghosh

Halappanavar

Tumeo

et al. 2018

View full text Add to dashboard Cite

Aho-Corasick String Matching on Shared and Distributed-Memory Parallel Architectures

Tumeo

Villa

Chavarŕıa-Miranda

2012

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Accelerating DNA analysis applications on GPU clusters

Tumeo

Villa

2010

View full text Add to dashboard Cite

Approximate weighted matching on emerging manycore and multithreaded architectures

Halappanavar

Feo

Villa

et al. 2012

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Graph matching is a prototypical combinatorial problem with many applications in high-performance scientific computing. Optimal algorithms for computing matchings are challenging to parallelize. Approximation algorithms are amenable to parallelization and are therefore important to compute matchings for large-scale problems. Approximation algorithms also generate nearly optimal solutions that are sufficient for many applications. In this paper we present multithreaded algorithms for computing half-approximate weighted matching on state-of-the-art multicore (Intel Nehalem and AMD Magny-Cours), manycore (Nvidia Tesla and Nvidia Fermi), and massively multithreaded (Cray XMT) platforms. We provide two implementations: the first uses shared work queues and is suited for all platforms; and the second implementation, based on dataflow principles, exploits special features available on the Cray XMT. Using a carefully chosen dataset that exhibits characteristics from a wide range of applications, we show scalable performance across different platforms. In particular, for one instance of the input, an R-MAT graph (RMAT-G), we show speedups of about [Formula: see text] on [Formula: see text] cores of an AMD Magny-Cours, [Formula: see text] on [Formula: see text] cores of Intel Nehalem, [Formula: see text] on Nvidia Tesla and [Formula: see text] on Nvidia Fermi relative to one core of Intel Nehalem, and [Formula: see text] on [Formula: see text] processors of Cray XMT. We demonstrate strong as well as weak scaling for graphs with up to a billion edges using up to 12,800 threads. We avoid excessive fine-tuning for each platform and retain the basic structure of the algorithm uniformly across platforms. An exception is the dataflow algorithm designed specifically for the Cray XMT. To the best of the authors' knowledge, this is the first such large-scale study of the half-approximate weighted matching problem on multithreaded platforms. Driven by the critical enabling role of combinatorial algorithms such as matching in scientific computing and the emergence of informatics applications, there is a growing demand to support irregular computations on current and future computing platforms. In this context, we evaluate the capability of emerging multithreaded platforms to tolerate latency induced by irregular memory access patterns, and to support fine-grained parallelism via light-weight synchronization mechanisms. By contrasting the architectural features of these platforms against the Cray XMT, which is specifically designed to support irregular memory-intensive applications, we delineate the impact of these choices on performance.

show abstract

High level synthesis of RDF queries for graph analytics

Castellana

Minutoli

Morari

et al. 2015

View full text Add to dashboard Cite

Abstract-In this paper we present a set of techniques that enable the synthesis of efficient custom accelerators for memory intensive, irregular applications. To address the challenges of irregular applications (large memory footprint, unpredictable finegrained data accesses, and high synchronization intensity), and exploit their opportunities (thread level parallelism, memory level parallelism), we propose a novel accelerator design that employs an adaptive and Distributed Controller (DC) architecture, and a Memory Interface Controller (MIC) that supports concurrent and atomic memory operations on a multi-ported/multi-banked shared memory. Among the multitude of algorithms that may benefit from our solution, we focus on the acceleration of graph analytics applications and, in particular, on the synthesis of SPARQL queries on Resource Description Framework (RDF) databases. We achieve this objective by incorporating the synthesis techniques into Bambu, an Open Source high-level synthesis tools, and interfacing it with GEMS, the Graph database Engine for Multithreaded Systems. The GEMS' front-end generates optimized C implementations of the input queries, modeled as graph pattern matching algorithms, which are then automatically synthesized by Bambu. We validate our approach by synthesizing several SPARQL queries from the Lehigh University Benchmark (LUBM).

show abstract

A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs

Tumeo

Monchiero

Palermo

et al. 2007

View full text Add to dashboard Cite

Multimedia applications, and in particular the encoding and decoding of standard image and video formats, are usually a typical target for Systemson-Chip (SoC). The bi-dimensional Discrete Cosine Transformation (2D-DCT) is a commonly used frequency transformation in graphic compression algorithms. Many hardware implementations, adopting disparate algorithms, have been proposed for Field Programmable Gate Arrays (FPGA). These designs focus either on performance or area, and often do not succeed in balancing the two aspects.In this paper, we present a design of a fast 2D-DCT hardware accelerator for a FPGA-based SoC. This accelerator makes use of a single seven stages 1D-DCT pipeline able to alternate computation for the even and odd coefficients in every cycle. In addition, it uses special memories to perform the transpose operations. Our hardware takes 80 clock cycles at 107MHz to generate a complete 8x8 2D DCT, from the writing of the first input sample to the reading of the last result (including the overhead of the interface logic). We show that this architecture provides optimal performance/area ratio with respect to several alternative designs.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Antonino Tumeo

AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing

Ant Colony Heuristic for Mapping and Scheduling Tasks and Communications on Heterogeneous Embedded Systems

Distributed Louvain Algorithm for Graph Community Detection

Aho-Corasick String Matching on Shared and Distributed-Memory Parallel Architectures

Accelerating DNA analysis applications on GPU clusters

Approximate weighted matching on emerging manycore and multithreaded architectures

High level synthesis of RDF queries for graph analytics

A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs

Contact Info

Product

Resources

About