Steven van der Vlugt scite author profile

Jääskeläinen

et al. 2018

ALMARVI is a collaborative European research project funded by Artemis involving 16 industrial as well as academic partners across 4 countries, working together to address various computational challenges in image and video processing in 3 application domains: healthcare, surveillance and mobile. This paper is an editorial for a special issue discussing the integrated system created by the partners to serve as a cross-domain solution for the project. The paper also introduces the partner articles published in this special issue to discuss the various technological developments achieved within ALMARVI spanning all system layers, from hardware to applications. We illustrate the challenges faced within the project based on use cases from the three targeted application domains, and how these can address the 4 main project objectives addressing 4 challenges faced by high performance image and video processing systems: massive data rate, low power consumption, composability and robustness. We present a system stack composed of algorithms, design frameworks and platforms as a solution to these challenges. Finally, the use cases from the three different application domains are mapped on the system stack solution and are evaluated based on their performance for each of the 4 ALMARVI objectives.

Monotonic Optimization of Dataflow Buffer Sizes

Hendriks

Ara

Geilen

et al. 2018

Many high data-rate video-processing applications are subject to a trade-off between throughput and the sizes of buffers in the system (the storage distribution). These applications have strict requirements with respect to throughput as this directly relates to the functional correctness. Furthermore, the size of the storage distribution relates to resource usage which should be minimized in many practical cases. The computation kernels of high data-rate video-processing applications can often be specified by cyclo-static dataflow graphs. We therefore study the problem of minimization of the total (weighted) size of the storage distribution under a throughput constraint for cyclo-static dataflow graphs. By combining ideas from the area of monotonic optimization with the causal dependency analysis from a state-of-the-art storage optimization approach, we create an algorithm that scales better than the state-of-the-art approach. Our algorithm can provide a solution and a bound on the suboptimality of this solution at any time, and it iteratively improves this until the optimal solution is found. We evaluate our algorithm using several models from the literature, and on models of a high data-rate video-processing application from the healthcare domain. Our experiments show performance increases up to several orders of magnitude.

Modeling and Analysis of FPGA Accelerators for Real-Time Streaming Video Processing in the Healthcare Domain

Ara

Jong

et al. 2018

Parallelization of Variable Rate Decompression through Metadata

Noordsij

Bamakhrama³

et al. 2020

Data movement has long been identified as the biggest challenge facing modern computer systems' designers. To tackle this challenge, many novel data compression algorithms have been developed. Often variable rate compression algorithms are favored over fixed rate. However, variable rate decompression is difficult to parallelize. Most existing algorithms adopt a single parallelization strategy suited for a particular HW platform. Such an approach fails to harness the parallelism found in diverse modern HW architectures. We propose a parallelization method for tiled variable rate compression algorithms that consists of multiple strategies that can be applied interchangeably. This allows an algorithm to apply the strategy most suitable for a specific HW platform. Our strategies are based on generating metadata during encoding, which is used to parallelize the decoding process. To demonstrate the effectiveness of our strategies, we implement them in a state-of-the-art compression algorithm called ZFP. We show that the strategies suited for multicore CPUs are different from the ones suited for GPUs. On a CPU, we achieve a near optimal decoding speedup and an overhead size which is consistently less than 0.04% of the compressed data size. On a GPU, we achieve average decoding rates of up to 100 GiB/s. Our strategies allow the user to make a trade-off between decoding throughput and metadata size overhead.

Frame-based Programming, Stream-Based Processing for Medical Image Processing Applications

Hoozemans

Jong

et al. 2019

This paper presents and evaluates an approach to deploy image and video processing pipelines that are developed frame-oriented on a hardware platform that is stream-oriented, such as an FPGA. First, this calls for a specialized streaming memory hierarchy and accompanying software framework that transparently moves image segments between stages in the image processing pipeline. Second, we use softcore VLIW processors, that are targetable by a C compiler and have hardware debugging capabilities, to evaluate and debug the software before moving to a High-Level Synthesis flow. The algorithm development phase, including debugging and optimizing on the target platform, is often a very time consuming step in the development of a new product. Our proposed platform allows both software developers and hardware designers to test iterations in a matter of seconds (compilation time) instead of hours (synthesis or circuit simulation time).