“…Niknam et al [23] is the first paper presenting the parallelization of Suzuki algorithm on a 16-cores AMD Opteron 885. The max speedup is ×2.5 on 4 threads for 256×256.…”
International audienceIn the last decade, many papers have been published to present sequential connected component labeling (CCL) algorithms. As modern processors are multi-core and tend to many cores, designing a CCL algorithm should address parallelism and multithreading. After a review of sequential CCL algorithms and a study of their variations, this paper presents the parallel version of the Light Speed Labeling for Connected Component Analysis (CCA) and compares it to our parallelized implementations of State-of-the-Art sequential algorithms. We provide some benchmarks that help to figure out the intrinsic differences between these parallel algorithms. We show that thanks to its run-based processing, the LSL is intrinsically more efficient and faster than all pixel-based algorithms. We show also, that all the pixel-based are memory-bound on multi-socket machines and so are inefficient and do not scale, whereas LSL, thanks to its RLE compression can scale on such high-end machines. On a 4×15-core machine, and for 8192×8192 images, LSL outperforms its best competitor by a factor ×10.8 and achieves a throughput of 42.4 gigapixel labeled per second
“…Niknam et al [23] is the first paper presenting the parallelization of Suzuki algorithm on a 16-cores AMD Opteron 885. The max speedup is ×2.5 on 4 threads for 256×256.…”
International audienceIn the last decade, many papers have been published to present sequential connected component labeling (CCL) algorithms. As modern processors are multi-core and tend to many cores, designing a CCL algorithm should address parallelism and multithreading. After a review of sequential CCL algorithms and a study of their variations, this paper presents the parallel version of the Light Speed Labeling for Connected Component Analysis (CCA) and compares it to our parallelized implementations of State-of-the-Art sequential algorithms. We provide some benchmarks that help to figure out the intrinsic differences between these parallel algorithms. We show that thanks to its run-based processing, the LSL is intrinsically more efficient and faster than all pixel-based algorithms. We show also, that all the pixel-based are memory-bound on multi-socket machines and so are inefficient and do not scale, whereas LSL, thanks to its RLE compression can scale on such high-end machines. On a 4×15-core machine, and for 8192×8192 images, LSL outperforms its best competitor by a factor ×10.8 and achieves a throughput of 42.4 gigapixel labeled per second
Section: A General Parameterizable Ordered Sparsecclmentioning
confidence: 99%
“…Most of CCL algorithms used to be sequential ones developed on single-core processors [7] [3]. Recently, new parallel algorithms were developed for multi-core processors [13] [5], SIMD processors [18] [11] [8] and GPUs [14] [9]. These algorithms are very efficient for natural images but not for very low density images (very few pixels set to one) like those generated in HEP experiment.…”
Connected components labeling and analysis for dense images have been extensively studied on a wide range of architectures. Some applications, like particles detectors in High Energy Physics, need to analyse many small and sparse images at high throughput. Because they process all pixels of the image, classic algorithms for dense images are inefficient on sparse data. We address this inefficiency by introducing a new algorithm specifically designed for sparse images. We show that we can further improve this sparse algorithm by specializing it for the data input format, avoiding a decoding step and processing multiple pixels at once. A benchmark on Intel and AMD CPUs shows that the algorithm is from ×1.6 to ×2.5 faster on sparse images.
“…CCL algorithms are often required to be optimized to run in real-time and have been ported on a wide set of parallel machines [1] [13]. After an era on single-core processors, where many sequential algorithms were developed [7] and codes were released [2], new parallel algorithms were developed on multicore processors [15] [6].…”
Connected Component Labeling (CCL) is a fundamental algorithm in computer vision, and is often required for real-time applications. It consists in assigning a unique number to each connected component of a binary image. In recent years, we have seen the emergence of direct parallel algorithms on multicore CPUs, GPUs and FPGAs whereas, there are only iterative algorithms for SIMD implementation. In this article, we introduce new direct SIMD algorithms for Connected Component Labeling. They are based on the new Scatter-Gather, Collision Detection (CD) and Vector Length (VL) instructions available in the recent Intel AVX512 instruction set. These algorithms have also been adapted for multicore CPU architectures and tested for each available SIMD vector length. These new algorithms based on SIMD Union-Find algorithms can be applied to other domains such as graphs algorithms manipulating Union-Find structures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.