GPU exhibits the capability for applications with a high level of parallelism despite its low cost. The support of integer and logical instructions by the latest generation of GPUs enables us to implement cipher algorithms more easily. However, decisions such as parallel processing granularity and memory allocation impose a heavy burden on programmers. Therefore, this paper presents results of several experiments that were conducted to elucidate the relation between memory allocation styles of variables of AES and granularity as the parallelism exploited from AES encoding processes using CUDA with an NVIDIA GeForce GTX285 (Nvidia Corp.). Results of these experiments showed that the 16 bytes/thread granularity had the highest performance. It achieved approximately 35 Gbps throughput. It also exhibited differences of memory allocation and granularity effects around 2%-30% for performance in standard implementation. It shows that the decision of granularity and memory allocation is the most important factor for effective processing in AES encryption on GPU. Moreover, implementation with overlapping between processing and data transfer yielded 22.5 Gbps throughput including the data transfer time.
As the data protection with encryption becomes important day by day, the encryption processing using General Purpose computation on a Graphic Processing Unit (GPGPU) has been noticed as one of the methods to realize high-speed data protection technology. GPUs have evolved in recent years into powerful parallel computing devices, with a high cost-performance ratio. However, many factors affect GPU performance. In earlier work to gain higher AES performance using GPGPU in various ways, we obtained the following two technical viewpoints: (1) 16 Bytes/Thread is the best granularity (2) Extended key and substitution table stored in shared memory and plaintext stored in register are the best memory allocation style.However, AES is not the only cipher algorithm widely used in the real world. For this reason, this study was undertaken to test the hypothesis that these two findings are applicable to implementation of other symmetric block ciphers on two generation of GPU. In this study, we targeted five 128-bit symmetric block ciphers, AES, Camellia, CIPHERUNICORN-A, Hierocrypt-3, and SC2000, from an e-government recommended ciphers list by the CRYPTography Research and Evaluation Committees (CRYPTREC) in Japan. We evaluated the performance of these five symmetric block ciphers on the machine including a 4-core CPU and each GPU using three method: (A) throughput without data transfer, (B) throughput with data transfer and overlapping encryption processing on GPU, (C) throughput with data transfer and non-overlapping encryption processing on GPU. Results demonstrate that the throughput of implementation of SC2000 in method (A) on Tesla C2050 achieved extremely high 73.4 Gbps. Additionally, the throughput obtained using methods (B) and (C) deteriorated to 33.4 Gbps and 18.3 Gbps, respectively. Method (B) showed effective throughput with an approximately 4.7 times higher speed compared to that obtained when using 8 threads on a 4-core CPU.
SUMMARYThis paper is concerned with content-addressable memory (CAM), which is a kind of functional memory. Flexible multiport content-addressable memory (FMCAM) is proposed, in which flexible retrieval functions and a multiport structure are realized. The implementation and an evaluation are presented. It is known that CAM offers high speed of the matching retrieval function. On the other hand, there are problems in practice concerning the relation between the processing speed and the number of comparators, and also the problem of cost, which have prevented its widespread use. Consequently, we developed the following FMCAM. Multiport CAM is realized by reducing the increase of the number of comparators while retaining highspeed matching retrieval functions. Functions for flexible retrieval can be added by implementation in an FPGA. By the categorization process and the use of a ring counter, multiple processing operations can be performed immediately while reducing the number of comparators. This improves the performance of the CAM itself. Another advantage is that FMCAM alone can perform the same processing in situations in which parallel CAM should be applied. After implementing the proposed FMCAM, several tests were performed for comparison. Excellent values were obtained: the AT product was 37.5% lower on average than in conventional CAM, and the increase in hardware complexity was by a factor of 1.57 even if the number of ports was increased by a factor of 4.
Nowadays, a hash function is used for password management. The hash function is desired to possess the following three characteristics: Pre-Image Resistance, Second Pre-Image Resistance, and Collision Resistance. They are set on the assumption that it is computationally difficult to find the original message from a given hash value. However, the security level of the password management will be further reduced by implementing a high speed hash function on GPU. In this paper, the implementation of high speed hash function Keccak-512 using the integrated development environment CUDA for GPU is proposed. The following four techniques are used in order to speed up its implementation. The first one is reforming lookup tables from 2 dimensional arrays to 1 dimensional arrays at step ρ and π. The second is an investigation into the effect of using constant memory and shared memory for constant values. The third is the finding out the optimal configuration of blocks-threads, then evaluate the implementation according to the occupancy. And the last one is using CUDA streams with overlapping to hide the overhead of data transfer and GPU processing. As the result, the throughput of implemented Keccak on GeForce GTX 1080 achieved up to maximum 64.58 GB/s. It is about 14.0 times faster than the previous research result. In addition, the safety level of Keccak is also discussed at the point of Pre-Image Resistance especially. In order to implement a high speed hash function for password cracking, we developed a special program for passwords up to 71 characters. Moreover, the throughputs of 2 times as well as 3 times hash are also evaluated. It is proved that multiple times hash is possible to greatly improved the security level of Keccak with password management. The throughput of hashing password with a large number of iterations confirmed the effect was about 90%. That is the time required to hash one password for 1000 times was almost the same as the total time to sequentially hash 900 passwords.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.