High-productivity languages for parallel computing become more important as parallel environments including multicores become more common. Cilk is such a language. It provides good load balancing for many applications including irregular ones; that is, it keeps all workers busy by creating plenty of "logical" threads and adopting the oldest-first work stealing strategy. This paper proposes a "logical thread"-free framework called Tascell, which achieves a higher performance and supports a wider range of parallel environments including clusters without loss of productivity. A Tascell worker spawns a "real" task only when requested by another idle worker. The worker performs the spawning by temporarily "backtracking" and restoring its oldest task-spawnable state. Our approach eliminates the cost of spawning/managing logical threads. It also promotes the reuse of workspaces and improves the locality of reference since it does not need to prepare a workspace for each concurrently runnable logical thread. Furthermore, Tascell enables elegant and highly-efficient backtrack search algorithms with delayed workspace copying. For instance, our 16-queens problem solver is 1.86 times faster than Cilk on a system with two dual-core processors. Our approach also enables a single program to run in both shared and distributed memory environments with reasonable efficiency and scalability.
Abstract. We propose a new language concept called "L-closures" for a running program to legitimately inspect/modify the contents of its execution stack. L-closures are lightweight lexical closures created by evaluating nested function definitions. A lexical closure can access the lexically-scoped variables in the creation-time environment and indirect calls to it provide legitimate stack access. By using an intermediate language extended with L-closures in high-level compilers, high-level services such as garbage collection, check-pointing, multithreading and load balancing can be implemented elegantly and efficiently. Each variable accessed by an L-closure uses private and shared locations for giving the private location a chance to get a register. Operations to keep coherency with shared locations as well as operations to initialize L-closures are delayed until an L-closure is actually invoked. Because most high-level services create L-closures very frequently but call them infrequently (e.g., to scan roots in garbage collection), the total overhead can be reduced significantly. Since the GNU C compiler provides nested functions, we enhanced GCC at relatively low implementation costs. The results of performance measurements exhibit quite low costs of creating and maintaining L-closures.
The SC language system was developed to provide a transformation-based language extension scheme for SC languages (extended/plain C languages with an S-expression-based syntax). Using this system, many flexible extensions to the C language can be implemented by means of transformation rules over S-expressions at low cost, mainly because of the preexisting Common Lisp capabilities for manipulating S-expressions. This paper presents the LW-SC (Lightweight-SC) language as an important application of this system, featuring nested functions (i.e., functions defined inside other functions). A function can manipulate its caller's local variables (or local variables of its indirect callers) by indirectly calling a nested function of its callers. Thus, many high-level services with "stack walk" can be easily and elegantly implemented by using LW-SC as an intermediate language. Moreover, such services can be implemented efficiently because we designed and implemented LW-SC to provide "lightweight" nested functions by aggressively reducing the costs of creating and maintaining nested functions. The GNU C compiler also provides nested functions as an extension to C, but our sophisticated translator to standard C is more portable and efficient for occasional "stack walk."
The QR factorization of a matrix is a fundamental operation in linear algebra and it is widely utilized in scientific simulations. Although the QR factorization requires a memory storage of O(N 2 ) and the computational cost of O(N 3 ), they can be reduced if the matrix could be approximated using low-rank structured matrices, such as hierarchical matrices (H-matrices). In this paper, we study the QR factorization based on the block low-rank (BLR-) matrix representation, which is a simplified variant of the H-matrix. We demonstrate how the BLR block size should be defined in such matrices and confirm that the memory and computational complexities of the BLR are O(N 1.5 ) and O(N 2 ) when using an appropriate block size. Furthermore, using the simple structure of BLR-matrices, we also present a parallel algorithm for the QR factorization of BLR-matrices on distributed memory systems. In numerical experiments performed, we observe that the accuracy of the QR factorization is controllable by the accuracy of the BLR-matrix approximation of the original matrix. Finally, we verify that our algorithm enables us to perform the QR factorization of matrices of several hundred thousand N within a reasonable amount of time using moderate computer resources.
This paper proposes a parallel algorithm to extract all connected subgraphs, each of which shares a common itemset whose size is not less than a given threshold, from a given graph in which each vertex is associated to an itemset. We also propose implementations of this algorithm using the task-parallel language Tascell. This kind of graph mining can be applied to analysis of social or biological networks. We have already proposed an efficient sequential search algorithm called COPINE for this problem. COPINE reduces the search space of a dynamically growing tree structure by pruning its branches corresponding to the following subgraphs; already visited, having itemsets smaller than the threshold, and having already-visited supergraphs with identical itemsets. For the third pruning, we use a table associating already-visited subgraphs and their itemsets. To avoid excess pruning in a parallel search where a unique set of subtrees (tasks) is assigned to each worker, we should put a certain restriction on a worker when it is referring to a table entry registered by another worker. We designed a parallel algorithm as an extension of COPINE by introducing this restriction. A problem of the implementation is how workers efficiently share the table entries so that a worker can safely use as many entries registered by other workers as possible. We implemented two sharing methods: (1) a victim worker makes a copy of its own table and passes it to a thief worker when the victim spawns a task by dividing its task and assigns it to the thief, and (2) a single table controlled by locks is shared among workers. We evaluated these implementations using a real protein network. As a result, the single table implementation achieved a speedup of approximately a factor four with 16 workers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.