Data, code, and workflows should be available and cited
Desktop resources are attractive for running compute-intensive distributed applications. Several systems that aggregate these resources in desktop grids have been developed. While these systems have been successfully used for a wide variety of high throughput applications there has been little insight into the detailed temporal structure of CPU availability of desktop grid resources. Yet, this structure is critical to characterize the utility of desktop grid platforms for both task parallel and even data parallel applications.We address the following questions: (i) What are the temporal characteristics of desktop CPU availability in an enterprise setting? (ii) How do these characteristics affect the utility of desktop grids? (iii) Based on these characteristics, can we construct a model of server "equivalents" for the desktop grids, which can be used to predict application performance? We present measurements of an enterprise desktop grid with over 220 hosts running the Entropia commercial desktop grid software. We utilize these measurements to characterize CPU availability and develop a performance model for desktop grid applications for various task granularities, showing that there is an optimal task size. We then introduce a new metric, cluster equivalence, which we use to quantify the utility of the desktop grid relative to that of a dedicated cluster.
Today's computational, experimental, and observational sciences rely on computations that involve many related tasks. The success of a scientific mission often hinges on the computer automation of these workflows. In April 2015, the US Department of Energy (DOE) invited a diverse group of domain and computer scientists from national laboratories supported by the Office of Science, the National Nuclear Security Administration, from industry, and from academia to review the workflow requirements of DOE's science and national security missions, to assess the current state of the art in science workflows, to understand the impact of emerging extreme-scale computing systems on those workflows, and to develop requirements for automated workflow management in future and existing environments. This article is a summary of the opinions of over 50 leading researchers attending this workshop. We highlight use cases, computing systems, workflow needs and conclude by summarizing the remaining challenges this community sees that inhibit large-scale scientific workflows from becoming a mainstream tool for extreme-scale science.
Pseudoknots have been recognized to be an important type of RNA secondary structures responsible for many biological functions. PseudoBase, a widely used database of pseudoknot secondary structures developed at Leiden University, contains over 250 records of pseudoknots obtained in the past 25 years through crystallography, NMR, mutational experiments and sequence comparisons. To promptly address the growing analysis requests of the researchers on RNA structures and bring together information from multiple sources across the Internet to a single platform, we designed and implemented PseudoBase++, an extension of PseudoBase for easy searching, formatting and visualization of pseudoknots. PseudoBase++ (http://pseudobaseplusplus.utep.edu) maps the PseudoBase dataset into a searchable relational database including additional functionalities such as pseudoknot type. PseudoBase++ links each pseudoknot in PseudoBase to the GenBank record of the corresponding nucleotide sequence and allows scientists to automatically visualize RNA secondary structures with PseudoViewer. It also includes the capabilities of fine-grained reference searching and collecting new pseudoknot information.
a b s t r a c tWe present a mathematical framework for constructing and analyzing parallel algorithms for lattice kinetic Monte Carlo (KMC) simulations. The resulting algorithms have the capacity to simulate a wide range of spatio-temporal scales in spatially distributed, non-equilibrium physiochemical processes with complex chemistry and transport micro-mechanisms. Rather than focusing on constructing exactly the stochastic trajectories, our approach relies on approximating the evolution of observables, such as density, coverage, correlations and so on. More specifically, we develop a spatial domain decomposition of the Markov operator (generator) that describes the evolution of all observables according to the kinetic Monte Carlo algorithm. This domain decomposition corresponds to a decomposition of the Markov generator into a hierarchy of operators and can be tailored to specific hierarchical parallel architectures such as multi-core processors or clusters of Graphical Processing Units (GPUs). Based on this operator decomposition, we formulate parallel Fractional step kinetic Monte Carlo algorithms by employing the Trotter Theorem and its randomized variants; these schemes, (a) are partially asynchronous on each fractional step time-window, and (b) are characterized by their communication schedule between processors.The proposed mathematical framework allows us to rigorously justify the numerical and statistical consistency of the proposed algorithms, showing the convergence of our approximating schemes to the original serial KMC. The approach also provides a systematic evaluation of different processor communicating schedules. We carry out a detailed benchmarking of the parallel KMC schemes using available exact solutions, for example, in Isingtype systems and we demonstrate the capabilities of the method to simulate complex spatially distributed reactions at very large scales on GPUs. Finally, we discuss work load balancing between processors and propose a re-balancing scheme based on probabilistic mass transport methods.
Distributed computing using PCs volunteered by the public can provide high computing capacity at low cost. However, computational results from volunteered PCs have a non-negligible error rate, so result validation is needed to ensure overall correctness. A generally applicable technique is "redundant computing", in which each computation is done on several separate computers, and results are accepted only if there is a consensus. Variations in numerical processing between computers (due to a variety of hardware and software factors) can lead to different results for the same task. In some cases, this can be addressed by doing a "fuzzy comparison" of results, so that two results are considered equivalent if they agree within given tolerances. However, this approach is not applicable to applications that are "divergent", that is, for which small numerical differences can produce large differences in the results. In this paper we examine the problem of validating results of divergent applications. We present a novel approach called Homogeneous Redundancy (HR), in which the redundant instances of a computation are dispatched to numerically identical computers, allowing strict equality comparison of the results. HR has been deployed in Predictor@home, a world-wide community effort to predict protein structure from sequence.
We test the feasibility of rapidly detecting and characterizing earthquakes with the Quake-Catcher Network (QCN) that connects low-cost microelectromechanical systems accelerometers to a network of volunteer-owned, Internet-connected computers. Following the 3 September 2010 M 7.2 Darfield, New Zealand, earthquake we installed over 180 QCN sensors in the Christchurch region to record the aftershock sequence. The sensors are monitored continuously by the host computer and send trigger reports to the central server. The central server correlates incoming triggers to detect when an earthquake has occurred. The location and magnitude are then rapidly estimated from a minimal set of received ground-motion parameters. Full seismic time series are typically not retrieved for tens of minutes or even hours after an event. We benchmark the QCN real-time detection performance against the GNS Science GeoNet earthquake catalog. Under normal network operations, QCN detects and characterizes earthquakes within 9.1 s of the earthquake rupture and determines the magnitude within 1 magnitude unit of that reported in the GNS catalog for 90% of the detections. IntroductionOver the past decade, several cyber-social-seismic networks have been developed, including the Personal Seismic Network (Cranswick et al., 1993), NetQuakes (Luetgert et al., 2009), the Quake-Catcher Network (QCN; Cochran, Lawrence, Christensen, and Chung, 2009;Cochran, Lawrence, Christensen, and Jakka, 2009), and the Community Seismic Network (Clayton et al., 2011). New sensor technology and computational techniques provide an avenue for creating very large cyber-social-seismic networks by reducing instrument costs, minimizing needed infrastructure, and harnessing public interest. Small low-cost ($30-$3000) microelectromechanical systems (MEMS) triaxial sensors provide ground-acceleration measurements of moderate to large earthquakes (Cochran, Lawrence, Christensen, and Chung, 2009;Cochran, Lawrence, Christensen, and Jakka, 2009;Chung et al., 2011;Cochran et al., 2011). Data from these low-cost sensors are transmitted to a central server either through an Internetconnected computer or via any available wireless connection (Luetgert et al., 2009;Cochran, Lawrence, Christensen, and Chung, 2009;Cochran, Lawrence, Christensen, and Jakka, 2009;Clayton et al., 2011). These networks minimize the costs associated with monitoring the sensors by utilizing the host's computing resources, A/C power, Internet, and shelter (Luetgert et al., 2009;Cochran, Lawrence, Christensen, and Chung, 2009;Cochran, Lawrence, Christensen, and Jakka, 2009;Clayton et al., 2011).The QCN represents one type of cyber-social-seismic network. In the QCN architecture, MEMS sensors are connected directly to Universal Serial Bus (USB) ports on a host's computer; the computer monitors the sensor and sends time series and ground-motion parameters to a central server. This is a low-cost paradigm compared to traditional sensor networks and even other cyber-social-seismic networks such as the NetQ...
The performance of several two-step scoring approaches for molecular docking were assessed for their ability to predict binding geometries and free energies. Two new scoring functions designed for “step 2 discrimination” were proposed and compared to our CHARMM implementation of the linear interaction energy (LIE) approach using the Generalized-Born with Molecular Volume (GBMV) implicit solvation model. A scoring function S1 was proposed by considering only “interacting” ligand atoms as the “effective size” of the ligand, and extended to an empirical regression-based pair potential S2. The S1 and S2 scoring schemes were trained and five-fold cross validated on a diverse set of 259 protein-ligand complexes from the Ligand Protein Database (LPDB). The regression-based parameters for S1 and S2 also demonstrated reasonable transferability in the CSARdock 2010 benchmark using a new dataset (NRC HiQ) of diverse protein-ligand complexes. The ability of the scoring functions to accurately predict ligand geometry was evaluated by calculating the discriminative power (DP) of the scoring functions to identify native poses. The parameters for the LIE scoring function with the optimal discriminative power (DP) for geometry (step 1 discrimination) were found to be very similar to the best-fit parameters for binding free energy over a large number of protein-ligand complexes (step 2 discrimination). Reasonable performance of the scoring functions in enrichment of active compounds in four different protein target classes established that the parameters for S1 and S2 provided reasonable accuracy and transferability. Additional analysis was performed to definitively separate scoring function performance from molecular weight effects. This analysis included the prediction of ligand binding efficiencies for a subset of the CSARdock NRC HiQ dataset where the number of ligand heavy atoms ranged from 17 to 35. This range of ligand heavy atoms is where improved accuracy of predicted ligand efficiencies is most relevant to real-world drug design efforts.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.