Data management is growing in complexity as largescale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources. In this paper we create a taxonomy of data provenance characteristics and apply it to current research efforts in e-science, focusing primarily on scientific workflow approaches. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. The survey culminates with an identification of open research problems in the field.
Advances in networking technologies will soon make it possible to use the global information infrastructure in a qualitatively different way—as a computational as well as an information resource. As described in the recent book The Grid: Blueprint for a New Computing Infrastructure, this Grid will connect the nation’s computers, databases, instruments, and people in a seamless web of computing and distributed intelligence, which can be used in an on-demand fashion as a problem-solving resource in many fields of human endeavor—and, in particular, science and engineering. The availability of grid resources will give rise to dramatically new classes of applications, in which computing resources are no longer localized but, rather, distributed, heterogeneous, and dynamic; computation is increasingly sophisticated and multidisciplinary; and computation is integrated into our daily lives and, hence, subject to stricter time constraints than at present. The impact of these new applications will be pervasive, ranging from new systems for scientific inquiry, through computing support for crisis management, to the use of ambient computing to enhance personal mobile computing environments. To realize this vision, significant scientific and technical obstacles must be overcome. Principal among these is usability. The goal of the Grid Application Development Software (GrADS) project is to simplify distributed heterogeneous computing in the same way that the World Wide Web simplified information sharing over the Internet. To that end, the project is exploring the scientific and technical problems that must be solved to make it easier for ordinary scientific users to develop, execute, and tune applications on the Grid. In this paper, the authors describe the vision and strategies underlying the GrADS project, including the base software architecture for grid execution and performance monitoring, strategies and tools for construction of applications from libraries of grid-aware components, and development of innovative new science and engineering applications that can exploit these new technologies to run effectively in grid environments.
The increasing ability for the earth sciences to sense the world around us is resulting in a growing need for datadriven applications that are under the control of data-centric workflows composed of grid-and web-services. The focus of our work is on provenance collection for these workflows, necessary to validate the workflow and to determine quality of generated data products. The challenge we address is to record uniform and usable provenance metadata that meets the domain needs while minimizing the modification burden on the service authors and the performance overhead on the workflow engine and the services. The framework, based on a loosely-coupled publish-subscribe architecture for propagating provenance activities, satisfies the needs of detailed provenance collection while a performance evaluation of a prototype finds a minimal performance overhead (in the range of 1% for an eight service workflow using 271 data products).
E ach year across the US, mesoscale weather events-flash floods, tornadoes, hail, strong winds, lightning, and localized winter storms-cause hundreds of deaths, routinely disrupt transportation and commerce, and lead to economic losses averaging more than US$13 billion.1 Although mitigating the impacts of such events would yield enormous economic and societal benefits, research leading to that goal is hindered by rigid IT frameworks that can't accommodate the real-time, on-demand, dynamically adaptive needs of mesoscale weather research; its disparate, high-volume data sets and streams; or the tremendous computational demands of its numerical models and data-assimilation systems.In response to the increasingly urgent need for a comprehensive national cyberinfrastructure in mesoscale meteorology-particularly one that can interoperate with those being developed in other relevant disciplines-the US National Science Foundation (NSF) funded a large information technology research (ITR) grant in 2003, known as Linked Environments for Atmospheric Discovery (LEAD). A multidisciplinary effort involving nine institutions and more than 100 scientists, students, and technical staff in meteorology, computer science, social science, and education, LEAD addresses the fundamental research challenges needed to create an integrated, scalable framework for adaptively analyzing and predicting the atmosphere.LEAD's foundation is dynamic workflow orchestration and data management in a Web services framework. These capabilities provide for the use of analysis tools, forecast models, and data repositories,
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.