Abstract. Data streams flowing from the physical environment are as unpredictable as the environment itself. Radars go down, long haul networks drop packets, and readings are corrupted on the wire. Yet the data driven scientific models and data mining algorithms do not necessarily account for the inaccuracies when assimilating the data. Low overhead provenance collection partially solves this problem. We propose a data model and collection model for near real time provenance collection. We define a system architecture for stream provenance tracking and motivate with a real-world application in meteorology forecasting.
Each year across the USA, destructive weather events disrupt transportation and commerce, resulting in both loss of lives and property. Mitigating the impacts of such severe events requires innovative new software tools and cyberinfrastructure through which scientists can monitor data for specific severe weather events such as thunderstorms and launch focused modeling computations for prediction and forecasts of these evolving weather events. Bringing about a paradigm shift in meteorology research and education through advances in cyberinfrastructure is one of the key research objectives of the Linked Environments for Atmospheric Discovery (LEAD) project, a large-scale, interdisciplinary NSF funded project spanning ten institutions. In this paper we address the challenges of making cyberinfrastructure frameworks responsive to realtime conditions in the physical environment driven by the use cases in mesoscale meteorology. The contribution of the research is two-fold: on the cyberinfrastructure side, we propose a model for bridging between the physical environment and e-Science 1 workflow systems, specifically through events processing systems, and provide a proof of concept implementation of that model in the context of the LEAD cyberinfrastructure. On the algorithmic side, we propose efficient stream mining algorithms that can be carried out on a continuous basis in real time over large volumes of observational data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.