Joe Hellerstein scite author profile

Executive SummaryThe database research community is rightly proud of success in basic research, and its remarkable record of technology transfer. Now the field needs to radically broaden its research focus to attack the issues of capturing, storing, analyzing, and presenting the vast array of online data. The database research community should embrace a broader research agenda --broadening the definition of database management to embrace all the content of the Web and other online data stores, and rethinking our fundamental assumptions in light of technology shifts. To accelerate this transition, we recommend changing the way research results are evaluated and presented. In particular, we advocate encouraging more speculative and long-range work, moving conferences to a poster format, and publishing all research literature on the Web.

show abstract

The Lowell database research self-assessment

Abiteboul¹,

Agrawal²,

Bernstein³

et al. 2005

Commun. ACM

110

View full text Add to dashboard Cite

A group of senior database researchers gathers every few years to assess the state of database research and to point out problem areas that deserve additional focus. This report summarizes the discussion and conclusions of the sixth ad-hoc meeting held May 4-6, 2003 in Lowell, Mass. It observes that information management continues to be a critical component of most complex software systems. It recommends that database researchers increase focus on: integration of text, data, code, and streams; fusion of information from heterogeneous data sources; reasoning about uncertain data; unsupervised data mining for interesting correlations; information privacy; and self-adaptation and repair.

show abstract

Data Triage: An Adaptive Architecture for Load Shedding in TelegraphCQ

Reiss¹,

Hellerstein

View full text Add to dashboard Cite

Many of the data sources used in stream query processing are known to exhibit bursty behavior. Data in a burst often has different characteristics than steady-state data, and therefore may be of particular interest. In this paper, we describe the Data Triage architecture that we are adding to TelegraphCQ to provide low latency results with good accuracy under such bursts. BackgroundOne of the distinguishing properties of stream query processors is that they produce query results in real time. For applications like financial market analysis, network monitoring, and inventory tracking, timely query results are of great importance.Studies show that common sources of streaming datanetwork traffic, environmental monitoring, software logs, etc. -often exhibit "bursty" behavior [1] [2]. Bursty behavior is characterized by periods of low data rates punctuated by "bursts" of high data rates that vary in their length and speed. Available network bandwidth and incoming query workloads may also be affected during periods of bursts, leading to a situation in which the effective load on a stream query processor can vary rapidly and unpredictably by orders of magnitude.Note that bursts often produce not only more data, but also different data than usual. This will often be the case, for example, in crisis scenarios (network attacks, environmental incidents, software malfunctions, etc.), where a high volume of unusual readings may be reported to the system. Hence, analysts may be particularly eager to capture the properties of the data in the burst.The requirement for low result latency under heavy load raises design challenges, since query processors must return useful results quickly regardless of the rate at which they receive data. Much recent work has focused on methods of coping with excessive data rates in streaming query processors by shedding load. We refer the reader to our technical Output Buffers Local Data Stream Network Remote Data Stream Synopsize Overflow Triage Queue Query Engine Wrapper Clearinghouse Key Tuples Synopses Triage Queue Remote Wrapper Synopsize Figure 1. The Data Triage load-shedding architecture. We embed triage queues inside the gateway modules that convert data streams into the system's internal format. If the query engine cannot consume tuples at the rate they enter the triage queues, the system builds synopses of the excess tuples.report [3] for a more in-depth description of previous work.We believe that bursty data arrival poses unique challenges for load-shedding that have not been adequately addressed in previous work. Since bursts can occur suddenly, load-shedding mechanisms need to react quickly to changes in data rates. Due to the very high variation in bandwidth exhibited by bursty data sources, load-shedding mechanisms need to produce accurate query results with low latency across a wide range of system loads. Finally, because bursts may contain the most interesting information, load-shedding should not simply discard excess data; it must capture properties of the missing data. Archit...

show abstract

Proof Sketches: Verifiable In-Network Aggregation

Garofalakis

Hellerstein

Maniatis

2007

View full text Add to dashboard Cite

show abstract

Declarative Network Monitoring with an Underprovisioned Query Processor

Reiss

Hellerstein

2006

View full text Add to dashboard Cite

Many of the data sources used in stream query processing are known to exhibit bursty behavior. We focus here on passive network monitoring, an application in which the data rates typically exhibit a large peak-to-average ratio. Provisioning a stream query processor to handle peak rates in such a setting can be prohibitively expensive.In this paper, we propose to solve this problem by provisioning the query processor for typical data rates instead of much higher peak data rates. To enable this strategy, we present mechanisms and policies for managing the tradeoffs between the latency and accuracy of query results when bursts exceed the steady-state capacity of the query processor.We describe the current status of our implementation and present experimental results on a testbed network monitoring application to demonstrate the utility of our approach. IntroductionMany of the emerging applications in stream query processing are known to exhibit high-speed, bursty data rates. The behavior of data streams in such applications is characterized by relatively long periods of calm, punctuated by "bursts" of highspeed data. The peak data rate exhibited in a burst is typically many times the average data rate.In this paper, we focus on an application that is particularly prone to bursty data: passive network monitoring. A passive network monitor (See Figure 1) is a device attached to a hightraffic network link that monitors and analyzes the packets on the link.There has been significant interest in bringing the power of declarative queries to passive network monitors [8,15]. Declarative queries are easy to change in response to the evolution of networked applications, protocols and attacks, and they free network operators from the drudgery of hand-optimizing their monitoring code.However, high-speed bursts and tight time constraints make implementing declarative query processing for network monitoring a difficult problem. The bursty nature of network traffic is well-documented in the literature [20,24]. Situations like SYN floods can multiply the effects of bursts by increasing bandwidth usage and decreasing packet size simultaneously. Under such situations, even keeping simple counters during bursts is considered difficult [11]. Bursts often produce not only more data, but also different data than usual. This will often be the case, for example, in crisis scenarios, such as a denial of service attack or a flash crowd. Because network operators need

show abstract

Enabling Real-Time Querying of Live and Historical Stream Data

Reiss

Stockinger

et al. 2007

View full text Add to dashboard Cite

Applications that query data streams in order to identify trends, patterns, or anomalies can often benefit from comparing the live stream data with archived historical stream data. However, searching this historical data in real time has been considered so far to be prohibitively expensive. One of the main bottlenecks is the update costs of the indices over the archived data. In this paper, we address this problem by using our highly-efficient bitmap indexing technology (called FastBit) and demonstrate that the index update operations are sufficiently efficient for this bottleneck to be removed. We describe our prototype system based on the TelegraphCQ streaming query processor and the FastBit bitmap index. We present a detailed performance evaluation of our system using a complex query workload for analyzing real network traffic data. The combined system uses TelegraphCQ to analyze streams of traffic information and FastBit to correlate current behaviors with historical trends. We demonstrate that our system can simultaneously analyze (1) live streams with high data rates and (2) a large repository of historical stream data.

show abstract

Creating a Customized Access Method for Blobworld

Thomas

Carson

Hellerstein

View full text Add to dashboard Cite

We present the design and analysis of a customized access method for the content-based image retrieval system, Blobworld. Using the amdb access method analysis tool, we analyze three existing multidimensional access methods that support nearest neighbor search in the context of the Blobworld application. Based on this analysis, we propose several variants of the R-tree, tailored to address the problems the analysis revealed. We implemented the access methods we propose in the Generalized Search T rees GiST framework and analyzed them using amdb, a tool that enables visualization and performance analysis of access methods. We found that two of our access methods have better performance characteristics for the Blobworld application than any of the traditional multi-dimensional access methods we examined. Based on this experience, we draw conclusions for nearest neighbor access method design, and for the task of constructing custom access methods tailored to particular applications. In particular, we found that our Top X Jagged Bites" bounding predicate performed better than all the other access methods we tested.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Joe Hellerstein

Interactive data analysis: the Control project

The Asilomar report on database research

The Lowell database research self-assessment

Data Triage: An Adaptive Architecture for Load Shedding in TelegraphCQ

Proof Sketches: Verifiable In-Network Aggregation

Declarative Network Monitoring with an Underprovisioned Query Processor

Enabling Real-Time Querying of Live and Historical Stream Data

Creating a Customized Access Method for Blobworld

Contact Info

Product

Resources

About