Kapil Surlaker scite author profile

Espresso is a document-oriented distributed data serving platform that has been built to address LinkedIn's requirements for a scalable, performant, source-of-truth primary store. It provides a hierarchical document model, transactional support for modifications to related documents, realtime secondary indexing, on-the-fly schema evolution and provides a timeline consistent change capture stream. This paper describes the motivation and design principles involved in building Espresso, the data model and capabilities exposed to clients, details of the replication and secondary indexing implementation and presents a set of experimental results that characterize the performance of the system along various dimensions.When we set out to build Espresso, we chose to apply best practices in industry, already published works in research and our own internal experience with different consistency models. Along the way, we built a novel generic distributed cluster management framework, a partition-aware changecapture pipeline and a high-performance inverted index implementation.

show abstract

Untangling cluster management with Helix

Gopalakrishna

Shi

Zhang

et al. 2012

View full text Add to dashboard Cite

All aboard the Databus!

Das

Botev

Surlaker

et al. 2012

View full text Add to dashboard Cite

In Internet architectures, data systems are typically categorized into source-of-truth systems that serve as primary stores for the user-generated writes, and derived data stores or indexes which serve reads and other complex queries. The data in these secondary stores is often derived from the primary data through custom transformations, sometimes involving complex processing driven by business logic. Similarly data in caching tiers is derived from reads against the primary data store, but needs to get invalidated or refreshed when the primary data gets mutated. A fundamental requirement emerging from these kinds of data architectures is the need to reliably capture, flow and process primary data changes.We have built Databus, a source-agnostic distributed change data capture system, which is an integral part of LinkedIn's data processing pipeline. The Databus transport layer provides latencies in the low milliseconds and handles throughput of thousands of events per second per server while supporting infinite look back capabilities and rich subscription functionality. This paper covers the design, implementation and trade-offs underpinning the latest generation of Databus technology. We also present experimental results from stress-testing the system and describe our experience supporting a wide range of LinkedIn production applications built on top of Databus.

show abstract

Gobblin

Qiao

Takiar

et al. 2015

Proc. VLDB Endow.

View full text Add to dashboard Cite

Data ingestion is an essential part of companies and organizations that collect and analyze large volumes of data. This paper describes Gobblin, a generic data ingestion framework for Hadoop and one of LinkedIn's latest open source products. At LinkedIn we need to ingest data from various sources such as relational stores, NoSQL stores, streaming systems, REST endpoints, filesystems, etc. into our Hadoop clusters. Maintaining independent pipelines for each source can lead to various operational problems. Gobblin aims to solve this issue by providing a centralized data ingestion framework that makes it easy to support ingesting data from a variety of sources. Gobblin distinguishes itself from similar frameworks by focusing on three core principles: generality, extensibility, and operability. Gobblin supports a mixture of data sources out-of-the-box and can be easily extended for more. This enables an organization to use a single framework to handle different data ingestion needs, making it easy and inexpensive to operate. Moreover, with an end-to-end metrics collection and reporting module, Gobblin makes it simple and efficient to identify issues in production.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Kapil Surlaker

Data Infrastructure at LinkedIn

On brewing fresh espresso

Untangling cluster management with Helix

All aboard the Databus!

Gobblin

Contact Info

Product

Resources

About