Benchmarking as Empirical Standard in Software Engineering Research

Hasselbring, Wilhelm

doi:10.1145/3463274.3463361

Cited by 16 publications

(16 citation statements)

References 36 publications

(46 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The recently published ACM SIGSOFT Empirical Standard for Benchmarking (Ralph et al 2021;Hasselbring 2021) names four essential components of a benchmark: 2 -the quality to be benchmarked (e.g., performance, availability, scalability, security) -the metric(s) to quantify the quality -the measurement method(s) for the metric (if not obvious) -the workload, usage profile and/or task sample the system under test is subject to (i.e., what the system is doing when the measures are taken)…”

Section: Components Of Benchmarksmentioning

confidence: 99%

“…In empirical software engineering research, benchmarks are an established research method to compare different methods, techniques, and tools based on a standardized method (Sim et al 2003;Tichy 2014;Hasselbring 2021). For traditional performance attributes such as latency or throughput, well-known (and often straightforward) metrics and measurement methods exists (Kounev et al 2020).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A configurable method for benchmarking scalability of cloud-native applications

Henning

Hasselbring

2022

Empir Software Eng

Self Cite

View full text Add to dashboard Cite

Cloud-native applications constitute a recent trend for designing large-scale software systems. However, even though several cloud-native tools and patterns have emerged to support scalability, there is no commonly accepted method to empirically benchmark their scalability. In this study, we present a benchmarking method, allowing researchers and practitioners to conduct empirical scalability evaluations of cloud-native applications, frameworks, and deployment options. Our benchmarking method consists of scalability metrics, measurement methods, and an architecture for a scalability benchmarking tool, particularly suited for cloud-native applications. Following fundamental scalability definitions and established benchmarking best practices, we propose to quantify scalability by performing isolated experiments for different load and resource combinations, which asses whether specified service level objectives (SLOs) are achieved. To balance usability and reproducibility, our benchmarking method provides configuration options, controlling the trade-off between overall execution time and statistical grounding. We perform an extensive experimental evaluation of our method’s configuration options for the special case of event-driven microservices. For this purpose, we use benchmark implementations of the two stream processing frameworks Kafka Streams and Flink and run our experiments in two public clouds and one private cloud. We find that, independent of the cloud platform, it only takes a few repetitions (≤ 5) and short execution times (≤ 5 minutes) to assess whether SLOs are achieved. Combined with our findings from evaluating different search strategies, we conclude that our method allows to benchmark scalability in reasonable time.

show abstract

Section: Components Of Benchmarksmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A configurable method for benchmarking scalability of cloud-native applications

Henning

Hasselbring

2022

Empir Software Eng

Self Cite

View full text Add to dashboard Cite

show abstract

“…Based on guidelines on benchmarking best-practices [29,78] and inspired by the microservice benchmark suite DeathStar-Bench [26], we formulate the following design principles:…”

Section: Design Principlesmentioning

confidence: 99%

“…We conduct a performance benchmarking experiment [29] with an open-loop load generator in the data center region Northern Virginia (us-east-1) as commonly used by other serverless studies [10,14,16,81,82]. We collected over 7.5 million traces, through over 12 months of experimentation in 2021 and 2022.…”

Section: Experiments Designmentioning

confidence: 99%

Let's Trace It: Fine-Grained Serverless Benchmarking using Synchronous and Asynchronous Orchestrated Applications

Scheuner¹,

Eismann²,

Talluri³

et al. 2022

Preprint

View full text Add to dashboard Cite

Making serverless computing widely applicable requires detailed performance understanding. Although contemporary benchmarking approaches exist, they report only coarse results, do not apply distributed tracing, do not consider asynchronous applications, and provide limited capabilities for (root cause) analysis. Addressing this gap, we design and implement ServiBench, a serverless benchmarking suite. ServiBench (i) leverages synchronous and asynchronous serverless applications representative of production usage, (ii) extrapolates cloud-provider data to generate realistic workloads, (iii) conducts comprehensive, end-to-end experiments to capture application-level performance, (iv) analyzes results using a novel approach based on (distributed) serverless tracing, and (v) supports comprehensively serverless performance analysis. With ServiBench, we conduct comprehensive experiments on AWS, covering five common performance factors: median latency, cold starts, tail latency, scalability, and dynamic workloads. We find that the median end-to-end latency of serverless applications is often dominated not by function computation but by external service calls, orchestration, or trigger-based coordination. We release collected experimental data under FAIR principles and ServiBench as a tested, extensible open-source tool.

show abstract

“…While there might be indications that one of the models is better, how can you know for sure? We need a reliable approach for benchmarking different models [19], i.e., test automation that helps us detect if any GDMs digress from acceptable behavior.…”

mentioning

confidence: 99%

Quality Assurance of Generative Dialog Models in an Evolving Conversational Agent Used for Swedish Language Practice

Borg¹,

Bengtsson²,

Harald³

et al. 2022

Preprint

View full text Add to dashboard Cite

Due to the migration megatrend, efficient and effective secondlanguage acquisition is vital. One proposed solution involves AIenabled conversational agents for person-centered interactive language practice. We present results from ongoing action research targeting quality assurance of proprietary generative dialog models trained for virtual job interviews. The action team elicited a set of 38 requirements for which we designed corresponding automated test cases for 15 of particular interest to the evolving solution. Our results show that six of the test case designs can detect meaningful differences between candidate models. While quality assurance of natural language processing applications is complex, we provide initial steps toward an automated framework for machine learning model selection in the context of an evolving conversational agent. Future work will focus on model selection in an MLOps setting.

show abstract

Benchmarking as Empirical Standard in Software Engineering Research

Cited by 16 publications

References 36 publications

A configurable method for benchmarking scalability of cloud-native applications

A configurable method for benchmarking scalability of cloud-native applications

Let's Trace It: Fine-Grained Serverless Benchmarking using Synchronous and Asynchronous Orchestrated Applications

Quality Assurance of Generative Dialog Models in an Evolving Conversational Agent Used for Swedish Language Practice

Contact Info

Product

Resources

About