2021
DOI: 10.48550/arxiv.2109.01164
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development

Abstract: This paper introduces a human-in-the-loop (HITL) data annotation pipeline to generate high-quality, large-scale speech datasets. The pipeline combines human and machine advantages to more quickly, accurately, and cost-effectively annotate datasets with machine pre-labeling and fully manual auditing. Quality control mechanisms such as blind testing, behavior monitoring, and data validation have been adopted in the annotation pipeline to mitigate potential bias introduced by machine-generated labels. Our A/B tes… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
references
References 28 publications
(32 reference statements)
0
0
0
Order By: Relevance