Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-1390
|View full text |Cite
|
Sign up to set email alerts
|

Polyglot and Speech Corpus Tools: A System for Representing, Integrating, and Querying Speech Corpora

Abstract: Speech datasets from many languages, styles, and sources exist in the world, representing significant potential for scientific studies of speech-particularly given structural similarities among all speech datasets. However, studies using multiple speech corpora remain difficult in practice, due to corpus size, complexity, and differing formats. We introduce open-source software for unified corpus analysis: integrating speech corpora and querying across them. Corpora are stored in a custom 'polyglot persistence… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 7 publications
(6 citation statements)
references
References 15 publications
(11 reference statements)
0
6
0
Order By: Relevance
“…The corpus contained 11863 qualifying word pairs, which were extracted along with existing time-aligned phonetic transcription using the Montreal Corpus Tools software (McAuliffe, Stengel-Eskin, Socolof, & Sonderegger, 2017). 4 The Pitt et al (2007) transcriptions were prepared automatically and subsequently hand-corrected by phonetically trained research assistants.…”
Section: Datasetmentioning
confidence: 99%
“…The corpus contained 11863 qualifying word pairs, which were extracted along with existing time-aligned phonetic transcription using the Montreal Corpus Tools software (McAuliffe, Stengel-Eskin, Socolof, & Sonderegger, 2017). 4 The Pitt et al (2007) transcriptions were prepared automatically and subsequently hand-corrected by phonetically trained research assistants.…”
Section: Datasetmentioning
confidence: 99%
“…In order to standardize the acoustic analysis across corpora, the Integrated Speech Corpus Analysis (ISCAN) tool was developed for use in this kind of cross-dialectal study in the context of the SPADE project. This section provides a brief overview of the ISCAN system: see McAuliffe et al (2017bMcAuliffe et al ( , 2019 and the ISCAN documentation page for details of the implementation 4 . The process of deriving a dataset from raw corpus files consists of three major steps.…”
Section: Discussionmentioning
confidence: 99%
“…Using the transcripts from each corpus, we identified target instances of /b d g/ and /p t k/ using the PolyglotDB Python package (McAuliffe, Stengel-Eskin, Socolof, & Sonderegger, 2017), and compared the SpiCE bilinguals' productions to each of the monolingual comparison corpora. 4 The analyses described in the following studies report the results of mixed effects models for each study-cross-corpus comparisons, and a within-SpiCE analysis examining factors that influence language mode.…”
Section: General Methodsmentioning
confidence: 99%