The CLEF 2003 Cross-Language Spoken Document Retrieval Track

Federico, Marcello; Jones, Gareth

doi:10.1007/978-3-540-30222-3_61

Cited by 17 publications

(7 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this paper we explore the use of Query Expansion (QE) methods for IR for user-generated content where the information is primarily in the spoken data stream, for which search relies on Spoken Content Retrieval (SCR) techniques. Research on SCR initially investigated IR for planned speech content such as news broadcasts and documentaries [1], [2]. The focus then shifted towards spoken content that is produced spontaneously such as interviews, lectures and TV shows [3].…”

Section: Introductionmentioning

confidence: 99%

“…Our work focuses on the blip10000 Internet video archive [8]. These videos were uploaded to the social video sharing site blip.tv by 2,237 different uploaders and covering a 25 different topics 1 with varying recording quality and differing lengths. The statistics of Automatic Speech Recognition (ASR) transcripts extracted from these videos are shown in Table I.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Investigating segment-based query expansion for user-generated spoken content retrieval

Ahmad

Jones

2016

2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI)

Self Cite

View full text Add to dashboard Cite

The very rapid growth in user-generated social multimedia content on online platforms is creating new challenges for search technologies. A significant issue for search of this type of content is its highly variable form and quality. This is compounded by the standard information retrieval (IR) problem of mismatch between search queries and target items. Query Expansion (QE) has been shown to be an effect technique to improve IR effectiveness for multiple search tasks. In QE, words from a number of relevant or assumed relevant top ranked documents from an initial search are added to the initial search query to enrich it before carrying out a further search operation. In this work, we investigate the application of QE methods for searching social multimedia content. In particular we focus on social multimedia content where the information is primarily in the audio stream. To address the challenge of content variability, we introduce three speech segment-based methods for QE using: Semantic segmentation, Discourse segmentation and Window-Based. Our experimental investigation illustrates the superiority of these segment-based methods in comparison to a standard full document QE method for a version of the MediaEval 2012 Search task newly extended as an adhoc search task.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Investigating segment-based query expansion for user-generated spoken content retrieval

Ahmad

Jones

2016

2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Earlier speech corpora contained relatively clean audio, often with a single speaker reading from a prepared text, such as the TIMIT collection (Garofolo et al, 1990) or broadcast news corpora, which have been used as data sets for speech retrieval experiments in both TREC (Garofolo et al, 2000) and CLEF (Federico and Jones, 2003), and for Topic Detection and Tracking (Allan et al, 1998). These more formal settings or samples of formal content are useful for the study of acoustic qualities of human speech, but represent a more idealized scenario than practical audio processing tasks of interest today.…”

Section: Related Datasetsmentioning

confidence: 99%

100,000 Podcasts: A Spoken English Document Corpus

Clifton¹,

Reddy²,

Yu³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research.

show abstract

“…These CLIR tasks were done using topics in several European languages. No metadata was provided in these tasks, but some interesting findings indicate that even with the manually translated queries, the best CLIR performance resulted in 15% reduction from the monolingual ones (Federico & Jones, 2004), while using dictionary term-by-term translation, this reduction increased to between about 40% and 60%, which highlights the challenge for CLIR over video collections (Federico et al, 2005).…”

Section: Related Workmentioning

confidence: 99%

“…From 2002-2004 the Cross-Language Spoken Document Retrieval (CL-SDR) task investigated news story document retrieval using data from the NIST TREC 8-9 Spoken Document Retrieval (SR) with manually translated queries (Federico & Jones, 2004;Federico, Bertoldi, Levow, & Jones, 2005). The aim of these tasks was to evaluate CLIR systems on noisy automatic transcripts of spoken documents with known story boundaries which involved the retrieval of American English news broadcasts of both unsegmented and segmented transcripts taken from radio and TV news.…”

Section: Related Workmentioning

confidence: 99%

Utilisation of Metadata Fields and Query Expansion in Cross-Lingual Search of User-Generated Internet Video

Ahmad¹,

Ganguly²,

Jones³

2016

jair

View full text Add to dashboard Cite

Recent years have seen significant efforts in the area of Cross Language Information Retrieval (CLIR) for text retrieval. This work initially focused on formally published content, but more recently research has begun to concentrate on CLIR for informal social media content. However, despite the current expansion in online multimedia archives, there has been little work on CLIR for this content. While there has been some limited work on Cross-Language Video Retrieval (CLVR) for professional videos, such as documentaries or TV news broadcasts, there has to date, been no significant investigation of CLVR for the rapidly growing archives of informal user generated (UGC) content. Key differences between such UGC and professionally produced content are the nature and structure of the textual UGC metadata associated with it, as well as the form and quality of the content itself. In this setting, retrieval effectiveness may not only suffer from translation errors common to all CLIR tasks, but also recognition errors associated with the automatic speech recognition (ASR) systems used to transcribe the spoken content of the video and with the informality and inconsistency of the associated user-created metadata for each video. This work proposes and evaluates techniques to improve CLIR effectiveness of such noisy UGC content. Our experimental investigation shows that different sources of evidence, e.g. the content from different fields of the structured metadata, significantly affect CLIR effectiveness. Results from our experiments also show that each metadata field has a varying robustness to query expansion (QE) and hence can have a negative impact on the CLIR effectiveness. Our work proposes a novel adaptive QE technique that predicts the most reliable source for expansion and shows how this technique can be effective for improving CLIR effectiveness for UGC content.

show abstract

The CLEF 2003 Cross-Language Spoken Document Retrieval Track

Cited by 17 publications

References 6 publications

Investigating segment-based query expansion for user-generated spoken content retrieval

Investigating segment-based query expansion for user-generated spoken content retrieval

100,000 Podcasts: A Spoken English Document Corpus

Utilisation of Metadata Fields and Query Expansion in Cross-Lingual Search of User-Generated Internet Video

Contact Info

Product

Resources

About