Large Scale Audio-Visual Video Analytics Platform for Forensic Investigations of Terroristic Attacks

Strobel, Alexander; Boyer, Martin; Lindley, Andrew; Schreiber, David; Philipp, Thomas

doi:10.1007/978-3-030-05716-9_9

Cited by 5 publications

(6 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the evaluation we are using a Convolutional Recurrent Neural Network (CRNN) [3,22]. A CRNN is a combination of a Convolutional Neural Network (CNN) stack and a Recurrent Neural Network (RNN).…”

Section: Model Architecturementioning

confidence: 99%

“…Audio representations aim to capture intrinsic properties and characteristics of the audio content to facilitate complex tasks such as classification (acoustic scenes [6,16], music genres [15]), regression (emotion recognition [31]) or similarity estimation (music, [13] general audio [22]). In the context of this paper we focus on their application in audio similarity estimation and retrieval.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Unsupervised cross-modal audio representation learning from unstructured multilingual text

Strobel

Gordea

Knees

2020

Proceedings of the 35th Annual ACM Symposium on Applied Computing

Self Cite

View full text Add to dashboard Cite

We present an approach to unsupervised audio representation learning. Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness. By applying Latent Semantic Indexing (LSI) we embed corresponding textual information into a latent vector space from which we derive track relatedness for online triplet selection. This LSI topic modelling facilitates fine-grained selection of similar and dissimilar audio-track pairs to learn the audio representation using a Convolution Recurrent Neural Network (CRNN). By this we directly project the semantic context of the unstructured text modality onto the learned representation space of the audio modality without deriving structured ground-truth annotations from it. We evaluate our approach on the Europeana Sounds collection and show how to improve search in digital audio libraries by harnessing the multilingual meta-data provided by numerous European digital libraries. We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection. The learned representations perform comparable to the baseline of handcrafted features, respectively exceeding this baseline in similarity retrieval precision at higher cut-offs with only 15% of the baseline's feature vector length.

show abstract

Section: Model Architecturementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Unsupervised cross-modal audio representation learning from unstructured multilingual text

Strobel

Gordea

Knees

2020

Proceedings of the 35th Annual ACM Symposium on Applied Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Implementation: The implemented approach -detailed in [14] -is a combination of the models developed in the Detection and Classification of Acoustic Scenes and Events (DCASE) [11] international evaluation campaign [9,17,18] and the approach presented in [21]. The model applies a Convolutional Recurrent Neural Network (CRNN) [14] with an attention layer on log-scaled Mel-Spectrogram inputs (9.92 seconds audio, 44,1KHz sample rate, 80 Mel-bands, 2048 samples STFT-window size with 50% overlap). It was trained on a pre-processed subset of the Audioset dataset [4].…”

Section: Sound Event Detection (Sed)mentioning

confidence: 99%

“…Implementation: The developed approach to a multi-class multitarget tracking method -also detailed in [14] -was trained and optimized on the specific scenario-relevant object categories. It is based on an appearance based tracker as in [19] and aims to add additional features such as targets motion and mutual interaction [20], as well as learning temporal dependencies as in [12] [20].…”

Section: Sound Event Detection (Sed)mentioning

confidence: 99%

Multi-Modal Video Forensic Platform for Investigating Post-Terrorist Attack Scenarios

Strobel¹,

Lindley²,

Jalali³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

The forensic investigation of a terrorist attack poses a significant challenge to the investigative authorities, as often several thousand hours of video footage must be viewed. Large scale Video Analytic Platforms (VAP) assist law enforcement agencies (LEA) in identifying suspects and securing evidence. Current platforms focus primarily on the integration of different computer vision methods and thus are restricted to a single modality. We present a video analytic platform that integrates visual and audio analytic modules and fuses information from surveillance cameras and video uploads from eyewitnesses. Videos are analyzed according their acoustic and visual content. Specifically, Audio Event Detection is applied to index the content according to attack-specific acoustic concepts. Audio similarity search is utilized to identify similar video sequences recorded from different perspectives. Visual object detection and tracking are used to index the content according to relevant concepts. Innovative user-interface concepts are introduced to harness the full potential of the heterogeneous results of the analytical modules, allowing investigators to more quickly follow-up on leads and eyewitness reports.

show abstract

“…In [3] the authors used lips reading to speech recognition. The authors in [4] presented a platform for audio-visual video analysis to assist agencies in analyzing and identifying suspects from large scale videos recorded after a terrorist attack.…”

Section: Introductionmentioning

confidence: 99%

Security Detection in Audio Events: A Comparison of Classification Methods

Nasser

2020

JAMCS

View full text Add to dashboard Cite

The security of public places is becoming important with the increased rate of violence and subversion. Recently, several types of research have been proposed to automatically detect abnormal behavior in public places like a car crash, violence or other hazardous events in an attempt to improve security and save lives. Furthermore, most of the researches are using supervised classifications techniques to classify the audio signals. This paper proposes the use of the kernel principal component analysis (KPCA) to reduce the number of MFCC features extracted from the audio signal and then apply an unsupervised classification algorithm. Moreover, this paper presents the results of several supervised and unsupervised classification methods for audio events detection and compares these results with the result of the proposed approach. Experiments are done using a real data set recorded at the mean of public transportation. The obtained results reveal that K-means on 2 KPCA components gave good results for triggering a true alarm as well as detecting a false alarm; where the percentages of false and missed alarms were 4.5% and 7.8% respectively; whereas these values were 0.8% and 9.3% respectively for kernel k-means. Notwithstanding the DNN network gave the best results with a false alarm rate of 0% and 1.4% missed alarm.

show abstract

Large Scale Audio-Visual Video Analytics Platform for Forensic Investigations of Terroristic Attacks

Cited by 5 publications

References 22 publications

Unsupervised cross-modal audio representation learning from unstructured multilingual text

Unsupervised cross-modal audio representation learning from unstructured multilingual text

Multi-Modal Video Forensic Platform for Investigating Post-Terrorist Attack Scenarios

Security Detection in Audio Events: A Comparison of Classification Methods

Contact Info

Product

Resources

About