WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

Tran, An C.; Drossos, Konstantinos; Virtanen, Tuomas

doi:10.23919/eusipco54536.2021.9616340

Cited by 9 publications

(5 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The success in the image captioning field has become a massive motivation for generating captions from one-dimensional signals. One of the most considerable 1D signals is audio, with several attempts to access it, such as the convolution-augmented Transformer (Conformer) [24], WaveTransformer [25], and adversarial training [26]. Some suggestions in these studies still need to be applied in medical signals, such as augmentation by mixing signals and generative adversarial network (GAN) to increase the diversity of the generated captions.…”

Section: Related Workmentioning

confidence: 99%

ECG Captioning with Prior-Knowledge Transformer and Diffusion Probabilistic Model

Tran,

Dang

et al. 2024

Preprint

View full text Add to dashboard Cite

Deep learning-based ECG interpretation has recently surfaced as a promising technique in automating report generation to reduce time and expense at the clinic. However, ECG descriptions generated by previous methods show considerable limitations compared with the diagnosis from cardiologists. This research proposes a deep network for extracting features by concatenating two networks: 1D input signal and 2D input spectrogram. Conjoining characteristics 1D and 2D push the captioning network based on the Transformer, focusing on complicated ECG components such as the P and T waves. Besides, we propose a novel approach for the prior knowledge module that utilizes the CO-attention Siamese Network to increase accuracy and performance to comprehend unique ECG characteristics. We also suggest an efficient augmentation technique to reduce the data imbalance and the FNet module to optimize the attention blocks in the Transformer design. The performance is validated using the PhysioNet Challenge 2021 database with approximately 88,243 twelve-lead ECG recordings. We train the model on free-text annotations of ECGs; the BLEU1, BLEU2, BLEU3, BLEU4, METEOR, ROUGE, and CIDEr scores are 0.88, 0.87, 0.81, 0.79, 0.62, 0.88, and 8.35, respectively, compared our output with generated captions based on the clinic standard. Using the public China Physiological Signal Challenge dataset, our proposal outperforms the earlier efforts for classification problems on the Sinus, First-degree atrioventricular block, and PVC rhythms. This study shows that a deep neural network appropriately trained on unstructured free-text physician annotations might lessen cardiologists' clinical workload when interpreting cardiovascular illness for patients. According to the concept, captioning ECG reports via a deep neural network may assist clinicians in evaluating ECG and reducing clinical workload.

show abstract

Section: Related Workmentioning

confidence: 99%

ECG Captioning with Prior-Knowledge Transformer and Diffusion Probabilistic Model

Tran,

Dang

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…An encoderdecoder model with caption decoder and content word decoder is presented in [19] to solve infrequent class problems in the captions. A transformer model is presented in [3] using temporal and time-frequency information in audio clips. Another transformer-based architecture is proposed in [21] to learn information with a continuously adapting approach.…”

Section: B Audio Captioningmentioning

confidence: 99%

“…Since the Clotho dataset has five captions for each audio clip, we can obtain up to 50 topics for an audio clip. We have experimented with different numbers of topics (2,3,10) for an audio caption using the BERTopic to explore how many topics we should use in the model for each caption. Let k be the number of topics obtained from the topic model for five captions, T = [t 1 , ..., t k ] is the topic vector with the length of k. When we experiment with two topics for each…”

Section: B Topic Modeling With Bertopicmentioning

confidence: 99%

See 1 more Smart Citation

Automated Audio Captioning With Topic Modeling

Eren

Sert

2023

IEEE Access

View full text Add to dashboard Cite

Automatic audio captioning (AAC) is an important area of research aimed at generating meaningful descriptions for audio clips. Most existing methods use relevant semantic information to improve AAC performance and have demonstrated the feasibility of semantic information extraction. Audio events and keywords are commonly used for this purpose. Unlike previous studies, this study proposes a framework that uses topic modeling to obtain relevant semantic content since topic models explore the main themes of the documents. To this end, we present a framework that integrates audio embeddings with audio topics in a transformer-based encoder-decoder architecture. First, we represent each audio clip with a set of topics using a pre-trained topic model, BERTopic. Then, we design a multilayer perceptron (MLP)based multi-label classifier to predict the topics of audio clips in the testing phase. Finally, in the proposed framework, we input audio embedding and extracted topics into the transformer model to generate captions. The results show that the proposed model improves performance and competes with the most advanced methods that utilize additional external data for training. We believe that the topic modeling can be used to extract semantic content in the AAC task.

show abstract

“…Unlike the field of computer vision, where considerable research has been carried out on higher-levels of semantic understanding of visual tasks (e.g., visual question answering: Agrawal et al, 2017; Yang et al, 2016; image captioning: Xu et al, 2015; Lu et al, 2017), only a few works have been realised in the audio domain. One example is the recent work described in (Drossos et al, 2017), followed by their current approach in Tran et al (2020), in which an encoder–decoder neural network is used to process a sequence of Mel-band energies and to compute a sequence of words that describe a given audio segment.…”

Section: State-of-the-art In Audio Analysismentioning

confidence: 99%

New Avenues in Audio Intelligence: Towards Holistic Real-life Audio Understanding

et al. 2021

View full text Add to dashboard Cite

Computer audition (i.e., intelligent audio) has made great strides in recent years; however, it is still far from achieving holistic hearing abilities, which more appropriately mimic human-like understanding. Within an audio scene, a human listener is quickly able to interpret layers of sound at a single time-point, with each layer varying in characteristics such as location, state, and trait. Currently, integrated machine listening approaches, on the other hand, will mainly recognise only single events. In this context, this contribution aims to provide key insights and approaches, which can be applied in computer audition to achieve the goal of a more holistic intelligent understanding system, as well as identifying challenges in reaching this goal. We firstly summarise the state-of-the-art in traditional signal-processing-based audio pre-processing and feature representation, as well as automated learning such as by deep neural networks. This concerns, in particular, audio interpretation, decomposition, understanding, as well as ontologisation. We then present an agent-based approach for integrating these concepts as a holistic audio understanding system. Based on this, concluding, avenues are given towards reaching the ambitious goal of ‘holistic human-parity’ machine listening abilities.

show abstract

WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

Cited by 9 publications

References 17 publications

ECG Captioning with Prior-Knowledge Transformer and Diffusion Probabilistic Model

ECG Captioning with Prior-Knowledge Transformer and Diffusion Probabilistic Model

Automated Audio Captioning With Topic Modeling

New Avenues in Audio Intelligence: Towards Holistic Real-life Audio Understanding

Contact Info

Product

Resources

About