Named entity recognition (NER) is among SLU tasks that usually extract semantic information from textual documents. Until now, NER from speech is made through a pipeline process that consists in processing first an automatic speech recognition (ASR) on the audio and then processing a NER on the ASR outputs. Such approach has some disadvantages (error propagation, metric to tune ASR systems sub-optimal in regards to the final task, reduced space search at the ASR output level,...) and it is known that more integrated approaches outperform sequential ones, when they can be applied. In this paper, we present a first study of end-to-end approach that directly extracts named entities from speech, though a unique neural architecture. On a such way, a joint optimization is able for both ASR and NER. Experiments are carried on French data easily accessible, composed of data distributed in several evaluation campaign. Experimental results show that this end-to-end approach provides better results (F-measure=0.69 on test data) than a classical pipeline approach to detect named entity categories (F-measure=0.65).
We present an end-to-end approach to extract semantic concepts directly from the speech audio signal. To overcome the lack of data available for this spoken language understanding approach, we investigate the use of a transfer learning strategy based on the principles of curriculum learning. This approach allows us to exploit out-of-domain data that can help to prepare a fully neural architecture. Experiments are carried out on the French MEDIA and PORTMEDIA corpora and show that this end-toend SLU approach reaches the best results ever published on this task. We compare our approach to a classical pipeline approach that uses ASR, POS tagging, lemmatizer, chunker... and other NLP tools that aim to enrich ASR outputs that feed an SLU text to concepts system. Last, we explore the promising capacity of our end-to-end SLU approach to address the problem of domain portability.
For the first time the maximum thermal budget of in-situ doped source/drain State Of The Art (SOTA) FDSOI bottom MOSFET transistors is quantified to ensure transistors stability in Sequential 3D (CoolCube TM ) integration. We highlight no degradation of Ion/Ioff trade-off up to 550°C. Thanks to both metal gate work-function stability especially on short devices and silicide stability improvement, the top MOSFET temperature could be relaxed up to 500°C. Laser anneal is then considered as a promising candidate for junctions activation. Based on in-depth morphological and electrical characterizations it demonstrates very promising results for high performance Sequential 3D integration.
Speaker segmentation consists in partitioning a conversation between one or more speakers into speaker turns. Usually addressed as the late combination of three sub-tasks (voice activity detection, speaker change detection, and overlapped speech detection), we propose to train an end-to-end segmentation model that does it directly. Inspired by the original end-to-end neural speaker diarization approach (EEND), the task is modeled as a multi-label classification problem using permutation-invariant training. The main difference is that our model operates on short audio chunks (5 seconds) but at a much higher temporal resolution (every 16ms). Experiments on multiple speaker diarization datasets conclude that our model can be used with great success on both voice activity detection and overlapped speech detection. Our proposed model can also be used as a post-processing step, to detect and correctly assign overlapped speech regions. Relative diarization error rate improvement over the best considered baseline (VBx) reaches 17% on AMI, 13% on DIHARD 3, and 13% on VoxConverse.
This paper presents a deep-learning-based algorithm dedicated to the processing of speckle noise in phase measurements in digital holographic interferometry. The deep learning architecture is trained with phase fringe patterns including faithful speckle noise, having non-Gaussian statistics and non-stationary property, and exhibiting spatial correlation length. The performances of the speckle de-noiser are estimated with metrics, and the proposed approach exhibits state-of-the-art results. In order to train the network to de-noise phase fringe patterns, a database is constituted with a set of noise-free and speckled phase data. The algorithm is applied to de-noising experimental data from wide-field digital holographic vibrometry. Comparison with the state-of-the-art algorithm confirms the achieved performance.
This work investigates spoken language understanding (SLU) systems in the scenario when the semantic information is extracted directly from the speech signal by means of a single end-to-end neural network model. Two SLU tasks are considered: named entity recognition (NER) and semantic slot filling (SF). For these tasks, in order to improve the model performance, we explore various techniques including speaker adaptation, a modification of the connectionist temporal classification (CTC) training criterion, and sequential pretraining.
Fig. 3 | Model dependencies and results of the rSA. a, MDS of the RDMs describing the relationships between pictures in terms of their collective (national news bulletins and reports on World War II), semantic (that is, Wikipedia World War II articles), spatial (that is, Memorial layout) and temporal (that is, acquisition order) properties. Collective and semantic RDMs included six to ten selected topics (Fig. 2c) and their ten iterations. Temporal RDMs included the six possible routes around the Memorial. b, The dmPFC and vmPFC regions of interest (ROIs). c, Similarities between the upper triangular portions of image arrangements (left), dmPFC (middle) and vmPFC (right) RDMs, and collective, semantic and contextual model RDMs. Bar graphs display the mean beta coefficients from the regression model (top) and Spearman's correlation coefficients (bottom) across participants (N = 24). Horizontal lines indicate significant differences at P < 0.05, false discovery rate (FDR)-corrected for multiple comparisons. Error bars reflect 95% bootstrapped CIs (and thus indicate significance when they do not overlap with zero). Dashed horizontal lines indicate the noise ceiling (that is, an estimate of the reliability of the neural data; see Methods), which reflects the expected performance of the (unknown) true model given the noise and variability among study participants 34. Regarding brain imaging data, the collective RDM reaches the noise ceiling, indicating that collective schemas account for a significant amount of the true neural dissimilarity structure. Note that in the context of the current experiment, the small correlation between individuals' brain dissimilarity structures and the expected true model arises from the fact that we have only one measurement of each pattern of memory activity (that is, participants recall the picture only once), making the estimate of the neural dissimilarity noisy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.