“…Additionally, we estimate direction of arrival (DOA) θ s for each of the detected sounds ′ s ′ using a set of DOA estimates from the raw signal (as many as detected acoustic events at each given time step), which are then mapped to x, y, z coordinates 3 . This process can leverage an additional semantic information from vision stream, as shown in [47]. The most likely pairs {acoustic_event, θ s } for co-occurring events are estimated in the spatial model using visual data.…”