In this report, we summarize the takeaways from the first NeurIPS 2021 NetHack Challenge. Participants were tasked with developing a program or agent that can win (i.e., 'ascend' in) the popular dungeon-crawler game of NetHack by interacting with the NetHack Learning Environment (NLE), a scalable, procedurally generated, and challenging Gym environment for reinforcement learning (RL). The challenge showcased communitydriven progress in AI with many diverse approaches significantly beating the previously best results on NetHack. Furthermore, it served as a direct comparison between neural (e.g., deep RL) and symbolic AI, as well as hybrid systems, demonstrating that on NetHack symbolic bots currently outperform deep RL by a large margin. Lastly, no agent got close to winning the game, illustrating NetHack's suitability as a long-term benchmark for AI research.
School of Computing, University of Eastern Finland
ABSTRACTIt is well-known that for speaker recognition task, genderdependent acoustic modeling performs better than genderindependent modeling. The practice is to use the gender ground-truth and to train gender-dependent models. However, such information is not necessarily available, especially if speakers are remotely enrolled. A way to overcome this is to use a gender classification system, which introduces an additional layer of uncertainty. To date, such uncertainty has not been studied. We implement two gender classifier systems and test them with two different corpora and speaker verification systems. We find that estimated gender information can improve speaker verification accuracy over genderindependent methods. Our detailed analysis suggests that gender estimation should have a sufficiently high accuracy to yield improvements in speaker verification performance.
In video-based training, clinicians practice and advance their skills on surgeries performed by their colleagues and themselves. Although microsurgeries are recorded daily, training centers are lacking the workforce to manually annotate the segments important for practitioners, such as instrument presence and position. In this work, we propose intelligent instrument detection using Convolutional Neural Network (CNN) to augment microsurgical training. The network was trained on real microsurgical practice videos for which human annotators manually gathered a large corpus of instrument positions. Under challenging conditions of highly magnified and often blurred view, the CNN was capable to correctly detect a needle-holder (a dominant tool in suturing practice) with 78.3% accuracy (F-score = 0.84) with recognition speed above 15 FPS. The result is promising in the emerging domain of augmented medical training where instrument recognition presents benefits to the microsurgical training.
As automatic speaker verification (ASV) systems are vulnerable to spoofing attacks, they are typically used in conjunction with spoofing countermeasure (CM) systems to improve security. For example, the CM can first determine whether the input is human speech, then the ASV can determine whether this speech matches the speaker's identity. The performance of such a tandem system can be measured with a tandem detection cost function (t-DCF). However, ASV and CM systems are usually trained separately, using different metrics and data, which does not optimize their combined performance. In this work, we propose to optimize the tandem system directly by creating a differentiable version of t-DCF and employing techniques from reinforcement learning. The results indicate that these approaches offer better outcomes than finetuning, with our method providing a 20% relative improvement in the t-DCF in the ASVSpoof19 dataset in a constrained setting.
Mapping states to actions in deep reinforcement learning is mainly based on visual information. The commonly used approach for dealing with visual information is to extract pixels from images and use them as state representation for reinforcement learning agent. But, any vision only agent is handicapped by not being able to sense audible cues. Using hearing, animals are able to sense targets that are outside of their visual range. In this work, we propose the use of audio as complementary information to visual only in state representation. We assess the impact of such multi-modal setup in reach-the-goal tasks in ViZDoom environment. Results show that the agent improves its behaviour when visual information is accompanied with audio features.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.