“…The first step of our experiments aims to find the (C, K, p, n) combination that returns the best results in the classification of a target word (SWC), an environmental sound (US8K), and an ambulance siren (A3S-Synth). We train prototypical networks in the (C, K) ∈ {(2, 1), (2, 5), (5, 1), (5, 5), (10, 1), (10,5), (10, 10)} configurations with SWC and US8K datasets, while we employ (C, K) ∈ {(2, 1), (2, 5), (2, 10)} with A3S-Synth. At inference time, all models are evaluated by constructing positive and negative support embeddings in the same (p, n) combinations with p ∈ {1, 5} and n ∈ {1, 5, 10, 50}.…”