“…These works and others often focus on a robot's ability to handle a specific aspect of multi-party interactions: receiving and responding to multiple requests [15,26], group detection [28,29], speech recognition [10,11], gesture generation [18], body orientation generation [30], gaze generation [4], etc. Relevant studies in multi-party turn-taking [3,14] use hand-crafted features (e.g., whether someone is speaking, head pose, prosody) to determine when the robot should take a turn, but do not incorporate the contents of speech. The closest multi-party work to ours, [15], uses human-human and human-robot data that was manually labeled to learn low-level submodules for how a bartender robot should interact with multiple customers (e.g., classifying user engagement, or saying pre-defined utterances).…”