“…Following the serialized output training (SOT) framework [28], a multi-talker transcription is represented as a single sequence Y by concatenating the word sequences of the individual speakers with a special "speaker change" symbol sc . For example, the reference token sequence to Y for the three-speaker case is given as R = {r 1 1 , .., r 1 N 1 , sc , r 2 1 , .., r 2 N 2 , sc , r 3 1 , .., r 3 N 3 , eos }, where r j i represents the i-th token of the j-th speaker. A special symbol eos is inserted at the end of all transcriptions to determine the termination of inference.…”