“…We use them to label/tag the actors and track their temporal behavior. With steeringless and pedalless cars around the corner, [11], [12], [13], it is only natural that Multi-Object Tracking (MOT) [14], [15], [16] and language-based navigation [17], [18], [19], [20], [21] will be inevitable features for any future SDV. Complementary appearance cues present in the BEV occupancy space and the RGB image space provide strong priors for the above-mentioned tasks.…”