“…They consist of spatio-temporal information sourced from two representations (BEV and RV), physical object dimensions encoded in the input BEV images, occlusion information provided from RV images, and rich semantics signified in a camera image. When these features are inserted into MotionNet backbone network, they yield accurate pixel-wise joint perception and motion predic- Output: [1], :] += 1 end P /= count // average (avoid division by 0) mask = (count == 0) P[mask, :] = -1 // assign -1 to empty cells end tion in real-time.…”