“…Utilizing weak labels to train models has come a long way in computer vision such as semantic segmentation [44], [45], [46], object detection [47], [48], and temporal action localization (TAL) [21], [22], [23]. In contrast to the fullysupervised TAL [49], [50], [51], [52], the WTAL methods are free of extensive frame-level annotations and adopt video- [23], [53], [54], [55], [56] or point (key frame)-level [57], [58], [59], [60] labels during training. Since different videolevel WTAL approaches have different emphases, we can categorize them as foreground-only, background-assisted or pseudo-label-guided.…”