“…Self-supervised learning. Our work builds off efforts to learn perceptual models that are "self-supervised" by leveraging natural contextual signals in images [10,22,33,38,24], videos [46,32,43,44,13,20], and even radio signals [48]. These approaches utilize the power of supervised learning while not requiring manual annotations, instead deriving supervisory signals from the structure in Procedure to generate the sound of a pixel: pixel-level visual features are extracted by temporal max-pooling over the output of a dilated ResNet applied to T frames.…”