“…The dilation value is always constrained to a minimum of 1, which is equivalent to the standard NA and has an upper bound of ⌊nk⌋, where n is the number of tokens and k is the kernel or neighborhood size. DilateFormer 45 shows that distant patches in the shallow layers are mostly irrelevant in semantics modeling for mainstream vision tasks, so we set different dilation values in the shallow and deep layers. Specifically, we set the dilation values to 1, 1, 1, 1, 1, 2, 1, 2, 1, 3, 1, and 3 in the first two RDiNAGs and 1, 1, 1, 2, 1, 3, 1, 4, 1, 6, 1, and 8 in the last four RDiNAGs.…”