This paper investigates how to fuse grayscale and thermal video data for detecting foreground objects under challenging scenarios. To this end, we propose an intuitive yet effective method, called WEighted Low-rank Decomposition (WELD), which adaptively pursues the cross-modality low-rank representation. Specifically, we form two data matrices by accumulating sequential frames from the grayscale and the thermal videos, respectively. Within these two observing matrices, WELD detects moving foreground pixels as sparse outliers against the low-rank structure background, and incorporates the weight variables to make the models of two modalities complementary to each other. The smoothness constraints of object motion are also introduced in WELD for further improving the robustness to noises. For optimization, we propose an iterative algorithm to efficiently solve the low-rank models with three sub-problems. Moreover, we utilize an edge-preserving filtering based method to substantially speed up WELD while preserving its accuracy. To provide a comprehensive evaluation benchmark of grayscalethermal foreground detection, we create a new dataset including 25 aligned grayscale-thermal video pairs with high diversity. Our extensive experiments on both the newly created dataset and the public dataset OSU3 suggest that WELD achieves superior performance and comparable efficiency against other state-ofthe-art approaches.