Tracking an unknown and time-varying number of targets (e.g., speakers) in indoor environments using audio-visual (AV) modalities has received increasing interest in numerous fields including video conferencing, individual speaker discrimination, and human-computer interaction. The audio-visual sequential Monte Carlo probability hypothesis density (AV-SMC-PHD) filter is a popular baseline for multi-target tracking, offering an elegant framework for fusing audio-visual information and dealing with a varying number of speakers. However, the performance of this filter can be adversely affected by the weight degeneracy problem, where the weights of most of the particles may become very small, while only few remain significant, during the iteration of the algorithm. In this paper, we will short discuss the multi-target tracking.