We present a method to identify and localize people by leveraging existing CCTV camera infrastructure along with inertial sensors (accelerometer and magnetometer) within each person's mobile phones. Since a person's motion path, as observed by the camera, must match the local motion measurements from their phone, we are able to uniquely identify people with the phones' IDs by detecting the statistical dependence between the phone and camera measurements. For this, we express the problem as consisting of a twomeasurement HMM for each person, with one camera measurement and one phone measurement. Then we use a maximum a posteriori formulation to find the most likely ID assignments. Through sensor fusion, our method largely bypasses the motion correspondence problem from computer vision and is able to track people across large spatial or temporal gaps in sensing. We evaluate the system through simulations and experiments in a real camera network testbed.