Figure 1: Given a video sequence captured from an egocentric viewpoint, we segment all the objects that the actor/observer interacts with. We achieve this by means of a neural architecture, NeuralDiff, that learns to decompose each frame into a static background and a dynamic foreground, comprising the manipulated objects, which seldomly move in the sequence, and the actor's body, which moves continually and heavily occludes the scene. The neural network contains three streams that, via different inductive biases, reconstruct the background, the objects and the actor in 3D, and is thus able to render images and their segmentations also for viewpoints that do not exist in the original video sequence (see the left part of the figure, where the camera is assumed to remain at its initial position while the action unfolds).