“…Since then, a series of frameworks combine multi-view image information for inferring camera motion and scene geometry [88,11,69,70,23,75]. While most works rely on generic network architectures, few combine learning with a traditional geometric optimization [70,69,11]. We base our model on DeepV2D [70], which couples supervised training of depth based on a cost volume architecture with a geometric pose graph optimization.…”