Depth Fusion For Large Scale Environments

This project aimed to implement spatial hashing for the TSDF volume data structure in the OpenCV RGBD module, and leverage the same to build a scalable submap based online 3D scene reconstruction system with little to no drift.

TSDF volumes are widely agreed upon in the research community to be locally consistent with minimal drift, therefore a natural mapping model is a PoseGraph[2] of overlapping submaps, each representing a local section of the entire scene. This mapping model allows for periodic global optimization, which corrects accumulating drift retrospectively in the model, as new sensor measurements are incorporated.

The following delineates the pipeline:

This implementation uses the existing Kinect Fusion[1] pipeline in OpenCV.


A sample demo of the application is as shown below, running on the ICL-NUIM dataset:

The implementation in OpenCV contains the following primitives.

2.1. Hash-table based TSDF volume

This implementation is based on the seminal work on Voxel hashing[7]. A regular TSDF volume represents a scene as a 3D volume grid of (truncated) signed distance functions. These truncated signed distance functions are simply the shortest distance of each point to its closest surface. While this is simple to implement, this constrains a user to reconstruct scenes of limited size, since the volume size has to be preallocated.

The Hash based volume, stores the volume as an unordered_map of smaller TSDF volume units (volumeUnits), each representing canonically 83 or 163 resolution. These volume units are indexed by a 3 dimensional Vector, which is hashed appropriately to minimise the number of hash collisions[3].

This TSDF volume requires the following important functionalities:

The following PR provides a CPU implementation of Hash-table TSDF volume in OpenCV:

[GSoC] Add new Hashtable based TSDF volume to RGBD module by akashsharma02 · Pull Request #2566 · opencv/opencv_contrib
This work is part of GSoC and is potentially for the first evaluation phase of the program. My mentor for the GSoC time period is: @savuor My proposal is available here: Please note that this is still Work in Progress, as the system is not yet reliable and does not work well.
2.2. Submap

The submap class is an abstraction over the hashTSDF volume to support the large_kinfu pipeline. Some questions that are especially relevant with submap based 3D reconstruction are:

  1. What is the appropriate size of the submap volume?
  2. When do you terminate and initialize a new submap?
  3. How do you track and create constraints between multiple submaps in a scene for downstream optimization of poses?

We address question 1. and 2. by using a camera overlap metric, which states that if the current camera frame consistently views - for a threshold number of frames - a new set of volume units that are only allocated recently and haven't been part of the older frames, then it means that the camera must have moved to a new part of the scene.[4] Once a new submap is instantiated, we initialize it with a submap \(SE(3)\) pose of the frame at which it was created.

We maintain a list of active submaps, all of which are simultaneously tracked at each time-step. The simultaneous tracking provides us with a camera pose w.r.t each submap as \(T^t_{s_i c}\) where \(s_i\) represents the \(i^{th}\) submap coordinate frame, \(c\) represents the camera coordinate frame and \(t\) represents the current time-step. The relative constraint at each time-step between submap \(s_j\) and \(s_i\) can be obtained as:

$$T^t_{s_j s_i} = {T^t_{s_i c}}^{-1} \circ T^t_{s_j c}$$

A robust estimate of the constraints between submaps over multiple timesteps is then obtained using a simple implementation of Iteratively Reweighted Least Squares (IRLS), which eliminates outlier constraint estimates using the Huber norm.[4]

2.3 PoseGraph optimization

Once we have a scene containing dynamically created submaps, we are required to optimize the submap poses to eliminate accumulating camera tracking drift and improved reconstruction

We implement a simple PoseGraph class, and implement second order optimization methods such as Gauss Newton, and Levenberg Marquardt.

The idea here is to abstract the submaps as nodes of 3D \(SE(3)\) poses, and to use the sensor measurements i.e., the robust Pose Constraints between submaps, as obtained from the previous section to correct the pose estimates. For a given submap pair \(s_j\) and \(s_i\) with an existing pose constraint \(\hat{T}_{s_j s_i}\), we formulate an error metric (factor) as follows:

$$r = \hat{T}{s_j s_i} \ominus (T_{s_i c} \ominus T_{s_j c})$$

Where \(\ominus\) denotes the \(SE(3)\) right composition i.e., \(A \ominus B \triangleq Log(B^{-1} \circ A)\)[5]

We minimize the residual rr by linearizing the function and then solving the linear system of equations using a Cholesky solver or a QR solver. (We leverage Eigen library for the linear system solver).[2]

NOTE: Currently, the implementation of Levenberg Marquardt is unstable, and thus for the time being Ceres library is used for the same[6]. However, we will refine the implementation of the optimizer to make the module dependency-free in the future.

The following Pull Request implements the Submap and PoseGraph optimization:

[GSoC] [WIP] Add Submaps and PoseGraph optimization for Large Scale Depth Fusion by akashsharma02 · Pull Request #2619 · opencv/opencv_contrib
This work is part of GSoC and is for the 2nd and 3rd part of the evaluation phase of the program. My mentor for the GSoC time period is: @savuor My proposal is available here: Checklist before review: Track multiple submaps simultaneously Posegraph data structure and Optimizer (Uses Ceres for PoseGraph optimization) Add support for computing constraints between submaps (In progress/Partially done) Relocalization of tracking (Dropping for GSoC.
Extensions and Future Work

A large omission in this work is the Relocalization module that is imperative to prevent spurious creation of submaps when the camera revisits previously created submap sections. I plan to add this extension after GSoC.

Typically relocalization modules are implemented using Fast Bag of Words or Dynamic Bag of Words (FBoW, DBoW2/3) [8] methods, which maintain a vocabulary of distinct keyframes as bag of words which can be quickly queried during tracking to detect cases of loop closure and can be used for camera relocalization if tracking fails.


It has been very pleasant working with my mentor Rostislav Vasilikhin, (and with Mihai Bujanca), who was enthusiastic and forthcoming with all the issues during the 3 months of Google Summer of Code.

It was especially rewarding and pedagogic implementing modules that form the basis of most Dense SLAM systems today, and the added benefit of making a significant open-source contribution in the process has made summer of 2020 worthwhile and interesting.

I would like to thank Google for giving me this opportunity as well!


[1] Newcombe, Richard A., Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. “KinectFusion: Real-Time Dense Surface Mapping and Tracking.” In 2011 10th IEEE International Symposium on Mixed and Augmented Reality, 127–36, 2011.

[2] Dellaert, Frank, and Michael Kaess. “Factor Graphs for Robot Perception.” Foundations and Trends in Robotics 6, no. 1–2 (2017): 1–139.

[3] Daniel James. “Boost Hash combine” hash_combine

[4] Kähler, Olaf, Victor A. Prisacariu, and David W. Murray. “Real-Time Large-Scale Dense 3D Reconstruction with Loop Closure.” In Computer Vision – ECCV 2016, edited by Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, 9912:500–516. Lecture Notes in Computer Science. Cham: Springer International Publishing, 2016.

[5] Solà, Joan, et al. “A Micro Lie Theory for State Estimation in Robotics.” ArXiv:1812.01537 [Cs], Aug. 2020,

[6]Ceres Solver — A Large Scale Non-Linear Optimization Library. Accessed 29 Aug. 2020.

[7] Nießner, Matthias, Michael Zollhöfer, Shahram Izadi, and Marc Stamminger. “Real-Time 3D Reconstruction at Scale Using Voxel Hashing.” ACM Transactions on Graphics 32, no. 6 (November 1, 2013): 1–11.

[8] Galvez-López, Dorian, and Juan D. Tardos. “Bags of Binary Words for Fast Place Recognition in Image Sequences.” IEEE Transactions on Robotics 28, no. 5 (October 2012): 1188–97.