This project aimed to implement spatial hashing for the TSDF volume data structure in the OpenCV RGBD module, and leverage the same to build a scalable submap based online 3D scene reconstruction system with little to no drift.
TSDF volumes are widely agreed upon in the research community to be locally consistent with minimal drift, therefore a natural mapping model is a PoseGraph^{[2]} of overlapping submaps, each representing a local section of the entire scene. This mapping model allows for periodic global optimization, which corrects accumulating drift retrospectively in the model, as new sensor measurements are incorporated.
The following delineates the pipeline:
This implementation uses the existing Kinect Fusion^{[1]} pipeline in OpenCV.
A sample demo of the application is as shown below, running on the ICL-NUIM dataset:
The implementation in OpenCV contains the following primitives.
This implementation is based on the seminal work on Voxel hashing^{[7]}. A regular TSDF volume represents a scene as a 3D volume grid of (truncated) signed distance functions. These truncated signed distance functions are simply the shortest distance of each point to its closest surface. While this is simple to implement, this constrains a user to reconstruct scenes of limited size, since the volume size has to be preallocated.
The Hash based volume, stores the volume as an unordered_map of smaller TSDF volume units (volumeUnits), each representing canonically 8^{3} or 16^{3} resolution. These volume units are indexed by a 3 dimensional Vector, which is hashed appropriately to minimise the number of hash collisions[3].
This TSDF volume requires the following important functionalities:
The following PR provides a CPU implementation of Hash-table TSDF volume in OpenCV:
The submap class is an abstraction over the hashTSDF volume to support the large_kinfu pipeline. Some questions that are especially relevant with submap based 3D reconstruction are:
We address question 1. and 2. by using a camera overlap metric, which states that if the current camera frame consistently views - for a threshold number of frames - a new set of volume units that are only allocated recently and haven't been part of the older frames, then it means that the camera must have moved to a new part of the scene.[4] Once a new submap is instantiated, we initialize it with a submap \(SE(3)\) pose of the frame at which it was created.
We maintain a list of active submaps, all of which are simultaneously tracked at each time-step. The simultaneous tracking provides us with a camera pose w.r.t each submap as \(T^t_{s_i c}\) where \(s_i\) represents the \(i^{th}\) submap coordinate frame, \(c\) represents the camera coordinate frame and \(t\) represents the current time-step. The relative constraint at each time-step between submap \(s_j\) and \(s_i\) can be obtained as:
A robust estimate of the constraints between submaps over multiple timesteps is then obtained using a simple implementation of Iteratively Reweighted Least Squares (IRLS), which eliminates outlier constraint estimates using the Huber norm.^{[4]}
Once we have a scene containing dynamically created submaps, we are required to optimize the submap poses to eliminate accumulating camera tracking drift and improved reconstruction
We implement a simple PoseGraph class, and implement second order optimization methods such as Gauss Newton, and Levenberg Marquardt.
The idea here is to abstract the submaps as nodes of 3D \(SE(3)\) poses, and to use the sensor measurements i.e., the robust Pose Constraints between submaps, as obtained from the previous section to correct the pose estimates. For a given submap pair \(s_j\) and \(s_i\) with an existing pose constraint \(\hat{T}_{s_j s_i}\), we formulate an error metric (factor) as follows:
$$r = \hat{T}{s_j s_i} \ominus (T_{s_i c} \ominus T_{s_j c})$$
Where \(\ominus\) denotes the \(SE(3)\) right composition i.e., \(A \ominus B \triangleq Log(B^{-1} \circ A)\)^{[5]}
We minimize the residual rr by linearizing the function and then solving the linear system of equations using a Cholesky solver or a QR solver. (We leverage Eigen library for the linear system solver).^{[2]}
NOTE: Currently, the implementation of Levenberg Marquardt is unstable, and thus for the time being Ceres library is used for the same^{[6]}. However, we will refine the implementation of the optimizer to make the module dependency-free in the future.
The following Pull Request implements the Submap and PoseGraph optimization:
A large omission in this work is the Relocalization module that is imperative to prevent spurious creation of submaps when the camera revisits previously created submap sections. I plan to add this extension after GSoC.
Typically relocalization modules are implemented using Fast Bag of Words or Dynamic Bag of Words (FBoW, DBoW2/3) ^{[8]} methods, which maintain a vocabulary of distinct keyframes as bag of words which can be quickly queried during tracking to detect cases of loop closure and can be used for camera relocalization if tracking fails.
It has been very pleasant working with my mentor Rostislav Vasilikhin, (and with Mihai Bujanca), who was enthusiastic and forthcoming with all the issues during the 3 months of Google Summer of Code.
It was especially rewarding and pedagogic implementing modules that form the basis of most Dense SLAM systems today, and the added benefit of making a significant open-source contribution in the process has made summer of 2020 worthwhile and interesting.
I would like to thank Google for giving me this opportunity as well!