3D-Reconstruction with Uncalibrated Stereo

Problem Statement

As a drone delivery company, we face challenges in efficiently navigating urban environments with complex structures and obstacles. Traditional 2D mapping techniques may not provide sufficient situational awareness, hindering safe and precise delivery operations. To address this, we need a robust 3D reconstruction solution that can accurately model the 3D environment, enabling our drones to plan optimal flight paths, detect obstacles, and ensure reliable and timely deliveries in densely populated areas. This technology should enhance our drone's perception capabilities, allowing for better collision avoidance and navigation in dynamic and ever-changing urban landscapes.

Abstract

In our study, we evaluated the accuracy and efficiency of Multi-View Stereo (MVS) and Structure-from-Motion (SFM) technologies for 3D reconstruction. We utilized publicly available online datasets to assess the performance of these methods in generating detailed 3D models. The results highlighted MVS's proficiency in producing dense and intricate reconstructions. Meanwhile, SFM demonstrated greater efficiency in scenarios where a wide range of viewpoints was involved.

To implement 3D reconstruction on the drone, we will integrate a camera or a stereo camera under the drone. The camera(s) will capture multiple images of the environment from different viewpoints as the drone flies. These images will then be processed using Structure-from-Motion (SFM) or Multi-View Stereo (MVS) algorithms.

In the SFM approach, the drone will capture a series of 2D images while moving through the area of interest. Our algorithm will analyze the images to estimate the camera poses and triangulate the 3D points in the scene. These 3D points will be combined to create a detailed 3D reconstruction of the environment.

Alternatively, with a stereo camera, the drone will have two synchronized cameras placed side by side to capture stereoscopic image pairs. The depth information from the stereo images can be used to perform real-time depth mapping and 3D reconstruction of the scene.

The implemented 3D reconstruction system will enable the drone to create accurate 3D models of the surroundings in real-time. This capability will facilitate obstacle detection and avoidance, optimize path planning, and enhance overall situational awareness during drone missions.

Special thanks to Prof. Shree Nayar for his awesome explanations on Stereo Vision!

Datasets

KITTI Dataset
- 3D Object Detection Evaluation
Middlebury Datasets:
- Temple Dataset
- Rock Dataset
Fountain Dataset
- Fountain-P11 Dataset

1. Uncalibrated Stereo

In the previous project - Pseudo LiDARS with Stereo Vision - we have seen how we could use two sets of cameras separated by a distance b (baseline) and using the intrinsic and extrinsic parameters, we would calculate the disparity, build a depth map, and ultimately get the distances of objects. In that scenario, all the parameters were provided to us from the KITTI dataset.

But what if we have two images of the same scene taken by two different persons and perhaps two different cameras and we know the internal parameters of the cameras. From these two arbitrary views, can we compute the translation and rotation of one camera w.r.t another camera? If so. can we compute a 3D model of the scene? In this project, we will devise a method to estimate the 3D structure of a static scene from two arbitrary views.

Assumption:

Intrinsic parameters are known for both cameras. (Obtained from metadata of image)
Extrinsic parameters (relative position and orientation of the cameras) are unknown.

Steps:

Find external parameters, i.e., the relationship between the 3D coordinate frames of the two cameras.
To achieve this, we need to find reliable correspondences between the two arbitrary views. For calibration, a minimum of eight matches is sufficient.
Once correspondences are established, we can determine the rotation and translation of each camera with respect to the other. This calibrates the stereo system, and then we can find dense correspondences.
Dense correspondences mean finding the corresponding points in the right image for every point in the left image. Knowing the rotation and translation facilitates a one-dimensional search to find the dense correspondences.
With the dense correspondence map, we can triangulate and find the 3D location of points in the scene, computing their depth.

2. Epipolar Geometry

Our goal is to find the relative position and orientation between the two cameras (calibrating an uncalibrated stereo system). The relative position and orientation are described by the epipolar geometry of the stereo system. It is represented as t and R in the image below.

2.1 Epipoles

The epipoles ( and ) of the stereo system are the projections of the center of the left camera onto the right camera image and the center of the right camera onto the left camera image. Note that for any stereo system, the epipoles are unique to that system.

2.2 Epipolar Plane

The epipolar plane of a scene point P is defined as the plane formed by the camera origins , epipoles , and the scene point P. Note that each point in the scene has a unique epipolar plane.

2.3 Epipolar Constraint

We first calculate the vector that is normal to the epipolar plane. We compute the cross-product of the unknown translation vector and the vector that corresponds to point P in the left coordinate frame. Moreover, we know that the dot product of the normal vector with should be zero which then gives us our epipolar constraint:

We re-write our epipolar constraint as vector then matrix form. Note that the 3x3 matrix is known as the translation matrix.

We have a second constraint such that we can relate the three-dimensional coordinates of a point P in the left camera to the three-dimensional coordinates of the same point in the right camera using t and R.

Now if we substitute equation 2 into equation 1:

In the equation above the product of the translation matrix with the translation vector is zero:

Hence, we are left with the equation below:

The product is the translation matrix with the rotation matrix gives the Essential matrix:

Hence, plugging in our equation above:

3. Essential Matrix

Note that it is possible to decompose our Essential Matrix into the translation and rotation matrix. Notice the translation matrix is a skew-symmetric matrix and R is an orthonormal matrix. It is possible to decouple the translation matrix and rotation matrix from their product using Singular Value Decomposition. Therefore, if we can compute the Essential Matrix, we can calculate the translation t and rotation R. Once we have these two unknowns, we have calibrated our uncalibrated stereo system.

One issue with the epipolar constraint is that both represents the 3D coordinates of the same scene point and we do not have this information. That is, when we take two pictures, two arbitrary views of the scene, we do not have the locations of points in the scene in 3D. In fact, that's exactly what we hope to recover ultimately.

However, we do have the image coordinates of the scene points, that is the projection of the scene onto the images -

To incorporate the image coordinates, we start with the perspective projection equations of the left camera.

We multiply the two equations by z:

Now using Homogeneous Coordinates:

We then have the product of this 3x3 camera matrix and the three-dimensional coordinates of the point scene point in the left camera - X and Y and Z. Note that we know the camera matrix as we assumed we know all the internal parameters of the two cameras.

We then have a left and right camera equation and this is all corresponding to the same point in the scene:

Re-writing the equations above in a more compact form:

We want to make and the subject of formula:

Left Camera

Right camera:

Recall our epipolar constraint:

We now substitute and in the equation above:

Note that in the equation above, we have our image coordinates and essential matrix, however, we also have and which are the depth of the same scene point in the two cameras.

In the camera coordinates frame, the center is located at the effective pinhole of the camera). Consequently, the depth of any point cannot be zero, except when the point coincides with the center of projection. Considering that the world is situated in front of the camera, it is reasonable to assert that and (depth values) are not equal to zero.

Hence, if and are not equal to zero, then the rest of the equation should be equal to zero because on the right-hand side we have zero. So we can simply eliminate and from this equation to get an expression:

Note that and are 3x3 matrices hence, the product of with the Essential matrix and also gives a 3x3 matrix known as the Fundamental Matrix:

Re-writing:

One important observation is that finding the fundamental matrix using the given constraint enables us to easily derive the essential matrix. Since we have prior knowledge of and internal parameters, determining the essential matrix becomes straightforward. By applying singular value decomposition to the essential matrix, taking into account the skew-symmetry of the translation matrix (T) and the orthogonality of the rotation matrix (R), we can decompose it into its constituent parts to obtain T and R.

4. Fundamental Matrix

We now have our epipolar constraint, which essentially tells us that there is a single 3x3 matrix - the fundamental matrix that we need to find to calibrate our uncalibrated stereo system. So our goal here is to estimate the fundamental matrix.

Initially, our task is to identify a limited number(normally 8) of corresponding features in the provided pair of images. Although the features might appear different due to the distinct viewpoints, applying techniques like SIFT to both images allows us to obtain reliable matches. These matches will serve as a starting point for further analysis.

Let's take one of these correspondences of the left and right images and plug them into our epipolar constraint:

We know the image coordinates in the left and right camera image hence, we only need to find the Fundamental matrix. We expand the matrix to get a linear equation for one scene point:

If we now stack all these equations for all corresponding points then we get:

Which can also be written as a more compact form as:

Note that we know everything in matrix A as it only includes the image coordinates in the left and right cameras. We also have the fundamental matrix written as a vector f. The fundamental matrix F and the fundamental matrix K x F describe equivalent epipolar geometries. This implies that F is only defined up to a scale factor, allowing for scaling without changing the resulting images. Consequently, the scale of F can be adjusted accordingly.

Next, we need to find the least squares solution for the fundamental matrix F. We want Af as close to 0 as possible and (same as saying we are fixing the scale of F):

such that:

This is referred to as a constrained linear least squares problem.

Recall, we saw the same when solving the Projection Matrix during Camera calibration in the Pseudo LiDARS with Stereo Vision project.

Since now we have the Fundamental Matrix F, we need to compute the Essential Matrix E from known left and right intrinsic camera matrices. And once we have the Essential Matrix, we can decompose it using Singular Value Decomposition into the translation and rotation matrix.

Our uncalibrated stereo system is now fully calibrated!

5. Correspondences

With our stereo system now fully calibrated, the next step is to identify correspondences between points in the left and right images. Our goal is to find the corresponding point in the right image for every point in the left image. Ultimately, this process will enable us to create a detailed depth map of the 3-dimensional scene.

Recall in a simple stereo setup, there are two cameras: the left camera and the right camera. Their relationship is straightforward, with the right camera positioned horizontally at a distance called the baseline from the left camera. Both cameras have parallel optical axes, so any point in the left image corresponds to a point in the right image lying on the same horizontal scanline. This makes stereo matching reduced to a 1-dimensional search. The same principle applies to uncalibrated stereo systems.

In uncalibrated stereo setups, the stereo matching problem still involves a one-dimensional search. After finding the rotation and translation between the two cameras, the question arises about which line in the right image to search for corresponding points.

Recall the epipolar plane is unique for any given point in a scene. It includes the point P, the epipoles, and the center of the two cameras. The epipolar plane intersects with our two image planes to produce the epipolar lines as shown in pink above. Hence, every scene point has two corresponding epipolar lines, one each on the two image planes.

When we have a point in the image plane, for example, in the left image plane, then the corresponding point in the right image must lie on the epipolar line on the right image plane. If we take into consideration u_l, which represents a single outgoing ray, the epipolar line is defined by projecting all the points on this ray onto the right image. Consequently, for any point on this ray, finding its matching point in the right image involves a search along a single line. However, the question remains: which particular line should we search along?

It has been found that, with the fundamental matrix and points in the left image, it is possible to derive the equation of the straight line in the right image. This equation serves as a guide during the search process to find the corresponding matching points.

Once more, the process of finding correspondents reduces to a 1-dimensional search.

5.1 Epipolar Line

Let's assume we have the Fundamental Matrix F and a single point in the left image . In the equation below, we want to find an expression for :

By expanding our matrix equation:

Simplifying furthermore we have the equation of a straight line:

When we have the fundamental matrix, for any point provided in the left image, it becomes possible to determine the specific line in the right image where the corresponding point should be located. Likewise, if we have a point in the right image, I can use the same method to calculate the equation for the epipolar line in the left image. This equation then directs the search to find the corresponding point for the given point in the right image.

To establish correspondences between points in the left and right images, we take a small window around a point in the left image and apply template matching along the determined red line in the right image. This process helps us find the best-matching point in the right image for the given point in the left image, enabling accurate matching between the two images.

Once we have the corresponding pairs in the left and right images, we can use these pairs for triangulation, enabling the computation of the depths of the three-dimensional coordinates for points in the scene.

6. Estimating Depth

At this stage, we have established correspondences between points in the left and right images. Now, our objective is to utilize the image coordinates of these corresponding pairs to estimate the three-dimensional coordinates of each point in the scene. This process, known as triangulation, allows us to compute the spatial positions accurately.

Below are the equations for point P in the 3D coordinate frame of the left and right camera being projected onto the image perspective projection using the internal parameters onto the image plane. Note that the 3x4 matrices are intrinsic matrices.

Left Camera Imaging Equation:

Right Camera Imaging Equation:

Note that since we have now calibrated our uncalibrated stereo system, we also have the relative position and orientation between the two cameras:

We thus substitute it in the left camera imaging equation:

Note that the product of the intrinsic matrix and the extrinsic matrix becomes the Projection Matrix.

Keeping the Right Camera Imaging Equation the same:

When we arrange these two equations we get it in the form of:

This is an overdetermined system of linear equations. We find the least squares using pseudo-inverse:

We repeat this for every pair of corresponding points in the left and right images and then that gives us a complete 3D depth map of the scene. This is how we go from arbitrary views to death maps of the scene through this calibration process.

7. Multi-View Stereo

Multi-View Stereo (MVS) is a technique used for dense 3D reconstruction of scenes using multiple 2D images of the same scene taken from different viewpoints. MVS aims to generate a dense 3D representation of the scene, where every pixel in the images is assigned a corresponding 3D point in the reconstructed space.

Image Source: Shape from photographs: a multi-view stereo pipeline

When working with matching images with known camera parameters, the 3D geometry of the scene allows for establishing correspondences between pixels in different images. When camera parameters are known, matching a pixel in one image with pixels in another image becomes a one-dimensional (1D) search problem.

Image Source: Multi-View 3D-Reconstruction

Matching pixels across images is a challenging problem that is not unique to stereo or multi-view stereo. Optical flow is another active field that addresses the problem of dense correspondence across images. However, there are some key differences between MVS and optical flow:

Optical flow typically deals with a two-image problem, similar to two-view stereo, whereas MVS involves multiple images.
In optical flow, camera calibration is not assumed, while MVS relies on known camera parameters.
The main application of optical flow is image interpolation and motion estimation, whereas MVS is primarily focused on 3D reconstruction and depth estimation of a scene.

7.1 Two-view Stereo Reconstruction

Before we dive into MVS, let's try something simpler: the Two-View Stereo. We pick up where we left off in the Pseudo LiDARS with Stereo Vision project.

We start by taking two pictures from the KITTI Dataset: a left and right image:

We then calculate the disparity using Stereo-Global Block Matching (SGBM) algorithm between these two images:

    # Compute disparity map
    disparity_map = compute_disparity(left_image, right_image, num_disparities=90, block_size=5, window_size=5, matcher="stereo_sgbm", show_disparity=True)

The result:

7.1.1 Q-Matrix

In order to use the reprojectImageTo3D function from OpenCV, we first need to calcuate the Q-matrix also known as the Disparity-to-depth mapping matrix.

Q-matrix

4x4 matrix that takes the disparity value and calculates the depth of the point in 3D space.
It represents the relationship between disparity values and depth.

Depth Map

2D image that represents the depth of each pixel in the image.

Disparity Map

2D image that represents the difference between pixel disparities in left and right images of a stereo pair.
Used to calculate the depth of a point in 3D space.

In summary, the Q-matrix provides the mathematical mapping between pixel disparities abd 3D coordinates. It allows us to transform the disparity information captured by the stereo camera system into depth information which we can use to perform tasks as 3D reconstruction or scene understanding and so on.

    # Compute Q matrix
    Q = np.zeros((4, 4))

    # Perform stereo rectification
    _, _, _, _, Q, _, _ = cv2.stereoRectify(cameraMatrix1=camera_matrix_left,
                                            cameraMatrix2=camera_matrix_right,
                                            distCoeffs1=None,
                                            distCoeffs2=None,
                                            imageSize=left_image.shape[:2],  # Provide the image size
                                            R=np.identity(3),
                                            T=np.array([baseline[0], 0., 0.]),  # Translation vector
                                            R1=None,
                                            R2=None,
                                            P1=None,
                                            P2=None,
                                            Q=Q)

[[   1.            0.            0.         -609.5593338 ]
 [   0.            1.            0.         -172.85400391]
 [   0.            0.            0.          721.53771973]
 [   0.            0.           -1.8771874     0.        ]]

We can now use the Q-matrix and disparity map to obtain 3D points:

Where:

b: baseline
d: disparity value (u,v)
: 3D coordinates of scene point in Homogeneous coordinates.
: Q-matrix
: pixel coordinates
and : optical centers
: Is equal to 0 if aligned.

    # Obtain 3D points from the disparity map
    points = cv2.reprojectImageTo3D(disparity_map.copy(), Q)

Below is the result of a 3D reconstruction scene:

bandicam.2023-07-28.21-35-24-775.mp4

7.2 Multi-view Stereo Reconstruction

We will now dive into multi-view stereo. We will take two datasets: Temple of the Dioskouroi in Agrigento, Sicily and The Rock. We have 12 images for each dataset and for each image we have their extrinsic and intrinsic parameters.

Below is a video for the 12 images for the datasets:

7.2.1 Topologies

Recall that we had to calculate the disparity between two images. In the Two-view Reconstruction, it was pretty straightforward however, when we have more than 2 images then we have different ways to calculate the disparity between images. Below are 5 topologies to calculate disparities. We can choose to have either a dense 3D model at the end with 360 or overlapping topologies, a medium dense point cloud with adjacent or a sparse one with skipping_1 and skipping_2.

Below is the code for the topologies:

topologies['360'] = tuple(zip((0,1,2,3,4,5,6,7,8,9,10,11),
                              (1,2,3,4,5,6,7,8,9,10,11,0)))

topologies['overlapping'] = tuple(zip((0,1,2,3,4,5,6,7,8,9,10),
                                      (1,2,3,4,5,6,7,8,9,10,11)))

topologies['adjacent'] = tuple(zip((0,2,4,6,8,10),
                                   (1,3,5,7,9,11)))

topologies['skipping_1'] = tuple(zip((0,3,6,9),
                                     (1,4,7,10)))

topologies['skipping_2'] = tuple(zip((0,4,8),
                                     (1,5,9)))

7.2.2 Camera Calibration

Before rectification, the intrinsic and extrinsic parameters of the two cameras are estimated through a calibration process. The calibration determines the camera matrices, distortion coefficients, and relative pose (rotation and translation) between the cameras.

    # StereoRectifyOptions
    StereoRectifyOptions = {
        'imageSize': (w, h),
        # Specifies the desired size of the rectified stereo images. 'w' and 'h' are width and height, respectively.
        'flags': (0, cv2.CALIB_ZERO_DISPARITY)[0],
        # Flag for stereo rectification. 0: Disparity map is not modified. cv2.CALIB_ZERO_DISPARITY: Zero disparity at all pixels.
        'newImageSize': (w, h),  # Size of the output rectified images after the rectification process.
        'alpha': 0.5
        # Balance between preserving all pixels (alpha = 0.0) and completely rectifying the images (alpha = 1.0).
    }

    # RemapOptions
    RemapOptions = {
        'interpolation': cv2.INTER_LINEAR
        # Interpolation method used during the remapping process. Bilinear interpolation for smoother results.
    }

    # CameraArrayOptions
    CameraArrayOptions = {
        'channels': 1,
        # Number of color channels in the camera images. 1 for grayscale images, 3 for RGB color channels.
        'num_cameras': 12,  # Total number of cameras in the camera array.
        'topology': 'adjacent'
        # Spatial arrangement or topology of the camera array ('adjacent', 'circular', 'linear', 'grid', etc.).
    }

7.2.3 Rectification

Stereo rectification is a geometric transformation applied to a pair of stereo images to make them appear as if they were taken from the same viewpoint. The goal of stereo rectification is to simplify the stereo correspondence problem by aligning the epipolar lines in the two images.

The rectification process warps the two images so that their epipolar lines become parallel to the horizontal axis. This ensures that for each point in one image, its corresponding point in the other image lies along the same horizontal scan line. Rectification simplifies the stereo-matching process as it reduces the search space for finding corresponding points from 2D search to 1D.

    # 1. Rectification
    left_image_rectified, right_image_rectified = rectify_and_show_results(opencv_matcher, image_index=index, show_image=False)

left_maps or right_maps represents the rectification maps for the left or right camera. These rectification maps are used to transform the pixels of the original left or right image to their corresponding positions in the rectified left or right image. The cv2.initUndistortRectifyMap function generates two maps, left_maps[0] and left_maps[1], which are needed for the rectification process. These maps are essentially lookup tables that define the pixel correspondence between the original and rectified images.

left_maps[0]: Represents the horizontal (x) displacement map. It contains the x-coordinate of the destination pixel in the rectified image for each pixel in the original left image. When applied to the original left image, it remaps the pixels to their new positions in the rectified image along the horizontal direction.
left_maps[1]: Represents the vertical (y) displacement map. It contains the y-coordinate of the destination pixel in the rectified image for each pixel in the original left image. When applied to the original left image, it remaps the pixels to their new positions in the rectified image along the vertical direction.

Below is the image of the rock, the left_maps[1] and left_image_rectified.

7.2.4 Disparity Estimation

Once the images are rectified, they can be re-projected into 3D space using disparity information.

 # StereoMatcherOptions
    StereoMatcherOptions = {
        'MinDisparity': -64,  # Minimum disparity value, influences the maximum depth.
        'NumDisparities': 256,  # Number of disparities, influences the minimum depth.
        'BlockSize': 21,  # Size of the window used for matching (odd number).
        'SpeckleWindowSize': 0,  # Speckle window size, must be strictly positive to turn on speckle post-filter.
        'SpeckleRange': 0,  # Speckle range, must be >= 0 to enable speckle post-filter.
        'Disp12MaxDiff': 0  # Maximum allowed difference (in integer pixel units) in the left-right disparity check.
    }

    # StereoBMOptions
    StereoBMOptions = {
        'PreFilterType': (cv2.StereoBM_PREFILTER_NORMALIZED_RESPONSE, cv2.StereoBM_PREFILTER_XSOBEL)[0],
        # Pre-filter type, can be StereoBM_PREFILTER_NORMALIZED_RESPONSE or StereoBM_PREFILTER_XSOBEL.
        'PreFilterSize': 5,  # Size of the pre-filter kernel, must be odd and within the range 5..255.
        'PreFilterCap': 63,  # Pre-filter cap, must be within the range 1..63. Used to truncate pixel values.
        'TextureThreshold': 10,  # Texture threshold, minimum difference to accept the computed disparity.
        'UniquenessRatio': 10
        # Uniqueness ratio, defines the minimum margin between the best and second-best matching cost.
    }

    # StereoSGBMOptions
    StereoSGBMOptions = {
        'PreFilterCap': 0,  # Pre-filter cap.
        'UniquenessRatio': 0,
        # Uniqueness ratio, defines the minimum margin between the best and second-best matching cost.
        'P1': 16 * 21 * 21,  # "Depth Change Cost"
        'P2': 16 * 21 * 21,  # "Depth Step Cost"
        'Mode': (cv2.StereoSGBM_MODE_SGBM, cv2.StereoSGBM_MODE_HH,
                              cv2.StereoSGBM_MODE_SGBM_3WAY)[1]
        # Matching mode, can be StereoSGBM_MODE_SGBM, StereoSGBM_MODE_HH, or StereoSGBM_MODE_SGBM_3WAY.
    }

    # 2. Disparity
    disparity_img = compute_and_show_disparity(left_image_rectified, right_image_rectified)

Below is the disparity map using SGBM:

7.2.5 Project to 3D

We will now use our Q-matrix and disparity map to project our image into 3D space.

    # 3. Project to 3D
    output_file_path = reproject_and_save_ply(disparity_img, opencv_matcher, index, output_folder)

The 3D reconstruction of the rock:

bandicam.2023-07-30.13-37-38-632.mp4

The 3D reconstruction of the temple:

bandicam.2023-07-30.13-46-13-779.mp4

Note that we got better results for The Rock due to its non-complex structure and a better disparity map. We also have some noise which needs to be filtered. For the Temple's point cloud, we can see the outline of the temple but it is obstructed by noise. For further improvement, CreSTEREO can be used to build a better disparity map.

8. Structure from Motion

Structure from Motion aims to recover the three-dimensional (3D) structure of a scene and camera motion from a set of 2D images or video frames. In simpler terms, it's about reconstructing the 3D shape of objects and the camera's trajectory as it moves through a scene, using only the 2D images or video as input.

Image Source: Multi-view Stereo & Structure from Motion

The process of Structure from Motion typically involves the following steps:

Feature Detection and Matching: Distinctive features, such as corners, edges, or keypoints, are detected in each image. These features are then matched across multiple images to find corresponding points.
Camera Pose Estimation: Given the correspondences between image points, the next step is to estimate the camera poses (positions and orientations) for each image. This process can be achieved using techniques like the RANSAC algorithm. This step is most important as we will need to estimate the Fundamental Matrix and from it the Essential Matrix. Then decompose the Essential Matrix to recover the relative rotation and translation matrix.
Triangulation: Triangulation is the process of finding the 3D coordinates of points in the scene by intersecting corresponding rays from multiple camera viewpoints.
Bundle Adjustment: Bundle adjustment is an optimization process used to refine the camera poses and 3D points simultaneously. It minimizes the reprojection error, which is the difference between the observed image points and the reprojected 3D points on the image plane.

The result of the Structure from Motion process is a sparse 3D point cloud representation of the scene and the camera poses for each image frame.

8.1 Two-View Structure from Motion

Similarly to MVS, let's start simple with two images and reconstruct a 3D model. We will use the Fountain Dataset which consists of 11 images.

8.1.1 Feature Matching

We will now use the SIFT algorithm in order to find the matching features on two images. We will also need the x-y coordinates of these matching features.

    ### --------------- FEATURE MATCHING ----------------------- ###

    # Feature Matching using SIFT
    keypoints1, descriptors1, keypoints2, descriptors2, good_matches = match_features_sift(img1_rgb, img2_rgb, show_image=True)

    # Obtain the image coordinates (x, y) of the matched keypoints in both images
    img1pts, img2pts = get_matched_keypoints_coordinates(keypoints1, keypoints2)

Note that if the two images are quite different or have a significantly different orientation then it will be hard to find matching features. For example, in the image below we only have 46 matching features.

If we take two pictures with minimal orientation change, then we have the possibility of having more features matching. Here we have 1577 features from both images.

8.1.2 Fundamental Matrix

We will now use the findFundamentalMat function from OpenCV to find the Fundamental Matrix, F. The function also returns the mask, which indicates the inliers used to estimate F. We will try different algorithms to calculate F: FM_LMEDS, FM_8POINT, FM_RANSAC and FM_7POINT.

    ### --------------- FUNDAMENTAL MATRIX ----------------------- ###

    F, mask = find_fundamental_matrix(img1pts, img2pts)

[[-2.32608990e-08 -4.14097505e-07  2.13234106e-04]
 [ 1.10774376e-06 -9.43696790e-08  1.04326766e-02]
 [-1.17953256e-03 -1.15538347e-02  1.00000000e+00]]

8.1.3 Epipolar Lines

Using the Fundamental Matrix F and feature matching coordinates, we will compute the epipolar lines.

    ### --------------- EPIPOLAR LINES ----------------------- ###

    # Compute epipolar lines for both images
    lines_img1 = compute_epipolar_lines(F, img1pts, 1)
    lines_img2 = compute_epipolar_lines(F, img2pts, 2)

We will then plot these lines on our images:

    # # Draw epipolar lines on both images
    img1_copy_1, img2_copy_2 = draw_epipolar_lines(img2_rgb, img1_rgb, lines_img2, img2pts, img1pts, draw_only=10)
    img1_copy_3, img2_copy_4 = draw_epipolar_lines(img1_rgb, img2_rgb, lines_img1, img1pts, img2pts, draw_only=10)

Below is the result of the epipolar lines when trying different algorithms for the Fundamental Matrix:

FM_7POINT	FM_8POINT

FM_LMEDS	FM_RANSAC

8.1.4 Essential Matrix

Note that we have the intrinsic parameters (given in the dataset) but not the extrinsic ones. Hence, we have our K matrix - our Camera Matrix and we will it to estimate the Essential Matrix.

    ### --------------- ESSENTIAL MATRIX ----------------------- ###

    K = np.array([[2759.48, 0, 1520.69],
                  [0, 2764.16, 1006.81],
                  [0, 0, 1]])

    E = K.T.dot(F.dot(K))

[[ -0.17712546  -3.15858845  -0.6596703 ]
 [  8.44947533  -0.72103913  33.2312819 ]
 [ -0.27489299 -33.93990701  -0.68567946]]

8.1.5 Camera Pose

We derived the Essential Matrix from the Fundamental Matrix. Now, we will decompose the Essential Matrix to recover R and t. Recall the formula:

    ### --------------- CAMERA POSE ----------------------- ###

    retval, R, t, mask = cv2.recoverPose(E, img1pts, img2pts, K)

Where R =

[[ 0.98756121 -0.02267659 -0.1555912 ]
 [ 0.02552626  0.99954058  0.01634136]
 [ 0.15514915 -0.02010975  0.98768636]]

and t =

[[ 0.99550392]
 [ 0.0178422 ]
 [-0.09302477]]

8.1.6 Triangulation

We will now use triangulation and get the 3D point cloud using the triangulatePoints function from OpenCV. The steps are as follows:

Convert 2D points to homogeneous coordinates
Normalize 2D points using the camera intrinsic matrix (K)
Convert back to inhomogeneous coordinates
Triangulate 3D points using OpenCV's triangulatePoints function
Convert 4D homogeneous points to inhomogeneous 3D points

    ### --------------- TRIANGULATION ----------------------- ###

    points_3d = triangulate_points(img1pts, img2pts, K, R, t)

8.1.7 Project to 3D

Since now we have the 3D points, we can also extract the colors from the image and then save the point cloud in a .ply format with the respective color.

    ### --------------- IMAGE COLOR EXTRACTION ----------------------- ###

    # Extract color from img1_rgb
    colors_img1 = []
    for pt in img1pts:
        x, y = np.round(pt).astype(int)
        color = img1_rgb[y, x]
        colors_img1.append(color)

    colors_img1 = np.array(colors_img1)

    # Extract color from img2_rgb
    colors_img2 = []
    for pt in img2pts:
        x, y = np.round(pt).astype(int)
        color = img2_rgb[y, x]
        colors_img2.append(color)

    colors_img2 = np.array(colors_img2)

    # Define the file name for the PLY file
    output_file_path = os.path.join(output_folder, f"out_color{index}.ply")

    # Write the points and colors to the PLY file
    write_ply(output_file_path, points_3d, colors_img1)
    print("Point cloud with color saved to:", output_file_path)

From this sparse point cloud, we see that we successfully extracted the color and shape of the golden fish in the middle and the golden plate at the top. We have an outline of the fountain. Notice that we have less noise compared to MVS however, we also have a less dense point cloud.

8.2 Multi-View Structure from Motion

Now we will see how to construct a 3D point cloud with all 11 images. As in MVS, we will make use of the topologies to calculate the fundamental matrix. We will use the overlapping one and we will get 10 separate point cloud. We will need to vstack them in order to get a combined 3D reconstructed model. Below is the result of the separate 3D models:

bandicam.2023-07-31.12-40-33-203.mp4

Notice that from the first model, we get a good outline of the fountain and then when we superimpose the other point clouds, they also have a good outline however they are not correctly aligned. It seems that all of them have a different center, hence a different orientation and position.

8.3 Bundle Adjustment

In SfM, the camera poses and 3D points are initially estimated from feature correspondences in the images. However, due to noise, inaccuracies, and unmodeled factors during the initial reconstruction, the estimated 3D points may not perfectly align with the observed image points as we saw above.

Bundle Adjustment aims to improve the accuracy of the reconstruction by finding the best possible camera poses and 3D point positions that minimize the discrepancies between the projected 3D points and the actual image observations. It considers all the images simultaneously, taking into account the correlations between camera poses and 3D points, to achieve a globally optimal solution. It simultaneously adjusts the camera positions and orientations, as well as the positions of the 3D points, in order to minimize the reprojection error between the 2D image observations and their corresponding 3D points.

Some methods for Bundle Adjustment:

Levenberg-Marquardt (LM): It is an iterative method that combines the steepest descent method and the Gauss-Newton method to efficiently find the optimal parameters.
Gauss-Newton: The Gauss-Newton algorithm is an iterative method used to solve non-linear least squares problems.
Conjugate Gradient (CG): Conjugate Gradient is an iterative method used to solve large-scale linear systems.
Sparse Bundle Adjustment (SBA): SBA is an extension of Bundle Adjustment that takes advantage of the sparsity in the Jacobian matrix, which represents the derivatives of the cost function with respect to the camera poses and 3D points.
Robust Bundle Adjustment: Robust techniques, such as Huber loss or Tukey loss, can be applied to Bundle Adjustment to handle outliers and improve the robustness of the optimization process against noisy or erroneous measurements.

8.4 Iterative Closest Point (ICP)

If we have two 3D reconstructions obtained from different viewpoints and want to align them to create a more complete and accurate 3D model, Iterative Closest Point (ICP) could be used to find the best transformation (translation and rotation) that aligns the two points clouds together.

As Bundle Adjustment is a complex and tedious process, I will instead use ICP to enhance the accuracy and completeness of the 3D reconstructions. Below is the result of before and after using ICP:

We successfully aligned both point clouds!

Conclusion

In this project, we saw the 3D reconstruction of a Rock, Temple, and Fountain using various photogrammetry techniques. We observed that with Multi-View Stereo (MVS), a denser and more detailed 3D reconstruction is obtained, making it preferable for scenes with textureless or repetitive regions. On the other hand, Structure-from-Motion (SFM) is faster and more computationally efficient, making it suitable for real-time applications and smaller scenes.

However, further improvements can be achieved by incorporating Neural Radiance Fields (NeRFs) into the pipeline. NeRFs is an innovative approach to 3D scene reconstruction that utilizes neural networks to model volumetric scenes. Unlike traditional methods such as Multi-View Stereo (MVS) and Structure-from-Motion (SFM), which rely on sparse point clouds or depth maps, NeRFs learn the radiance at every 3D point in the scene.

This enables NeRFs to generate highly detailed and photorealistic reconstructions, even in regions with limited data or texture. NeRFs offer continuous and consistent scene representation, filling in missing data, refining surface geometry, and providing a comprehensive 3D representation that preserves intricate details. With its ability to leverage information from multiple views and generate high-fidelity renderings, NeRFs surpass MVS and SFM in terms of accuracy and realism, making it a superior choice for tasks requiring rich 3D scene understanding and visualization.

References

[1] PRG UMD Teaching. (n.d.). YouTube. https://www.youtube.com/watch?v=GQ3W9ltqqrw&list=PLZgpos4wVnCYhf5jsl2HcsCl_Pql6Kigk&index=13&ab_channel=PRGUMDTeaching

[2] PRG UMD Teaching. (n.d.). YouTube. https://www.youtube.com/watch?v=RGDNs1kQ7NI&list=PLZgpos4wVnCYhf5jsl2HcsCl_Pql6Kigk&index=16&ab_channel=PRGUMDTeaching

[3] Ximea River: Fast 3D Reconstruction of Water and Rivers Using a Multi-View Stereo Approach. USC Vision and Graphics Lab. https://vgl.ict.usc.edu/Research/XimeaRiver/XimeaRiver_EG_2017.pdf

[4] Multiview Stereo, Shape from Motion, and Structure from Motion. NYU Computer Science. https://cs.nyu.edu/~fergus/teaching/vision_2012/6_Multiview_SfM.pdf

[5] Multiple View Geometry in Computer Vision. University of Washington. https://courses.cs.washington.edu/courses/cse455/10wi/lectures/multiview.pdf

[6] MVS: A Survey. Carlos Hernández. https://carlos-hernandez.org/papers/fnt_mvs_2015.pdf

[7] Longuet-Higgins, H. C. (1981). A Computer Algorithm for Reconstructing a Scene from Two Projections. University of Cambridge. https://www.cs.cmu.edu/~16385/s17/Slides/12.4_8Point_Algorithm.pdf

[8] Cyrill Stachniss. (n.d.). YouTube. https://www.youtube.com/watch?v=zX5NeY-GTO0&ab_channel=CyrillStachniss

[9] Middlebury Multi-View Datasets. Middlebury College. https://vision.middlebury.edu/mview/data/

[10] Schönberger, J. L., & Frahm, J. M. (2016). Structure-from-Motion Revisited. Stanford University. http://vision.stanford.edu/teaching/cs231a_autumn1112/lecture/lecture10_multi_view_cs231a.pdf

[11] K. Simek. (2013). Intrinsic Calibration with ROS. kSimek.github.io. http://ksimek.github.io/2013/08/13/intrinsic/

[12] Voxel Carving. GitHub. https://github.com/cranberrymuffin/voxel-carving

[13] Gogia, P. C. (n.d.). YouTube. https://www.youtube.com/watch?v=yoQ1zHQsugg&t=2s&ab_channel=PrakrutiCatherineGogia

[14] Gogia, P. C. (n.d.). Medium. https://medium.com/@AaronLeeIV/3d-reconstruction-from-2d-images-cb21096631ad

[15] NSVF: Neural Sparse Voxel Fields. ArXiv. https://arxiv.org/pdf/2003.08934.pdf

[16] Douglas, J. D., & Barnes, C. (2018). Photosynthesizing Neural Networks for Volumetric Rendering. ArXiv. https://arxiv.org/pdf/1806.08734.pdf

[17] Kilonerf: A Fully Progressive 3D Generative Network for Video. GitHub. https://creiser.github.io/kilonerf/

[18] Barron, J. T., Bylow, E., Grathwohl, W., Lu, M., & Rabinovich, A. (2021). Neural Mesh Renderer. jonbarron.info. https://jonbarron.info/mipnerf/

[19] Kantorov, V., & Oquab, M. (2020). A Fully Projective Rendition of Dense 3D Environments. ArXiv. https://arxiv.org/pdf/2012.03927.pdf

[20] Frangakis, A. S., & Hegerl, R. (2020). Network Principles of Anatomical and Functional Segregation in the Brain. ArXiv. https://arxiv.org/pdf/2012.02190.pdf

[21] Li, T., Aittala, M., Durand, F., Durand, F., Heide, F., & Theobalt, C. (2021). Neural Scene Representation and Rendering. https://lingjie0206.github.io/papers/NSVF/

[22] NeRFies. (n.d.). nerfies.github.io. https://nerfies.github.io/

[23] Chen, Y., Zhang, Y., Rau, C., Li, Y., & Yu, F. (2020). Countering Adversarial Attacks. ArXiv. https://arxiv.org/pdf/2008.02268.pdf

yudhisteer / 3d-reconstruction-with-uncalibrated-stereo Goto Github PK