2019; Project Augmata (Robotics Club)

Guining Pertin
Apr 18, 2019
6 min read

Introduction

TL; DR: https://github.com/otoshuki/Project_Augmata

Mixed Reality is the 21st century’s passport to an ethereal world of endless possibilities. The merging of real world with virtual tech produces new environments that allow for physical and digital objects to coexist and interact in real time. The technology which leverages the core efficiencies of virtual and augmented reality with more emphasis on the former is paving the way for a new engaging and immersive way of experiencing our reality.

Project Augmata is an effort to bring together the worlds of Virtual Reality and Augmented Reality to form, what is called Mixed Reality.

We have developed a budget-friendly mixed reality headset using two webcams, completely programmed in Python and uses OpenCV and OpenGL. The current version was used to display Pokemon characters in mixed reality during TechEvince 2019 – The Annual Technical Exhibition IIT Guwahati.

The project started with the aim to develop a mixed reality headset that can display 3D CAD design of construction sites overlaid over the real world. This will help in designing and future planning of the ongoing work and hence augment manual work.

Mark II

Mark II will have leap motion enabled with Unity Engine for gesture control and better rendering of objects.

Tested barrel transformation on the images.
Tested OpenCV Fast feature detector.
Used OpenCV’s ORB feature detector.
Used KMeans to cluster ORB and Fast features to estimate surfaces.

NB: Project Halted

Mark I

Mk I uses –

ArUco markers for localization
OpenGL for rendering
OpenCV for image processing and computer vision

Working

Detect and identify the ArUco marker from the scene.
Compare the detected corners with reference corners to estimate homography.
Determine the projection matrix.
Use OpenGL to render the object to the scene with the projection matrix as input.

So that’d give a very crude idea of what’s happening, let me explain it in detail –

First things first!

We all use cameras and we know for sure that they give us 2D images. Our memories are transformed from the 3D world to the 2D plane.

Somehow, someone somewhere found out how these 3D and 2D coordinates for a point are related together, given by –

We know that the distance of a point in the real world cannot be determined uniquely using a single image, akin to using a single eye (monocular vision). This property is provided by the variable s in the LHS multiplied with the 2D image vector in homogeneous form(here, basically the 1 included at the uend makes it a point in homogeneous coordinates).

The first matrix in the RHS is the camera matrix which has the focal lengths, fx and fy and the position of center of the image, cx and cy. Yes the position might be different from the center of the image that you get easily by dividing by 2.

The bigger 3×4 matrix is the homogeneous transformation matrix that transforms the 3D world coordinates in homogeneous form to the 2D form, just before passing through the camera. It contains both rotations and translation information in the form of rotation matrix and translation vector.

WTH is homography then?

The 3D world coordinates really depend on how the coordinate frame has been set up.

Consider what happens if we have a planar object and stick up this world coordinate frame to the origin of our plane object, with the Z axis pointing up and away from the it, the X and Y axes along the length and breadth of the object?

Consider thicker one is the world coordinate frame

All the points on our object now has Z=0! This has a remarkable property. All the points in this world(the plane object) is essentially a 2D point varying only in X and Y. So we can finally have the complete transformation as a single linear transformation(which includes the camera matrix and the transformation matrix)!

Which means, we can essentially remove the 3rd column from the transformations for the Z axis

The camera matrix and this transformation can be combined together to give what is called the Homography Matrix:

Giving us this simple transformation!

If you know the coordinates of the original point on the object, X and the same point in the image x, you can find out this matrix H using inverse(or pseudo inverse in this case), although using a single point won’t give a unique solution. So, you take up a more data points, whose X and x are known and then perform the inversion to get the required matrix.

But but, we were never given these X and x!

So, this is where feature matching comes in, in basic terms, a feature detection algorithm will find image features (edges, corners etc.) in a reference image, while a feature descriptor provides values that distinguishes a feature from another one(eg. a value of 1 for edges and 0 for corners). This is done again for the actual image we are working with and a feature matching algorithm will try to match these two different set of features from reference and our current image. Another algorithm called RANSAC allows for robust matching by removing outliers.

So, if our reference image is the top down view of the planar object, the feature here will be the X vectors while the image of the same object as seen through the camera will have these same features at x.

Initially we tried implementing all of these and ended up quite frustrated, until someone pointed us to this –

homography, = cv2.findHomography(refpoints, img_points, cv2.RANSAC)

Which was a huge blow, since the entire thing was already implemented into a single function!

Our initial work could work with any kind of marker provided a good reference picture was taken, but often had several problems during feature matching. With time running out of our hands (on our final night XD), we shifted to using the ArUco markers again.

ArUco Markers:

These are binary patterns that can be easily recognized using OpenCV’s aruco module. The function below returns the corner coordinates of the marker from the image:

corners, ids, rejected = aruco.detectMarkers(img_gray, markers_dict, parameters)

Now we don’t even need to perform feature matching since the corners are always indexed the same. So putting our reference corners and image corners into the cv2.findHomography() function, we receive the homography matrix.

Homography -> 3D projection?

We know the composition of the homography matrix. So given the camera matrix we can get back the transformation matrix from the world coordinates to our image frame. But there’s a problem again!

The transformation matrix used for our Homography matrix has lost the 3rd column, ie it has only 2 rotational components and the translational component. But our 3D object to be rendered has Z /= 0 for several points. We need to somehow get this 3rd column vector for the complete 3×4 matrix.

We know that the columns of the rotation matrix are orthonormal, that is, we can find another vector orthonormal to both column vectors using cross product to get the last one. But since we are using estimates for these columns, it is highly probable that these are not orthonormal to each other. This would mean that the projection would be squished slightly.

This has been an area of research and a lot of methods exist. For my case, I used a simple method inspired by https://github.com/juangallostra/augmented-reality:

Say the 2 columns are G1 and G2 vectors. We wished they were orthonormal, but aren’t. So doing G1*G2 won’t give us a solution.
What we need is two orthonormal vectors in this G1-G2 plane but close to the original G1 and G2 vectors and another vector orthonormal to these vectors, which will form the 3rd column.
The first two orthonormal vectors will basically be our basis for the plane(subspace) formed by G1 and G2. We can find this by first finding a vector p, orthogonal to both G1 and G2.
Then we find the vector orthogonal to both G1 and p again to get a vector d close to G2, but is orthogonal to G1.
We see that we finally end up with 3 vectors, G1, p and d, all of them orthogonal to each other and with d lying close to G2.
We can now normalize these and use them as our columns for the rotation matrix!

Projection matrix = [unit_v(G1), unit_v(d), unit_v(p), normalized(t)]

Rendering the 3D object

This projection matrix from previous step is finally used with the object model to render it using the function taken from https://github.com/juangallostra/augmented-reality

Members

1st Year Team (as of TechEvince 2019)

Aadi Gupta
Anirudh Varanasi
Dibyakanti Kumar
Manish Agarwal
Shivansh Mishra
Soham Khadilkar
Sps Pranav
Vishisht Priyadarshi

Mentor – Guining Pertin