Friday, February 18, 2011

Using SIMD for Hardware Acceleration

Most modern processors come with a feature of operating the same instruction on multiple instances of data. The acronym SIMD stands for "Single Instruction Multiple Data" describes just that.

Intel first introduced the MMX instructions that could operate on multiple data in the Pentium CPU's way back in 1994. Intel supported this by coming out with an extended instruction set henceforth called Streaming SIMD Extensions or SSE. The SSE instruction set which first came with the Pentium III processor uses 128 bit registers that can be used to pack 4 integers or 4 floating point data types. The advantage of doing this becomes clear if one tries to consecutively add four different sets of operands. Using the basic instruction set one needs to:
  1. Move the two operands to registers.
  2. Add the operands.
  3. Move result to memory.
  4. Repeat the above steps for three more sets of operands.
But with just three MMX instructions all these can be done at once for the four sets of operands hence rapidly improving performance.

Before I show you how to use these SSE instructions it is important to know how to access data in a SIMD register. As you know a SIMD register is 128 bit which means it can hold 4 x 32 it data or in other 4 ints or 4 floats. This data in order is referred as [x, y, z, w]. The following SSE instruction: addps adds the four ints or floats in the 128 bit XMM0 register and the 128 bit XMM1 register and stores the results back in the XMM0 register.

addps xmm0 xmm1;

The four floats that are loaded into the SSE register can be moved from memory individually but such operations are slow. Moreover moving data between the FPU (Floating Point Processing Unit) registers and the CPU registers is particularly slow because the CPU has to wait for the FPU to complete the present operation at hand. Hence it is a good practice to leave the data in the SSE registers unless and until space has to be cleared.

Let us now see how we can leverage these SIMD instructions from C/C++. Many compilers provide different data types for SIMD operations. Here I will discuss only the Microsoft Visual Studio compiler. The MVCC provides a predefined datatype __m128, which can be used to declare a variable which holds data in a MMX register. A __m128 type variable is stored directly in a MMX register without ever being put in the memory or the CPU registers. It is the programmer's responsibility to align the data to 16 byte address once you load its contents directly into memory.

Here's a sample program which demonstrates the usage of SIMD to perform addition on four floating point data.

__m128 addMMX(__m128 a, __m128 b)
__m128 result;
/* inline assembly */
movaps xmm0, xmmword ptr [a]
movaps xmm1, xmmword ptr [b]
addps xmm0, xmm1
movaps xmmword ptr [result], xmm0
return result;

This is however a bad approach because the code is not portable and one has to embed inline assembly into high level code. A better way to do the same thing is to use intrinsics. Intrinsics are special commands that look and behave like C functions but are internally expanded to inline assembly code by the compiler. In order to use intrinsics be sure to include the xmmintrin.h file into your code.

#include <xmmintrin.h>

__m128 addSIMDwithIntrinsics(__m128 a, __m128 b)
/* use intrisics */
__m128 result = _mm_add_ps(a,b);
return result;

To load 4 floats into the MMX register simply use the load intrinsic.

/* be sure to 16 byte align your arrays to
reduce the number of fetch cycles required to load */

__declspec(align(16)) float A[] = {1.0f, 2.0f, 3.0f, 4.0f};
__declspec(align(16)) float B[] = {4.0f, 3.0f, 2.0f, 1.0f};
__declspec(align(16)) float C[] = {0.0f, 0.0f, 0.0f, 0.0f};

int main(int args, char* argv[])
/* load a and b from the arrays above */
__m128 a = _mm_load_ps(&A[0]);
__m128 b = _mm_load_ps(&B[0]);
__m128 c;

/* call addSIMDwithIntrinsics() function from above */
c = addSIMDwithIntrinsics(a, b);

/* write the result back to array */
_mm_store_ps(&C, c);

The next time you set out to write FFT functions or just about any repetitive math operations be sure to utilize this feature.

Saturday, April 10, 2010

The limits of our perception

The last century can be seen as the golden age of modern civilization. There has been exponential scientific growth and expansion of human knowledge about self and our surroundings. We can now predict the behavior of almost everything with a high degree of precision. Despite the extent of our knowledge there is still uncertainty prevalent in our science. Our understanding of the quantum realms and the obscure phenomena in the vastness of space is still hazy. Theoretically our equations give us a fair picture. Or at least that is what we think. The major part of the 20th century was spent in unifying the most successful theories we have - The general theory of relativity and the quantum mechanical theory. Even now our physicists are in pursuit of the grand unification theory, a theory that if formulated will govern the behavior of particles in both infinitesimal and the infinitely large realms.

Now here surfaces the major flaw. What exactly is the infinitesimal small and the infinitely large? Our mathematics focuses on the real number system as the superset. ( complex numbers are ignored here for the sake of practicality). The real number line is defined such that there are infinitely many real numbers between any two given real numbers. In other words the set of infinity is a subset of itself. ( You might argue saying that every set is a subset of itself but remember that here we talk about a bounded subset). This gives rise to a major anomaly. How can infinity be contained between two well defined finite real numbers, when the distance between them is also a finite real number?

Consider a line of finite length say 10cm. Lets call this line "line A". Consider another line, "line B" of length 5 cm. Based on inequality one can definitely say that line A is larger than line B.

Now, let me start my argument. I argue that both the lines are of the same length. My argument is backed by the fact that every point on the line A can be paired with a point on line B. This one-one correspondence is possible because both line A and line B contain infinite number of points. Due to this one-one correspondence we can also say that both the lines have the same number of points and hence they are of the same length.

You may call me insane, but before you do let me explain why and how this ambiguity prevails. Every object in this universe has an upper and lower limit of perception. All observations that can be made by that object lie within its own limits of perception or what I call the region of perception. These limits equally apply to all the dimensions perceivable by that object. We humans can only observe objects that are larger and more massive than photons. This is because only such objects can reflect photons back to our eye. Similarly we can extend this argument to the larger scales. after a certain point any object bigger than the limiting boundary value of perception seems equally big. Hence we cannot differentiate them. It is just not humanly possible. But you may say that we can build apparatus that can make these measurements for us but again remember that any apparatus that we may build wcertainly lie in our boundary of perception and hence cannot make observations beyond our scale.

So in order to completely understand our universe we require the ability to perceive in any scale. Un-defining infinity and eliminating it from our system of mathematics is the necessary condition to advance our knowledge. But the limits imposed on our region of perception prevent us from doing so. There may be ways to achieve this indirectly. The Vedas hold countless accounts about how our ancient sages freed their minds and opened the doors to the limitless knowledge. But the true essence of the Vedas is scarce in today's world. The translations to the original version are adulterated with the personal interpretations of the translators. The Vedas must be relished as they are, untouched. It is one of my goals to read them in their true form.

I do not assert that the Vedas hold the key to our advancement. But I certainly believe they can direct us in a right path and accelerate us towards achieving it. The top scientists and great thinkers of today might just rediscover this knowledge for us.

Thursday, April 8, 2010

Kick to start!

It has been a few days since my "so called" review, during which I let myself take a few days off before and after. I am conveniently extending my break until the end of this week. Call me lazy, YES call me LAZY! All that I want to do now is slouch around on my bed and brood over each and every thought that comes by in my head.

Very ironically most of my crazy thoughts occur to me in an effortless fleeting moment when I am close to sleep and when my mind is blank ( I am positively sure my mind is blank ). I guess in this state I tend to shut down the logical part of my brain and substitute it with a magic 8 ball ! With no criteria for selection and with the slightest hint of external stimuli, I rapidly argue with myself, conclude the subject and wait for the next crazy thought. Much like a powerful vacuum cleaner that swallows everything that is in its way.

On a hot day in Vellore ( the town from where the devil leaves with sunburns ), while waiting for my friend Abhiram at the canteen table with food laid out in front, I felt myself falling in this crazy trance again. When I am in it I know I am in it but I am so comfortable that I want to be in it! I don't remember what I was looking at but the first thing I noticed was a house fly. It was sitting on the rim on my friends juice mug ready to take a sip! In an instinctive motion I unleashed the "human fly swat attack" which the fly dodged easily. A couple of more failed attempts aaaaand...
first thought: The fly must be really good at computing trajectories and judging collisions. Assertion: The fly's brain must be quite adept at calculations and recognition, much more powerful in this respect than our first microprocessors.

second thought: But the fly must be dumb enough to miss the logic behind all the "fried" flies in the pest-o-flash tray because it zooms straight toward it!
Assertion: Its brain is fast but cannot process logical information.

Third thought: Maybe the fly thinks it is not one of the hundreds of others that lay burnt in the tray!
Assertion: The fly doesnt know its a fly!!

Then comes Abhiram to save my brain some effort. The guy is as crazy as I am! I just casually mention my new observations and he quips : " Flies have hundreds of eyes and their brains have to be powerful and quick enough to process all that information" - True! We then continue to compare a fly's brain to our early microprocessors, the details of which I should avoid for sanity's sake!

Wednesday, March 31, 2010

Problems with

I would recommend that you do ARToolKit development on Linux. There are many hassles one can face while compiling and trying to build ARToolKit for Windows. I couldn't find a V4L driver for my cheap camera and hence I was stuck with no option but to develop on Windows.

For those of you who are stuck with Windows like me, I suggest you use MS VC++ 2005 or later. VC++ 6.0 has some known bugs with class templates. Once you build and try to run your first AR application you are most likely to get an error like:
"Could not find DsRenderer runtime" or sorts.
ARToolKit depends on DSVideoLib 0.0.8 or higher for video related functions that internally call the VFW or V4L driver. To solve this problem you have to register the as a service. Just open command prompt as administrator, navigate to


then type Regsvr32 "path\"

where path is the absolute path to where you contain the file If you cant find it anywhere, then you are most likely to find it in the "bin" folder of ARToolKit 2.65 version.

Once the service is successfully registered you can run your application smoothly.

Discovering ARToolKit

During my initial days at IIIT-H, I had no definite goals apart from the fact that I wanted to spend my time here productively. Though I had some ideas in mind, my guide suggested me to put them away for later as 6 months was too less a time to complete them. He gave me a week off to think of some good things that I could do.

I wanted to work on CUDA. CUDA (Compute Unified Device Architeture) is a revolutionary concept from nVidia. The principle of the CUDA architecture is fairly simple. Instead of devoting almost half the transistors in the processor IC to cache and registers, CUDA devices ( all the latest nVidia GPUs and nVidia Tesla supercomputer) have most transistors devoted to computing. This means lesser cache (which is made unnecessary in GPUs due to the copious amount of equally fast video memory that they come with these days) and more raw computing power. In a nutshell CUDA devices are essentially an array of multiprocessors with shared device memory and they can outperform any processor in the market today by huge margins! Anyway, I was in IIIT -H to do some constructive research on CUDA and my guide seemed to think that this will take more time than I have.

During the course of my week off , I spent most of the time learning about programmable shaders and writing a few myself. I even learnt how to program CUDA devices. I also wrote a few image processing routines that harnessed CUDA's multiprocessing power ( I have a nVidia Geforce 8600 GTS , a CUDA device with a CUDA computability of 1.2). All this gave me no new ideas other than the ones I came here with.

Towards the end of the week I received an email from my guide. In it he described this concept called Augmented Reality and sent me a few links to a toolkit called ARToolKit developed by Dr. Hirokazu Kato and supported by the Human Interface Technology Laboratory at the University of Washington. As I browsed through the sites and googled a few terms, I found myself increasingly fascinated by this new technology. Though it looked complex and enthralling, looking through the sources and reading a few books on computer vision ( Multiple View Geometry in Computer Vision - R.Hartley and A.Zisserman ) I realized that the principle involved was quite simple.

Those of you familiar with the fundamental graphics pipeline will know that at every stage the vertices are changed from one coordinate system (CS) to another by simply multiplying with an appropriate transformation matrix. Similarly, ARToolKit finds the transformation matrix that maps the coordinates in the real world CS to the coordinates in the camera projection plane CS, where the image is formed. Once this matrix is obtained any given virtual object can be placed anywhere in the scene and its corresponding camera plane coordinates can be obtained by multiplying its coordinates with that matrix. This matrix is called the camera matrix or in ARToolKit terms a Transformation Matrix and it is a 3 x 4 matrix.

ARToolKit uses markers of known dimensions and shapes in order to achieve this. The program first looks for a square box with a darkened border in the scene first. Once it finds it, It matches the pattern inside the box to the pattern templates that it has. This can be quite trivially achieved with common image processing routines. once this marker of known dimensions is detected, it is easy to obtain the camera matrix with the pattern of known dimensions and the camera's focal length.

A Typical ARToolKit style marker with a dark bounding box and a pattern inside.

Though it sounds simple, It is difficult to perfect this technique. most cameras have glitches in their output called distortions. These distortions occur due to improper manufacturing of the lens. Radial distortion if the lens bulges too much and barrel distortion etc. It is impossible to manufacture a perfect lens. However these distortions can be corrected by certain complex methods that I wont get my hands dirty with because thankfully ARToolKit does all that hard work.
Once the camera matrix is known all the hard work is done as far as the augmented reality part is concerned. All that is left for us to do is to place our objects where ever we want them and use this camera matrix to transform it to the camera plane CS.
The code that follows is an example given with the ARToolKit. It draws a cube on the detected pattern.

#ifdef _WIN32
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
#ifndef __APPLE__
#include <gl/gl.h>
#include <gl/glut.h>
#include <opengl/gl.h>
#include <glut/glut.h>
#include <ar/gsub.h>
#include <ar/video.h>
#include <ar/param.h>
#include <ar/ar.h>

// Camera configuration.
#ifdef _WIN32
char *vconf = "Data\\WDM_camera_flipV.xml";
char *vconf = "";

int xsize, ysize;
int thresh = 100;
int count = 0;

char *cparam_name = "Data/camera_para.dat";
ARParam cparam;

char *patt_name = "Data/patt.hiro";
int patt_id;
double patt_width = 80.0;
double patt_center[2] = {0.0, 0.0};
double patt_trans[3][4];

static void init(void);
static void cleanup(void);
static void keyEvent( unsigned char key, int x, int y);
static void mainLoop(void);
static void draw( void );

int main(int argc, char **argv)
glutInit(&argc, argv);

argMainLoop( NULL, keyEvent, mainLoop );
return (0);

static void keyEvent( unsigned char key, int x, int y)
/* quit if the ESC key is pressed */
if( key == 0x1b ) {
printf("*** %f (frame/sec)\n", (double)count/arUtilTimer());

/* main loop */
static void mainLoop(void)
ARUint8 *dataPtr;
ARMarkerInfo *marker_info;
int marker_num;
int j, k;

/* grab a vide frame */
if( (dataPtr = (ARUint8 *)arVideoGetImage()) == NULL ) {
if( count == 0 ) arUtilTimerReset();

argDispImage( dataPtr, 0,0 );

/* detect the markers in the video frame */
if( arDetectMarker(dataPtr, thresh, &marker_info, &marker_num) < 0 ) {


/* check for object visibility */
k = -1;
for( j = 0; j < marker_num; j++ ) {
if( patt_id == marker_info[j].id ) {
if( k == -1 ) k = j;
else if( marker_info[k].cf < marker_info[j].cf ) k = j;
if( k == -1 ) {

/* get the transformation between the marker and the real camera */
arGetTransMat(&marker_info[k], patt_center, patt_width, patt_trans);



static void init( void )
ARParam wparam;

/* open the video path */
if( arVideoOpen( vconf ) < 0 ) exit(0);
/* find the size of the window */
if( arVideoInqSize(&xsize, &ysize) < 0 ) exit(0);
printf("Image size (x,y) = (%d,%d)\n", xsize, ysize);

/* set the initial camera parameters */
if( arParamLoad(cparam_name, 1, &wparam) < 0 ) {
printf("Camera parameter load error !!\n");
arParamChangeSize( &wparam, xsize, ysize, &cparam );
arInitCparam( &cparam );
printf("*** Camera Parameter ***\n");
arParamDisp( &cparam );

if( (patt_id=arLoadPatt(patt_name)) < 0 ) {
printf("pattern load error !!\n");

/* open the graphics window */
argInit( &cparam, 1.0, 0, 0, 0, 0 );

/* cleanup function called when program exits */
static void cleanup(void)

static void draw( void )
double gl_para[16];
GLfloat mat_ambient[] = {0.0, 0.0, 1.0, 1.0};
GLfloat mat_flash[] = {0.0, 0.0, 1.0, 1.0};
GLfloat mat_flash_shiny[] = {50.0};
GLfloat light_position[] = {100.0,-200.0,200.0,0.0};
GLfloat ambi[] = {0.1, 0.1, 0.1, 0.1};
GLfloat lightZeroColor[] = {0.9, 0.9, 0.9, 0.1};

argDraw3dCamera( 0, 0 );
glClearDepth( 1.0 );

/* load the camera transformation matrix */
argConvGlpara(patt_trans, gl_para);
glLoadMatrixd( gl_para );

glLightfv(GL_LIGHT0, GL_POSITION, light_position);
glLightfv(GL_LIGHT0, GL_AMBIENT, ambi);
glLightfv(GL_LIGHT0, GL_DIFFUSE, lightZeroColor);
glMaterialfv(GL_FRONT, GL_SPECULAR, mat_flash);
glMaterialfv(GL_FRONT, GL_SHININESS, mat_flash_shiny);
glMaterialfv(GL_FRONT, GL_AMBIENT, mat_ambient);
glTranslatef( 0.0, 0.0, 25.0 );
glDisable( GL_LIGHTING );

glDisable( GL_DEPTH_TEST );

If you want to play around with this you should first print out all the markers provided with the toolkit.

The output of this program can be seen once you display your "Hiro" marker in front of the camera. A blue cube appears on the pattern and it turns in accordance to your movements.

Most of the code is self explanatory and the rest will be clear when you write a few programs yourself so I wont waste time in elaborating it.

So this was my first encounter with ARToolKit.

The next day I met my guide ( after trying out ARToolKit ) and discussed with him. We decided that I should develop a demo in which humans can interactively control these objects. Now as to what the demo should be never surfaced until a little later but until then i was tinkering with ARToolKit. In between I tried to recreate the functionality of ARToolKit using OpenCV which I left midway ( I hope to resume it soon ) as soon as I had a solid idea as to what I was going to about the demo.

Augmented Reality

I am currently in IIIT-Hyderabad immersed in my current project on augmented reality.I was introduced to it by a professor here a few months ago. For those of you who haven't heard of augmented reality, it is an upcoming field that will soon play a huge part in revolutionizing human-computer interaction. Augmented-Reality, as the term suggests, means to add virtual elements to a real environment. In other words it can be described as a merger of reality and synthesized virtual realms. The results are truly a feast for the eyes! For instance in the picture show here, the two people who are viewing through their head mounted displays, fed by head mounted cameras can see the virtual city placed on a real table! The city is rendered by the program and placed in the correct position and in the right orientation as per certain calculations which I will describe in detail further ahead.

Augmented reality can be seen as a progeny of existing computer vision concepts. The basic principle is simple:
  • Firstly it is assumed that the camera sees what the user sees, which is almost true given its close proximity to the user's eye and after methodically eliminating radial and axial distortions.
  • Since we wish to place our virtual objects in the scene and make it look convincingly real, we need to know the mapping from the scene to the image plane of the camera. To make it even simpler - The virtual objects need to be aligned with the perspective of the scene. For example, a user looking at a real teapot and a virtual teapot, placed at the same distance from the camera should be convinced that the scene is real. The virtual teapot should not look "out of place" in the scene. There are several novel methods that can be used to achieve this which I will go through in much more detail as I describe my project ahead.
  • The scene should be rendered efficiently. Since the frame rate of the video should be maintained at a reasonable and constant value, we need to finish our frame by frame processing quickly and efficiently.
Now my project here in IIIT - Hyderabad is to find ways to add human interaction into this. So I set out to make a demo, a game rather. My game is quite simple. There is a plane that a user will be holding in front of a camera (like a sheet of paper) . A virtual maze will be created on this plane along with a virtual ball. The objective of this game is that the user should guide the ball through the maze by tilting the plane in his hand. Remember that the ball should be under the influence of gravity to make this possible! :)

It has been almost 2 months since I have been brooding on this project and now I have a working prototype. I now have a 4 sided boundary region in which the user can play with his ball, collision, gravity and a fully functional force accumulator model included. I regret not blogging my progress all this while but now I have some catching up to do. I will devote any free time I have to enumerate the progress that I made over the past 2 months and other proceedings in my project. I will soon upload my game so that you can play it too!

Back to Blogging!

I have been off blogging for long and I now wish to make up for the lost time.
Blogging is a healthy habit. Its enables one to maintain a log his thoughts and actions. Once you look back after years of blogging, you can see your life since then as a smooth slide show. This thought excites me. I often have lots to share and I feel a surge of enthusiasm to log my work here. But on each and every occasion such as this my laziness has successfully been able to take over me. Now I am determined to get over that phase of mine! It helps when I think of my blog as a means of torturing people with my rantings ;) ( Just kidding..hehe). Anyway, I am looking forward to some productive blog activity and urge you people to motivate me by posting your queries and arguments.