Thanks for the A2A.
I think Nitish's answer is what you're looking for. So I'll try and provide some technical background and help you get started.
I'll start with answering your two main questions.
Since you've mentioned that your are new to Computer Vision, I'll start this answer at a very low level.
I assume that you have basic linear algebra knowledge, and know what a matrix is and about arrays in C/C++.
First, some background:
Almost all approaches to computer vision, work by capturing, analyzing and processing a single frame/image at a time. A video is simply a sequence of static images, called frames, that are displayed sequentially with a defined delay between each of them. This delay is provided as "frames per second" (fps). Typically 25 to 35 fps is normal and anything above 30fps guarantees a smooth video experience.
Now onto a static image. Imagine a 2-dimensional black-white image. This image can be imagined as 2-dimentional matrix. Each element in this matrix is what we call a "pixel". In the case of a black-white image, the actual values don't really matter since its either black or white. Lets say 0 is black and 1 is white. This is a normal convention, since the values in the matrix actually represent "intensity", so black has the least intensity, while white has the max intensity.
Now, lets take this a step further to a grayscale image. An image with varying shades of grey. The basic structure remains of a 2D matrix. However each element needs to be more than a binary 1 or 0, since we need different intensities or shades. So we assign values within a range. The range is defined by the "depth". An 8bit depth, would be mean that you can have values between 0 to 255, while a 16bit depth would be 0 to 65535. As you can imagine, the more the depth, the more shades and more details are possible. (Default depth is 8bits in OpenCV).
When you load an image in OpenCV, it also treats it as matrix, however it "flattens" the matrix into a single dimension array. It does so "row-major order".
Row-major order
You can have a look at the CV Mat structure, to better understand.
http://docs.opencv.org/mo dules/c...
Notice how the data is stored in a 1 dimensional array "data" of type, "uchar" or "unsigned character". This means that each element in your array will be 8bits. Which is fine if we have a depth of 8bits, but if your depth is more than 8 bits, say 16 bits, then two subsequent elements in the array "data" will store a single pixel.
So to access a single pixel of your image in the 1-dimensional "data" array, you need to know the depth, the number of rows and the number of columns.
Things get complicated with colored image. A colored image is made up of 3 components, Red, Blue and Green (RGB). OpenCV prefers the BGR format, which is identical except that blue is encoded 1st and then green and then red.
This is organized in a 3-dimensional matrix. You can imagine this as three 2D matrices stacked on top of each other. Once again OpenCV stores this 3D matrix in a 1D array, in row-major order. However, it clumps B,G and R pixels together and stores them. Again, depth matters. The more depth, the more colors can be represented.
Now, on to your questions.
As you can see, a single color image has a tremendous amount of data associated with it, and openCV performs almost all it's operations on Mat structures. Imagine a sequence of several frames, that forms a video. That is a lot of data. Obviously its not sensible to store all of this data in its current "raw" form. Also we need other data that provides us with read rates, fps, audio etc to be stored. So all of this data has to be placed in a "container", which is just a way of defining how all this data is organized and stored. But there is a lot of redundant data, and these frames consume a lot of disk-space. So we "compress" it. So we stored compressed video in a container onto the disk.
The compression is done by an algorithm, which "encodes" the data into a stream of bits. (We don't need to know how this is done.). So, when you want to work with a frame from a video, the first thing that needs to be done is "decoding". This process will decompress the data, decouple it from the audio and other meta-data, and organize it in an OpenCV friendly format.
The data that is available by reading a file, is usually referred to as the "raw" data. Which basically means that it is unedited and has not been modified.
Please note that, if you are grabbing an image from a camera or a webcam, then technically, the raw image is the frame you grab from the camera. When you a VideoCaputre object in OpenCV, it automatically interfaces with your camera drivers to decode incoming frames. This often consists of post-processing to fit with OpenCV's formats, however we dont count this is a "processing" and assume the output of all this to be the "raw data".
(Side Note: Its not uncommon for webcams to use some sort of filter arrays when they grab or send images. Example: Bayer filter)
Hope this answers your basic queries and was informative!.
Now, on to the specifics.
GPUs are fantastic with large data which has a great deal of data-parallelism i.e. each element can be operated upon individually. Think of matrix addition, the sum matrice's individual elements are the sum of a single element from each matrix. No correlation to the others. So rather than iterating over each element of the matrix, a GPU is capable of exploiting this parallelism and working on a matrix-level. (This is a very high-level and simplisitc explanation, but it can suffice.) As such, GPUs are excellent for processing images/frames. You needn't worry how to do this as OpenCV provides some excellent bindings to do this.
A modern CPU has built in support for decoding videos at a hardware level. Also, grabbing a frame from a camera requires interfacing with the camera which is best done by the CPU. So like others have told you, let the CPU decode and do your processing on the GPU.
Dividing tasks between CPUs and GPUs can be tricky at times. So here's a few points:
I think Nitish's answer is what you're looking for. So I'll try and provide some technical background and help you get started.
I'll start with answering your two main questions.
Since you've mentioned that your are new to Computer Vision, I'll start this answer at a very low level.
I assume that you have basic linear algebra knowledge, and know what a matrix is and about arrays in C/C++.
First, some background:
Almost all approaches to computer vision, work by capturing, analyzing and processing a single frame/image at a time. A video is simply a sequence of static images, called frames, that are displayed sequentially with a defined delay between each of them. This delay is provided as "frames per second" (fps). Typically 25 to 35 fps is normal and anything above 30fps guarantees a smooth video experience.
Now onto a static image. Imagine a 2-dimensional black-white image. This image can be imagined as 2-dimentional matrix. Each element in this matrix is what we call a "pixel". In the case of a black-white image, the actual values don't really matter since its either black or white. Lets say 0 is black and 1 is white. This is a normal convention, since the values in the matrix actually represent "intensity", so black has the least intensity, while white has the max intensity.
Now, lets take this a step further to a grayscale image. An image with varying shades of grey. The basic structure remains of a 2D matrix. However each element needs to be more than a binary 1 or 0, since we need different intensities or shades. So we assign values within a range. The range is defined by the "depth". An 8bit depth, would be mean that you can have values between 0 to 255, while a 16bit depth would be 0 to 65535. As you can imagine, the more the depth, the more shades and more details are possible. (Default depth is 8bits in OpenCV).
When you load an image in OpenCV, it also treats it as matrix, however it "flattens" the matrix into a single dimension array. It does so "row-major order".
Row-major order
You can have a look at the CV Mat structure, to better understand.
http://docs.opencv.org/mo
Notice how the data is stored in a 1 dimensional array "data" of type, "uchar" or "unsigned character". This means that each element in your array will be 8bits. Which is fine if we have a depth of 8bits, but if your depth is more than 8 bits, say 16 bits, then two subsequent elements in the array "data" will store a single pixel.
So to access a single pixel of your image in the 1-dimensional "data" array, you need to know the depth, the number of rows and the number of columns.
Things get complicated with colored image. A colored image is made up of 3 components, Red, Blue and Green (RGB). OpenCV prefers the BGR format, which is identical except that blue is encoded 1st and then green and then red.
This is organized in a 3-dimensional matrix. You can imagine this as three 2D matrices stacked on top of each other. Once again OpenCV stores this 3D matrix in a 1D array, in row-major order. However, it clumps B,G and R pixels together and stores them. Again, depth matters. The more depth, the more colors can be represented.
Now, on to your questions.
As you can see, a single color image has a tremendous amount of data associated with it, and openCV performs almost all it's operations on Mat structures. Imagine a sequence of several frames, that forms a video. That is a lot of data. Obviously its not sensible to store all of this data in its current "raw" form. Also we need other data that provides us with read rates, fps, audio etc to be stored. So all of this data has to be placed in a "container", which is just a way of defining how all this data is organized and stored. But there is a lot of redundant data, and these frames consume a lot of disk-space. So we "compress" it. So we stored compressed video in a container onto the disk.
The compression is done by an algorithm, which "encodes" the data into a stream of bits. (We don't need to know how this is done.). So, when you want to work with a frame from a video, the first thing that needs to be done is "decoding". This process will decompress the data, decouple it from the audio and other meta-data, and organize it in an OpenCV friendly format.
The data that is available by reading a file, is usually referred to as the "raw" data. Which basically means that it is unedited and has not been modified.
Please note that, if you are grabbing an image from a camera or a webcam, then technically, the raw image is the frame you grab from the camera. When you a VideoCaputre object in OpenCV, it automatically interfaces with your camera drivers to decode incoming frames. This often consists of post-processing to fit with OpenCV's formats, however we dont count this is a "processing" and assume the output of all this to be the "raw data".
(Side Note: Its not uncommon for webcams to use some sort of filter arrays when they grab or send images. Example: Bayer filter)
Hope this answers your basic queries and was informative!.
Now, on to the specifics.
GPUs are fantastic with large data which has a great deal of data-parallelism i.e. each element can be operated upon individually. Think of matrix addition, the sum matrice's individual elements are the sum of a single element from each matrix. No correlation to the others. So rather than iterating over each element of the matrix, a GPU is capable of exploiting this parallelism and working on a matrix-level. (This is a very high-level and simplisitc explanation, but it can suffice.) As such, GPUs are excellent for processing images/frames. You needn't worry how to do this as OpenCV provides some excellent bindings to do this.
A modern CPU has built in support for decoding videos at a hardware level. Also, grabbing a frame from a camera requires interfacing with the camera which is best done by the CPU. So like others have told you, let the CPU decode and do your processing on the GPU.
Dividing tasks between CPUs and GPUs can be tricky at times. So here's a few points:
1) Tasks that interface with the disk and peripherals, such as reading and writing files or grabbing frames from a camera are best suited for CPUs.
2) Data processing that is done on a per-frame level, in the form of mathematical operations, is best handled by the GPU. Example, filters, convolutions, transforms etc.
3) When implementing an algorithm, try to clump several GPU operations together. That is, once you've uploaded the frame to the GPU, try and perform as many tasks as possible on the GPU, before transfering it back to the CPU. There is significant overhead involved in copying data from the CPU to the GPU and from the GPU to the CPU.
For the specifics on how to do this, I believe Nitish Satyavolu's answer to this post is good and should be what you're looking for!
Good luck! And welcome to the world of computer vision with OpenCV!