图像学习 - Histogram of Gradient (HOG, 方向梯度直方图) 特征

最新推荐文章于 2024-06-03 07:30:00 发布

elsahhhhh

最新推荐文章于 2024-06-03 07:30:00 发布

阅读量1k

点赞数

分类专栏： computer vision

computer vision 专栏收录该内容

6 篇文章

订阅专栏

作者：Slyne_D
链接：https://www.jianshu.com/p/395f0582c5f7
来源：简书

特征描述子(Feature Descriptor)

特征描述子就是图像的表示，抽取了有用的信息丢掉了不相关的信息。通常特征描述子会把一个w*h*3(宽*高*3，3个channel)的图像转换成一个长度为n的向量/矩阵。比如一副64*128*3的图像，经过转换后输出的图像向量长度可以是3780。

什么样子的特征是有用的呢？假设我们想要预测一张图片里面衣服上面的扣子，扣子通常是圆的，而且上面有几个洞，那你就可以用边缘检测(edge detector)，把图片变成只有边缘的图像，然后就可以很容易的分辨了，那么对于这张图边缘信息就是有用的，颜色信息就是没有用的。而且好的特征应该能够区分纽扣和其它圆形的东西的区别。

方向梯度直方图(HOG)中，梯度的方向分布被用作特征。沿着一张图片X和Y轴的方向上的梯度是很有用的，因为在边缘和角点的梯度值是很大的，我们知道边缘和角点包含了很多物体的形状信息。

（HOG特征描述子可以不局限于一个长度的，也可以用很多其他的长度，这里只记录一种计算方法。）

怎么计算方向梯度直方图呢？

我们会先用图像的一个patch来解释。

第一步：预处理

Patch可以是任意的尺寸，但是有一个固定的比列，比如当patch长宽比1:2，那patch大小可以是100*200, 128*256或者1000*2000但不可以是101*205。

这里有张图是720*475的，我们选100*200大小的patch来计算HOG特征，把这个patch从图片里面抠出来，然后再把大小调整成64*128。

HOG Preprocessing

hog_preprocess

第二步：计算梯度图像

首相我们计算水平和垂直方向的梯度，再来计算梯度的直方图。可以用下面的两个kernel来计算，也可以直接用OpenCV里面的kernel大小为1的Sobel算子来计算。

horizontal_vertical_gradient_kernel

调用OpenCV代码如下：

// C++ gradient calculation.
// Read image
Mat img = imread("bolt.png");
img.convertTo(img, CV_32F, 1/255.0);
 
// Calculate gradients gx, gy
Mat gx, gy; 
Sobel(img, gx, CV_32F, 1, 0, 1);
Sobel(img, gy, CV_32F, 0, 1, 1);

# Python gradient calculation 
 
# Read image
im = cv2.imread('bolt.png')
im = np.float32(im) / 255.0
 
# Calculate gradient 
gx = cv2.Sobel(img, cv2.CV_32F, 1, 0, ksize=1)
gy = cv2.Sobel(img, cv2.CV_32F, 0, 1, ksize=1)

接着，用下面的公式来计算梯度的幅值g和方向theta:

gradient_direction_formula

可以用OpenCV的cartToPolar函数来计算：

// C++ Calculate gradient magnitude and direction (in degrees)
Mat mag, angle; 
cartToPolar(gx, gy, mag, angle, 1);

# Python Calculate gradient magnitude and direction ( in degrees ) 
mag, angle = cv2.cartToPolar(gx, gy, angleInDegrees=True)

计算得到的gradient图如下：

左边：x轴的梯度绝对值中间：y轴的梯度绝对值右边：梯度幅值

从上面的图像中可以看到x轴方向的梯度主要凸显了垂直方向的线条，y轴方向的梯度凸显了水平方向的梯度，梯度幅值凸显了像素值有剧烈变化的地方。(注意：图像的原点是图片的左上角，x轴是水平的，y轴是垂直的)

图像的梯度去掉了很多不必要的信息(比如不变的背景色)，加重了轮廓。换句话说，你可以从梯度的图像中还是可以轻而易举的发现有个人。

在每个像素点，都有一个幅值(magnitude)和方向，对于有颜色的图片，会在三个channel上都计算梯度。那么相应的幅值就是三个channel上最大的幅值，角度(方向)是最大幅值所对应的角。

第三步：在8*8的网格中计算梯度直方图

在这一步，上面的patch图像会被分割成8*8大小的网格(如下图)，每个网格都会计算一个梯度直方图。那为什么要分成8*8的呢？用特征描述子的一个主要原因是它提供了一个紧凑(compact)/压缩的表示。一个8*8的图像有8*8*3=192个像素值，每个像素有两个值(幅值magnitude和方向direction，三个channel取最大magnitude那个)，加起来就是8*8*2=128，后面我们会看到这128个数如何用一个9个bin的直方图来表示成9个数的数组。不仅仅是可以有紧凑的表示，用直方图来表示一个patch也可以更加抗噪，一个gradient可能会有噪音，但是用直方图来表示后就不会对噪音那么敏感了。

这个patch的大小是64*128,分割成8*8的cell，那么一共有64/8 * 128/8 = 8*16=128个网格

对于64*128的这幅patch来说，8*8的网格已经足够大来表示有趣的特征比如脸，头等等。
直方图是有9个bin的向量，代表的是角度0,20,40,60.....160。

我们先来看看每个8*8的cell的梯度都是什么样子:

中间: 一个网格用箭头表示梯度右边: 这个网格用数字表示的梯度

中间这个图的箭头是梯度的方向，长度是梯度的大小，可以发现箭头的指向方向是像素强度都变化方向，幅值是强度变化的大小。

右边的梯度方向矩阵中可以看到角度是0-180度，不是0-360度，这种被称之为"无符号"梯度("unsigned" gradients)因为一个梯度和它的负数是用同一个数字表示的，也就是说一个梯度的箭头以及它旋转180度之后的箭头方向被认为是一样的。那为什么不用0-360度的表示呢？在事件中发现unsigned gradients比signed gradients在行人检测任务中效果更好。一些HOG的实现中可以让你指定signed gradients。

下一步就是为这些8*8的网格创建直方图，直方图包含了9个bin来对应0,20,40,...160这些角度。

下面这张图解释了这个过程。我们用了上一张图里面的那个网格的梯度幅值和方向。根据方向选择用哪个bin, 根据副值来确定这个bin的大小。先来看蓝色圈圈出来的像素点，它的角度是80，副值是2，所以它在第五个bin里面加了2，再来看红色的圈圈出来的像素点，它的角度是10，副值是4，因为角度10介于0-20度的中间(正好一半)，所以把幅值一分为二地放到0和20两个bin里面去。

梯度直方图

这里有个细节要注意，如果一个角度大于160度，也就是在160-180度之间，我们知道这里角度0，180度是一样的，所以在下面这个例子里，像素的角度为165度的时候，要把幅值按照比例放到0和160的bin里面去。

角度大于160的情况

把这8*8的cell里面所有的像素点都分别加到这9个bin里面去，就构建了一个9-bin的直方图，上面的网格对应的直方图如下:

8*8网格直方图

这里，在我们的表示中，Y轴是0度(从上往下)。你可以看到有很多值分布在0,180的bin里面，这其实也就是说明这个网格中的梯度方向很多都是要么朝上，要么朝下。

第四步: 16*16块归一化

上面的步骤中，我们创建了基于图片的梯度直方图，但是一个图片的梯度对于整张图片的光线会很敏感。如果你把所有的像素点都除以2，那么梯度的幅值也会减半，那么直方图里面的值也会减半，所以这样并不能消除光线的影响。所以理想情况下，我们希望我们的特征描述子可以和光线变换无关，所以我们就想让我们的直方图归一化从而不受光线变化影响。

先考虑对向量用l2归一化的步骤是:
v = [128, 64, 32]
[(128^2) + (64^2) + (32^2) ]^0.5=146.64
把v中每一个元素除以146.64得到[0.87,0.43,0.22]
考虑另一个向量2*v，归一化后可以得到向量依旧是[0.87, 0.43, 0.22]。你可以明白归一化是把scale给移除了。

你也许想到直接在我们得到的9*1的直方图上面做归一化，这也可以，但是更好的方法是从一个16*16的块上做归一化，也就是4个9*1的直方图组合成一个36*1的向量，然后做归一化，接着，窗口再朝后面挪8个像素(看动图)。重复这个过程把整张图遍历一边。

hog-16x16-block-normalization

第五步：计算HOG特征向量

为了计算这整个patch的特征向量，需要把36*1的向量全部合并组成一个巨大的向量。向量的大小可以这么计算:

我们有多少个16*16的块？水平7个，垂直15个，总共有7*15=105次移动。
每个16*16的块代表了36*1的向量。所以把他们放在一起也就是36*105=3780维向量。

可视化HOG

通常HOG特征描述子是画出8*8网格中9*1归一化的直方图，见下图。你可以发现直方图的主要方向捕捉了这个人的外形，特别是躯干和腿。

visualizing_histogram

英文原文

https://www.learnopencv.com/histogram-of-oriented-gradients/

What is a Feature Descriptor

A feature descriptor is a representation of an image or an image patch that simplifies the image by extracting useful information and throwing away extraneous information.

Typically, a feature descriptor converts an image of size width x height x 3 (channels ) to a feature vector / array of length n. In the case of the HOG feature descriptor, the input image is of size 64 x 128 x 3 and the output feature vector is of length 3780.

Keep in mind that HOG descriptor can be calculated for other sizes, but in this post I am sticking to numbers presented in the original paper so you can easily understand the concept with one concrete example.

This all sounds good, but what is “useful” and what is “extraneous” ? To define “useful”, we need to know what is it “useful” for ? Clearly, the feature vector is not useful for the purpose of viewing the image. But, it is very useful for tasks like image recognition and object detection. The feature vector produced by these algorithms when fed into an image classification algorithms like Support Vector Machine (SVM) produce good results.

But, what kinds of “features” are useful for classification tasks ? Let’s discuss this point using an example. Suppose we want to build an object detector that detects buttons of shirts and coats. A button is circular ( may look elliptical in an image ) and usually has a few holes for sewing. You can run an edge detector on the image of a button, and easily tell if it is a button by simply looking at the edge image alone. In this case, edge information is “useful” and color information is not. In addition, the features also need to have discriminative power. For example, good features extracted from an image should be able to tell the difference between buttons and other circular objects like coins and car tires.

In the HOG feature descriptor, the distribution ( histograms ) of directions of gradients ( oriented gradients ) are used as features. Gradients ( x and y derivatives ) of an image are useful because the magnitude of gradients is large around edges and corners ( regions of abrupt intensity changes ) and we know that edges and corners pack in a lot more information about object shape than flat regions.

How to calculate Histogram of Oriented Gradients ?

In this section, we will go into the details of calculating the HOG feature descriptor. To illustrate each step, we will use a patch of an image.

Step 1 : Preprocessing

As mentioned earlier HOG feature descriptor used for pedestrian detection is calculated on a 64×128 patch of an image. Of course, an image may be of any size. Typically patches at multiple scales are analyzed at many image locations. The only constraint is that the patches being analyzed have a fixed aspect ratio. In our case, the patches need to have an aspect ratio of 1:2. For example, they can be 100×200, 128×256, or 1000×2000 but not 101×205.

To illustrate this point I have shown a large image of size 720×475. We have selected a patch of size 100×200 for calculating our HOG feature descriptor. This patch is cropped out of an image and resized to 64×128. Now we are ready to calculate the HOG descriptor for this image patch.

The paper by Dalal and Triggs also mentions gamma correction as a preprocessing step, but the performance gains are minor and so we are skipping the step.

Step 2 : Calculate the Gradient Images

To calculate a HOG descriptor, we need to first calculate the horizontal and vertical gradients; after all, we want to calculate the histogram of gradients. This is easily achieved by filtering the image with the following kernels.

We can also achieve the same results, by using Sobel operator in OpenCV with kernel size 1.

// C++ gradient calculation.

// Read image

Mat img = imread("bolt.png");

img.convertTo(img, CV_32F, 1/255.0);

// Calculate gradients gx, gy

Mat gx, gy;

Sobel(img, gx, CV_32F, 1, 0, 1);

Sobel(img, gy, CV_32F, 0, 1, 1);

# Python gradient calculation

# Read image

im = cv2.imread('bolt.png')

im = np.float32(im) / 255.0

# Calculate gradient

gx = cv2.Sobel(img, cv2.CV_32F, 1, 0, ksize=1)

gy = cv2.Sobel(img, cv2.CV_32F, 0, 1, ksize=1)

Next, we can find the magnitude and direction of gradient using the following formula

$\begin{align*} g &= \sqrt { g^2_x + g^2_y } \\ \theta &= \arctan \frac{g_y}{g_x} \end{align*}$

If you are using OpenCV, the calculation can be done using the function cartToPolar as shown below.

// C++ Calculate gradient magnitude and direction (in degrees)

Mat mag, angle;

cartToPolar(gx, gy, mag, angle, 1);

The same code in python looks like this.

1 2	`# Python Calculate gradient magnitude and direction ( in degrees )` `mag, angle` `=` `cv2.cartToPolar(gx, gy, angleInDegrees=True)`

The figure below shows the gradients.

Left : Absolute value of x-gradient. Center : Absolute value of y-gradient. Right : Magnitude of gradient.

Notice, the x-gradient fires on vertical lines and the y-gradient fires on horizontal lines. The magnitude of gradient fires where ever there is a sharp change in intensity. None of them fire when the region is smooth. I have deliberately left out the image showing the direction of gradient because direction shown as an image does not convey much.

The gradient image removed a lot of non-essential information ( e.g. constant colored background ), but highlighted outlines. In other words, you can look at the gradient image and still easily say there is a person in the picture.

At every pixel, the gradient has a magnitude and a direction. For color images, the gradients of the three channels are evaluated ( as shown in the figure above ). The magnitude of gradient at a pixel is the maximum of the magnitude of gradients of the three channels, and the angle is the angle corresponding to the maximum gradient.

Step 3 : Calculate Histogram of Gradients in 8×8 cells

8×8 cells of HOG. Image is scaled by 4x for display.

In this step, the image is divided into 8×8 cells and a histogram of gradients is calculated for each 8×8 cells.

We will learn about the histograms in a moment, but before we go there let us first understand why we have divided the image into 8×8 cells. One of the important reasons to use a feature descriptor to describe a patch of an image is that it provides a compact representation. An 8×8 image patch contains 8x8x3 = 192 pixel values. The gradient of this patch contains 2 values ( magnitude and direction ) per pixel which adds up to 8x8x2 = 128 numbers. By the end of this section we will see how these 128 numbers are represented using a 9-bin histogram which can be stored as an array of 9 numbers. Not only is the representation more compact, calculating a histogram over a patch makes this represenation more robust to noise. Individual graidents may have noise, but a histogram over 8×8 patch makes the representation much less sensitive to noise.

But why 8×8 patch ? Why not 32×32 ? It is a design choice informed by the scale of features we are looking for. HOG was used for pedestrian detection initially. 8×8 cells in a photo of a pedestrian scaled to 64×128 are big enough to capture interesting features ( e.g. the face, the top of the head etc. ).

The histogram is essentially a vector ( or an array ) of 9 bins ( numbers ) corresponding to angles 0, 20, 40, 60 … 160.

Let us look at one 8×8 patch in the image and see how the gradients look.

Center : The RGB patch and gradients represented using arrows. Right : The gradients in the same patch represented as numbers

If you are a beginner in computer vision, the image in the center is very informative. It shows the patch of the image overlaid with arrows showing the gradient — the arrow shows the direction of gradient and its length shows the magnitude. Notice how the direction of arrows points to the direction of change in intensity and the magnitude shows how big the difference is.

On the right, we see the raw numbers representing the gradients in the 8×8 cells with one minor difference — the angles are between 0 and 180 degrees instead of 0 to 360 degrees. These are called “unsigned” gradients because a gradient and it’s negative are represented by the same numbers. In other words, a gradient arrow and the one 180 degrees opposite to it are considered the same. But, why not use the 0 – 360 degrees ? Empirically it has been shown that unsigned gradients work better than signed gradients for pedestrian detection. Some implementations of HOG will allow you to specify if you want to use signed gradients.

The next step is to create a histogram of gradients in these 8×8 cells. The histogram contains 9 bins corresponding to angles 0, 20, 40 … 160.

The following figure illustrates the process. We are looking at magnitude and direction of the gradient of the same 8×8 patch as in the previous figure. A bin is selected based on the direction, and the vote ( the value that goes into the bin ) is selected based on the magnitude. Let’s first focus on the pixel encircled in blue. It has an angle ( direction ) of 80 degrees and magnitude of 2. So it adds 2 to the 5th bin. The gradient at the pixel encircled using red has an angle of 10 degrees and magnitude of 4. Since 10 degrees is half way between 0 and 20, the vote by the pixel splits evenly into the two bins.

There is one more detail to be aware of. If the angle is greater than 160 degrees, it is between 160 and 180, and we know the angle wraps around making 0 and 180 equivalent. So in the example below, the pixel with angle 165 degrees contributes proportionally to the 0 degree bin and the 160 degree bin.

The contributions of all the pixels in the 8×8 cells are added up to create the 9-bin histogram. For the patch above, it looks like this

In our representation, the y-axis is 0 degrees. You can see the histogram has a lot of weight near 0 and 180 degrees, which is just another way of saying that in the patch gradients are pointing either up or down.

Step 4 : 16×16 Block Normalization

In the previous step, we created a histogram based on the gradient of the image. Gradients of an image are sensitive to overall lighting. If you make the image darker by dividing all pixel values by 2, the gradient magnitude will change by half, and therefore the histogram values will change by half. Ideally, we want our descriptor to be independent of lighting variations. In other words, we would like to “normalize” the histogram so they are not affected by lighting variations.

Before I explain how the histogram is normalized, let’s see how a vector of length 3 is normalized.

Let’s say we have an RGB color vector [ 128, 64, 32 ]. The length of this vector is $\sqrt{128^2 + 64^2 + 32^2} = 146.64$ . This is also called the L2 norm of the vector. Dividing each element of this vector by 146.64 gives us a normalized vector [0.87, 0.43, 0.22]. Now consider another vector in which the elements are twice the value of the first vector 2 x [ 128, 64, 32 ] = [ 256, 128, 64 ]. You can work it out yourself to see that normalizing [ 256, 128, 64 ] will result in [0.87, 0.43, 0.22], which is the same as the normalized version of the original RGB vector. You can see that normalizing a vector removes the scale.

Now that we know how to normalize a vector, you may be tempted to think that while calculating HOG you can simply normalize the 9×1 histogram the same way we normalized the 3×1 vector above. It is not a bad idea, but a better idea is to normalize over a bigger sized block of 16×16. A 16×16 block has 4 histograms which can be concatenated to form a 36 x 1 element vector and it can be normalized just the way a 3×1 vector is normalized. The window is then moved by 8 pixels ( see animation ) and a normalized 36×1 vector is calculated over this window and the process is repeated.

Step 5 : Calculate the HOG feature vector

To calculate the final feature vector for the entire image patch, the 36×1 vectors are concatenated into one giant vector. What is the size of this vector ? Let us calculate

How many positions of the 16×16 blocks do we have ? There are 7 horizontal and 15 vertical positions making a total of 7 x 15 = 105 positions.
Each 16×16 block is represented by a 36×1 vector. So when we concatenate them all into one gaint vector we obtain a 36×105 = 3780 dimensional vector.

Visualizing Histogram of Oriented Gradients

The HOG descriptor of an image patch is usually visualized by plotting the 9×1 normalized histograms in the 8×8 cells. See image on the side. You will notice that dominant direction of the histogram captures the shape of the person, especially around the torso and legs.

Unfortunately, there is no easy way to visualize the HOG descriptor in OpenCV.