[C#] Bgr24彩色位图转为灰度的Bgr24位图的跨平台SIMD硬件加速向量算法

本文链接：https://blog.youkuaiyun.com/zyl910/article/details/143928924

在上一篇文章里，我们讲解了“Bgr24彩色位图转为Gray8灰度位图”算法。本文将探讨“Bgr24彩色位图转为灰度的Bgr24位图”。区别在于目标位图也是Bgr24格式的，只是将像素数据由彩色转为了灰度。这些算法也是跨平台的，同一份源代码，能在 X86及Arm架构上运行，且均享有SIMD硬件加速。

一、标量算法

1.1 算法实现

算法原理与上一篇文章是一样，唯一区别是目标位图的地址计算与写入处理。因为现在对于每一个像素，需要写入3个字节。
源代码如下。

public static unsafe void ScalarDoBatch(byte* pSrc, int strideSrc, int width, int height, byte* pDst, int strideDst) {
   
    const int cbPixel = 3; // Bgr24
    const int shiftPoint = 16;
    const int mulPoint = 1 << shiftPoint; // 0x10000
    const int mulRed = (int)(0.299 * mulPoint + 0.5); // 19595
    const int mulGreen = (int)(0.587 * mulPoint + 0.5); // 38470
    const int mulBlue = mulPoint - mulRed - mulGreen; // 7471
    byte* pRow = pSrc;
    byte* qRow = pDst;
    for (int i = 0; i < height; i++) {
   
        byte* p = pRow;
        byte* q = qRow;
        for (int j = 0; j < width; j++) {
   
            byte gray = (byte)((p[2] * mulRed + p[1] * mulGreen + p[0] * mulBlue) >> shiftPoint);
            q[0] = q[1] = q[2] = gray;
            p += cbPixel; // Bgr24
            q += cbPixel; // Bgr24 store grayscale.
        }
        pRow += strideSrc;
        qRow += strideDst;
    }
}

1.2 基准测试代码

使用 BenchmarkDotNet 进行基准测试。
可以使用上一篇文章的公共函数，写好标量算法的基准测试代码。源代码如下。

[Benchmark(Baseline = true)]
public void Scalar() {
   
    ScalarDo(_sourceBitmapData, _destinationBitmapData, 0);
}

[Benchmark]
public void ScalarParallel() {
   
    ScalarDo(_sourceBitmapData, _destinationBitmapData, 1);
}

public static unsafe void ScalarDo(BitmapData src, BitmapData dst, int parallelFactor = 0) {
   
    int width = src.Width;
    int height = src.Height;
    int strideSrc = src.Stride;
    int strideDst = dst.Stride;
    byte* pSrc = (byte*)src.Scan0.ToPointer();
    byte* pDst = (byte*)dst.Scan0.ToPointer();
    int processorCount = Environment.ProcessorCount;
    int batchSize = 0;
    if (parallelFactor > 1) {
   
        batchSize = height / (processorCount * parallelFactor);
    } else if (parallelFactor == 1) {
   
        if (height >= processorCount) batchSize = 1;
    }
    bool allowParallel = (batchSize > 0) && (processorCount > 1);
    if (allowParallel) {
   
        int batchCount = (height + batchSize - 1) / batchSize; // ceil((double)length / batchSize)
        Parallel.For(0, batchCount, i => {
   
            int start = batchSize * i;
            int len = batchSize;
            if (start + len > height) len = height - start;
            byte* pSrc2 = pSrc + start * strideSrc;
            byte* pDst2 = pDst + start * strideDst;
            ScalarDoBatch(pSrc2, strideSrc, width, len, pDst2, strideDst);
        });
    } else {
   
        ScalarDoBatch(pSrc, strideSrc, width, height, pDst, strideDst);
    }
}

二、向量算法

2.1 算法思路

对于24位转8位灰度，可以使用这种办法: 每次从源位图读取3个向量，进行3-元素组的解交织运算，得到 R,G,B 平面数据。随后使用向量化的乘法与加法，来计算灰度值。最后将存储了灰度值的那一个向量，进行3-元素组的交织运算，便能存储到目标位图。

它与“Bgr24彩色位图转为Gray8灰度位图”向量算法的区别，在于最后需做“3-元素组的交织运算”。

例如 Sse指令集使用的是128位向量，此时1个向量为16字节。每次从源位图读取3个向量，就是读取了48字节，即16个RGB像素。最后将灰度向量做“3-元素组的交织运算”，结果是3个向量。将那3个向量存储到目标位图，就是写入了48字节，即16个RGB像素。

对于3-元素组的交织，可以使用 shuffle 类别的指令来实现。例如对于X86架构的 128位向量，可以使用 SSSE3 的 _mm_shuffle_epi8 指令，它对应 NET 中的 Ssse3.Shuffle 方法。源代码如下。

static readonly Vector128<byte> YGroup3Zip_Shuffle_Byte_X_Part0 = Vector128.Create((sbyte)0, -1, -1, 1, -1, -1, 2, -1, -1, 3