发布 VectorTraits v3.0（支持 X86架构的Avx512系列指令集，支持 Wasm架构及PackedSimd指令集等）

最新推荐文章于 2025-12-28 09:18:17 发布

原创

最新推荐文章于 2025-12-28 09:18:17 发布 · 1.3k 阅读

12 ·

CC 4.0 BY-SA版权

文章标签：

#wasm #c# #.net #SIMD

文章目录

VectorTraits已更新至 v3.0版。支持Vector512类型及 X86架构的Avx512系列指令集; 支持 Wasm架构及PackedSimd指令集; 还提供了多向量换位（YShuffleX2、YShuffleX3）、交织(Group2Zip, Group2Unzip) 等原创的向量方法。

NuGet: https://www.nuget.org/packages/VectorTraits/3.0.0
源代码: https://github.com/zyl910/VectorTraits
在线文档: https://zyl910.github.io/VectorTraits_doc/

变更日志如下。

Supports Vector512 type and X86 architecture’s Avx512 family instruction sets (支持Vector512类型及 X86架构的Avx512系列指令集).
Improved algorithms for 128/256 bit vectors using the Avx512 family instruction sets (使用Avx512系列指令集, 改进了128/256位向量的算法).
Supports Wasm(WebAssembly) architecture, supports PackedSimd instruction set (支持 Wasm(WebAssembly)架构及PackedSimd指令集).
Added VectorTraits.Benchmarks.Wasm project. Used for unit test and benchmark vector types on the Wasm architecture (增加了 VectorTraits.Benchmarks.Wasm 项目. 用于在 Wasm 架构上对向量类型进行单元测试与基准测试).
Support for .NET 8.0 new vector methods (支持 .NET 8.0 新增的向量方法): WidenLower, WidenUpper.
Provides vector methods for de-interleave (提供解交织的向量方法): YGroup2Unzip, YGroup2UnzipEven, YGroup2UnzipOdd, YGroup3Unzip, YGroup3UnzipX2, YGroup4Unzip, YGroup6Unzip_Bit128.
Provides vector methods for interleave (提供交织的向量方法): YGroup2Zip, YGroup2ZipHigh, YGroup2ZipLow, YGroup3Zip, YGroup3ZipX2, YGroup4Unzip, YGroup6Zip_Bit128.
Provides vector methods for reconstruction groups (提供重新构造组的向量方法): YGroup1ToGroup3, YGroup1ToGroup4, YGroup1ToGroup4WithW, YGroup3ToGroup4, YGroup4ToGroup3.
Provides vector methods for multi vector shuffle (提供多向量换位的向量方法): YShuffleX2, YShuffleX2Insert, YShuffleX2Kernel, YShuffleX3, YShuffleX3Insert, YShuffleX3Kernel, YShuffleX4, YShuffleX4Insert, YShuffleX4Kernel.
Provides vector methods for transpose (提供转置的向量方法): YGroup2Transpose, YGroup2TransposeEven, YGroup2TransposeOdd.
Provides vector methods for compare (提供比较的向量方法): GreaterThan_Unsigned.
Provides extension method for vectors (提供向量的扩展方法): AsSigned, AsUnsigned.
VectorSameWExtensions add As method. Vector<T> add As Extension Methods (Vector<T> 增加 As 扩展方法).
The IsHardwareAccelerated property has been added to the Vectors(/Vector128s/Vector256s/Vector512s) classes (Vectors(/Vector128s/Vector256s/Vector512s) 类增加了 IsHardwareAccelerated 属性).
Added 128-bit integer type - ExInt128/ExUInt128 (增加了128位整数类型 - ExInt128/ExUInt128).
(Experimental) Experimentally added the ExType(Extended type) mechanism to enable vector types to support 128-bit integers(ExInt128/ExUInt128). However, it is found that some functions do not work properly under some .NET versions. Therefore, it is recommended to prioritize using functions with the suffix “_Bit128” instead of ExType (实验性的增加了 ExType(扩展类型) 机制, 使向量类型能支持128位整数(ExInt128/ExUInt128). 但发现某些.NET版本下，个别函数的工作不正常. 故建议优先使用“_Bit128”后缀的函数, 而不是 ExType).
The MathBitOperations class has been added. It provides these functions (增加了 MathBitOperations 类. 它提供了这些函数): Crc32C, IsPow2, LeadingZeroCount, Log2, PopCount, RoundUpToPowerOf2, RotateLeft, RotateRight, TrailingZeroCount.
Package “System.Runtime.CompilerServices.Unsafe” upgraded to version 5.0.0. UnsafeUtil obsoletes methods such as IsNullRef, NullRef, SkipInit (Unsafe包升级到 5.0.0 版. UnsafeUtil 废弃了 IsNullRef 等函数).
The UnsafeUtil class add methods (UnsafeUtil类增加了方法): GetArrayDataReference, Dec, Inc, PreDec, PreInc, PostDec, PostDecExcept, PostDecExceptZero, PostInc, PostIncExcept, PostIncExceptZero .
Optimize type conversion for vector generic types by replacing (object) with the As method (优化向量泛型类型的类型转换, 用 As 方法取代 (object)).
Optimize the combining of two vectors, using WithUpper instead of Create (优化两个向量的组合, 用 WithUpper 代替 Create). e,g, Narrow and more.
Optimized hardware acceleration of YBitToByte, YBitToInt16, YBitToInt32, YBitToInt64 methods on all architecture. It no longer uses OnesComplement (优化YBitToByte, YBitToInt16, YBitToInt32, YBitToInt64方法在所有架构的硬件加速. 它不再使用 OnesComplement).
Optimized hardware acceleration of Shuffle/YShuffleInsert methods on X86 architecture. Use EqualsShift arithmetic (优化Shuffle/YShuffleInsert方法在X86架构的硬件加速. 使用 EqualsShift 算法). For 16~64 bit types.
Fix all YShuffleG4X2 methods, remove redundant ConstantExpected attribute(修正所有 YShuffleG4X2 方法, 移除多余的 ConstantExpected 特性).
Removal of obsolete project file VectorTraits_vs2019.sln (移除过时的项目文件 VectorTraits_vs2019.sln).
Deprecation notice: Deprecation notice: The next version will remove such as WVectorTraits128AdvSimdB64/WVectorTraits128Avx2 classes (废弃预告: 下个版本将会移除 WVectorTraits128AdvSimdB64, WVectorTraits128Avx2 等类).

完整列表: ChangeLog

支持 X86架构的Avx512系列指令集

支持Avx512时的输出信息

当支持Avx512指令集，范例程序 samples/VectorTraits.Sample 的输出信息如下。

VectorTraits.Sample

IsRelease:	True
Environment.ProcessorCount:	16
Environment.Is64BitProcess:	True
Environment.OSVersion:	Microsoft Windows NT 10.0.22631.0
Environment.Version:	8.0.8
Stopwatch.Frequency:	10000000
RuntimeEnvironment.GetRuntimeDirectory:	C:\Program Files\dotnet\shared\Microsoft.NETCore.App\8.0.8\
RuntimeInformation.FrameworkDescription:	.NET 8.0.8
RuntimeInformation.OSArchitecture:	X64
RuntimeInformation.OSDescription:	Microsoft Windows 10.0.22631
RuntimeInformation.RuntimeIdentifier:	win-x64
IntPtr.Size:	8
BitConverter.IsLittleEndian:	True
Vector.IsHardwareAccelerated:	True
Vector<byte>.Count:	32	# 256bit
Vector<float>.Count:	8	# 256bit
Vector128.IsHardwareAccelerated:	True
Vector256.IsHardwareAccelerated:	True
Vector512.IsHardwareAccelerated:	True
Vector<T>.Assembly.CodeBase:	file:///C:/Program Files/dotnet/shared/Microsoft.NETCore.App/8.0.8/System.Private.CoreLib.dll
GetTargetFrameworkDisplayName(VectorTextUtil):	.NET 8.0
GetTargetFrameworkDisplayName(TraitsOutput):	.NET 8.0
VectorTraitsGlobal.InitCheckSum:	-2122844161	# 0x8177F7FF
VectorEnvironment.CpuModelName:	AMD Ryzen 7 7840H w/ Radeon 780M Graphics
VectorEnvironment.SupportedInstructionSets:	Aes, Avx, Avx2, Avx512BW, Avx512CD, Avx512DQ, Avx512F, Avx512Vbmi, Avx512VL, Bmi1, Bmi2, Fma, Lzcnt, Pclmulqdq, Popcnt, Sse, Sse2, Sse3, Ssse3, Sse41, Sse42, X86Base
Vector128s.Instance:	WVectorTraits128Avx2	// Sse, Sse2, Sse3, Ssse3, Sse41, Sse42, Avx, Avx2, Avx512VL
Vector256s.Instance:	WVectorTraits256Avx2	// Avx, Avx2, Sse, Sse2, Avx512VL
Vector512s.Instance:	WVectorTraits512Avx512	// Avx512BW, Avx512DQ, Avx512F, Avx512Vbmi, Avx, Avx2, Sse, Sse2
Vectors.Instance:	VectorTraits256Avx2	// Avx, Avx2, Sse, Sse2, Avx512VL
Vectors.BaseInstance:	VectorTraits256Base

src:    <0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7>        # (0000 0001 0002 0003 0004 0005 0006 0007 0000 0001 0002 0003 0004 0005 0006 0007)
ShiftLeft:      <0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14>  # (0000 0002 0004 0006 0008 000A 000C 000E 0000 0002 0004 0006 0008 000A 000C 000E)
Equals to BCL ShiftLeft:        True
Equals to ShiftLeft_Const:      True

desc:   <15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0>  # (000F 000E 000D 000C 000B 000A 0009 0008 0007 0006 0005 0004 0003 0002 0001 0000)
Shuffle:        <14, 12, 10, 8, 6, 4, 2, 0, 14, 12, 10, 8, 6, 4, 2, 0>  # (000E 000C 000A 0008 0006 0004 0002 0000 000E 000C 000A 0008 0006 0004 0002 0000)
Equals to BCL Shuffle:  True
Equals to Shuffle_Core: True

ShiftLeft_AcceleratedTypes:     SByte, Byte, Int16, UInt16, Int32, UInt32, Int64, UInt64        # (00001FE0)
Shuffle_AcceleratedTypes:       SByte, Byte, Int16, UInt16, Int32, UInt32, Int64, UInt64, Single, Double        # (00007FE0)

其中最关键的是这几行，分别展示了各种向量类型所用的指令集。

Vector128s.Instance:	WVectorTraits128Avx2	// Sse, Sse2, Sse3, Ssse3, Sse41, Sse42, Avx, Avx2, Avx512VL
Vector256s.Instance:	WVectorTraits256Avx2	// Avx, Avx2, Sse, Sse2, Avx512VL
Vector512s.Instance:	WVectorTraits512Avx512	// Avx512BW, Avx512DQ, Avx512F, Avx512Vbmi, Avx, Avx2, Sse, Sse2
Vectors.Instance:	VectorTraits256Avx2	// Avx, Avx2, Sse, Sse2, Avx512VL

输出信息中的“Vector512.IsHardwareAccelerated: True”，表示512位向量支持硬件加速。

支持 Wasm架构及PackedSimd指令集

支持PackedSimd时的输出信息

当在WASM环境下运行，且支持PackedSimd指令集，范例程序的输出信息如下。

VectorTraits.Sample on Wasm

IsRelease:	True
Environment.ProcessorCount:	1
Environment.Is64BitProcess:	False
Environment.OSVersion:	Other 1.0.0.0
Environment.Version:	8.0.4
Stopwatch.Frequency:	1000000000
RuntimeEnvironment.GetRuntimeDirectory:	/
RuntimeInformation.FrameworkDescription:	.NET 8.0.4
RuntimeInformation.OSArchitecture:	Wasm
RuntimeInformation.OSDescription:	Browser
RuntimeInformation.RuntimeIdentifier:	browser-wasm
IntPtr.Size:	4
BitConverter.IsLittleEndian:	True
Vector.IsHardwareAccelerated:	False
Vector<byte>.Count:	16	# 128bit
Vector<float>.Count:	4	# 128bit
Vector128.IsHardwareAccelerated:	True
Vector256.IsHardwareAccelerated:	False
Vector512.IsHardwareAccelerated:	False
Vector<T>.Assembly.CodeBase:	
GetTargetFrameworkDisplayName(VectorTextUtil):	.NET 8.0
GetTargetFrameworkDisplayName(TraitsOutput):	.NET 8.0
VectorTraitsGlobal.InitCheckSum:	-2122844158	# 0x8177F802
VectorEnvironment.CpuModelName:	
VectorEnvironment.SupportedInstructionSets:	PackedSimd
Vector128s.Instance:	WVectorTraits128PackedSimd	// PackedSimd
Vectors.Instance:	VectorTraits128PackedSimd	// PackedSimd
Vectors.BaseInstance:	VectorTraits128Base

src:	<0, 1, 2, 3, 4, 5, 6, 7>	# (0000 0001 0002 0003 0004 0005 0006 0007)
ShiftLeft:	<0, 2, 4, 6, 8, 10, 12, 14>	# (0000 0002 0004 0006 0008 000A 000C 000E)
Equals to BCL ShiftLeft:	True
Equals to ShiftLeft_Const:	True

desc:	<7, 6, 5, 4, 3, 2, 1, 0>	# (0007 0006 0005 0004 0003 0002 0001 0000)
Shuffle:	<14, 12, 10, 8, 6, 4, 2, 0>	# (000E 000C 000A 0008 0006 0004 0002 0000)
Equals to BCL Shuffle:	True
Equals to Shuffle_Core:	True

ShiftLeft_AcceleratedTypes:	None	# (00000000)
Shuffle_AcceleratedTypes:	None	# (00000000)

其中最关键的是这几行：

Vector.IsHardwareAccelerated:	False
Vector128.IsHardwareAccelerated:	True
Vector256.IsHardwareAccelerated:	False
Vector512.IsHardwareAccelerated:	False
VectorEnvironment.SupportedInstructionSets:	PackedSimd
Vector128s.Instance:	WVectorTraits128PackedSimd	// PackedSimd
Vectors.Instance:	VectorTraits128PackedSimd	// PackedSimd
Vectors.BaseInstance:	VectorTraits128Base

可以看出，浏览器支持PackedSimd。于是128位向量（Vector128）具有硬件加速。

`VectorTraits.Benchmarks.Wasm` 使用说明

传统的 .NET 程序，无论是命令行程序，还是UI界面程序，甚至 ASP.Net 后台程序，都是在本机上运行，而不是在浏览器中运行。
为了能直接在浏览器中运行 .NET 程序, 需使用“Blazor Wasm”项目类型。它会将 .NET 程序编译为Wasm，用于在浏览器中运行。

本库已经建立了 VectorTraits.Benchmarks.Wasm 项目，可以用它来测试 .NET 程序在Wasm架构运行时的表现。且支持基准测试与单元测试。

启动VectorTraits.Benchmarks.Wasm 项目后，VS会自动打开浏览器，并浏览主页（Home）。如下图。

在左侧的导航栏中点击“View benchmark”，便会打开基准测试的页面，用于对比测试各种算法的性能。点击“Test”按钮，便会开始测试。该测试是异步进行的，还会在“Test”按钮的右侧显示进度。

默认情况下，仅运行少量的基准测试。若勾选了“AllowFakeBenchmark”复选框，会运行所有的基准测试，会花费一个小时或更长的时间。

在左侧的导航栏中点击“View unit test”，便会打开基准测试的页面，用于验证算法的正确性。点击“Test”按钮，便会开始测试。

由于 .NET 8.0 时的“Blazor Wasm”尚不支持多线程，且NUnit尚不支持异步执行。故只能以同步方式运行，需等到所有单元测试均执行完毕后，才能获得结果。一般需花费数分钟不等。

新增了向量方法

支持 `.NET 8.0` 新增的向量方法

.NET 8.0 新增了这些向量方法——

WidenLower: Widens the lower half of a Vector into a Vector (将向量的低半部分扩宽为一个向量).
Mnemonic: rt[i] := widen(source[i]).
WidenUpper: Widens the upper half of a Vector into a Vector (将向量的高半部分扩宽为一个向量).
Mnemonic: rt[i] := widen(source[i - Count/2]).

Vectors/Vector128s/Vector256s/Vector512s 均支持了这些方法。

提供交织与解交织的向量方法

在使用向量化运算时，有时需要重新排列元素的位置。例如在做图形、图像处理时，经常需要对已打包的RGB或RGBA数据进行交织和解交织运算。

虽然 .NET 7.0 开始提供了 Shuffle 方法，但它存在很多问题——变长的Vector类型里还未提供该方法，直到 .NET 8.0很多架构上仍没有硬件加速。
为了解决官方的 Shuffle 方法的问题，本库提供 Shuffle 系列方法，能够在 X86、Arm等架构上享受硬件加速。但还存在2个缺点——

若让使用者基于Shuffle来开发交织和解交织运算的函数，写起来会比较繁琐。
对于交织和解交织算法来说，Shuffle实现的算法的性能并不是最优的。有一些特殊算法，能带来更大的性能提升。

于是本库提供了交织与解交织的向量方法。且将其抽象为 2~4-元素组的处理，使它们能具有更强的通用性。相关方法的说明见下。

YGroup2Unzip[/_Bit128]: De-Interleave 2-element groups into 2 vectors. It converts the 2-element groups AoS to SoA (将2-元素组解交织为2个向量. 它能将2元素组的数组结构体转为结构体数组).
Mnemonic: x[i] =: element_ref(2*i, data0, data1), y[i] =: element_ref(2*i+1, data0, data1).
YGroup2UnzipEven: De-Interleave the 2-element groups into 2 vectors, and return the vector of even positions (将2-元素组解交织为2个向量, 并返回偶数位置的数据).
Mnemonic: rt[i] =: element_ref(2*i, data0, data1).
YGroup2UnzipOdd: De-Interleave the 2-element groups into 2 vectors, and return the vector of odd positions (将2-元素组解交织为2个向量, 并返回奇数位置的数据).
Mnemonic: rt[i] =: element_ref(2*i+1, data0, data1).
YGroup2Zip[/_Bit128]: Interleave 2 vectors into 2-element groups. It converts the 2-element groups SoA to AoS (将2个向量交织为2-元素组. 它能将2元素组的结构体数组转为数组结构体).
Mnemonic: element_ref(i, data0, data1) := (0==(i&1))?( x[i2] ):( y[i2] ), i2 := i/2.
YGroup2ZipHigh: Interleave 2 vectors into 2-element groups and returns the data in the high position. (将2个向量交织为2-元素组, 并返回高位置的数据).
Mnemonic: rt[i] := (0==(i&1))?( x[i2] ):( y[i2] ), i2 := (i+T.Count)/2.
YGroup2ZipLow: Interleave 2 vectors into 2-element groups and returns the data in the low position. (将2个向量交织为2-元素组, 并返回低位置的数据).
Mnemonic: rt[i] := (0==(i&1))?( x[i2] ):( y[i2] ), i2 := i/2.
YGroup3Unzip[/_Bit128]: De-Interleave 3-element groups into 3 vectors. It converts the 3-element groups AoS to SoA. It can also deinterleave packed RGB pixel data into R,G,B planar data (将3-元素组解交织为3个向量. 它能将3元素组的数组结构体转为结构体数组. 它还能将已打包的RGB像素数据, 解交织为 R,G,B 平面数据).
Mnemonic: x[i] =: element_ref(3*i, data0, data1, data2), y[i] =: element_ref(3*i+1, data0, data1, data2), z[i] =: element_ref(3*i+2, data0, data1, data2).
YGroup3UnzipX2[/_Bit128]: De-Interleave 3-element groups into 3 vectors and process 2x data (将3-元素组解交织为3个向量, 且处理2倍数据).
Mnemonic: (x, y, z) = YGroup3Unzip(data0, data1, data2), (xB, yB, zB) = YGroup3Unzip(data3, data4, data5).
YGroup3Zip[/_Bit128]: Interleave 3 vectors into 3-element groups. It converts the 3-element groups SoA to AoS. It can also interleave R,G,B planar data into packed RGB pixel data (将3个向量交织为3-元素组. 它能将3元素组的结构体数组转为数组结构体. 它还能将 R,G,B 平面数据, 交织为已打包的RGB像素数据).
Mnemonic: element_ref(i, data0, data1, data2) := (0==(i%3))?( x[i2] ):( (1==(i%3))?( y[i2] ):( z[i2] ) ), i2 := i/3.
YGroup3ZipX2[/_Bit128]: Interleave 3 vectors into 3-element groups and process 2x data (将3个向量交织为3-元素组, 且处理2倍数据).
Mnemonic: (data0, data1, data2) = YGroup3Zip(x, y, z), (data3, data4, data5) = YGroup3Zip(xB, yB, zB).
YGroup4Unzip[/_Bit128]: De-Interleave 4-element groups into 4 vectors. It converts the 4-element groups AoS to SoA. It can also deinterleave packed RGBA pixel data into R,G,B,A planar data (将4-元素组解交织为4个向量. 它能将4元素组的数组结构体转为结构体数组. 它还能将已打包的RGBA像素数据, 解交织为 R,G,B,A 平面数据).
Mnemonic: x[i] =: element_ref(4*i, data0, data1, data2, data3), y[i] =: element_ref(4*i+1, data0, data1, data2, data3), z[i] =: element_ref(4*i+2, data0, data1, data2, data3), w[i] =: element_ref(4*i+3, data0, data1, data2, data3).
YGroup4Zip[/_Bit128]: Interleave 4 vectors into 4-element groups. It converts the 4-element groups SoA to AoS. It can also interleave R,G,B,A planar data into packed RGBA pixel data (将4个向量交织为4-元素组. 它能将4元素组的结构体数组转为数组结构体. 它还能将 R,G,B,A 平面数据, 交织为已打包的RGBA像素数据).
Mnemonic: element_ref(i, data0, data1, data2, data3) := (0==(i&3))?( x[i2] ):( (1==(i&3))?( y[i2] ):( (2==(i&3))?( z[i2] ):( w[i2] ) ) ), i2 := i/4.
YGroup6Unzip_Bit128: De-Interleave 6-element groups into 6 vectors. It converts the 6-element groups AoS to SoA (将6-元素组解交织为6个向量. 它能将6元素组的数组结构体转为结构体数组). It is specialized for process 128-bit element (它专门用于处理128位元素).
Mnemonic: x[i] =: element_ref(6*i, data0, data1, data2, data3, data4, data5), y[i] =: element_ref(6*i+1, data0, data1, data2, data3, data4, data5), z[i] =: element_ref(6*i+2, data0, data1, data2, data3, data4, data5), w[i] =: element_ref(6*i+3, data0, data1, data2, data3, data4, data5), u[i] =: element_ref(6*i+4, data0, data1, data2, data3, data4, data5), v[i] =: element_ref(6*i+5, data0, data1, data2, data3, data4, data5).
YGroup6Zip_Bit128: Interleave 6 vectors into 6-element groups. It converts the 6-element groups SoA to AoS (将6个向量交织为6-元素组. 它能将6元素组的结构体数组转为数组结构体). It is specialized for process 128-bit element (它专门用于处理128位元素).
Mnemonic: element_ref(i, data0, data1, data2, data3, data4, data5) := (0==(i%6))?( x[i2] ):( (1==(i%6))?( y[i2] ):( (2==(i%6))?( z[i2] ):( (3==(i%6))?( w[i2] ):( (4==(i%6))?( u[i2] ):( v[i2] ) ) ) ) ), i2 := i/6.

对于 3-元素组的交织，有一种算法是可以同时处理2倍数据的，于是提供了 YGroup3UnzipX2、YGroup3ZipX2 方法。且这些算法的内部需要处理“128位6-元素组”的交织，于是还暴露了 YGroup6Unzip_Bit128、YGroup6Zip_Bit128 方法。

YGroup3Unzip的范例代码

下面是范例代码。

private static void ShowYGroup3Unzip(TextWriter writer) {
   
   
    writer.WriteLine("[YGroup3Unzip]");
    // Byte
    int cnt = Vector<byte>.Count;
    var a0 = Vectors.CreateByDoubleLoop<byte>(0, 1);
    var a1 = Vectors.CreateByDoubleLoop<byte>(cnt * 1, 1);
    var a2 = Vectors.CreateByDoubleLoop<byte>(cnt * 2, 1);
    var s0 = Vectors.YGroup3Unzip(a0, a1, a2, out var s1, out var s2);
    var t0 = Vectors.YGroup3Zip(s0, s1, s2, out var t1, out var t2);
    VectorTextUtil.WriteLine(writer, "Before      :\t{0}, {1}, {2}", a0, a1, a2);
    VectorTextUtil.WriteLine(writer, "YGroup3Unzip:\t{0}, {1}, {2}", s0, s1, s2);
    VectorTextUtil.WriteLine(writer, "YGroup3Zip  :\t{0}, {1}, {2}", t0, t1, t2);
    writer.WriteLine();
}

这个方法里，先使用 CreateByDoubleLoop 创建向量，使 a0、a1、a2 为连续的数字。

随后使用 YGroup3Unzip 对这3个向量进行 3-元素组的解交织，结果存储到 s0、s1、s2。若 a0、a1、a2 存储的是已打包的RGB像素值的话，可以看出 R、G、B 通道的数值正好被分离到 s0、s1、s2 这3个向量里了。

再使用 YGroup3Zip 对这3个向量进行 3-元素组的交织，结果存储到 t0、t1、t2。它们与先前的 a0、a1、a2 的值相同。

下面是输出结果。

[YGroup3Unzip]
Before      :   <0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31>, <32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63>, <64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95>      # (00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F), (20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F), (40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F)
YGroup3Unzip:   <0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93>, <1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58, 61, 64, 67, 70, 73, 76, 79, 82, 85, 88, 91, 94>, <2, 5, 8, 11, 14, 17, 20, 23, 26, 29, 32, 35, 38, 41, 44, 47, 50, 53, 56, 59, 62, 65, 68, 71, 74, 77, 80, 83, 86, 89, 92, 95>      # (00 03 06 09 0C 0F 12 15 18 1B 1E 21 24 27 2A 2D 30 33 36 39 3C 3F 42 45 48 4B 4E 51 54 57 5A 5D), (01 04 07 0A 0D 10 13 16 19 1C 1F 22 25 28 2B 2E 31 34 37 3A 3D 40 43 46 49 4C 4F 52 55 58 5B 5E), (02 05 08 0B 0E 11 14 17 1A 1D 20 23 26 29 2C 2F 32 35 38 3B 3E 41 44 47 4A 4D 50 53 56 59 5C 5F)
YGroup3Zip  :   <0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31>, <32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63>, <64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95>      # (00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F), (20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F), (40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F)

由于是在支持Avx2指令集的X86电脑上运行，故向量是256位的。

使用 YGroup3Unzip与YGroup3Zip，能很方便处理RGB像素值。一般步骤为：先用 YGroup3Unzip进行解交织，然后分别对3个通道的数据进行处理。最后使用 YGroup3Zip，将3个通道的数据，交织为已打包的RGB像素值。

提供重新构造组的向量方法

将 3-元素组与 4-元素组进行转换，虽然可以先解交织（Unzip），然后再做交织处理（Zip）。但是有些算法的性能更高，于是本库提供了重新构造组的向量方法。相关方法的说明见下。

YGroup1ToGroup3: Convert a 1-element group, to a 3-element group. It also converts grayscale pixel data to packed RGB pixel data (将1-元素组, 转为3-元素组. 它还能将灰度像素数据, 转换为已打包的RGB像素数据).
Mnemonic: View for group: (result0, result1, result2) = YGroup3Zip(x, x, x). View for element: element_ref(i, result0, result1, result2) := x[i/3].
YGroup1ToGroup4: Convert a 1-element group, to a 4-element group. It also converts grayscale pixel data to packed RGBA pixel data (将1-元素组, 转为4-元素组. 它还能将灰度像素数据, 转换为已打包的RGBA像素数据).
Mnemonic: View for group: (result0, result1, result2, result4) = YGroup4Zip(x, x, x, x). View for element: element_ref(i, result0, result1, result2, result4) := x[i/4].
YGroup1ToGroup4WithW: Convert a 1-element group and w argument, to a 4-element group. It also converts grayscale pixel data to packed RGBA pixel data (将1-元素组及w参数, 转为4-元素组. 它还能将灰度像素数据, 转换为已打包的RGBA像素数据).
Mnemonic: View for group: (result0, result1, result2, result4) = YGroup4Zip(x, x, x, w). View for element: element_ref(i, result0, result1, result2, result4) := ((i%4)<3)?( x[i2] ):( w[i2] ), i2 := i/4.
YGroup3ToGroup4: Convert a 3-element group, to a 4-element group. It also converts packed RGB pixel data to packed RGBA pixel data (将3-元素组, 转为4-元素组. 它还能将已打包的RGB像素数据, 转换为已打包的RGBA像素数据).
Mnemonic: View for group: (result0, result1, result2, result3) = YGroup4Zip(YGroup3Unzip(data0, data1, data2), Vector.Zero)). View for element: element_ref(i, result0, result1, result2, result3) := (3!=(i%4))?element_ref((i/4)*3+(i%4), data0, data1, data2):0.
YGroup4ToGroup3: Convert a 4-element group, to a 3-element group. It also converts packed RGBA pixel data to packed RGB pixel data (将4-元素组, 转为3-元素组. 它还能将已打包的RGBA像素数据, 转换为已打包的RGB像素数据).
Mnemonic: View for group: (result0, result1, result2) = YGroup3Zip(YGroup4Unzip(data0, data1, data2, data3))). View for element: element_ref(i, result0, result1, result2) := element_ref((i/3)*4+(i%3), data0, data1, data2, data3).

提供转置的向量方法

提供多向量换位的向量方法

.NET8增加了Arm架构的多寄存器的查表函数（VectorTableLookup/VectorTableLookupExtension）。
单个向量查表（如 vqvtbl1q_u8）时，只能在 16字节（128位）的范围内进行查表。而使用4个向量查表（如 vqtbl4q_u8 ）时，能在 16*4=64字节（512位）的范围内进行查表。

且.NET8.0增加了对Avx512系列指令集的支持。Avx512提供了2个向量查表的指令，例如 VPERMI2B。

本库参考了这2类指令，提供了 2~4个向量换位的方法。即下面的这些方法。

YShuffleX2[/_Args/_Core]: Shuffle and clear on 2 vectors (在2个向量上进行换位并清零). Creates a new vector by selecting values from an input vector using a set of indices (通过使用一组索引从输入向量中选择值，来创建一个新向量). If the indices value is out of range, the element will be cleared (若索引值超出范围, 元素会被清零).
Mnemonic: rt[i] := (0<=indices[i] && indices[i]<(Count*2))?( element_ref(indices[i], vector0, vector1) ):0.
YShuffleX2Insert[/_Args/_Core]: Shuffle and insert on 2 vectors (在2个向量上进行换位并插入). Creates a new vector by selecting values from an input vector using a set of indices (通过使用一组索引从输入向量中选择值，来创建一个新向量). If the index value is out of range, the elements of the background vector will be inserted (若索引值超出范围, 会插入背景向量的元素).
Mnemonic: rt[i] := (0<=indices[i] && indices[i]<(Count*2))?( element_ref(indices[i], vector0, vector1) ):back[i].
YShuffleX2Kernel[/_Args/_Core]: Only shuffle on 2 vectors (在2个向量上进行仅换位). Creates a new vector by selecting values from an input vector using a set of indices (通过使用一组索引从输入向量中选择值，来创建一个新向量). If the index value is out of range, the result is undefined (若索引值超出范围, 结果是未定义的). You can use the IndexMask mask to constrain the parameters (可使用 IndexMask 掩码来约束参数).
Mnemonic: rt[i] := element_ref(indices[i], vector0, vector1). Conditions: 0<=indices[i] && indices[i]<(Count*2).
YShuffleX3[/_Args/_Core]: Shuffle and clear on 3 vectors (在3个向量上进行换位并清零). Creates a new vector by selecting values from an input vector using a set of indices (通过使用一组索引从输入向量中选择值，来创建一个新向量). If the indices value is out of range, the element will be cleared (若索引值超出范围, 元素会被清零).
Mnemonic: rt[i] := (0<=indices[i] && indices[i]<3*Count)?( element_ref(indices[i], vector0, vector1, vector2) ):0.
YShuffleX3Insert[/_Args/_Core]: Shuffle and insert on 3 vectors (在3个向量上进行换位并插入). Creates a new vector by selecting values from an input vector using a set of indices (通过使用一组索引从输入向量中选择值，来创建一个新向量). If the index value is out of range, the elements of the background vector will be inserted (若索引值超出范围, 会插入背景向量的元素).
Mnemonic: rt[i] := (0<=indices[i] && indices[i]<3*Count)?( element_ref(indices[i], vector0, vector1, vector2) ):back[i].
YShuffleX3Kernel[/_Args/_Core]: Only shuffle on 3 vectors (在3个向量上进行仅换位). Creates a new vector by selecting values from an input vector using a set of indices (通过使用一组索引从输入向量中选择值，来创建一个新向量). If the index value is out of range, the result is undefined (若索引值超出范围, 结果是未定义的).
Mnemonic: rt[i] := element_ref(indices[i], vector0, vector1, vector2). Conditions: 0<=indices[i] && indices[i]<(Count*3).
YShuffleX4[/_Args/_Core]: Shuffle and clear on 4 vectors (在4个向量上进行换位并清零). Creates a new vector by selecting values from an input vector using a set of indices (通过使用一组索引从输入向量中选择值，来创建一个新向量). If the indices value is out of range, the element will be cleared (若索引值超出范围, 元素会被清零).
Mnemonic: rt[i] := (0<=indices[i] && indices[i]<(Count*4))?( element_ref(indices[i], vector0, vector1, vector2, vector3) ):0.
YShuffleX4Insert[/_Args/_Core]: Shuffle and insert on 4 vectors (在4个向量上进行换位并插入). Creates a new vector by selecting values from an input vector using a set of indices (通过使用一组索引从输入向量中选择值，来创建一个新向量). If the index value is out of range, the elements of the background vector will be inserted (若索引值超出范围, 会插入背景向量的元素).
Mnemonic: rt[i] := (0<=indices[i] && indices[i]<(Count*4))?( element_ref(indices[i], vector0, vector1, vector2, vector3) ):back[i].
YShuffleX4Kernel[/_Args/_Core]: Only shuffle on 4 vectors (在4个向量上进行仅换位). Creates a new vec