SSE intrinsic的几个指令_mm_prefetch/_mm_movehl_ps/_mm_shuffle_ps

本文详细介绍了SSE指令集中的_mm_prefetch预取指令、_mm_movehl_ps数据移动指令及_mm_shuffle_ps数据混洗指令的使用方法。通过具体实例展示了如何利用这些指令优化内存访问与数据处理效率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 

1 _mm_prefetch

void_mm_prefetch(char *p, int i)

The argument "*p" gives the address of the byte (and corresponding cache line) to be prefetched. The value "i" gives a constant (_MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2, or _MM_HINT_NTA) that specifies the type of prefetch operation to be performed.

T0 (temporal data)--prefetch data into all cache levels.

T1 (temporal data with respect to first level cache)--prefetch data in all cache levels except 0th cache level

T2 (temporal data with respect to second level cache) --prefetch data in all cache levels, except 0th and 1st cache levels.

NTA (non-temporal data with respect to all cache levels)--prefetch data into non-temporal cache structure. (This hint can be used to minimize pollution of caches.)

  
void _mm_prefetch(char *p, int i) 

 从地址P处预取尺寸为cache line大小的数据缓存,参数i指示预取方式(_MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2, _MM_HINT_NTA,分别表示不同的预取方式)
T0 预取数据到所有级别的缓存,包括L0。
T1 预取数据到除L0外所有级别的缓存。
T2 预取数据到除L0和L1外所有级别的缓存。
NTA  预取数据到非临时缓冲结构中,可以最小化对缓存的污染。 
       如果在CPU操作数据之前,我们就已经将数据主动加载到缓存中,那么就减少了由于缓存不命中,需要从内存取数的情况,这样就可以加速操作,获得性能上提升。使用主动缓存技术来优化内存拷贝。
 
注 意,CPU对数据操作拥有绝对自由!使用预取指令只是按我们自己的想法对CPU的数据操作进行补充,有可能CPU当前并不需要我们加载到缓存的数据,这 样,我们的预取指令可能会带来相反的结果,比如对于多任务系统,有可能我们冲掉了有用的缓存。不过,在多任务系统上,由于线程或进程的切换所花费的时间相 对于预取操作来说太长了, 所以可以忽略线程或进程切换对缓存预取的影响。


2 _mm_movehl_ps

Moves the upper two single-precision, floating-point values of  b  to the lower two single-precision, floating-point values of the result. The upper two single-precision, floating-point values of a are passed through to the result.

b 的高 64 位移至结果的低 64 位, a 的高 64 位传递给结果。

如:

r = __m128 _mm_movehl_ps( __m128 a, __m128 b ); //r = {a3, a2, b3, b2} //

s = _mm_movehl_ps( x , x );//--s = {x3, x2, x3, x2}

 

例:( 代码)

 

 


3 _mm_shuffle_ps

Selects four specific single-precision, floating-point values from a and b,   based on the mask  i. 其中, i 是一个 8 bit 的常量,这个常量的 1~8 位分别控制了从两个操作数中选择分量的情况。

__m128 _mm_shuffle_ps(__m128 a , __m128 b , int i );

 

s = _mm_shuffle_ps( r , r , 1 )//r = {r3, r2, r1, r0}, s = {r0, r0, r0, r1}

 

它可以把两个操作数的分量以特定的顺序排列并赋予给目标数。比如

__m128 b = _mm_shuffle_ps ( a , a , 0 );

  

b 的所有分量都是 a 中下标为 0 的分量。第三个参数控制分量分配,是一个 8bit 的常量,这个常量的 1~8 位分别控制了从两个操作数中选择分量的情况。而在使用 intrinsic 的时候,最好使用 _MM_SHUFFLE  宏,它可以定义分配情况。

Shuffle Function Macro

_MM_SHUFFLE(z, y, x, w)

/* expands to the following value */

(z<<6) | (y<<4) | (x<<2) | w

 

m3 = _mm_shuffle_ps(m1, m2, _MM_SHUFFLE(1, 0, 3, 2))

_mm_shuffle_ps

  It is a simple selection operation of the operands m1 and m2.

So, _MM_SHUFFLE(z,y,x,w) selects x&w 32 bit double words from m1 and z&y from m2. How simple!!.

 

one little very formal suggestion: 
_MM_SHUFFLE(z,y,x,w) does not select anything, this macro just creates a mask. SHUFPS instruction (or _mm_shuffle_ps wrapper function) performs selection, using mask created by _MM_SHUFFLE macro.

 

例:

如果定义一个共同体

typedef union {  __m128 m;  float m128_f32[4];  } my_m128;

 __m128 m1 = { 1.0f, 2.0f, 3.0f, 4.0f };

那么, m128_f32[0] = 4, m128_f32[1] = 3, m128_f32[2] = 2, m128_f32[4] = 1

 


 

 

附: 

下面我们来复习一下叉积的求法。

c = a x b

可以写成:

那么写成 SSE intrinsic 形式则是:

 

三分量的向量求点积,可以写成:

 

通过这两个例子,可以留意到向量内元素的垂直相加一般形式,即:

那么通过扩展,可以得到求向量长度的函数,首先是求分量平方和函数:

 

 

参考:MSDN

http://hi.baidu.com/sige_online/blog/item/a80522ceec812433b700c829.html

http://www.codeguru.com/forum/archive/index.php/t-337156.html

http://blog.youkuaiyun.com/igame/archive/2007/08/21/1752430.aspx

 

### SSE Intrinsic `_mm_store_ss` Usage and Documentation The intrinsic function `_mm_store_ss` belongs to the Streaming SIMD Extensions (SSE) set, specifically designed for storing a single-precision floating-point value from an XMM register into memory. This operation ensures that only one element of the float data type is written back to memory while potentially taking advantage of architectural optimizations such as write-combining mechanisms[^1]. For x86 and x86-64 architectures using GCC or compatible compilers, developers can utilize this intrinsic directly within C/C++ code without needing assembly language. #### Function Prototype ```c void _mm_store_ss(float *p, __m128 a); ``` Here `*p` points to the destination address where the lowest significant single-precision floating point component (`a[0]`) of vector operand `__m128 a` gets stored. #### Example Code Demonstrating Use Case Below demonstrates how `_mm_store_ss` could be used in practice: ```c #include <emmintrin.h> // Include header file necessary for SSE instructions int main() { float source_value = 3.14f; float target_array[4]; // Load scalar value into lower part of XMM register with other elements zeroed out. __m128 loaded_data = _mm_set_ps(0, 0, 0, source_value); // Store first element of 'loaded_data' into array position indexed by pointer arithmetic. _mm_store_ss(&target_array[0], loaded_data); return 0; } ``` In this example, after loading a specific value into an XMM register via another intrinsic call like `_mm_set_ps`, `_mm_store_ss` writes just the least significant single precision float contained therein back onto heap space allocated previously through automatic variable declaration.
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值