oneAPI GPU 优化指南 - 预取

C与Fortran语言OpenMP中的预取技术

最新推荐文章于 2025-07-05 21:05:15 发布

翻译最新推荐文章于 2025-07-05 21:05:15 发布 · 210 阅读

文章标签：

#oneapi

本章节翻译by chenchensmail@163.com 原文：Clauses: is_device_ptr, use_device_ptr, has_device_addr,... (intel.com)

用户引导的数据预取是一种隐藏由于低级缓存未命中和主内存访问产生的延迟的有用技术。 OpenMP 针对 Intel® GPU 的部署现在使用预取编译指示启用此功能，语法如下：

C OpenMP 预取:

#pragma ompx prefetch data([prefetch-hint-modifier:],arrsect, [,arrsect] ) [ if (condition) ]

Fortran OpenMP 预取:

!$omp prefetch data( [prefetch-hint-modifier:] arrsect [, arrsect] ) [if (condition)]

上述预取编译指示是 Intel® 扩展，并适用于 Intel® 数据中心 Max 系列 GPU 及以后的产品。编译指示的主要方面包括：

Prefetch-hint: 使用可选的 prefetch-hint-modifier 指定预取数据的目标。

有效值为 0 (No-op)， 2 (仅预取到 L2) 和 4(预取到 L1 和 L2)。如果未指定值，默认值为 0。
Use of array section: 使用 OpenMP 语法 [lower-bound : length] 指定连续数组部分 arrsect。例如，在 C 中使用 a[2:4]，在 Fortran 中使用 a(2:4)。

如果只需要预取一个元素，在 C 中我们可以使用 a[2:1] 或 a[2]。Fortran 需要明确指定长度，即使是单个元素的预取，如 a(2:1)。
Default prefetch size: 即使请求预取单个数组元素，硬件也会预取包含该元素的整个缓存行。在 Intel® 数据中心 Max系列 GPU 中，缓存行的大小为 64 字节。
Faulting: 在 Intel® 数据中心 Max系列 GPU 中，预取指令如果是错误的，这意味着访问无效地址可能会导致分段错误。可以使用 pragma 中的可选 if 条件来防止越界访问。
Non-blocking: 预取编译指示不会阻塞，它不用等待预取完成。

C 语言 OpenMP 中的预取

以下示例显示了一个简化的一维 N-body 力学 kernel 。外部 for 循环遍历计算力量的粒子。内部循环以 TILE_SIZE 粒子的 batches 遍历交互粒子。我们可以在计算当前粒子图块期间预取下一个粒子图块。如上所述，预取总是带入 64 字节的数据。因此，我们只需要每 16 个单精度浮点值预取一次，这可以通过使用 if ( (next_tile % 16) == 0 ) 来实现。使用这种掩码条件可能并不总是有帮助，请参阅下面代码片段后面的附加注释。使用的预取提示是 4(预取到 L1 和 L2 缓存)。下面只显示了部署了的 kernel。

 1#define WORKGROUP_SIZE 1024
 2#define PREFETCH_HINT 4 // 4 = prefetch to L1 and L3;  2 = prefetch to L3
 3#define TILE_SIZE 64
 4
 5void nbody_1d_gpu(float *c, float *a, float *b, int n1, int n2) {
 6#pragma omp target teams distribute parallel for thread_limit(WORKGROUP_SIZE)
 7  for (int i = 0; i < n1; i++) {
 8    const float ma0 = 0.269327f, ma1 = -0.0750978f, ma2 = 0.0114808f;
 9    const float ma3 = -0.00109313f, ma4 = 0.0000605491f, ma5 = -0.00000147177f;
10    const float eps = 0.01f;
11
12    float dx = 0.0;
13    float bb[TILE_SIZE];
14    for (int j = 0; j < n2; j += TILE_SIZE) {
15      // load tile from b
16      for (int u = 0; u < TILE_SIZE; ++u) {
17        bb[u] = b[j + u];
18#ifdef PREFETCH
19        int next_tile = j + TILE_SIZE + u;
20        if ((next_tile % 16) == 0) {
21#pragma ompx prefetch data(PREFETCH_HINT : b[next_tile]) if (next_tile < n2)
22        }
23#endif
24      }
25#pragma unroll(TILE_SIZE)
26      for (int u = 0; u < TILE_SIZE; ++u) {
27        float delta = bb[u] - a[i];
28        float r2 = delta * delta;
29        float s0 = r2 + eps;
30        float s1 = 1.0f / sqrtf(s0);
31        float f =
32            (s1 * s1 * s1) -
33            (ma0 + r2 * (ma1 + r2 * (ma2 + r2 * (ma3 + r2 * (ma4 + ma5)))));
34        dx += f * delta;
35      }
36    }
37    c[i] = dx * 0.23f;
38  }
39}

条件 if ( (next_tile % 16) == 0 ) 可以在数组索引未向量化时节省预取开销。在上面的示例中，只有索引 i 被向量化，所以当我们预取用 j 索引的 b[] 时，有助于每 16 个元素只进行一次预取。另一方面，如果我们要在索引 i 上预取数组，则预取被向量化，因此掩码条件可能无法提供任何好处。用户将需要通过实验确定他们应用程序最佳方法。

编译命令:

不带预取:

icpx -O3 -g -fiopenmp -fopenmp-targets=spir64 -mcmodel=medium nbody_c.cpp -o test_c

带预取:

icpx -O3 -g -fiopenmp -fopenmp-targets=spir64 -mcmodel=medium -DPREFETCH nbody_c.cpp -o test_c

运行命令:

LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS="-cl-strict-aliasing -cl-fast-relaxed-math" ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_PLUGIN_PROFILE=T,usec IGC_ForceOCLSIMDWidth=16 ./test_c

默认的 SIMD 宽度由后端设备编译器（Intel® 图形编译器或 IGC ）在 Intel® 数据中心 Max系列 GPU 中自动选择为 16 或 32，编译器启发式方法考虑到了 kernel 中的寄存器压力等因素。可以使用 IGC 环境变量 IGC_ForceOCLSIMDWidth=16 来请求 IGC 编译器强制 SIMD 宽度为 16。对于上述 kernel ， SIMD16 提供了更好的性能。在运行命令中，我们还启用了 OpenMP 的内置分析器，使用 LIBOMPTARGET_PLUGIN_PROFILE=T,usec 。没有预取的运行输出如下。

得到的输出 = 222700231.430
预期输出 = 222700339.016

总时间 =    205.4 milliseconds
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL0) for OMP DEVICE(0) Intel(R) Graphics [0x0bd6], Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0                  : __omp_offloading_46_3c0d785c__Z12nbody_1d_gpuPfS_S_ii_l15
Kernel 1                  : __omp_offloading_46_3c0d785c__Z15clean_cache_gpuPdi_l69
Kernel 2                  : __omp_offloading_46_3c0d785c__Z4main_l98
----------------------------------------------------------------------------------------------------------------------
                          : Host Time (usec)                        Device Time (usec)
Name                      :      Total   Average       Min       Max     Total   Average       Min       Max     Count
----------------------------------------------------------------------------------------------------------------------
Compiling                 :  598283.05 598283.05 598283.05 598283.05      0.00      0.00      0.00      0.00      1.00
DataAlloc                 :    9578.23    798.19      0.00   8728.03      0.00      0.00      0.00      0.00     12.00
DataRead (Device to Host) :      77.01     77.01     77.01     77.01      5.68      5.68      5.68      5.68      1.00
DataWrite (Host to Device):     713.11    356.55    179.05    534.06     15.76      7.88      5.04     10.72      2.00
Kernel 0                  :  205292.22   2052.92   2033.95   2089.98 203572.32   2035.72   1984.96   2073.12    100.00
Kernel 1                  :  109194.28   1091.94   1076.94   1681.09 107051.52   1070.52   1062.40   1107.04    100.00
Kernel 2                  :    1746.89   1746.89   1746.89   1746.89      3.84      3.84      3.84      3.84      1.00
Linking                   :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit        :    2647.88   2647.88   2647.88   2647.88      0.00      0.00      0.00      0.00      1.00
======================================================================================================================

从上面的输出中，GPU kernel 执行 (Kernel 0) 的平均设备时间为 2036 微秒。如果我们运行启用预取的二进制文件，观察到的平均 kernel 执行的设备时间为 1841 微秒，如下所示：

得到的输出 = 222700231.430
预期输出 = 222700339.016

总时间 =    185.9 milliseconds
======================================================================================================================
LIBOMPTARGET_PLUGIN_PROFILE(LEVEL0) for OMP DEVICE(0) Intel(R) Graphics [0x0bd6], Thread 0
----------------------------------------------------------------------------------------------------------------------
Kernel 0                  : __omp_offloading_43_3c0d785c__Z12nbody_1d_gpuPfS_S_ii_l15
Kernel 1                  : __omp_offloading_43_3c0d785c__Z15clean_cache_gpuPdi_l69
Kernel 2                  : __omp_offloading_43_3c0d785c__Z4main_l98
----------------------------------------------------------------------------------------------------------------------
                          : Host Time (usec)                        Device Time (usec)
Name                      :      Total   Average       Min       Max     Total   Average       Min       Max     Count
----------------------------------------------------------------------------------------------------------------------
Compiling                 :  499351.98 499351.98 499351.98 499351.98      0.00      0.00      0.00      0.00      1.00
DataAlloc                 :    9609.94    800.83      0.00   8740.19      0.00      0.00      0.00      0.00     12.00
DataRead (Device to Host) :      77.01     77.01     77.01     77.01      4.96      4.96      4.96      4.96      1.00
DataWrite (Host to Device):     722.17    361.08    185.01    537.16     16.40      8.20      5.44     10.96      2.00
Kernel 0                  :  185793.88   1857.94   1839.88   1919.03 184075.20   1840.75   1824.00   1874.56    100.00
Kernel 1                  :  109442.95   1094.43   1076.94   1590.01 107334.56   1073.35   1062.40   1115.68    100.00
Kernel 2                  :    1821.99   1821.99   1821.99   1821.99      3.84      3.84      3.84      3.84      1.00
Linking                   :       0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00      1.00
OffloadEntriesInit        :    2493.14   2493.14   2493.14   2493.14      0.00      0.00      0.00      0.00      1.00
======================================================================================================================

请注意，实际性能取决于使用的硬件和软件堆栈，因此用户可能会看到不同的性能数字。

Fortran 语言 OpenMP 中的预取

下面以 Fortran 展示了相同的 nbody1d kernel。预取编译指示插入在与之前相同的位置，预取提示值为 4，并且再次于每 16 个元素中只预取一个。

 1#define WORKGROUP_SIZE 1024
 2#define PREFETCH_HINT 4     ! 4 = prefetch to L1 and L3;  2 = prefetch to L3
 3#define TILE_SIZE 64
 4
 5      subroutine nbody_1d_gpu(c, a, b, n1, n2)
 6      implicit none
 7      integer n1, n2
 8      real a(0:n1-1), b(0:n2-1), c(0:n1-1)
 9      real dx, bb(0:TILE_SIZE-1), delta, r2, s0, s1, f
10      integer i,j,u,next
11      real ma0, ma1, ma2, ma3, ma4, ma5, eps
12      parameter (ma0=0.269327, ma1=-0.0750978, ma2=0.0114808)
13      parameter (ma3=-0.00109313, ma4=0.0000605491, ma5=-0.00000147177)
14      parameter (eps=0.01)
15
16!$omp target teams distribute parallel do thread_limit(WORKGROUP_SIZE)
17!$omp& private(i,dx,j,u,bb,next,delta,r2,s0,s1,f)
18      do i = 0, n1-1
19        dx = 0.0
20        do j = 0, n2-1, TILE_SIZE
21          ! load tile from b
22          do u = 0, TILE_SIZE-1
23            bb(u) = b(j+u)
24#ifdef PREFETCH
25            next = j + TILE_SIZE + u
26            if (mod(next,16).eq.0) then
27!$omp prefetch data(PREFETCH_HINT:b(next:next))if(next<n2)
28            endif
29#endif
30          enddo
31          ! compute
32          !DIR$ unroll(TILE_SIZE)
33          do u = 0, TILE_SIZE-1
34            delta = bb(u) - a(i)
35            r2 = delta*delta
36            s0 = r2 + eps
37            s1 = 1.0 / sqrt(s0)
38            f = (s1*s1*s1)-(ma0+r2*(ma1+r2*(ma2+r2*(ma3+r2*(ma4+ma5)))))
39            dx = dx + f*delta
40          enddo
41        enddo
42        c(i) = dx*0.23
43      enddo
44      end subroutine

编译命令:

不带预取:

ifx -O3 -g -fiopenmp -fopenmp-targets=spir64 -fpconstant -fpp -ffast-math -fno-sycl-instrument-device-code -mcmodel=medium nbody_f.f -o test_f

带预取:

ifx -O3 -g -fiopenmp -fopenmp-targets=spir64 -fpconstant -fpp -ffast-math -fno-sycl-instrument-device-code -mcmodel=medium -DPREFETCH nbody_f.f -o test_f

运行命令:

LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS="-cl-strict-aliasing -cl-fast-relaxed-math" ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_PLUGIN_PROFILE=T,usec IGC_ForceOCLSIMDWidth=16 ./test_f

这里没有显示输出，因为它看起来像 C 示例的输出。不带预取和带预取的平均 kernel 执行时间分别为2017微秒和1823微秒。再次，请注意，用户可能会看到不同的性能数字，这取决于实际使用的硬件和软件堆栈。

C 语言 OpenMP SIMD 中的预取

OpenMP 部署还支持一种 SIMD 编程模型，其中所有计算都包含 Intel® 数据中心 Max 系列 GPU 中指定的 16 或 32 个 SIMD 通道的 EU 线程。相应地，即使在 OpenMP 中的 thread_limit() 子句也具有修改后的含义，现在也指定了每个 work-group 的 EU 线程数。下面列出了 nbody1d Kernel 的 OpenMP SIMD 版本。我们需要明确指定 SIMD 宽度，即 VECLEN=16。在撰写本文时，建议在 simd 子句的范围之外使用预取编译指示，这意味着只有一个 SIMD 通道将发出预取指令。在这个例子中，16 个通道中的 1 个将执行预取，这正是我们需要的——所以我们不再需要上面之前几个示例中使用过的 if ( (next_tile % 16) == 0 )。

 1#define WORKGROUP_SIZE 1024
 2#define PREFETCH_HINT 4 // 4 = prefetch to L1 and L3;  2 = prefetch to L3
 3#define TILE_SIZE 64
 4
 5void nbody_1d_gpu(float *c, float *a, float *b, int n1, int n2) {
 6#pragma omp target teams distribute parallel for thread_limit(WORKGROUP_SIZE / \
 7                                                              VECLEN)
 8  for (int i = 0; i < n1; i += VECLEN) {
 9    const float ma0 = 0.269327f, ma1 = -0.0750978f, ma2 = 0.0114808f;
10    const float ma3 = -0.00109313f, ma4 = 0.0000605491f, ma5 = -0.00000147177f;
11    const float eps = 0.01f;
12
13    float dx[VECLEN];
14    float aa[VECLEN], bb[TILE_SIZE];
15#pragma omp simd simdlen(VECLEN)
16#pragma unroll(0)
17    for (int v = 0; v < VECLEN; ++v) {
18      dx[v] = 0.0f;
19      aa[v] = a[i + v];
20    }
21    for (int j = 0; j < n2; j += TILE_SIZE) {
22      // load tile from b
23      for (int u = 0; u < TILE_SIZE; u += VECLEN) {
24#pragma omp simd simdlen(VECLEN)
25#pragma unroll(0)
26        for (int v = 0; v < VECLEN; ++v)
27          bb[u + v] = b[j + u + v];
28#ifdef PREFETCH
29        int next_tile = j + TILE_SIZE + u;
30#pragma ompx prefetch data(PREFETCH_HINT : b[next_tile]) if (next_tile < n2)
31#endif
32      }
33// compute current tile
34#pragma omp simd simdlen(VECLEN)
35#pragma unroll(0)
36      for (int v = 0; v < VECLEN; ++v) {
37#pragma unroll(TILE_SIZE)
38        for (int u = 0; u < TILE_SIZE; ++u) {
39          float delta = bb[u] - aa[v];
40          float r2 = delta * delta;
41          float s0 = r2 + eps;
42          float s1 = 1.0f / sqrtf(s0);
43          float f =
44              (s1 * s1 * s1) -
45              (ma0 + r2 * (ma1 + r2 * (ma2 + r2 * (ma3 + r2 * (ma4 + ma5)))));
46          dx[v] += f * delta;
47        }
48      }
49    }
50#pragma omp simd simdlen(VECLEN)
51#pragma unroll(0)
52    for (int v = 0; v < VECLEN; ++v) {
53      c[i + v] = dx[v] * 0.23f;
54    }
55  }
56}

编译命令:

我们需要使用一个额外的编译开关 -fopenmp-target-simd 来启用 SIMD 编程模型。因此，编译命令如下：

不带预取:

icpx -O3 -g -fiopenmp -fopenmp-targets=spir64 -mcmodel=medium -fopenmp-target-simd nbody_c_simd.cpp -o test_c_simd

带预取:

icpx -O3 -g -fiopenmp -fopenmp-targets=spir64 -mcmodel=medium -DPREFETCH -fopenmp-target-simd nbody_c_simd.cpp -o test_c_simd

运行命令:

LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS="-cl-strict-aliasing -cl-fast-relaxed-math" ZE_AFFINITY_MASK=0.0 LIBOMPTARGET_PLUGIN_PROFILE=T,usec ./test_c_simd

注意，我们不再需要环境变量 IGC_ForceOCLSIMDWidth=16，因为mSIMD 宽度已在 OpenMP 代码中明确指定。

输出看起来像前面的示例，所以没有显示。不带预取和带预取的平均 kernel 执行时间分别为 2008 微秒和 1810 微秒。如前所述，用户可能会看到不同的性能数字，这取决于实际使用的硬件和软件堆栈。