OpenCL™ C 6.15.11. 从全局到本地内存、从本地到全局内存和预取的异步复制

最新推荐文章于 2025-03-22 20:20:12 发布

翻译最新推荐文章于 2025-03-22 20:20:12 发布 · 105 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#async-copies

文章标签：

#opencl c

openCL C 专栏收录该内容

74 篇文章

订阅专栏

6.15.11. Async Copies From Global to Local Memory, Local to Global Memory, and Prefetch

6.15.11. 从全局到本地内存、从本地到全局内存和预取的异步复制

The OpenCL C programming language implements the following functions that provide asynchronous copies between global and local memory and a prefetch from global memory.

OpenCL C编程语言实现了以下函数，这些函数在全局和本地内存之间提供异步副本，并从全局内存进行预取。

The async copy and wait group events functions are performed by all work-items in a work-group and therefore must be encountered by all work-items in a work-group executing the kernel with the same argument values, otherwise the results are undefined. This rule applies to ND-ranges implemented with uniform and non-uniform work-groups.

异步复制和等待组事件函数由工作组中的所有工作项执行，因此必须由使用相同参数值执行内核的工作组中所有工作项遇到，否则结果未定义。此规则适用于使用统一和非统一工作组实施的ND范围。

If an async copy or wait group events function is inside a conditional statement then all work-items in the work-group must enter the conditional if any work-item in the work-group enters the conditional statement and executes the async copy or wait group events function.

如果异步复制或等待组事件函数位于条件语句中，则如果工作组中的任何工作项输入条件语句并执行异步复制或等候组事件函数，则工作组中所有工作项都必须输入条件。

If an async copy or wait group events function is inside a loop then all work-items in the work-group must execute the async copy or wait group events function on each iteration of the loop if any work-item executes the async copy or wait group events function on that iteration.

如果异步复制或等待组事件函数位于循环内，则如果任何工作项在循环的每次迭代中执行异步复制或等候组事件函数，则工作组中的所有工作项都必须在该迭代中执行该函数。

The generic type name gentype indicates that the function can take any of

泛型类型名称gentype表示该函数可以采用以下任何类型

char, charn, uchar, or ucharn
short, shortn, ushort, or ushortn
int, intn, uint, or uintn
long [61], longn, ulong, or ulongn
float, floatn
double [62] or doublen
half [63] or halfn

All functions taking or returning half types are supported only when the cl_khr_fp16 extension macro is supported.

仅当支持cl_khr_fp16扩展宏时，才支持所有接受或返回half类型的函数。

as the type for the arguments unless otherwise stated. n is 2, 3 [64], 4, 8, or 16.

除非另有说明，否则作为参数的类型。n为2、3[64]、4、8或16。

Table 24. Built-in Async Copy and Prefetch Functions

表24 内置异步复制和预取函数
Function 函数	Description 功能
event_t async_work_group_copy(__local gentype dst, const __global gentype src, size_t num_gentypes, event_t event) event_t async_work_group_copy(__global gentype dst, const __local gentype src, size_t num_gentypes, event_t event)	Perform an async copy of num_gentypes gentype elements from src to dst. 执行num_gentypes gentype元素从src到dst的异步复制。 Returns an event object that can be used by wait_group_events to wait for the async copy to finish. The event argument can also be used to associate the async_work_group_copy with a previous async copy allowing an event to be shared by multiple async copies; otherwise event should be zero. 返回一个事件对象，wait_group_events可以使用该对象等待异步复制完成。event参数还可以用于将async_work_group_copy与之前的异步副本相关联，从而允许多个异步副本共享事件；否则event应为零。 0 can be implicitly and explicitly cast to `event_t` type. 0可以隐式和显式转换为event_t类型。 If event argument is non-zero, the event object supplied in event argument will be returned. 如果event参数为非零，则将返回event参数中提供的事件对象。 This function does not perform any implicit synchronization of source data such as using a barrier before performing the copy. 此函数不执行源数据的任何隐式同步，例如在执行复制之前使用屏障。

event_t async_work_group_strided_copy(__local gentype dst, const __global gentype src, size_t num_gentypes, size_t src_stride, event_t event) event_t async_work_group_strided_copy(__global gentype dst, const __local gentype src, size_t num_gentypes, size_t dst_stride, event_t event)	Perform an async gather of num_gentypes `gentype` elements from src to dst. The src_stride is the stride in elements for each `gentype` element read from src. The dst_stride is the stride in elements for each `gentype` element written to dst. 从src到dst执行num_gentypes gentype元素的异步收集。src_stread是从src读取的每个gentype元素的步长元素。dst_stride是写入dst的每个gentype元素的步长元素。 Returns an event object that can be used by wait_group_events to wait for the async copy to finish. The event argument can also be used to associate the async_work_group_strided_copy with a previous async copy allowing an event to be shared by multiple async copies; otherwise event should be zero. 返回一个事件对象，wait_group_events可以使用该对象等待异步复制完成。event参数还可以用于将async_work_group_strided_copy与之前的异步副本相关联，从而允许多个异步副本共享事件；否则event应为零。 0 can be implicitly and explicitly cast to event_t type. 0可以隐式和显式转换为event_t类型。 If event argument is non-zero, the event object supplied in event argument will be returned. 如果event参数为非零，则将返回event参数中提供的事件对象。 This function does not perform any implicit synchronization of source data such as using a barrier before performing the copy. 此函数不执行源数据的任何隐式同步，例如在执行复制之前使用屏障。 The behavior of async_work_group_strided_copy is undefined if src_stride or dst_stride is 0, or if the src_stride or dst_stride values cause the src or dst pointers to exceed the upper bounds of the address space during the copy. 如果src_stread或dst_stride为0，或者src_strede或dst_strie值导致src或dst指针在复制过程中超过地址空间的上限，则async_work_group_strid_copy的行为未定义。 Requires support for OpenCL C 1.1 or newer. 需要支持OpenCL C 1.1或更高版本。

void wait_group_events(int num_events, event_t *event_list)	Wait for events that identify the async_work_group_copy operations to complete. The event objects specified in event_list will be released after the wait is performed. 等待标识async_work_group_copy操作的事件完成。执行等待后，event_list中指定的事件对象将被释放。

void prefetch(const __global gentype p, size_t num_gentypes*)	Prefetch `num_gentypes * sizeof(gentype)` bytes into the global cache. The prefetch instruction is applied to a work-item in a work-group and does not affect the functional behavior of the kernel. 将num_gentypes * sizeof（gentype）字节预取到全局缓存中。预取指令应用于工作组中的工作项，不会影响内核的功能行为。
`void async_work_group_copy_fence( cl_mem_fence_flags flags)`	Orders async copies produced by the work-items of a work-group executing a kernel. Async copies preceding the async_work_group_copy_fence must complete their access to the designated memory or memories, including both reads-from and writes-to it, before async copies following the fence are allowed to start accessing these memories. In other words, every async copy preceding the async_work_group_copy_fence must happen-before every async copy following the fence, with respect to the designated memory or memories. 对执行内核的工作组的工作项生成的异步副本进行排序。Async_work_group_copy_fence之前的异步副本必须完成对指定内存的访问，包括对内存的读取和写入，然后才允许围栏之后的异步副本开始访问这些内存。换句话说，相对于指定的一个或多个内存，async_work_group_copy_fence之前的每个异步复制都必须发生在围栏之后的每个异步副本之前。 The flags argument specifies the memory address space and can be set to a combination of the following literal values: flags参数指定内存地址空间，可以设置为以下文字值的组合： `CLK_LOCAL_MEM_FENCE` `CLK_GLOBAL_MEM_FENCE` The async fence is performed by all work-items in a work-group and this built-in function must therefore be encountered by all work-items in a work-group executing the kernel with the same argument values; otherwise the results are undefined. This rule applies to ND-ranges implemented with uniform and non-uniform work-groups. 异步围栏由工作组中的所有工作项执行，因此，使用相同参数值执行内核的工作组中所有工作项都必须遇到此内置函数；否则结果不明确。此规则适用于使用统一和非统一工作组实施的ND范围。 Requires support for the `cl_khr_async_work_group_copy_fence` extension macro. 需要支持cl_khr_sync_work_group_copy_fence扩展宏。

The kernel must wait for the completion of all async copies using the wait_group_events built-in function before exiting; otherwise the behavior is undefined.

内核在退出之前必须使用wait_group_events内置函数等待所有异步副本完成；否则行为未定义。

6.15.11.1. Extended Async Copy Functions

6.15.11.1. 扩展异步复制函数

If the cl_khr_extended_async_copies extension macro is supported, additional Built-in Extended Async Copy Functions are provided which interpret the source and destination as 2D or 3D data.

如果支持cl_khr_extended_async_copies扩展宏，则提供了额外的内置扩展异步复制函数，将源和目标解释为2D或3D数据。

async_work_group_strided_copy is a special case of async_work_group_copy_2D2D, namely one which copies a single column to a single line or vice versa. For example:
async_work_group_strided_copy(dst, src, num_gentypes, src_stride, event) is equal to async_work_group_copy_2D2D(dst, 0, src, 0, sizeof(gentype), 1, num_gentypes, src_stride, 1, event)

async_work_group_strided_copy是async_work_group_copy_2D2D的一个特例，即将一列复制到一行，反之亦然。例如： async_work_group_strided_copy(dst, src, num_gentypes, src_stride, event)等于async_work_group_copy_2D2D(dst, 0, src, 0, sizeof(gentype), 1, num_gentypes, src_stride, 1, event)

The functions described in this section support arbitrary gentype-based buffers by casting pointers to void*.

本节中描述的函数通过将指针强制转换为void*来支持任意基于gentype的缓冲区。

These functions do not perform any implicit synchronization of source data such as using a barrier before performing the copy.

这些函数不执行源数据的任何隐式同步，例如在执行复制之前使用屏障。

These functions are performed by all work-items in a work-group and must therefore be encountered by all work-items in a work-group executing the kernel with the same argument values; otherwise the results are undefined.

这些函数由工作组中的所有工作项执行，因此，使用相同参数值执行内核的工作组中所有工作项都必须遇到这些函数；否则结果不明确。

The src_offset, dst_offset, src_total_line_length, dst_total_line_length, src_total_plane_area and dst_total_plane_area function arguments are expressed in elements.

src_coffset、dst_offset、src_total_line_length、dst_total_line_length、src_total_plane_area和dst_total_plane_area函数参数以元素表示。

Both src_total_line_length and dst_total_line_length describe the number of elements between the beginning of the current line and the beginning of the next line.

src_total_line_length和dst_total_line_length都描述了当前行开头和下一行开头之间的元素数量。

Both src_total_plane_area and dst_total_plane_area describe the number of elements between the beginning of the current plane and the beginning of the next plane.

src_total_plane_area和dst_total_plan_area都描述了当前平面开始和下一个平面开始之间的元素数量。

These functions return an event object that can be used by wait_group_events to wait for the async copy to finish. The event argument can also be used to associate the async copy with a previous async copy allowing an event to be shared by multiple async copies; otherwise event should be zero. If the event argument is non-zero, the event object supplied as the event argument will be returned.

这些函数返回一个事件对象，wait_group_events可以使用该对象等待异步复制完成。event参数还可以用于将异步副本与之前的异步副本相关联，从而允许多个异步副本共享事件；否则event应为零。如果event参数为非零，则将返回作为event参数提供的事件对象。

Table 25. Built-in Extended Async Copy Functions

表25 内置扩展异步复制函数
Function 函数	Description 描述
event_t async_work_group_copy_2D2D( __local void dst, size_t dst_offset, const __global void src, size_t src_offset, size_t num_bytes_per_element, size_t num_elements_per_line, size_t num_lines, size_t src_total_line_length, size_t dst_total_line_length, event_t event) event_t async_work_group_copy_2D2D( __global void dst, size_t dst_offset, const __local void src, size_t src_offset, size_t num_bytes_per_element, size_t num_elements_per_line, size_t num_lines, size_t src_total_line_length, size_t dst_total_line_length, event_t event)	Perform an async copy of (num_elements_per_line * num_lines) elements of size num_bytes_per_element from (src + (src_offset * num_bytes_per_element)) to (dst + (dst_offset * num_bytes_per_element)). All pointer arithmetic is performed with implicit casting to `char` by the implementation. Each line contains num_elements_per_line* elements of size num_bytes_per_element. After each line of transfer, the src address is incremented by src_total_line_length elements (i.e. src_total_line_length * num_bytes_per_element bytes), and the dst address is incremented by dst_total_line_length elements (i.e. dst_total_line_length * num_bytes_per_element bytes), for the next line of transfer. 执行大小为num_bytes_perelement的（num_elements_per_linenum_lines）元素的异步复制，从（src+（src_offsetnum_bytes _perelement））到（dst+（dst_offsetnum_bytes.perelement））。所有指针算法都是通过实现隐式转换为char来执行的。每一行都包含大小为num_bytes_per_element的num_elements_per_line元素。在每一行传输之后，src地址由src_total_line_length元素（即src_tota_line_lengthnum_bytes_perelement字节）递增，而dst地址由dst_total_lines_lengthnum_bytes_per_element字节递增，用于下一行传输。 The behavior of async_work_group_copy_2D2D is undefined if the source or destination addresses exceed the upper bounds of the address space during the copy. 如果源地址或目标地址在复制过程中超过地址空间的上限，则async_work_group_copy_2D2D的行为未定义。 The behavior of async_work_group_copy_2D2D is also undefined if the src_total_line_length or dst_total_line_length values are smaller than num_elements_per_line, i.e. overlapping of lines is undefined. 如果src_total_line_length或dst_total_line_length值小于num_elements_per_line，则async_work_group_copy_2D2D的行为也未定义，即行重叠未定义。
event_t async_work_group_copy_3D3D( __local void dst, size_t dst_offset, const __global void src, size_t src_offset, size_t num_bytes_per_element, size_t num_elements_per_line, size_t num_lines, size_t num_planes, size_t src_total_line_length, size_t src_total_plane_area, size_t dst_total_line_length, size_t dst_total_plane_area, event_t event) event_t async_work_group_copy_3D3D( __global void dst, size_t dst_offset, const __local void src, size_t src_offset, size_t num_bytes_per_element, size_t num_elements_per_line, size_t num_lines, size_t num_planes, size_t src_total_line_length, size_t src_total_plane_area, size_t dst_total_line_length, size_t dst_total_plane_area, event_t event)	Perform an async copy of num_elements_per_line * num_lines) * num_planes) elements of size num_bytes_per_element from (src + (src_offset * num_bytes_per_element to (dst + (dst_offset * num_bytes_per_element)), arranged in num_planes planes. All pointer arithmetic is performed with implicit casting to `char` by the implementation. Each plane contains num_lines* lines. Each line contains num_elements_per_line elements. After each line of transfer, the src address is incremented by src_total_line_length elements (i.e. src_total_line_length * num_bytes_per_element bytes), and the dst address is incremented by dst_total_line_length elements (i.e. dst_total_line_length * num_bytes_per_element bytes), for the next line of transfer. 执行大小为num_bytes_perelement的（num_elements_per_linenum_lines）num_planes元素的异步复制，从（src+（src_offsetnum_bytes_perelement）到（dst+（dst_offsetnum_bytes_per_element），排列在num_planes平面中。所有指针算法都是通过实现隐式转换为char来执行的。每个平面都包含num_lines线。每一行都包含num_elements_per_line元素。在每一行传输之后，src地址由src_total_line_length元素（即src_tota_line_lengthnum_bytes_perelement字节）递增，而dst地址由dst_total_lines_lengthnum_bytes _per_ence字节递增，用于下一行传输。 The behavior of async_work_group_copy_3D3D* is undefined if the source or destination addresses exceed the upper bounds of the address space during the copy. 如果源地址或目标地址在复制过程中超过地址空间的上限，则async_work_group_copy_3D3D的行为未定义。 The behavior of async_work_group_copy_3D3D is also undefined if the src_total_line_length or dst_total_line_length values are smaller than num_elements_per_line, i.e. overlapping of lines is undefined. 如果src_total_line_length或dst_total_line_length值小于num_elements_per_line，则async_work_group_copy_3D3D的行为也未定义，即行重叠未定义。 The behavior of async_work_group_copy_3D3D is also undefined if src_total_plane_area is smaller than (num_lines * src_total_line_length), or dst_total_plane_area is smaller than (num_lines * dst_total_line_length), i.e. overlapping of planes is undefined. 如果src_total_plane_area小于（num_linessrc_total_line_length），或者dst_total_plane_area小于（num_lines* * dst_total_line_length），则async_work_group_copy_3D3D的行为也未定义，即平面重叠未定义。

Function 函数	Description 描述
event_t async_work_group_copy_2D2D( __local void dst, size_t dst_offset, const __global void src, size_t src_offset, size_t num_bytes_per_element, size_t num_elements_per_line, size_t num_lines, size_t src_total_line_length, size_t dst_total_line_length, event_t event) event_t async_work_group_copy_2D2D( __global void dst, size_t dst_offset, const __local void src, size_t src_offset, size_t num_bytes_per_element, size_t num_elements_per_line, size_t num_lines, size_t src_total_line_length, size_t dst_total_line_length, event_t event)	Perform an async copy of (num_elements_per_line * num_lines) elements of size num_bytes_per_element from (src + (src_offset * num_bytes_per_element)) to (dst + (dst_offset * num_bytes_per_element)). All pointer arithmetic is performed with implicit casting to `char` by the implementation. Each line contains num_elements_per_line* elements of size num_bytes_per_element. After each line of transfer, the src address is incremented by src_total_line_length elements (i.e. src_total_line_length * num_bytes_per_element bytes), and the dst address is incremented by dst_total_line_length elements (i.e. dst_total_line_length * num_bytes_per_element bytes), for the next line of transfer. 执行大小为num_bytes_perelement的（num_elements_per_linenum_lines）元素的异步复制，从（src+（src_offsetnum_bytes _perelement））到（dst+（dst_offsetnum_bytes.perelement））。所有指针算法都是通过实现隐式转换为char来执行的。每一行都包含大小为num_bytes_per_element的num_elements_per_line元素。在每一行传输之后，src地址由src_total_line_length元素（即src_tota_line_lengthnum_bytes_perelement字节）递增，而dst地址由dst_total_lines_lengthnum_bytes_per_element字节递增，用于下一行传输。 The behavior of async_work_group_copy_2D2D is undefined if the source or destination addresses exceed the upper bounds of the address space during the copy. 如果源地址或目标地址在复制过程中超过地址空间的上限，则async_work_group_copy_2D2D的行为未定义。 The behavior of async_work_group_copy_2D2D is also undefined if the src_total_line_length or dst_total_line_length values are smaller than num_elements_per_line, i.e. overlapping of lines is undefined. 如果src_total_line_length或dst_total_line_length值小于num_elements_per_line，则async_work_group_copy_2D2D的行为也未定义，即行重叠未定义。
event_t async_work_group_copy_3D3D( __local void dst, size_t dst_offset, const __global void src, size_t src_offset, size_t num_bytes_per_element, size_t num_elements_per_line, size_t num_lines, size_t num_planes, size_t src_total_line_length, size_t src_total_plane_area, size_t dst_total_line_length, size_t dst_total_plane_area, event_t event) event_t async_work_group_copy_3D3D( __global void dst, size_t dst_offset, const __local void src, size_t src_offset, size_t num_bytes_per_element, size_t num_elements_per_line, size_t num_lines, size_t num_planes, size_t src_total_line_length, size_t src_total_plane_area, size_t dst_total_line_length, size_t dst_total_plane_area, event_t event)	Perform an async copy of num_elements_per_line * num_lines) * num_planes) elements of size num_bytes_per_element from (src + (src_offset * num_bytes_per_element to (dst + (dst_offset * num_bytes_per_element)), arranged in num_planes planes. All pointer arithmetic is performed with implicit casting to `char` by the implementation. Each plane contains num_lines* lines. Each line contains num_elements_per_line elements. After each line of transfer, the src address is incremented by src_total_line_length elements (i.e. src_total_line_length * num_bytes_per_element bytes), and the dst address is incremented by dst_total_line_length elements (i.e. dst_total_line_length * num_bytes_per_element bytes), for the next line of transfer. 执行大小为num_bytes_perelement的（num_elements_per_linenum_lines）num_planes元素的异步复制，从（src+（src_offsetnum_bytes_perelement）到（dst+（dst_offsetnum_bytes_per_element），排列在num_planes平面中。所有指针算法都是通过实现隐式转换为char来执行的。每个平面都包含num_lines线。每一行都包含num_elements_per_line元素。在每一行传输之后，src地址由src_total_line_length元素（即src_tota_line_lengthnum_bytes_perelement字节）递增，而dst地址由dst_total_lines_lengthnum_bytes _per_ence字节递增，用于下一行传输。 The behavior of async_work_group_copy_3D3D* is undefined if the source or destination addresses exceed the upper bounds of the address space during the copy. 如果源地址或目标地址在复制过程中超过地址空间的上限，则async_work_group_copy_3D3D的行为未定义。 The behavior of async_work_group_copy_3D3D is also undefined if the src_total_line_length or dst_total_line_length values are smaller than num_elements_per_line, i.e. overlapping of lines is undefined. 如果src_total_line_length或dst_total_line_length值小于num_elements_per_line，则async_work_group_copy_3D3D的行为也未定义，即行重叠未定义。 The behavior of async_work_group_copy_3D3D is also undefined if src_total_plane_area is smaller than (num_lines * src_total_line_length), or dst_total_plane_area is smaller than (num_lines * dst_total_line_length), i.e. overlapping of planes is undefined. 如果src_total_plane_area小于（num_linessrc_total_line_length），或者dst_total_plane_area小于（num_lines* * dst_total_line_length），则async_work_group_copy_3D3D的行为也未定义，即平面重叠未定义。