Question

众所周知，AMD-OpenCL支持WaveFront（2015年8月）：http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf

例如，AMD Radeon HD 7770 GPU支持超过25,000个GPU 飞行中的工作项目，可以切换到新的 wavefront （包含在一个周期内完成64个工作项目。

但为什么在OpenCL标准1.0 / 2.0 / 2.2中没有提及WaveFront？

没有PDF没有单词 WaveFront ：https://www.khronos.org/registry/OpenCL/specs/

我也发现：

2013：https://community.amd.com/thread/160658

OpenCL是一个开放标准。它仍然不支持这种混合概念。它甚至不支持wavefront / warp。

2013：https://stackoverflow.com/a/19874984/1558037

这就是为什么这个概念不在OpenCL规范本身上。

2011：https://forums.khronos.org/showthread.php/7211-How-can-i-split-my-work-load-in-a-GPU-with-OpenCL

标准OpenCL没有＆＃34; wavefront＆＃34;
的概念

2011：https://www.cvg.ethz.ch/teaching/2011spring/gpgpu/GPU-Optimization.pdf

确实官方的OpenCL 2.2标准仍不支持WaveFront？

结论：

OpenCL标准中没有WaveFront，但OpenCL-2.0中的存在具有类似于WaveFronts的SIMD执行模型的子组。

第-1页： http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf

6.4.2工作组/子组级功能

OpenCL 2.0引入了Khronos 子组扩展。分组是一个   逻辑抽象的硬件SIMD执行模型类似于   wavefronts ，warps或vector，允许编程更接近   硬件以独立于供应商的方式。此扩展程序包含一组   跨组子内置函数匹配的集合   上面指定的跨工作组内置函数。

Answer 1

他们必须采用更具动态性的方法dep.version：https://www.khronos.org/registry/OpenCL/specs/opencl-2.2.pdf

sub-group

和

Sub-group: Sub-groups are an implementation-dependent grouping of work-items within a
work-group. The size and number of sub-groups is implementation-defined.

和

Work-groups are further divided into sub-groups,
which provide an additional level of control over execution.

所以即使它不被称为wavefront，它现在可以在运行时查询和

在没有同步功能（例如屏障）的情况下，子组内的工作项可以序列化。在......的存在下子组功能，子组内的工作项可以序列化在任何给定的子组函数之前，动态遇到之间成对的子组函数和工作组函数和内核的结束。

甚至连锁步的方式有时也会丢失。

除此之外，

The mapping of work-items to
sub-groups is implementation-defined and may be queried at runtime.

说存在某种子组内通信。因为现在opencl有子内核定义：

 sub_group_all() and
sub_group_broadcast() and are described in OpenCL C++ kernel language and IL specifications.
The use of these sub-group functions implies sequenced-before relationships between statements
within the execution of a single work-item in order to satisfy data dependencies.

最终，有点像

Device-side enqueue: A mechanism whereby a kernel-instance is enqueued by a kernel-instance
running on a device without direct involvement by the host program. This produces nested
parallelism; i.e. additional levels of concurrency are nested inside a running kernel-instance.
The kernel-instance executing on a device (the parent kernel) enqueues a kernel-instance (the
child kernel) to a device-side command queue. Child and parent kernels execute asynchronously
though a parent kernel does not complete until all of its child-kernels have completed.

你应该能够以你需要的任何大小生成自己的（已升级的？）波前，并且它们与父内核同时工作（并且可以与子组内线程进行通信）但是它们不被称为波前因为它们不是硬编码的通过硬件imho。

2.0 api specs说：

kernel void launcher()
{
    ndrange_t ndrange = ndrange_1D(1);
    enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange,
    ^{
    size_t id = get_global_id(0);
    }
    );
}

提醒amd的16位模拟人物和nvidia的32位模拟人物与一些想象中的fpga 95宽计算核心相比。伪波前可能？

官方OpenCL 2.2标准是否支持WaveFront？

1 个答案: