我正在浏览 Metal 文档(链接 here)中“为计算处理选择设备对象”页面上链接的项目。在那里,我注意到我希望在我自己的粒子模拟器中采用线程组内存的巧妙使用。但是,在我这样做之前,我需要了解线程组内存的一个特定方面以及开发人员在这种情况下正在做什么。
代码包含一个像这样的段:
// In AAPLKernels.metal
// Parameter of the kernel
threadgroup float4* sharedPosition [[ threadgroup(0)]]
// Body
...
// For each particle / body
for(i = 0; i < params.numBodies; i += numThreadsInGroup)
{
// Because sharedPosition uses the threadgroup address space, 'numThreadsInGroup' elements
// of sharedPosition will be initialized at once (not just one element at lid as it
// may look like)
sharedPosition[threadInGroup] = oldPosition[sourcePosition];
j = 0;
while(j < numThreadsInGroup)
{
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr);
} // while
sourcePosition += numThreadsInGroup;
} // for
特别是,以“因为...”开头的 sharedPosition
赋值之前的注释让我感到困惑。我没有在任何地方读到线程组内存写入同时发生在同一线程组中的所有线程上;事实上,我认为在再次从共享内存池读取之前需要一个屏障以避免未定义的行为,因为 每个 线程随后在分配后从整个线程组内存池中读取(分配是一个当然写)。为什么这里不需要屏障?