您的位置:首页 > 产品设计 > UI/UE

Questions about OPENCL global and local work size

2014-05-15 14:15 1196 查看
favorite
11

searching the nvidia forums I found these questions, which are also of interest to me, but nobody had answered them in the
last four days or so. Can you help?

Original forum post: Digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.

Must the
global_work_size
be smaller than
CL_DEVICE_MAX_WORK_ITEM_SIZES
? On my machine
CL_DEVICE_MAX_WORK_ITEM_SIZES
= 512, 512, 64.

Is
CL_KERNEL_WORK_GROUP_SIZE
the recommended
work_group_size
for the used kernel? 2b. Or is this the only work_group_size the GPU allows? On my machine
CL_KERNEL_WORK_GROUP_SIZE
= 512

Do I need to divide into work groups or can I have only one, but not specifying
local_work_size
? 3b. To what do I have to pay attention, when I only have one work group?

What does
CL_DEVICE_MAX_WORK_GROUP_SIZE
mean? On my machine
CL_DEVICE_MAX_WORK_GROUP_SIZE
= 512, 512, 64 4b. Does this mean, I can have one work group which is as large as the
CL_DEVICE_MAX_WORK_ITEM_SIZES
?

Added by edit: 5. Has
global_work_size
to be a divisor of
CL_DEVICE_MAX_WORK_ITEM_SIZES
? In my code
global_work_size
= 20.

Thanks for your help!

opencl

share|improve
this question
asked Oct 18 '10 at 7:05



Framester

3,64474597

2 Answers

active oldest votes

up vote39 down voteaccepted
In general you can choose global_work_size as big as you want, while local_work_size is constraint by the underlying device/hardware, so all query results will tell you the possible dimensions for local_work_size instead of the global_work_size. the only
constraint for the global_work_size is that it must be a multiple of the local_work_size (for each dimension).

The work group sizes specifiy the sizes of the workgroups so if
CL_DEVICE_MAX_WORK_ITEM_SIZES
is
512, 512, 64
that means that means your local_work_size can't be bigger then
512
for the x and y dimension and
64
for
the z dimension.

However there is also a constraint on the local group size depending on the kernel. This is expressed through
CL_KERNEL_WORK_GROUP_SIZE
. Your cumulative workgoupsize (as in the product of all dimensions, e.g.
256
if you have a localsize
of
16, 16, 1
) must not be greater then that number. This is due to the limited hardware resources to be devided between the threads (from your query results I assume you are programming on a nvidia gpu, so the amount of local memory and registers
used by a thread will limit the number of threads which can be executed in parallel).

CL_DEVICE_MAX_WORK_GROUP_SIZE
defines the maximum size of a work group in the same manner as
CL_KERNEL_WORK_GROUP_SIZE
, but specific to the device instead the kernel (and it should be a a scalar value aka
512
).

You can choose not to specify local_work_group_size, in which case the opencl implementation will choose a local work group size for you (so its not a guarantee that it uses only one workgroup). However it's generally not advisiable, since you don't know
how your work is divided into workgroups and furthermore it's not guaranteed that the workgroupsize chosen will be optimal.

However you should note that using only one workgroup is generally not a good idea performancewise (and why use opencl if performance is not a concern). In general a workgroup has to execute on one compute unit, while most devices will have more then one
(modern CPUs have 2 or more, one for each core, while modern CPUs can have up to 20). Furthermore even the one Compute Unit on which your workgroup executes might not be fully used, since several workgroup can execute on one compute unit in an SMT style. To
use NVIDIA GPUs optimally you need 768/1024/1536 threads (depending on the generation, meaning G80/GT200/GF100) executing on one compute unit, and while I don't know the numbers for amd right now, they are in the same magnitude, so it's good to have more then
one workgroup. Furhermore for gpus it's typically advisable to have workgroups which at least 64 threads (and a number of threads divisible by 32/64 (nvidia/amd) per workgroup), because otherwise you will again have reduced performance (32/64 is the minimum
granuaty for execution on gpus, so if you have less items in a workgroup, it will still execute as 32/64 threads, but discard the results from unused threads).

share|improve
this answer
edited Oct 18 '10 at 15:21

answered Oct 18 '10 at 13:50




Grizzly

10.9k2449

Thank you very much. One thing I have to aks: When you mean smaller, do you mean < or <=? – Framester Oct
18 '10 at 14:37
@Framester: smaller or equal, edited to fix that – Grizzly Oct
18 '10 at 15:22
I would edit the recomend of "it is not advisiable not to specify a workgroup size". Since for many many operations it IS the best choise. – DarkZeros Oct
15 '13 at 11:44
add comment





up vote 0down vote
It should be a comment but I cannot because of my points. If the each
CL_KERNEL_WORK_GROUP_SIZE
limits the maximum work group size for a kernel, then how can we specify the work group size? If we specify some value, it might be bigger than what
clGetKernelWorkGroupInfo
function
returns.

I have this problem,my device has max work group size of 256. For a kernel I specified work group size 128. When I run the kernel for teh first time first time inside a loop, it is ok. But for the second iterations or sometimes for third iterations, the
CL_KERNEL_WORK_GROUP_SIZE
get
reduced from 256 to 64. This throws errors as my work group size is 128. How can I solve this problem?

share|improve
this answer
answered Apr 14 at 19:55




Luniam

486

Hi Luniam, why don't you post your comment as a new question and link to my question. Then the community can help you with your question! – Framester Apr
15 at 8:11
That should not happen. Kernels don't change their properties over time. Please post a full question with some code, and we might help you. – DarkZeros Apr
15 at 8:58
add comment
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: