您的位置:首页 > 编程语言

cuda编程之thread,block and grid

2017-10-19 21:32 811 查看
block中的所有线程都将在同一个stream processor中;

关于thread blocks, 可参考cuda c programming guide

Cuda Dynamic Parallelism 章节中D.1.2 Glossary关于thread block的描述。

A Thread Block is a group of threads which execute on the same multiprocessor(SMX).
Threads within a Thread Block have access to shared memory and can be explicitly synchronized. 


一个kernel可被多个equal shaped blocks执行。

There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core.
On current GPUs, a thread block may contain up to 1024 threads.

However, a kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks.

Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional gridof
thread blocks. The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system, which it can greatly exceed. 

take nvidia gk110 architecture as an example, which was refered on https://devtalk.nvidia.com/default/topic/897696/relationship-between-threads-and-gpu-core-units/?offset=6

A SMX consists of 4 subpartitions
each containing a warp scheduler, resources (register file, scheduler slots) and execution units. The SMX also contains shared execution units such as texture unit, shared memory unit, and double prec
4000
ision units.

The compute work distributor distributes thread blocks to an SMX when the SMX has sufficient available
resources for the thread block. The thread blocks is divided into warps. Each warp is allocated to a SM subpartition and warp resources such as registers are allocated. A warp will stay on the specific subpartition until is completes. When it completes its
resources will be freed.

Each cycle each warp scheduler will pick an eligible warp (not stalled) and issue 1 or 2 instructions
from the warp. These instructions will be dispatched to execution units (single precision/integer unit, double precision unit, special function unit, load store unit, texture unit, shared memory unit, etc. Each of the execution units are pipelined so the warp
scheduler can execute instructions from the same warp or a different warp N cycles later. ALU instructions tend to have fixed latency (measurable by microbenchmarks) whereas SMX shared units such as the double precision unit and memory unit such as shared
memory and texture unit have variable latency.

The reason the SMX can manage 2048 threads = 64 warps is so that each warp scheduler has a sufficient
pool of warps to hide long latency instructions or to hide short latency instructions without adding the area and power cost of out of order execution.

In addition, Appendix A of NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
mentioned that:

CUDA Hardware Execution

CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes one or more kernel grids; a streaming multiprocessor (SM on Fermi / SMX on Kepler) executes one or more thread blocks; and CUDA cores
and other execution units in the SMX execute thread instructions. The SMX executes threads in groups of 32 threads called warps. While programmers can generally ignore warp execution for functional correctness and focus on programming individual scalar threads,
they can greatly improve performance by having threads in a warp execute the same code path and access memory with nearby addresses. 
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: