cuda编程之thread,block and grid
2017-10-19 21:32
811 查看
block中的所有线程都将在同一个stream processor中;
关于thread blocks, 可参考cuda c programming guide
Cuda Dynamic Parallelism 章节中D.1.2 Glossary关于thread block的描述。
A Thread Block is a group of threads which execute on the same multiprocessor(SMX).
Threads within a Thread Block have access to shared memory and can be explicitly synchronized.
一个kernel可被多个equal shaped blocks执行。
There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core.
On current GPUs, a thread block may contain up to 1024 threads.
However, a kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks.
Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional gridof
thread blocks. The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system, which it can greatly exceed.
take nvidia gk110 architecture as an example, which was refered on https://devtalk.nvidia.com/default/topic/897696/relationship-between-threads-and-gpu-core-units/?offset=6
A SMX consists of 4 subpartitions
each containing a warp scheduler, resources (register file, scheduler slots) and execution units. The SMX also contains shared execution units such as texture unit, shared memory unit, and double prec
4000
ision units.
The compute work distributor distributes thread blocks to an SMX when the SMX has sufficient available
resources for the thread block. The thread blocks is divided into warps. Each warp is allocated to a SM subpartition and warp resources such as registers are allocated. A warp will stay on the specific subpartition until is completes. When it completes its
resources will be freed.
Each cycle each warp scheduler will pick an eligible warp (not stalled) and issue 1 or 2 instructions
from the warp. These instructions will be dispatched to execution units (single precision/integer unit, double precision unit, special function unit, load store unit, texture unit, shared memory unit, etc. Each of the execution units are pipelined so the warp
scheduler can execute instructions from the same warp or a different warp N cycles later. ALU instructions tend to have fixed latency (measurable by microbenchmarks) whereas SMX shared units such as the double precision unit and memory unit such as shared
memory and texture unit have variable latency.
The reason the SMX can manage 2048 threads = 64 warps is so that each warp scheduler has a sufficient
pool of warps to hide long latency instructions or to hide short latency instructions without adding the area and power cost of out of order execution.
In addition, Appendix A of NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
mentioned that:
CUDA Hardware Execution
CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes one or more kernel grids; a streaming multiprocessor (SM on Fermi / SMX on Kepler) executes one or more thread blocks; and CUDA cores
and other execution units in the SMX execute thread instructions. The SMX executes threads in groups of 32 threads called warps. While programmers can generally ignore warp execution for functional correctness and focus on programming individual scalar threads,
they can greatly improve performance by having threads in a warp execute the same code path and access memory with nearby addresses.
关于thread blocks, 可参考cuda c programming guide
Cuda Dynamic Parallelism 章节中D.1.2 Glossary关于thread block的描述。
A Thread Block is a group of threads which execute on the same multiprocessor(SMX).
Threads within a Thread Block have access to shared memory and can be explicitly synchronized.
一个kernel可被多个equal shaped blocks执行。
There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core.
On current GPUs, a thread block may contain up to 1024 threads.
However, a kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks.
Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional gridof
thread blocks. The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system, which it can greatly exceed.
take nvidia gk110 architecture as an example, which was refered on https://devtalk.nvidia.com/default/topic/897696/relationship-between-threads-and-gpu-core-units/?offset=6
A SMX consists of 4 subpartitions
each containing a warp scheduler, resources (register file, scheduler slots) and execution units. The SMX also contains shared execution units such as texture unit, shared memory unit, and double prec
4000
ision units.
The compute work distributor distributes thread blocks to an SMX when the SMX has sufficient available
resources for the thread block. The thread blocks is divided into warps. Each warp is allocated to a SM subpartition and warp resources such as registers are allocated. A warp will stay on the specific subpartition until is completes. When it completes its
resources will be freed.
Each cycle each warp scheduler will pick an eligible warp (not stalled) and issue 1 or 2 instructions
from the warp. These instructions will be dispatched to execution units (single precision/integer unit, double precision unit, special function unit, load store unit, texture unit, shared memory unit, etc. Each of the execution units are pipelined so the warp
scheduler can execute instructions from the same warp or a different warp N cycles later. ALU instructions tend to have fixed latency (measurable by microbenchmarks) whereas SMX shared units such as the double precision unit and memory unit such as shared
memory and texture unit have variable latency.
The reason the SMX can manage 2048 threads = 64 warps is so that each warp scheduler has a sufficient
pool of warps to hide long latency instructions or to hide short latency instructions without adding the area and power cost of out of order execution.
In addition, Appendix A of NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
mentioned that:
CUDA Hardware Execution
CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes one or more kernel grids; a streaming multiprocessor (SM on Fermi / SMX on Kepler) executes one or more thread blocks; and CUDA cores
and other execution units in the SMX execute thread instructions. The SMX executes threads in groups of 32 threads called warps. While programmers can generally ignore warp execution for functional correctness and focus on programming individual scalar threads,
they can greatly improve performance by having threads in a warp execute the same code path and access memory with nearby addresses.
相关文章推荐
- CUDA编程——GPU架构,由sp,sm,thread,block,grid,warp说起
- CUDA编程——GPU架构,由sp,sm,thread,block,grid,warp说起
- 【并行计算-CUDA开发】CUDA编程——GPU架构,由sp,sm,thread,block,grid,warp说起
- CUDA中grid、block、thread、warp与SM、SP的关系
- Understanding CUDA grid dimensions, block dimensions and threads organization
- 【CUDA】grid、block、thread的关系及thread索引的计算
- cuda编程-block和thread数量的确定
- CUDA软件架构—网格(Grid)、线程块(Block)和线程(Thread)的组织关系以及线程索引的计算公式
- CUDA中grid、block、thread、warp与SM、SP的关系
- Cuda 学习教程(五):GPU架构-Sp,sm,thread,block,grid,warp
- gpu/cuda-01-grid/block/thread
- [原]CUDA中grid、block、thread、warp与SM、SP的关系
- How do I choose grid and block dimensions for CUDA kernels?
- How do I choose grid and block dimensions for CUDA kernels?
- cuda编程-block和thread数量的确定
- CUDA中grid、block、thread、warp与SM、SP的关系
- CUDA学习----sp, sm, thread, block, grid, warp概念
- CUDA 关于 BLOCK数目与Thread数目设置
- CUDA: Threading的Block和Grid的設定與 Warp
- CUDA:一维、二维的grid、block的核函数线程分配