您的位置:首页 > 其它

PyCUDA 学习笔记 -- pagelocked memory

2016-02-23 05:03 1126 查看

PyCUDA: pagelocked memory

In GPU Programming, we have to transfer data from CPU to GPU which might take a while.

Normal ways of transferring data:

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np

a = np.random.randn(256, 256)
a = a.astype
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)

##following codes ...


transfer pagelocked host memory from host to device (aka. from CPU to GPU)

The above codes would take quite a long time to transfer the data, so we try the pagelocked ways.

Background concerning the pagelocked memory and GPU pinned memory

CPU data allocations are pageable by default. The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked, the CUDA driver must first allocate a temporary page-locked, or “pinned”, host array, copy the data to the pinned array, and then transfer the data from the pinned array to device memory.

Device Interface:

pycuda.driver. pagelocked_empty(shape, dtype, order=”C”, mem_flags=0)

Allocate a pagelocked numpy.ndarray of shape, dtype and order

mem_flags: may be one of the values in host_alloc_flags. It may only be non-zero on CUDA 2.2 and newer:

The default one is equal to the cudaMallocHost(void

** ptr, size_t size) in CUDA, which allocates size bytes of host memory that is page-locked and accessible to the device.

PORTABLE:

The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation.

DEVICEMAP:

Maps the allocation into the CUDA address space. This device pointer to the memory may be obtained by calling cudaHostGetDevicePoineer()

WRITECOMBINED:

Allocates the memory as write-combined(WC). WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. WC memory is a good option for buffers that will be written by the CPU and read by the device via mapped pinned memory or host -> device transfers.

cuda.pagelocked_empty(shape, dtype, order="C")
cuda.pagelocked_zeros( .. )
cuda.pagelocked_empty_like( array )
cuda.pagelocked_zero_like( array )


However, when I try to access the page-locked memory from CPU, it appears to be super slow
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  gpu cuda pycuda