When to use cudaHostRegister() and cudaHostAlloc()? What is the meaning of "Pinned or page-locked" memory? Which are the equivalent in OpenCL?

Question

I am just new with this APIs of the Nvidia and some expressions are not so clear for me. I was wondering if somebody can help me to understand when and how to use these CUDA commands in a simply way. To be more precise:

Studing how is possible to speed up some applications with parallel execution of a kernel (with CUDA for example), at some point I was facing the problem of speeding up the interaction Host-Device. I have some informations, taken surfing on the web, but I am little bit confused. It clear that you can go faster when it is possible to use cudaHostRegister() and/or cudaHostAlloc(). Here it is explained that

"you can use the cudaHostRegister() command to take some data (already allocated) and pin it avoiding extra copy to take into the GPU".

What is the meaning of "pin the memory"? Why is it so fast? How can I do this previously in this field? After, in the same video in the link, they continue explaining that

"if you are transferring PINNED memory, you can use the asynchronous memory transfer, cudaMemcpyAsync(), which let's the CPU keep working during the memory transfer".

Are the PCIe transaction managed entirely from the CPU? Is there a manager of a bus that takes care of this? Also partial answers are really appreciated to re-compose the puzzle at the end.

It is also appreciate to have some link about the equivalent APIs in OpenCL.

I've already read this and it is still not so clear and complete: I am searching other info on the Web and trying to study deeply the topic. I don't have some basic concepts like "pin the memory" and so on. That's why I wrote the question. Thank you for the help in any case :) — Leos313, Sep 12 '16 at 17:22
http://linux.die.net/man/2/mlock - the first paragraph should answer pretty much all your background questions. If you can't understand it, I fear you are asking your question in the wrong place. — talonmies, Sep 12 '16 at 18:24

score 13 · Accepted Answer · edited May 23 '17 at 11:46

13

What is the meaning of "pin the memory"?

It means make the memory page locked. That is telling the operating system virtual memory manager that the memory pages must stay in physical ram so that they can be directly accessed by the GPU across the PCI-express bus.

Why is it so fast?

In one word, DMA. When the memory is page locked, the GPU DMA engine can directly run the transfer without requiring the host CPU, which reduces overall latency and decreases net transfer times.

Are the PCIe transaction managed entirely from the CPU?

No. See above.

Is there a manager of a bus that takes care of this?

No. The GPU manages the transfers. In this context there is no such thing as a bus master

edited May 23 '17 at 11:46

Community

1
1

answered Sep 13 '16 at 06:15

talonmies

70,661
34
192
269

doyou know the equivalent of these instructrions for OpenCL? – Leos313 Aug 24 '17 at 08:06
1

@Leos313 I think the equivalent OpenCL instruction is `clEnqueueMapBuffer` – Philippe Fisher Jul 17 '18 at 17:09

Kshitij Lakhani · Answer 2 · 2021-01-28T19:26:20.620

EDIT: Seems like CUDA treats pinned and page-locked as the same as per the Pinned Host Memory section in this blog written by Mark Harris. This means by answer is moot and the best answer should be taken as is.

I bumped into this question while looking for something else. For all future users, I think @talonmies answers the question perfectly, but I'd like to bring to notice a slight difference between locking and pinning pages - the former ensures that the memory is not pageable but the kernel is free to move it around and the latter ensures that it stays in memory (i.e. non-pageable) but also is mapped to the same address. Here's a reference to the same.

When to use cudaHostRegister() and cudaHostAlloc()? What is the meaning of "Pinned or page-locked" memory? Which are the equivalent in OpenCL?

2 Answers2

Linked