I am just new with this APIs of the Nvidia and some expressions are not so clear for me. I was wondering if somebody can help me to understand when and how to use these CUDA commands in a simply way. To be more precise:
Studing how is possible to speed up some applications with parallel execution of a kernel (with CUDA for example), at some point I was facing the problem of speeding up the interaction Host-Device.
I have some informations, taken surfing on the web, but I am little bit confused.
It clear that you can go faster when it is possible to use cudaHostRegister() and/or cudaHostAlloc(). Here it is explained that
"you can use the
cudaHostRegister()command to take some data (already allocated) and pin it avoiding extra copy to take into the GPU".
What is the meaning of "pin the memory"? Why is it so fast? How can I do this previously in this field? After, in the same video in the link, they continue explaining that
"if you are transferring PINNED memory, you can use the asynchronous memory transfer,
cudaMemcpyAsync(), which let's the CPU keep working during the memory transfer".
Are the PCIe transaction managed entirely from the CPU? Is there a manager of a bus that takes care of this? Also partial answers are really appreciated to re-compose the puzzle at the end.
It is also appreciate to have some link about the equivalent APIs in OpenCL.