-
Hi!
Program response:
Here memcpy and tMandelbrot are separate launches, and Queue_pl is the benchmark from the example. let's swap the lines inside the benchmark:
Now, while the fractal is being calculated, the data is being copied in parallel. I'm a beginner, I haven't worked with pure CUDA. Is this a feature of GeForce or SYCL? Is it possible to run the kernel in parallel after memcpy? Benchmark and fractal code:
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hi, AdaptiveCpp will automatically attempt to overlap kernels and data transfers if you use out-of-order queues (like here), there are no dependencies between the operations and the hardware supports it. It may also attempt to run multiple kernels simultaneously. This relies on some heuristics. I cannot tell you why you observe the operations being run in parallel it in the second case but not in the first. In either case, it will not explicitly wait on the data transfer or kernel. Perhaps in the first case the timing is just unlucky, or some host latencies obscure the overlapping. If you want more control over the overlapping and concurrency of independent operations, then construct multiple in-order queues and support independent operations to different queues. They will be executed in parallel if the hardware can do it. |
Beta Was this translation helpful? Give feedback.
-
Figured it out, it was necessary to use pinned memory.
But it had to be like this:
I'm sure if I had specified this in the question, you would have answered. Thank you so much for the comment! |
Beta Was this translation helpful? Give feedback.
Figured it out, it was necessary to use pinned memory.
I allocated memory on the host in the usual way:
But it had to be like this:
I'm sure if I had specified this in the question, you would have answered.
Thank you so much for the comment!
Multiple In-order queues are not only faster, but also more convenient than a single out-of-order queue. Because instead of specifying dependencies between events, you can put a set of commands for each task into a separate in-order queue.