The GPU waits for MEMCPY and only then starts the KERNEL #1864

EmerAX7 · 2025-06-24T10:28:24Z

EmerAX7
Jun 24, 2025

Hi!
I am studying an example where during data processing the next part of the data is copied.
I work with an NVIDIA GeForce RTX 2070 graphics card.
The example below calculates the Mandelbrot fractal. The fractal is not essential, what is important is that the calculations do not depend on the data being copied in parallel.

...

sycl::property_list properties{sycl::property::queue::enable_profiling()};
auto qu = sycl::queue{sycl::gpu_selector_v, properties};

auto sBuf  = sycl::malloc_device<quint16>(pixAll, qu); //
auto sFrc  = sycl::malloc_device<quint16>(pixImg, qu); //fractal

benchmark([&]() {
    e1 = qu.memcpy(sBuf, buf16, pixAll * sizeof(quint16));
    e2 = tMandelbrot(qu, -0.9, -0.9, 0.9, 0.9, width, height, 4095, sFrc, {});
    e1.wait();
    e2.wait();
}, "Queue_prl");
    
int64_t e1_end    = e1.get_profiling_info<sycl::info::event_profiling::command_end>();
int64_t e2_start  = e2.get_profiling_info<sycl::info::event_profiling::command_start>();
    
std::cout << "E2_start - E1_end: " << e2_start-e1_end << std::endl;

Program response:

memcpy       : 13526 mcs
tMandelbrot  : 30530 mcs
Queue_prl    : 42079 mcs
E2_start - E1_end: 109997

Here memcpy and tMandelbrot are separate launches, and Queue_pl is the benchmark from the example.
From this we can see that Queue_pl is the sum of the execution times. Also, the start of the E2 event is later than the end of E1, i.e. E2 was waiting for E1 to complete.

let's swap the lines inside the benchmark:

    e1 = tMandelbrot(qu, -0.9, -0.9, 0.9, 0.9, width, height, 4095, sFrc, {});
    e2 = qu.memcpy(sBuf, buf16, pixAll * sizeof(quint16));
    e1.wait();
    e2.wait();

memcpy       : 14009 mcs
tMandelbrot  : 31086 mcs
Queue_prl    : 30888 mcs
E2_start - E1_end: -30846426

Now, while the fractal is being calculated, the data is being copied in parallel.

I'm a beginner, I haven't worked with pure CUDA. Is this a feature of GeForce or SYCL? Is it possible to run the kernel in parallel after memcpy?

Benchmark and fractal code:

template <typename Func>
auto benchmark(Func &&func, string str, int n=12, int d=0)
{
    auto tm_beg = chrono::steady_clock::now();
    func();
    auto tm_end = chrono::steady_clock::now();
    auto tm_time = chrono::duration_cast<chrono::microseconds>(tm_end - tm_beg);
    cout << addVsps(str, n, d) << " : " << tm_time.count() << " mcs" << endl;
}

sycl::event tMandelbrot(sycl::queue &qu, float x1, float y1, float x2, float y2,
                            int width, int height, int maxIters, unsigned short * image, const std::vector<sycl::event> &evt)
{
    float dx = (x2-x1)/width, dy = (y2-y1)/height;
    int pxc = width*height;
    sycl::range nRng = sycl::range<1>(pxc);
    return qu.parallel_for(nRng, evt, [=](sycl::id<1> idx) {
        auto gid = idx[0];
        int j = gid/width;    
        int i = gid - j*width; 
        //
        float sx = x1+dx*i;
        float sy = y1+dy*j;
        float zx=0, zy=0;
        int cnt = 0;
        while (zx*zx + zy*zy <= 4.0f && cnt <= maxIters) {
            float nx = zx*zx - zy*zy + sx;
            zy = 2.0f * zx * zy + sy;
            zx = nx;
            cnt += 1;
        }
        image[gid] = cnt - tMCLR;
    });
}

Answered by EmerAX7

Jun 27, 2025

Figured it out, it was necessary to use pinned memory.
I allocated memory on the host in the usual way:

buf16 = new quint16[pixAll];
bufFr = new quint16[pixImg];

But it had to be like this:

buf16 = sycl::malloc_host<quint16>(pixAll, qu);
bufFr = sycl::malloc_host<quint16>(pixImg, qu);

I'm sure if I had specified this in the question, you would have answered.

Thank you so much for the comment!
Multiple In-order queues are not only faster, but also more convenient than a single out-of-order queue. Because instead of specifying dependencies between events, you can put a set of commands for each task into a separate in-order queue.

View full answer

illuhad · 2025-06-24T13:44:17Z

illuhad
Jun 24, 2025
Maintainer

Hi,

AdaptiveCpp will automatically attempt to overlap kernels and data transfers if you use out-of-order queues (like here), there are no dependencies between the operations and the hardware supports it. It may also attempt to run multiple kernels simultaneously.

This relies on some heuristics. I cannot tell you why you observe the operations being run in parallel it in the second case but not in the first. In either case, it will not explicitly wait on the data transfer or kernel. Perhaps in the first case the timing is just unlucky, or some host latencies obscure the overlapping.

If you want more control over the overlapping and concurrency of independent operations, then construct multiple in-order queues and support independent operations to different queues. They will be executed in parallel if the hardware can do it.
Doing this optimization explicitly rather than relying on out-of-order queues to do it automatically usually results in better performance, since the application developer knows patterns better than the SYCL runtime, which needs to figure out things at runtime.

0 replies

EmerAX7 · 2025-06-27T14:27:19Z

EmerAX7
Jun 27, 2025
Author

Figured it out, it was necessary to use pinned memory.
I allocated memory on the host in the usual way:

buf16 = new quint16[pixAll];
bufFr = new quint16[pixImg];

But it had to be like this:

buf16 = sycl::malloc_host<quint16>(pixAll, qu);
bufFr = sycl::malloc_host<quint16>(pixImg, qu);

I'm sure if I had specified this in the question, you would have answered.

Thank you so much for the comment!
Multiple In-order queues are not only faster, but also more convenient than a single out-of-order queue. Because instead of specifying dependencies between events, you can put a set of commands for each task into a separate in-order queue.

1 reply

illuhad Jun 27, 2025
Maintainer

Yes, using pinned memory can have an effect here.

However, usually it's only needed when you want to overlap two data transfers (not kernel and data transfer). We documented this also here: https://www.iwocl.org/wp-content/uploads/21-presentation-iwocl-syclcon-2022-applencourt.pdf

But I guess there could be differences between different GPU models and driver versions. It might also make a difference that we looked at data center GPUs in that poster, not at consumer GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The GPU waits for MEMCPY and only then starts the KERNEL #1864

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

The GPU waits for MEMCPY and only then starts the KERNEL #1864

Uh oh!

EmerAX7 Jun 24, 2025

Replies: 2 comments · 1 reply

Uh oh!

illuhad Jun 24, 2025 Maintainer

Uh oh!

EmerAX7 Jun 27, 2025 Author

Uh oh!

illuhad Jun 27, 2025 Maintainer

EmerAX7
Jun 24, 2025

Replies: 2 comments 1 reply

illuhad
Jun 24, 2025
Maintainer

EmerAX7
Jun 27, 2025
Author

illuhad Jun 27, 2025
Maintainer