Developer Blog

VCL: a new open source VirtIO-GPU OpenCL driver leveraging hardware acceleration

Written by

Antonio Caggiano

Written by

Marco Liebel

Oct 15, 2024

If you are developing, testing, or using OpenCL applications, you are likely aiming to harness the power of heterogeneous computing. OpenCL enables the acceleration of general-purpose workloads by executing them on the GPU, but what about workloads within a virtual machine? Is it possible to leverage OpenCL to accelerate those workloads on the physical GPU? And how would one achieve this?

In this post, we will examine VirtIO-GPU, a VirtIO-based graphics adapter, and VCL, an OpenCL driver by Qualcomm Technologies, Inc. for VirtIO-GPU. Using VCL, you can leverage the host's graphics hardware to speed up OpenCL applications in guest virtual machines.

How can we use the host GPU to accelerate operations in a virtual machine?

A virtual machine functions as a computer within another computer, aimed at speeding up content execution. Hardware virtualization involves a host, which is the underlying machine, and a guest, which is the virtual machine running on top of the host. And, in enhancing the performance virtual machines, techniques like hardware-assisted virtualization and paravirtualization have evolved to execute guest content directly on the host. That content can be executed on the CPU (as in Linux KVM) and on the GPU.

Taking advantage of the GPU on the host is a complex topic that can be tackled in different ways. Ours involves a combination of API remoting, device emulation and forwarding API calls from guest to host for execution. For that to work, a virtual GPU acts much like a physical graphics card from the guest perspective; therefore, the guest OS needs a corresponding device driver.

VirtIO-GPU is such a driver. It’s a VirtIO-based graphics adapter and OASIS standard, and the Linux kernel has provided a kernel driver for it since 2013. VirtIO-GPU enables graphics API user space drivers such as Virgl and Venus, which satisfy the requirements for most graphics applications in a DRM-enabled Unix guest.

But there is still one Khronos API absent: OpenCL. True, it’s possible to run OpenCL applications on top of Vulkan with projects like clvk or ANGLE, but those projects come with their own limitations. clvk still depends on Venus, which is relatively new and may not be available on some targets. And ANGLE has many dependencies and supports only a limited set of targets.

Considering these constraints, we recognized the necessity for an OpenCL driver to accelerate guest workloads using the host GPU. Thus, we began developing VCL, a compact OpenCL driver for VirtIO-GPU that relies solely on libdrm.

VCL architecture

As shown in the block diagram below, VCL takes great inspiration from other mesa drivers.

VCL architecture: highlighted in red, the vcl driver on the guest, and the vcomp virglrenderer backend on the host.

At the bottom of the guest block is a layer which uses libdrm to interact with the VirtIO-GPU, driving all communications from the guest to the host. The layer effectively forwards API calls from the guest to the host via the VirtIO-GPU using techniques common to Virgl and Venus. The driver is written in Rust; therefore, at the very top of the guest block is a layer that implements the OpenCL API the way Rusticl does.

Output parameters

Many OpenCL functions expect output parameters where an implementation can store return values. In the VCL transport layer a command ID is associated with each OpenCL function. This ID, together with the arguments passed to the function, are forwarded to the host. In practice, the driver encodes one or more commands into a buffer, which is then passed to the EXECBUFFER ioctl.

In an environment where host and guest do not share memory, passing an output parameter to the host involves transmitting a pointer to the guest's address space. This pointer cannot be dereferenced by the host, as it does not reside in the host's address space; however, the host still requires the value it points to. The solution is to pass both the pointer and the value it references. For a pointer to an array, this means passing the size of the array along with all its values.

The host receives the command buffer in the virgl_renderer_submit_cmd() and forwards it to the vcomp context submit_cmd() function. The host decodes the command buffer and executes commands. Since guest pointers cannot be used on the host, local variables are defined instead and used as output parameters to call the corresponding OpenCL function. At this point both the return and the output values should be sent back to the guest. This process is similar to the encode-command step, as we encode a reply for the guest.

There is no reverse-ioctl mechanism, so how do we send a reply back to the guest? The guest allocates a Virgl resource to be used as reply buffer. The host encodes the reply and uses the resource created by the guest. From the guest perspective, the EXECBUFFER ioctl is blocking, so as soon as it is done, it triggers a transfer from the host and maps the resource for reading the reply with the corresponding VirtIO-GPU ioctls, as shown below.

Sequence diagram of the execution of a command buffer from guest to host

VirtIO-GPU resources

There are two VirtIO-GPU ioctls for creating resources:

CREATE_RESOURCE
CREATE_RESOURCE_BLOB

While blob resources require shared memory, normal resources can be used for transferring data when shared memory is not available.

virglrenderer is using resources mostly for OpenGL contexts, but when requesting a resource with CUSTOM_BINDING usage, virglrenderer creates a virgl/vrend resource backed by host memory only.

When creating a resource with the corresponding ioctl, QEMU attaches iovecs pointing to a destination in guest memory. This step does not happen if the resource is created with custom commands in EXECBUFFER.

Once the host has written data to the iovecs, the guest can trigger a transfer from the host, wait for the transfer to complete, and map the resource memory for reading the reply, as shown below.

Sequence diagram of the creation of a resource from host to guest

Command and reply buffers

A vcomp context receives a buffer with OpenCL commands from the VCL driver and it is supposed to generate a reply for most of those commands. A vcomp context decodes the first few bytes to get a command ID. For each command ID there is a dispatch function that continues decoding command arguments and possibly encodes a reply. The reply is encoded into a reply buffer that will be decoded later by the guest.

Without shared memory we cannot use the same addresses on both guest driver and host backend. That is because a memory allocation on the guest lives in a different address space with respect to the host, and vice versa. The guest is required to create a VirtIO-GPU resource with custom binding for the reply buffer. On the host, that would correspond to a virgl/vrend resource backed by host memory with an associated iovec pointing to guest memory.

From the host's perspective, vcomp does not know where to encode the reply until the guest driver tells the host backend which resource to use. That is done with a custom command from the guest, clSetReplyBufferMESA(resource_id); therefore, it must be sent before any other command that would require a reply. See the diagram below.

Sequence diagram of the creation of a resource for reply from host to guest

Transport layer

The transport layer is a collection of functions used for encoding and decoding OpenCL commands and their arguments. It’s the same approach used in the Venus driver and, thanks to the work of some very nice guys from Google, most of this code is generated by a python script in the venus-protocol repository. Generation of Venus protocol source code happens in two steps:

parsing an XML file containing the Vulkan specification
generating Venus protocol code based on the specification

Similar to Vulkan there exists an official XML file containing the specification of OpenCL – cl.xml, – so the VCL protocol source code can be generated in a similar fashion. We cloned venus-protocol and heavily modified it according to our needs, so we now call it vcl-protocol.

Another `virglrenderer` context

Creating a new kind of virglrenderer context requires the CONTEXT_INIT VirtIO-GPU kernel parameter, available from kernel 5.16+. Without this we would be able to create only VIRGL/VIRGL2 contexts and submit commands to only those kinds of contexts.

The CONTEXT_INIT kernel parameter brings the new CONTEXT_INIT ioctl to the VirtIO-GPU. The ioctl is called as a result of querying the OpenCL platforms via the VCL driver, and it triggers the creation of the host virglrenderer/vcomp context. The guest will then be able to send the next commands – such as creating resources, transferring data, mapping resource memory and executing a command buffer – to the correct virglrenderer context on the host.

By reading the kernel implementation of this ioctl, we can see that the capset ID is stored in vfpriv.context_init and a create context command is issued with it. That will go through Qemu's virtio-gpu-virgl, passing the context_init (with value VCL capset) to virglrenderer.

Host backend

Triggered by the CONTEXT_INIT ioctl, a vcomp context gets created on the host by vcomp_context_create() in virglrenderer/src/vcl/vcomp_context.c.

The vcomp context maintains a dispatch object, which contains function pointers for each OpenCL command. The dispatch object is initialized by vcomp_context_init_dispatch(), which sets those function pointers to the various vcomp_dispatch_*() functions.

The vcomp context receives command buffers from the EXEC_BUFFER ioctl, containing one or more commands, via vcomp_context_submit_cmd(). The command buffer is decoded and the commands are dispatched one by one via vcl_dispatch_command() to the corresponding vcomp_dispatch_*() function.

Adding support for a new OpenCL function means implementing and registering the corresponding dispatch function to the dispatch object.

Handle mapping

To support cl_khr_icd, driver objects have to be ICD compatible. In other words, they should look like this:

struct _cl_<object>
{
    struct _cl_icd_dispatch *dispatch;
    // ... remainder of internal data
};

In Rust, that would look like this:

#[repr(C)]
pub struct _cl_<object> {
    dispatch: *mut _cl_icd_dispatch,
    // ...
}

That precludes us from using handles from the host. Imagine we call clGetPlatformIDs() to get a platform_id handle from the host directly. It’s a pointer to host memory, and if the host OpenCL implementation supports cl_khr_icd, it will have a dispatch pointer set in platform_id->dispatch.

Unfortunately, we cannot use that pointer on the guest. As soon as we attempt to dereference it, we get a segmentation fault due to the difference between guest and host address spaces.

The solution adopted by Venus is to create objects in the guest driver and maintain a mapping between guest handles and host handles.

When the guest creates a new VkInstance, it calls vkCreateInstance(), passing an output pointer where it expects the handle to be returned. On the host, the decoder maintains a hash table – decoder->object_table – using the pointer from the guest as key and vkr_objects as values. The vcl driver follows the same approach, so for every guest handle there will be a corresponding vcomp_object on the host.

Irregular API

Note, however, that some OpenCL functions are irregular. For example, when the guest creates a cl_context, it calls clCreateContext(), but instead of using an output parameter, it expects the handle as a return value. This is a problem as we need to pass a guest pointer to the host for handle mapping. For reference, this is the complete signature of the function:

cl_context clCreateContext(
    const cl_context_properties* properties,
    cl_uint num_devices,
    const cl_device_id* devices,
    void (CL_CALLBACK* pfn_notify)(const char*, const void*, size_t, void*),
    void* user_data,
    cl_int* errcode_ret
);

To account for that, and for all OpenCL functions returning handles as function return values, we introduced a custom set of OpenCL functions. They put handles in output parameters and return cl_int for error reporting.

In practice, an OpenCL object gets created in the guest driver first, then a pointer to its handle is sent as an output parameter through a custom OpenCL function such as clCreateContextMESA(). On the host, the decoder uses the pointer from the guest as key when inserting a new vcomp object into the object table. For reference, here is the signature of this new function:

cl_int clCreateContextMESA(
    const cl_context_properties* properties,
    cl_uint num_devices,
    const cl_device_id* devices,
    void (CL_CALLBACK* pfn_notify)(const char*, const void*, size_t, void*),
    void* user_data,
    cl_context* out_context
);

Benchmarks

To illustrate the performance boost from hardware acceleration with VCL and VirtIO-GPU, following are some results of benchmark tests we conducted in cl-mem and clpeak.

The most dramatic difference is in single-precision compute:

Single-precision compute benchmarks comparing VCL and Rusticl

Red indicates Rusticl/llvmpipe (software OpenCL implementation running entirely on the guest). Green indicates VCL (backed by host with an integrated graphics card). Without hardware acceleration Rusticl reaches less than 200 GFLOPS; with acceleration, VCL attains 2000 – 2400 GFLOPS.

The difference is similarly striking in integer compute:

Integer compute Fast 24-bit benchmarks comparing VCL and Rusticl. Green color marks Vcl performance vs red color for Rusticl

Without acceleration, Rusticl attains at most 150 GIOPS, while VCL and hardware acceleration attain over 600 GIOPS.

The benefits of hardware acceleration are even more pronounced across these tests:

integer compute (int, int2, int4, int8, int16)
integer char (8-bit) compute
integer short (16-bit) compute

Note that copy operations (from guest to host and back) in VCL affect memory:

Memory benchmarks comparing VCL and Rusticl

Next steps

At present, the VCL driver accommodates most of the functionalities available in OpenCL 1.0 and successfully completes over 600 piglit tests. This driver does not feature an integrated compiler; instead, it transmits the program source directly to the host. Consequently, it depends on the host driver, allowing most program tests to pass with relative ease.

Interested in trying out VCL or contributing to its development? Here are a few options:

Build VCL for yourself and even include it in your products. Note that there are dependencies, and you should have some experience with building mesa. The mesa documentation is a good place to begin.
Wait for VCL to be included in your next distribution upgrade (when it is eventually accepted by the community).
Review the merge requests for mesa and virglrenderer, then contribute.
Watch the presentation ”QQVP: Qualcomm’s SystemC and QEMU modelling solution” from Linaro Connect 2024.

We now have a dedicated channel for open-source projects on Developer Discord. Join the community of like-minded developers to connect, get support and exchange ideas.

Linux Open Source Qualcomm Linux

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Qualcomm-branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

About the Authors