Multi-threaded Multi-buffered UI Rendering
Added 2024-08-02 01:14:37 +0000 UTCWhen people think of optimisation they usually think about fast 3D rendering, quick loading screens, powerful physics engines and small download sizes.
But one day I was profiling Sector's Edge and found that the user interface took the same amount of time to render as the 3D world. I was shocked - the UI was so simple but so slow!
It turns out that UI is still a form of rendering and requires the same optimisations as 3D rendering. OpenGL makes it easy to apply these optimisations to buffers (3D models, like in my YouTube video here), but it's trickier to apply them to textures.
To achieve a smooth, stutter free UI rendering system, we need to:
Allocate and paint bitmaps on background threads
Send bitmap data to the GPU without causing stutters
Ensure we aren't updating textures while they're being rendered
Reuse memory, textures and buffers to prevent stuttering
Overview
The UI you see in games starts out as a bitmap on the CPU. A bitmap is a grid of pixels, and we can change the colour of each pixel like so:

To draw this bitmap on your screen, we need to first copy it to a texture on the GPU. We'll create a texture of the same size, and then copy the bitmap data to it.

Then we can use this texture in a shader to draw it on the screen.
Sounds simple! But there are crucial optimisations we need to implement at each stage.
Setup
For this demo we're using SkiaSharp to create and paint bitmaps, and RichTextKit to paint the text.
Skia is a popular UI rendering library (most browsers use it) and is written in C. SkiaSharp lets us use it in C#
RichTextKit supports rendering all kinds of text (colours, fonts, weights, styles) and was created by a very smart programmer (my dad)
Multithreading
Our first goal is to allocate and paint bitmaps on background threads. This is considered heavy lifting because allocating memory for large bitmaps is slow, and text is notoriously slow to paint.
'Creating' refers to allocating memory for the pixels
'Painting' refers to modifying the pixels in a bitmap, e.g. text, vector rasterisation, shapes
'Rendering' refers to displaying the texture on your screen
Let's say we're making a first person shooter game and we want to show the player's ammo in the bottom right of the screen. This would consist of 3 elements:
Weapon icon
Ammo text
Rectangle border

We could paint these 3 elements on the main thread and the game would still run quickly, because it only takes a fraction of a millisecond to paint them. But games usually have larger, complex UI elements that update often. Painting these on the main thread every frame would slow the game down.
So rather than painting these bitmaps on the main thread, we'll store the parameters of each paint function in a list of PaintCommand objects. This means we've saved a copy of all the information required to paint this bitmap, and can send it to a background thread for painting.
For example, the DrawIcon function takes these parameters:

Paint Commands
There are many types of PaintCommands, such as:
PaintCommandBitmap
PaintCommandText
PaintCommandRectangle

Each of these commands share 3 common functions, but each have their own implementation:
Paint() - modify the pixels of the bitmap
ExpandBounds() - when we have multiple commands, we need to combine their bounding boxes together to figure out the final size of the bitmap
UpdateCRC() - if we issue the same list of Paint Commands over and over again, we should only need to paint the bitmap once and then re-use it. To figure out if we've issued the same list of Paint Commands, we iterate over each command and build up a unique CRC based on its parameters
For example, here's a PaintCommandBitmap, which draws a bitmap at the specified position.

Painting on Background Threads
Let's say you have a friend in real life who can paint really nice pictures. You describe to them what you want the painting to look like, and then they go off and paint it. A few days later they give you a beautifully framed painting exactly like you asked for.
This is what we want to achieve with our bitmap painting. We have a class called Painter that we can give a list of commands to. Then - on another thread - the Painter will figure out how big the bitmap needs to be, allocates memory for it, and executes each command.
Since the command list contains all the information the Painter needs, the Painter can safely perform its work on background threads.
Before asking the Painter to draw something, we need to figure out the size of the final bitmap. We'll do this by looping over each command and combine their bounding boxes together:

Now we'll ask the Painter to allocate a bitmap that's 122px wide and 54px tall, and paint our list of commands to it.
Coordinate System
To paint a command, the bitmap needs to first convert the command's position from screen-coordinates to bitmap-coordinates. For example, if we have a very basic bitmap that's just a green rectangle, we'd store a command with data:
Position: 275x, 80y
Size: 50 wide, 50 tall
Colour: Green

Since our bitmap is 50x50px in size, if we paint a rectangle within the bitmap at position (275, 80), we'll be drawing way outside the bitmap. What we should do is paint the rectangle at position (0, 0) within the bitmap, and then draw the final bitmap at (275, 80) on the screen.
Each Paint command will perform this conversion before painting, e.g. in PaintCommandBitmap the position is converted to a bitmap-relative netPosition:

GPU Optimisations
All CPU optimisations are now applied, and it's time to move on to the GPU. There isn't as much code here, but the concepts are important to understand.
Let's start with the reason why texture transfers cause stutters.
Let's say our game is running at 60 FPS, which means the GPU takes 16ms to render one frame. Each frame consists of many OpenGL calls, mainly these 4 over and over again:
Bind shader
Bind texture
Bind buffer
Render buffer
If the CPU only takes a 3ms to enqueue these commands, the GPU will get further and further behind the CPU:

From my research, NVIDIA allows their GPUs to be at most 3 frames behind the CPU, and AMD allow their GPUs to be at most 5 frames behind. In the above scenario, the GPU already has 3 frames of work to process, so NVIDIA will make the CPU wait for the GPU to finish rendering Frame 1. This causes a stall on the CPU:

These stalls can be mitigated by either:
Optimising rendering / shaders / etc to reduce the GPU frame time
Moving work from background threads onto the main thread, to increase the CPU frame time
Moving all heavy lifting to background CPU threads, so it doesn't matter if the main thread stalls
Did You Know - in the diagram above, the gap between when the CPU issues Frame 4 and when the GPU finishes processing Frame 4 is the cause of Input Lag. There's a 30ms delay between when the CPU processes your mouse movements, and when they are reflected on your screen
CPUs can also cause a stutter by updating a texture that's currently in use. Let's say the CPU has enqueued 2 frames of commands, and then during the 3rd frame it updates a texture.

If this texture was used during frame 1 or 2, the OpenGL Driver will say "Hey, the GPU hasn't finished rendering that texture yet. You'll have to wait". This causes the CPU to stall until that texture is no longer being rendered:

Now the CPU can copy the new UI bitmap data to the texture. This will cause a stall on the GPU, because the GPU can't start rendering the texture until the memory transfer completes:

The player will only perceive the 5ms stutter on the GPU timeline, but we've still lost valuable time on the CPU.
Solving the Stutters
The workaround to this is to create another texture that we'll store the new bitmap data in, rather than attempting to update a texture that's being used by the GPU. This removes all stalls on the CPU, but the GPU is still slowed down by the time it takes to transfer the texture from the CPU to the GPU (orange Stall bar in the diagram above). The CPU is also transferring the texture data on its main thread, which slows the CPU down.
To solve this, we'll use multithreading on both the CPU and GPU to transfer the texture asynchronously, in the background. OpenGL has excellent support for updating buffers on background threads (Persistent-Mapped Buffers), but it doesn't have the same functionality for textures.
To work around this, OpenGL has a special kind of buffer called a Pixel Buffer Object (PBO). This buffer is special because its contents can be copied directly into a texture. This means we can:
create a PBO on the main thread
map the PBO on the main thread
copy bitmap data to the mapped buffer on a background thread
flush the buffer on the main thread
The expensive memory copy is now performed on a background thread, and when the flush completes our bitmap data is now stored on the PBO on the GPU!
The last step is to copy the data from the PBO to the texture. This is much faster than copying from our CPU to the texture, and since both PBO and texture live on the GPU, it won't stall the CPU or the GPU.
However when copying from a PBO to a texture, NVIDIA GPUs will throw this warning:
Pixel-path performance warning: Pixel transfer is synchronized with 3D rendering.
Although GPUs have thousands of cores, they can only process one OpenGL command at a time. The GPU will stop rendering, copy the texture data, and then resume rendering. This means our texture transfer is slowing the game down.

NVIDIA solved this by adding Copy Engines to their GPUs, which can copy texture data around at the same time that the Compute Engine performs rendering. AMD has a something feature but I'm not sure what it's called.

Copy Engines
To utilise these copy engines, we need to understand how OpenGL contexts work.
If you're running two games at once on your computer, they will each have their own OpenGL context. By default, each context stores its own textures, buffers, shaders, framebuffers, and has its own queue of frames to process. This means both games are kept completely separate, and prevents Game A from rendering a texture that's owned by Game B.
Because the OpenGL Driver knows that both games don't share resources, it allows Game A to transfer textures while Game B is rendering.
To utilise this functionality within one game, we can create two OpenGL contexts when the game starts up. The first OpenGL context will only be used on the main thread and performs all our typical OpenGL calls. The second OpenGL context will only be used on a background thread and will manage all texture copies.
Since both contexts were created within the same process, OpenGL allows us to share some objects - e.g. buffers and textures - between the two.
All of our code stays the same, except the glTexSubImage2D call to copy data from the PBO to the texture will be executed on the 2nd OpenGL context. Since this context runs on a background thread, it's glTexSubImage2D calls will be processed by NVIDIA's Copy Engine. The 1st context will continue to render commands on the main thread unimpeded.
However, since we are sharing textures across multiple threads, the main thread needs to know when the texture transfer has completed and it's safe to render the texture. If we don't, the main thread will attempt to render the texture while it's half-updated, which may display a corrupt texture on your screen or cause the GPU to stall while it waits for the transfer to finish.
To do this, we'll create a fence using glFenceSync after the glTexSubImage2D call on the background thread. The background thread will poll the status of this fence, and when it has signaled it will tell the main thread that it's safe to render the texture.
Code
In a few hours I'm flying to Japan for a holiday. When I'm back, I will create a public GitHub repo containing the code for the CPU and GPU optimisations covered in this post.
If you are a paid Patreon member, you can access this code on the lagcomp branch on the ve repository. I still have much to improve and clean, but the main files are:
BitmapPainter.cs contains the multithreaded painting code
VECX.cs is the base class for all UI elements
The BackgroundOpenGLThread function in ClientOpenGL.cs performs the texture copies. It's only 25 lines of code!
TexLayer.cs manages the pooling of textures and PBOs