vercidium

Raytraced Audio - No More Voxels

Added 2025-11-05 06:24:12 +0000 UTC

Since I posted my Raytraced Audio video back in April, nearly everything has changed:

no more voxels - rays now collide against 3D primitives (meshes, spheres, cubes, etc)
no more basic reverb blending - reverb is now calculated using echograms, materials and energy-based models
basic occlusion and permeation system replaced with energy-based models and air absorption
materials are now supported (scattering, absorption, transmission)
reduced latency when playing new sounds
zero memory allocations at runtime

The video below showcases a few of these features. I plan to add more wildlife and sound effects to this forest sandbox and release it as a demo for you to try yourself. Stay tuned!

In this post we'll talk about the first major change: voxels.

No More Voxels

The original raytraced audio system was built for Sector's Edge, which had a voxel environment. Casting rays through voxel worlds is incredibly fast, and meant thousands of rays could be cast in real time, even on older laptops.

So when I extracted the raytracing system out of Sector's Edge, it made sense to keep all the original voxel code. It already ran great, all I needed to do was convert non-voxel worlds to voxels.

I spent a couple months on this and created a system that could convert meshes and low poly shapes into voxels in real time on background threads. But I always had the thought in the back of my mind of "how will this work for large outdoor environments?".

There were two issues with the voxel approach

Memory usage - the larger the world, the more voxels need to be allocated
Diagonal and curved surfaces become jagged when converted to voxels, meaning rays won't reflect correctly

So I started creating a new system that casts rays against low poly 3D primitives (triangles, spheres, prisms, etc) without converting them to voxels. This has a few benefits:

Worlds can be any size. Since ray intersections are calculated using maths rather than stepping through a voxel grid, the distance between objects doesn't affect performance
Rays always reflect at the correct angle off curved and angled surfaces
Memory usage is nonexistent in comparison - no need to allocate millions of voxels

At first this new system was about 6x slower than voxel raymarching, but after applying the optimisations below, it ended up 8x faster.

Optimisation 1: Bounding Volume Hierarchies

Originally, when casting a ray I was checking if it intersected with every object in the world, which becomes very slow when more objects are added. To alleviate this, I organised the objects into a Bounding Volume Hierarchy (BVH).

This data structure groups objects by their 3D position, so that rays only intersect with nearby objects, rather than every object in the scene.

There are many articles that explain BVHs already, so the quick explanation is that close-by objects are grouped together within a box. If a ray doesn't intersect with the box, it's guaranteed not to intersect with any object inside it. This reduces the amount of objects a ray needs to intersect with.

For most scenes, performance was on par with voxel raymarching. But when objects were more spread out, the voxel approach was still about 3x faster.

Optimisation 2: Trails

This next optimisation completely left voxel raymarching in the dust. It significantly reduces the amount of rays that need to be cast, and can also scale onto compute shaders.

Let's say our world has 30 sounds and 1 listener, and we're casting 1000 rays with 8 bounces each. This means we need to cast 248,000 rays total, to calculate reverb properties and figure out how occluded each sound should be. This is far too many rays for the average CPU to cast in real time.

This new trail-based system reduces the amount of rays that need to be cast each frame by 85% by:

reusing the results of past raytracing
creating dynamic BVHs for even faster ray-intersection tests
performing early plane-side tests to skip BVH checks entirely
breaking work into smaller stages that can be easily parallelised

With all optimisations applied, the new raytracer is about 8x faster than voxel raymarching.

Let's dive into how it works. Each time a ray bounces, it checks for line-of-sight to the listener and each voice. It also casts a permeation ray to each voice on the 1st bounce only (configurable).

The naïve approach casts all of these rays every frame, but this is wasteful.

If we look at the full path the ray took, we can break it into four parts:

the ray's trail (green)
multiple line-of-sight rays to the listener (blue)
one line-of-sight ray to each voice (yellow)
one permeation ray to each voice (orange)

This green trail will form the foundation of all our raytracing data and optimisations.

The second half of this post dives into the optimisations that power this new trail-based raytracing system, and is available to Patrons only.

The addition of materials has opened up many opportunities for increasing the realism of this raytraced audio system. In the next post I'll talk about how I used materials to implement a new energy-based reverb, occlusion and permeation system.

This new system produces nearly identical results with 1024 vs 32 rays, which means the minimum required hardware can potentially be reduced significantly.

For paid Patrons - continue reading!

Trail Building

The new system creates a trail for each ray, which stores the position of every bounce, the object it bounced off, and how much energy the ray lost (depending on the object's material).

Once this trail has been constructed, it can be reused until one of the following events occur:

A - one of the highlighted objects above moves
B - a new object is added that intersects with any part of the trail
C - the listener moves (past a certain threshold)

Event A - Reflection Updates

Let's say the sphere on hit 6 moves up and down. The first 5 hits in the trail remain unaffected, but the rest of the trail needs to be updated:

We don't need to cast any rays to determine which objects have moved. We can simply set a 'dirty' flag on an object when it moves, and then loop over each hit in the trail to check if the object is dirty. If so, we'll 'trim' the rest of the trail, and cast rays 7 and 8 again.

Event B - Intersection Updates

If a new object is added, or an existing object moves, we need to check if it intersects with any part of the trail.

We only need to check against objects that are moving - not the entire BVH. That's because the trail has already been constructed by raytracing against the entire BVH. This makes it very quick to test for new intersections.

If a new intersection occurs, we'll trim the trail from that point onwards, and then rebuild the trail. Note that rebuilding the rest of the trail requires intersecting rays against the entire BVH, because they are now reflected in different directions.

Event C - Listener Updates

If the listener moves, we need to check if we still have line-of-sight between the listener, and the 1st hit in the trail. If not, or if the listener has moved a significant distance away from the original raycast position, we must recalculate the whole trail.

But if the listener has only moved a little bit, the trails are still valid and can be re-used. Note the reflection angles remain stable even though the listener is moving:

Voice Occlusion

Now that we have a complete trail, we can perform line-of-sight (LOS, yellow rays) checks by casting rays from each point in the trail, to each voice in the scene:

Similar to trail trimming, these LOS checks are also cached and stored in the trail. To check if these LOS checks are still valid, we'll take two different approaches.

To validate a successful LOS checks, we only need to check if any new/dirty objects intersect with the yellow ray. If not, it's still guaranteed to have line-of-sight with the voice.

To validate a failed LOS check (red line), we'll keep track of the primitive that blocked our line-of-sight:

If the primitive hasn't moved, it's guaranteed to still be blocking line-of-sight
If it has moved, then we need to check if it's still blocking line-of-sight. If so, the LOS check still fails
If it has moved and is no longer blocking line-of-sight, then we need to cast a ray again against the entire BVH

If the voice itself moves:

for successful LOS checks, we need to cast a ray against the entire BVH
for failed LOS checks, we only need to check if the last blocking primitive is still blocking line-of-sight. If so, we know LOS still fails

The reason we can perform these simple validation checks is because LOS is either a pass or fail, so we only need to check if one object is breaking line-of-sight.

Plane Checks

When a ray bounces off a surface, we can create a fake plane that aligns to the normal of the surface (blue dotted line):

Because the reflected ray will be bouncing away from this dotted line, it's guaranteed that it won't intersect with anything on the other side (blue area) of the plane. This means:

LOS is guaranteed to fail with any voices on the other side
Ray-intersection tests are guaranteed to not hit objects on the other side

This effectively cuts the scene in half on each bounce, and reduces the amount of ray-intersection tests we need to perform.

LOS Compute

This raytracing system is now split into two stages: trail building, and line-of-sight checks. Trail building contains the complex bouncing logic and will likely always run on the CPU.

Whereas all these LOS checks can theoretically be organised into an array of start + end positions and loaded onto a compute shader. Then all of these LOS rays can be fired on the GPU in parallel.

This feature won't be in the first C# SDK release, but I'm planning to work on it soon. I'm also unsure if the latency between the CPU and GPU will affect this; it may be faster to keep it on the CPU.

Reduced Playback Latency

All trails are stored in memory and refreshed in real-time to ensure they are always up to date. This means when a new sound is played, we can instantly perform LOS checks from every position in every trail. This reduces the amount of work needed to initialise new sounds, and reduces the delay before they can play.

Summary

When all of the above optimisations are combined, a huge number of rays can be cached. In the forest demo at the top of this post, it used to take 3.5ms to cast all rays with voxel raymarching.

With the new system, when the camera is moving around, 85% of rays are cached and it takes 0.4ms to refresh all trails + LOS checks. When the camera is still, 100% of rays are cached and it takes 0.06ms to refresh. This means the new system is:

8x faster when the camera moves around
58x faster when the camera is still

This new system produces nearly identical results with 1024 vs 32 trails, which means the minimum required hardware can potentially be reduced significantly.

Thanks for reading!