The xpans Ecosystem
xpans is an open ecosystem for source-based spatial audio technologies. Audio sources have 3D positions in virtual space, and can have several other spatial properties.
Philosophy
Unassuming
xpans aims to make as little assumptions as possible about both creators and listeners.
As an ecosystem, xpans does not assume where the listener may be positioned within a scene. While the ecosystem at its core doesn’t make this assumption, directional rendering modes (i.e. directional stereo, surround sound) do render audio sources depending on a central listening position. So keep in mind that different rendering modes have their own different constraints as they are designed for different listening configurations.
Objective
xpans’ rendering modes are intended to do just enough to create an immersive experience, remaining simple and predictable.
This principle ensures creators have control over how their mix sounds and can more easily predict how their mix would sound in a rendering mode they haven’t monitored. Rendering modes can filter and delay audio signals just enough to provide a convincing spatial impression, being sure not to make artistic choices on behalf of the creator.
Open
While describing xpans as an ‘open’ ecosystem can refer to it being open-source, it mainly refers to its focus on interoperability, extendability, and possibility to be implemented by anyone.
There isn’t a strong concept of ‘official’ technologies within the ecosystem. If a technology shares the same concept of an audio source (or even a subset) and interoperates well with other technologies within the ecosystem, then it’s probably safe to say it’s part of the xpans Ecosystem in at least one way or another.
Audio Sources
An audio source is a combination of one audio channel and spatial metadata that describes the spatial properties of the audio channel. In other ecosystems, this concept is known as Object-based audio.
Properties
Position
Each audio source has a three-dimensional position within virtual space. This is typically represented using X, Y, and Z coordinates, although other coordinate systems can be used.
Cartesian orientation
Currently, our official software uses left-handed, z-up coordinates.
Extent
An audio source may have an extent, which can be thought of as its size. This determines the space an audio source will occupy in a scene.
Currently, extent is only scarcely (and in some places, incorrectly) implemented in the ecosystem. More work needs to be done to find suitable algorithms for correctly solving extent’s effect in each rendering mode, especially directional rendering modes like directional stereo and headphones.
Shape
Each audio source with extent above zero in more than one axis must have an explicitly and unambiguously defined shape. An audio source with extent but without shape is invalid within the xpans Ecosystem.
Currently, only cuboidal shapes are implemented as we have yet to decide how to represent and render other shapes.
Rotation
Audio sources with extent and shape may also have a rotation. This determines the rotation of a source’s extent around with its position as its origin point.
Currently, rotation is completely unimplemented. This is because it depends on extent to have any effect, and extent still has many unsolved implementation challenges.
Audio sources in motion
Just like audio data, spatial data is sampled at a particular rate. xpans does not have a concept of a separate spatial sampling rate. Instead, spatial samples can exist at the same time as any audio sample. This allows for extreme control and precision in how a mix is rendered. However, it can mean a large amount of spatial data for scenes with lots of frequently moving audio sources.
While the xpans ecosystem may not enforce a separate spatial sampling rate, certain spatial codecs, formats, or user configurations may enforce a spatial rate limit to decrease file size and/or increase performance. In such cases, spatial samples and/or renderer interpretations may be interpolated to prevent unwanted artifacts.
Spatial Scenes
A spatial scene is simply a collection of audio sources.
xpans does not set a limit on the number of audio sources in a scene, however the format you are using to store and/or deliver your spatial content may impose a limit.
Roadmap
This is a rough, non-exhaustive roadmap for the xpans Ecosystem.
This roadmap may not cover an individual application’s planned features. This is a generic roadmap that covers the intended future of widely shared functionality between applications in the broader ecosystem.
This roadmap is split into several sections, NOT stages. Sections indicate how essential, or “core”, a task is to the ecosystem, NOT when it should be completed in a linear timeline (although some tasks may depend on others).
Foundational
- Implement proper extent rendering in available rendering modes.
- Add rotation to XSR, SPE, and other xpans interfaces.
- Implement rotation rendering in available rendering modes.
- Decide on shape representation(s), and implement in interfaces.
- Implement all available shape representations in all available rendering modes.
Critical
- Surround-sound rendering
- Spectral filtering (HRTF) in headphone render mode
- Positional and orientational headtracking for headphone listening
- Standard protocol for headtracking data
Generally important
- Losslessly compressed spatial codec
- Positional rendering mode(s) (Distance Based Amplitude Panning?)
- Spatially-aware audio effects for production (i.e. reverb, chorus/flanger)
Important for some applications
- Screen-related audio sources (for audio-visual media)
Planned
- Standalone SPE protocol
Rendering
Rendering can be thought of as having two stages: Source Interpretation and Sample Processing.
Source Interpretation is when an audio source’s spatial data is transformed into values that the Sample Processing stage will use to modify and output audio samples. The result of Interpretation is also called an interpretation. Interpretations vary in the type of data they hold as they are specifically for the target rendering format.
Interpretation does not interact with audio data, and Sample Processing does not directly interact with spatial data. Interpretations are intermediate data that sit between Interpretation and Sample Processing.
This separation of stages allows Interpretation to happen only when necessary. If an audio source’s properties are unchanged, it does not need to be re-interpreted (except in cases where the scene transforms relative to the listener, i.e. headtracking).
Stereo
Stereo Mode
Positional
Positional stereo is intended for use in stereo speaker setups rather than headphone usage.
Directional
Directional stereo might be more immersive for headphone listeners.
Pan laws
The following pan laws are implemented:
- -3dB with sine taper
- -3dB with square root taper
- -6dB with linear taper
Headphones
Maximum ITD
The maximum amount of time an audio source will be delayed in the ear farthest from the source. The ear closest to the source always experiences zero delay.
Maximum ITD is typically represented in nanoseconds or microseconds. It should never exceed 1 millisecond for practical renders.
Distance
Distance cues are rendered by reducing the interaural level difference (ILD) as distance increases. This effect relies on interaural time difference (ITD) to preserve the listener’s localization of the source. Without ITD, this effect can cause the listener to incorrectly localize the direction of audio sources. Distance does not affect interaural time difference (ITD).
Distance curve
Linear
Applies a linear curve to the normalized distance.
Exponential
Applies an exponential curve to the normalized distance.
Minimum distance
The distance from the listener where an audio source will sound closest.
Maximum distance
The distance from the listener where an audio source will sound farthest.
Distance effect
The distance effect parameter scales the amount that the ILD will be reduced
based on distance. When distance effect is 1.0, the ILD will be reduced to
zero when a source is greater than or equal to the maximum distance. When
distance effect is zero, the ILD will never be reduced regardless of a source’s
distance. It is not recommended to set the distance effect to a high value as
it may make it hard for listeners to localize frequencies above 1,500 Hz.
Pan laws
The following pan laws are implemented:
- -3dB with sine taper
- -3dB with square root taper
- -6dB with linear taper
Mono
Mono rendering ignores all audio sources’ spatial data and sums the scene’s audio channels together into one, optionally copying the resulting audio channel to several more.
Channel Count
The channel count parameter controls how many channels the mono rendering
will output. When set to 1, only one channel will be output. When set to 2
or higher, the sum of all sources’ audio signals will be output to that number
of channels with each output channel containing identical audio data.
Creating
This section covers the basics on creating within the xpans Ecosystem.
Using the Essential Plugins
xpans distributes a suite of plugins, named the Essential Plugins. This set of plugins is meant to provide foundational utilities for creating in the xpans Ecosystem within your DAW.
Requirements
You’ll need a DAW that allows for tracks with a high channel count. REAPER allows for 128 channels per track.
Creating a spatial audio scene
It’s recommended to have each audio source be its own track. Each track should have the Scene Editor plugin in the FX chain, usually at the very end (unless you are using spatially-aware effects!) These tracks shouldn’t route audio to the main/master track.
You’ll also need a scene bus: a track that receives all of your audio sources’ audio data via audio routing, and spatial data via MIDI routing. This track shouldn’t route audio to the main/master track.
Routing audio and spatial data
All audio source tracks should route their audio each to their own separate audio channel within the scene bus. MIDI messages should all exist on the same MIDI channel, but you must route each audio source track’s MIDI to the scene bus as well.
Note: Audio and MIDI routing varies across DAWs. Refer to your DAW’s documentation for more information.
In the Scene Editor plugin of each source, you’ll also need to set the Source ID to the channel number your audio source occupies in the scene bus.
Note: At the time of writing, Source IDs count from zero. The first audio source in your scene will have a Source ID of 0.
Monitoring
You’ll also need a way to listen to your scene. Create another track with the same channel count as your scene bus. Route all audio and MIDI from the scene bus to this track. Add a monitoring plugin (ex. Headphone Monitor, Stereo Monitor, or Mono Monitor) to this track. Make sure that this track is the only track routing audio to your main/master track. Enjoy!
Exporting your scene
Add a Scene Exporter plugin onto your scene bus.
Click ‘Set Export Path’ to choose a destination for your spatial data.
Move the playhead to the start of your scene, and click ‘Set Scene Start’. Do the same for the end of your scene.
When you are ready to export, move the playhead back to the beginning of the scene, or before it. Then, click the ‘Export’ button. The next time the project is played from the start of your scene to the end, the Scene Exporter will write a xpans Spatial Record (.xsr) file to the export path you specified.
No audio data is stored in a .xsr file. It’s just spatial data. You will need to render the scene bus (NOT the master/main track) to an audio file as well.
Depending on the DAW you are using, spatial data and audio data can be exported at the same time. Scene Exporter should export your spatial data during either live playback or offline rendering in roughly the same manner.
Tips and tricks
Scene Editor supports sample-accurate automation. If your DAW is configured correctly, you can have spatial properties update as frequently as your project sample rate. Note that this can increase the size of your exported .xsr file drastically.
Best Practices
Maintaining signal separation
In xpans, each audio signal or ‘voice’ is meant to exist in only one audio source. When an audio signal is shared between two or more audio sources (aka signal sharing), it can be subject to phase interference in rendering modes that use delay to create or enhance spatial impression (i.e. headphones).
This can be unpleasant for listeners and can make your mix extraordinarily difficult to quality control. It is okay to combine multiple audio signals into one audio source, but you must ensure no audio signals are shared by two or more audio sources.
Note: If you would like to intentionally break this rule for creative experimentation, make sure you are monitoring your mix in as many formats as possible to ensure your intention is preserved.
Panning across audio sources
Panning across audio sources inherently involves signal sharing. It’s best to leave channel-based panning to the renderer while creating in xpans.
Using channel-based synthesizers and/or effects
Channel-based synthesizers and effects pan audio signals across multiple output channels. By converting those channels into audio sources, you will create phase interference. Always separate individual voices into their own audio sources or reach for spatially-aware audio plugins that use the xpans Spatial Property Exchange (SPE).
Spatial property exchange (SPE)
The Spatial Property Exchange is an abstract protocol for sending and receiving spatial properties across applications.
SPE enables spatially-aware audio processing. Audio effects using SPE can create more immersive output as they are aware of where their input signals are located within the scene. SPE also allows output signals to be given spatial properties.
SPE-enabled applications can be designed so that the user only needs to think of audio sources as one abstract idea instead of a pairing of audio data and spatial data.
SPE-MIDI
SPE-MIDI implements the SPE protocol using MIDI System Exclusive messages for use in hosts and plugins that don’t natively support SPE. It is currently the only way to use SPE.
SPE-MIDI messages must be manually routed by the user. While audio data requires each audio source to have its own dedicated audio channel, SPE does not require a separate MIDI channel for each audio source. Each SPE-MIDI message includes the audio channel number it refers to. This number is referred to the ‘Source ID’. SPE-MIDI-enabled plugins will likely require you to provide source IDs.
Unfortunately, without native SPE integration or generic inter-plugin/host communication, we must tediously ensure our source IDs align with our audio channels using SPE-MIDI.
Refer to SPE-MIDI’s source code repository for a more technical overview and a rough specification.
xpans Spatial Record (XSR)
The xpans Spatial Record format is a data structure for serializing and deserializing spatial properties. It contains the minimal amount of information for reproducing a scene.
Use case
XSR is not an ideal format for many cases. It is meant to be easily read, written, and adapted in order to support the development of xpans especially in the ecosystem’s early stages.
Data layout
XSR begins with the sample rate of the scene.
After that, it holds an array of spatial samples. In XSR, spatial samples only need to be recorded if they are the first sample of the scene or any audio source data has been changed since the previous sample.
Spatial samples contain the sample number and an array of ‘events’. Each event relates to only one audio source. An event contains the Source ID the event is referring to, and the spatial properties that have changed since the last sample.
Refer to XSR’s source code repository for a more technical overview and a rough specification.