Dr. Henney Oh, co-founder and CEO of spatial audio specialist G’Audio Lab talks us through the processes of capturing, mixing and rendering sound for virtual reality and 360-degree video applications.
The premise of VR and 360-degree video is to simulate an alternate reality. For this to be truly immersive, it needs cogent sound to match the visuals. Humans rely heavily on sound cues to inform us of our environment, which is why immersive graphics need equally immersive 3D audio that replicates the natural listening experience. The challenge becomes how to draw the viewer’s attention to a specific point when there is continuous imagery in every direction, and sound cues can help with that.
The key to creating realistic audio for this is to synchronise sounds according to the user’s head orientation and view in real time. This helps replicate an actual human hearing mechanism, which makes the listening experience more realistic. Producing truly immersive sound requires several steps. First, you must capture the audio signals, then mix the signals and finally render the sound for the listener.
To replicate the natural listening experience, the use of two audio signals – Ambisonics and object – is essential.
Ambisonics is a technique that employs a spherical microphone to capture a sound field in all directions, including above and below the listener. This requires placing a soundfield microphone (also known as an Ambisonics or 360 microphone) somewhere near the position where you intend to listen to. Keep in mind that these microphones will record a full sphere of sound at the position of the microphone, so be strategic with where you place them. It’s also important that your mic is not spotted in the scene, so we encourage placing the microphone directly below the 360 camera.
In addition to capturing audio from a soundfield microphone, content creators also need to acquire sounds from each individual object as a mono source. This enables you to attach higher fidelity sounds to objects as they move through the scene for added control and flexibility. With this object-based audio technique, you can control the sound attributed to each object in the scene and adjust those sounds depending on the user’s view.
Capturing mono sound can also be tricky because the traditional use of a boom microphone to capture mono does not work in VR. In synchronised 360 sound recording, there is no space to place the boom microphone, so it is helpful to place a lavalier microphone directly on the individual (hidden underneath apparel).
Previously, sound mixing was typically formed by its target loudspeaker layout, but today’s object-based audio techniques allow for individual objects on screen, like a dinosaur, to be free from the representation layout, user’s listening point and even the sonic space. It is possible because you can send all of the object tracks to the player side. As with traditional mixing, you might need to add extra Foley, ADR and background music tracks to complete the sonic scene.
Combining object, Ambisonics and channels (like traditional 2.0 if needed) and balancing them plays an important role in mixing and mastering 3D audio. If you captured the object and the Ambisonics together, be sure that the Ambisonics signal already contains the objects. You may need an additional process to remove or balance those object tracks to ensure they aren’t counted twice.
Traditionally, you only needed to work on synchronising your sound with your image in time domain, which is referred to as lip-synchronisation. But with cinematic VR and 360 video, you also need to work on spatial synchronisation between the sound and image. For example, when producing traditional cinematic audio, you only need to look at an actor’s mouth and play the sound according to the movements of the mouth.
With VR and 360 video content, you not only need to consider the actor’s mouth movements but also carefully place the sound according to the position of the actor on the 360 screen, which requires a new and more dedicated sound mastering tool. Specifically, it’s now important to use a tool that lets you edit as you watch, so that while watching the visuals, you can match the sounds accordingly in both space and time.
There are many special processes needed on top of the conventional mixing workflow, requiring a dedicated authoring tool to work properly and conveniently.
Historically, content creators relied on DAWs for everything from mixing to mastering into a target layout. So the output of a DAW was a pre-rendered sound bed. However, with VR, sound rendering must take place on the listeners’ end, which, in this case is the actual VR hardware and is most frequently a head-mounted display (HMD). All of the possible scenarios have to be processed through HMD devices, which can require a huge amount of additional processing power. As such, while it still maintains higher quality, minimising latency as well as the amount of computation power needed when rendering is key.
A benefit of the renderer being on the listener’s end is the possibility for unprecedented levels of personalisation. Keep in mind that with a conventional pre-rendered bed, you can’t variate its rendering according to each user. However, personalisation is still a long way out as measuring an individual’s personalized HRTF is still an expensive and time-consuming process.
In addition to the special capturing and mixing techniques, we believe high-quality VR rendering is the most crucial enabler for completion of the VR audio experience. One way to improve this experience is to use the same binaural rendering engine on DAW and on the player side. This requires a type of end-to-end solution like the one we’ve developed at G’Audio Lab. Our Sol player for cinematic VR and 360 video allows for real-time rendering by reflecting the HMD user’s head orientation and interactive motions with real-time calculations of relative sound source positions. Sol leverages G’Audio’s binaural rendering technology, which was adopted in the next generation international standard by Moving Picture Experts Group’s 3D Audio (MPEG-H 3D Audio) for requiring minimal processing power while delivering the best audio quality possible. VR content can be played as intended without being degraded by hardware limitations.
When compared to the solutions available for creating VR video, tools for producing truly immersive sound still have some catching up to do. However, there’s been an overarching shift in the industry to focus on audio, and I’m confident we’ll see huge strides made in the months to come.
Dr. Henney Oh is co-founder and CEO of G’Audio Lab, a spatial audio company dedicated to developing immersive and extensive interactive 3D audio production software solutions for creative professionals. www.gaudiolab.com