Binaural hearing (hearing with two ears and the ability to locate sound sources) is essential in human perception. Virtual and augmented reality systems use binaural audio to achieve users’ immersion in the environment. Proper acoustic stimuli can be inferred depending on the user’s location and head position. However, portable devices like phones tend to record mono audio.
A recent study proposes to employ data from LiDAR and RGB-depth cameras for binauralization. A novel multimodal neural network synthesizes the binaural version of mono audio using the 3D point cloud scene of the sound sources in space. Contrary to previous methods, it operates in the waveform domain, which allows training the model in an end-to-end fashion. Quantitative and perceptual evaluations show the effectiveness of the proposed model, which enables potential applications.
Binaural sound that matches the visual counterpart is crucial to bring meaningful and immersive experiences to people in augmented reality (AR) and virtual reality (VR) applications. Recent works have shown the possibility to generate binaural audio from mono using 2D visual information as guidance. Using 3D visual information may allow for a more accurate representation of a virtual audio scene for VR/AR applications. This paper proposes Points2Sound, a multi-modal deep learning model which generates a binaural version from mono audio using 3D point cloud scenes. Specifically, Points2Sound consist of a vision network which extracts visual features from the point cloud scene to condition an audio network, which operates in the waveform domain, to synthesize the binaural version. Both quantitative and perceptual evaluations indicate that our proposed model is preferred over a reference case, based on a recent 2D mono-to-binaural model.
Research paper: Lluís, F., Chatziioannou, V., and Hofmann, A., “Points2Sound: From mono to binaural audio using 3D point cloud scenes”, 2021. Link: https://arxiv.org/abs/2104.12462