Over the past months, I was privileged to attend two future-focused meetings related to the topics of this blog. The first was held in Berkeley CA, hosted by Starkey Hearing Technologies, and titled “Listening Into 2030.” It gathered top auditory scientists from around the world and asked them to envision how listening technologies will impact the way humans experience the world of sound and communicate with other individuals over the next 15 years. The second was the inaugural International Conference on Audio for Virtual and Augmented Reality (AVAR), held in conjunction with the Audio Engineering Society convention in Los Angeles CA. The AVAR conference brought together hardware and software designers, content creators, and scientists to explore the state of the art in immersive audio experiences for VR and AR, identify the challenges facing the field, and share research that charts a way forward. Both meetings were eye-opening to say the least.
I was particularly struck by how strongly most participants agreed with one another and shared a common vision (perhaps reflecting the zeitgeist of the times). However, this shared vision manifested in different ways at the two conferences. At Listening Into 2030, scientists shared a clear vision of what can be achieved by auditory augmentation, but disagreed about the goals of such systems. At AVAR, engineers and scientists appeared quite uniform regarding the goals (convincing “3D” immersive sound) but differed in their views about the approach. The vast majority continue to focus on filter-based recreations of real-world measured acoustics (the so-called Head-Related Transfer Function, or HRTF, approach). A much smaller minority (including myself) emphasized the importance of understanding perception itself–in other words, how the brain and mind shape our auditory experience.
Listening Into 2030
The Berkeley conference included a series of short, provocative, presentations (“rants”) by tech-industry and neuroscience experts, which were combined with workshops on envisioning future technologies, applications, and users. Rants touched on topics like attention and distraction, mind-reading by machine interfaces, challenges of big data (and privacy!) for ubiquitous audio recording, and how much “hearables” will need to evolve in order to support positive experiences by all-day/everyday users.
The design workshop components of Listening Into 2030 investigated a wide range of technologies visible on the horizon, from rooms with variable acoustics to optimize communication and/or privacy (think of classrooms that can be adjusted to encourage interaction while discouraging auditory distraction), to hearing aids that sense and adapt to a patient’s listening difficulties as they age. Many of the discussions focused on aspects of clinical audiology (detecting and addressing listening difficulty; ensuring universal access across global and economic boundaries), but even these issues were seen as integrated with the notion that new technologies will enhance listening for all users, impaired or not.
Although the discussions originated in numerous small groups, a shared technological vision quickly became clear. In that vision, devices for auditory augmentation will become widely available and technologically capable of (a) sensing and understanding the auditory scene surrounding a listener, and (b) altering sound to modify that scene in various ways. Some of these modifications will compensate for listening difficulties, as in current hearing aids but addressing a wider variety of concerns. Others augment the auditory scene by adding layers of information (auditory tagging of real-world objects), telepresent communication partners, or interface elements that use spatialized speech or other auditory cues. Still other modifications might reduce unwanted sounds or replace them with audio more suited to the listener’s goals. Imagine a commuter “turning down” the voices of other passengers and bringing up an immersive forest soundscape more conducive to reading or other cognitively focused work.
You might recognize many of these ideas from my inaugural post, Bin-Li: A short story about binaural listening agents and hearing aids of the future. That story describes several more examples of how I think this type of technology will impact listeners (typical or impaired) and society. My experience at Listening Into 2030 convinced me that these ideas are shared by most of the field because the means to achieve them are close at hand. It is the nature of the zeitgeist that society collectively embraces a shared vision, and world-changing inventions are simultaneously co-discovered by many groups. I hope that we can continue to embrace this shared vision with the goal that implementations will reflect our needs as a society and as individuals, not just those of the competitive marketplace and its closely guarded intellectual property.
The Gold Rush - Audio for Virtual and Augmented Reality
I've been interested in virtual reality for a long time–fanatically in the SGI and Mondo 2000 days of the early 1990s and more casually in the years since. I’ve been astounded by the rapid embrace of this nascent technology in the last few years. Public interest exploded after Oculus Rift’s kickstarter campaign demonstrated that the technology is now feasible at an individual and affordable level. Since then, the number of people with first-hand experience of VR has grown exponentially, and torrents of money have flowed into startups and VR divisions of tech companies large and small. VR headsets are now available across a wide range of performance and expense, from mobile versions that run on ubiquitous smartphone platforms all the way to high-end gaming PCs. With the recent commercial releases of Oculus Rift and Sony’s Playstation VR, the mass marketing of VR for entertainment has definitively begun.
But VR–along with its cousins augmented (AR) and mixed reality (MR)–offers to change computing and communication in ways that go far beyond gaming or entertainment. Ubiquitous AR and MR-driven interfaces promise the next evolution in computing platforms, similar to the past decade’s transition from desktop to mobile computing. And the immersive quality of VR and AR experiences suggests new modes for interpersonal communication, offering heightened degrees of presence and telepresence that will become increasingly similar to face-to-face interaction. In real life, a huge part of that experience (of presence, immersion, or spatial awareness of the environment in all directions) comes from our sense of hearing, and it stands to reason that hearing will be just as important for VR, AR, and MR experiences.
The AES Conference on Audio for Virtual and Augmented Reality (AVAR), organized by Linda Gedemer and Andres Mayo of AES, brought together technology developers, product engineers, audio engineers, content creators, and scientists working on the audio components of VR/AR experiences. Sponsors, vendors, and representatives from industry spanned the gamut of recognizable names in audio (Dolby, Intel, DTS, Fraunhofer) and VR (Oculus, Magic Leap), as well as numerous startups and vendors developing tools for content authoring, spatial audio processing, etc. I was struck by two general observations about the conference.
The first observation, supported by any number of articles on developments in the tech industry, is the frantic “Gold Rush” nature of VR at this moment in time. No single platform has reached the kind of market penetration that will ultimately define the “standard” for experience. So the future is wide open for any company to enter and dominate the market with a convincing product. Established tech companies like Sony, Microsoft, Google, and Facebook (which owns Oculus) are entering the market with fully realized platforms aimed to do just that. Others may be testing the waters or identifying niche markets where a dominant role can be established. Smaller startups may be focusing on specific technological innovations in the realms of sensor integration, display technology, content capture, and so forth. The presence of all these companies at AVAR suggests that players at all levels understand the importance of audio, and auditory experience, to the ultimate success of VR and AR platforms.
The second observation was more of a surprise to me. Spatial audio (or “3D” audio) was, of course, a major topic of discussion at the conference. Some interesting presentations focused on rendering sound propagation in virtual spaces and working with object-based surround-sound formats (e.g. Dolby Atmos) for room-based or binaural AVAR. But a larger number of discussions focused on optimizing head-related transfer functions (HRTF) for virtual 3D sound. The latter problem has been studied extensively for 25 years. The approach, and its many limitations, have been understood by hearing scientists for most of that time. Yet AVAR technologists remain focused on refining this technology, presumably because they hope for a breakthrough that transcends the limitations and creates a universally compelling experience.
The HRTF approach is intuitively simple: the head and ears alter the sound that is picked up by the ear drum, and the pattern of alterations depends strongly on the direction from which a sound arrives. The brain is very good at detecting these alterations, and it uses that information to determine the direction of a sound source. Small microphones placed inside the ear canal can capture the pattern of sound alterations (the HRTF) and store it in a computer file. The HRTF refers to that information, which can be used to synthesize new sounds so that the signal at the ear drum matches that of a particular (virtual) location in space.
The strength of the HRTF approach is that it provides a way to deliver sounds that accurately match the acoustics of real-world listening. In other words, we can use the approach to get the physical cues exactly right. A major complication is that each person’s HRTF is different because the size and shape of the head, shoulders, and external ear differ from person to person. Research studies show that listeners are more accurate at localizing sounds synthesized with their own HRTF (listening through their own ears) than with a different HRTF. That is, spatial hearing is more accurate if we use individualized HRTFs rather than generalized HRTFs based on an “average” listener, recording manikin, etc. It’s easy to see that AVAR systems will ship with a limited set of generalized HRTFs. The result will be audio experiences that are acceptable for some listeners, unacceptable for others, and perfect for no one. In particular, because men, women, and children differ tremendously in terms of overall size, the “average” HRTF differs tremendously across these groups. It isn’t clear how vendors should target these groups, but the potential for peril, in terms of creating experiences that work for men but not women or vice versa, is clear.
Several presentations at the AVAR conference focused on the problem of customizing HRTFs using photographs or scanned measurements of individual users. Although technological innovations have changed the implementations, many such approaches have already been investigated in the literature on virtual auditory stimulation. Some degree of enhanced accuracy can be achieved, for example, simply by scaling the HRTF according to ear or body size. It isn’t clear to me, as an observer, whether the AVAR industry is aware of that literature or simply reinventing the wheel.
A major surprising weakness of the HRTF approach is that even when the physical cues are reproduced perfectly (with individualized HRTFs, etc.), the virtual experience is typically very different from the real. Virtual sounds are often heard “around” the head, an improvement over “within-head” perception produced by normal headphone listening, but not “out there” where the source is intended to be. That is, virtual experiences lack proper externalization, and distance is particularly misperceived. Research by Durand Begault and his NASA colleagues in the 1990s demonstrated that externalization could be maximized by a combination of individualized HRTFs, low-latency head tracking, and the addition of reverberation (see Begault et al. 2001 for a review of this work). That is, multiple auditory and non-auditory cues combine to shape auditory spatial perception. Contemporary approaches to AVAR should thus benefit from their inclusion of head tracking and room-acoustics modeling. For some entertainment applications, these factors might even be convincing enough to outweigh the inaccuracy of HRTF cues for many listeners.
If AVAR technology workers seem fixated on 25-year-old approaches to 3D audio, it’s not because the field of spatial hearing has not progressed. One way that field has progressed is in our understanding of how multiple auditory and non-auditory cues are combined to shape our perception of auditory space, and how much weight listeners place on each of those various cues. Interestingly, that weighting can vary tremendously between individual listeners and across different listening contexts.
So, for example, reverberation might greatly enhance externalization for me but not for you, because we differ in how we much weight we give to the reverberation cue. Similarly, listeners differ in how much weight they give to interaural timing and intensity cues. The weighting patterns can even change as listeners move between different rooms or engage in different tasks. Thus, the same physical stimulus can produce very different perceptual experiences for different listeners or in different contexts. Attempts by AVAR algorithms to manipulate spatial perception by altering various cues will need to understand and take account of these variables in order to produce compelling virtual experiences that match the real world.