Hearables and Auditory Virtual and Augmented Reality (2016)

The future of listening meets the virtual-reality gold rush

Chis Stecker — Thu, 20 Oct 2016 10:11:00 +0000

Over the past months, I was privileged to attend two future-focused meetings related to the topics of this blog. The first was held in Berkeley CA, hosted by Starkey Hearing Technologies, and titled “Listening Into 2030.” It gathered top auditory scientists from around the world and asked them to envision how listening technologies will impact the way humans experience the world of sound and communicate with other individuals over the next 15 years. The second was the inaugural International Conference on Audio for Virtual and Augmented Reality (AVAR), held in conjunction with the Audio Engineering Society convention in Los Angeles CA. The AVAR conference brought together hardware and software designers, content creators, and scientists to explore the state of the art in immersive audio experiences for VR and AR, identify the challenges facing the field, and share research that charts a way forward. Both meetings were eye-opening to say the least.

I was particularly struck by how strongly most participants agreed with one another and shared a common vision (perhaps reflecting the zeitgeist of the times). However, this shared vision manifested in different ways at the two conferences. At Listening Into 2030, scientists shared a clear vision of what can be achieved by auditory augmentation, but disagreed about the goals of such systems. At AVAR, engineers and scientists appeared quite uniform regarding the goals (convincing “3D” immersive sound) but differed in their views about the approach. The vast majority continue to focus on filter-based recreations of real-world measured acoustics (the so-called Head-Related Transfer Function, or HRTF, approach). A much smaller minority (including myself) emphasized the importance of understanding perception itself–in other words, how the brain and mind shape our auditory experience.

Listening Into 2030

The Berkeley conference included a series of short, provocative, presentations (“rants”) by tech-industry and neuroscience experts, which were combined with workshops on envisioning future technologies, applications, and users. Rants touched on topics like attention and distraction, mind-reading by machine interfaces, challenges of big data (and privacy!) for ubiquitous audio recording, and how much “hearables” will need to evolve in order to support positive experiences by all-day/everyday users.

The design workshop components of Listening Into 2030 investigated a wide range of technologies visible on the horizon, from rooms with variable acoustics to optimize communication and/or privacy (think of classrooms that can be adjusted to encourage interaction while discouraging auditory distraction), to hearing aids that sense and adapt to a patient’s listening difficulties as they age. Many of the discussions focused on aspects of clinical audiology (detecting and addressing listening difficulty; ensuring universal access across global and economic boundaries), but even these issues were seen as integrated with the notion that new technologies will enhance listening for all users, impaired or not.

Although the discussions originated in numerous small groups, a shared technological vision quickly became clear. In that vision, devices for auditory augmentation will become widely available and technologically capable of (a) sensing and understanding the auditory scene surrounding a listener, and (b) altering sound to modify that scene in various ways. Some of these modifications will compensate for listening difficulties, as in current hearing aids but addressing a wider variety of concerns. Others augment the auditory scene by adding layers of information (auditory tagging of real-world objects), telepresent communication partners, or interface elements that use spatialized speech or other auditory cues. Still other modifications might reduce unwanted sounds or replace them with audio more suited to the listener’s goals. Imagine a commuter “turning down” the voices of other passengers and bringing up an immersive forest soundscape more conducive to reading or other cognitively focused work.

You might recognize many of these ideas from my inaugural post, Bin-Li: A short story about binaural listening agents and hearing aids of the future. That story describes several more examples of how I think this type of technology will impact listeners (typical or impaired) and society. My experience at Listening Into 2030 convinced me that these ideas are shared by most of the field because the means to achieve them are close at hand. It is the nature of the zeitgeist that society collectively embraces a shared vision, and world-changing inventions are simultaneously co-discovered by many groups. I hope that we can continue to embrace this shared vision with the goal that implementations will reflect our needs as a society and as individuals, not just those of the competitive marketplace and its closely guarded intellectual property.

The Gold Rush - Audio for Virtual and Augmented Reality

I've been interested in virtual reality for a long time–fanatically in the SGI and Mondo 2000 days of the early 1990s and more casually in the years since. I’ve been astounded by the rapid embrace of this nascent technology in the last few years. Public interest exploded after Oculus Rift’s kickstarter campaign demonstrated that the technology is now feasible at an individual and affordable level. Since then, the number of people with first-hand experience of VR has grown exponentially, and torrents of money have flowed into startups and VR divisions of tech companies large and small. VR headsets are now available across a wide range of performance and expense, from mobile versions that run on ubiquitous smartphone platforms all the way to high-end gaming PCs. With the recent commercial releases of Oculus Rift and Sony’s Playstation VR, the mass marketing of VR for entertainment has definitively begun.

But VR–along with its cousins augmented (AR) and mixed reality (MR)–offers to change computing and communication in ways that go far beyond gaming or entertainment. Ubiquitous AR and MR-driven interfaces promise the next evolution in computing platforms, similar to the past decade’s transition from desktop to mobile computing. And the immersive quality of VR and AR experiences suggests new modes for interpersonal communication, offering heightened degrees of presence and telepresence that will become increasingly similar to face-to-face interaction. In real life, a huge part of that experience (of presence, immersion, or spatial awareness of the environment in all directions) comes from our sense of hearing, and it stands to reason that hearing will be just as important for VR, AR, and MR experiences.

The AES Conference on Audio for Virtual and Augmented Reality (AVAR), organized by Linda Gedemer and Andres Mayo of AES, brought together technology developers, product engineers, audio engineers, content creators, and scientists working on the audio components of VR/AR experiences. Sponsors, vendors, and representatives from industry spanned the gamut of recognizable names in audio (Dolby, Intel, DTS, Fraunhofer) and VR (Oculus, Magic Leap), as well as numerous startups and vendors developing tools for content authoring, spatial audio processing, etc. I was struck by two general observations about the conference.

The first observation, supported by any number of articles on developments in the tech industry, is the frantic “Gold Rush” nature of VR at this moment in time. No single platform has reached the kind of market penetration that will ultimately define the “standard” for experience. So the future is wide open for any company to enter and dominate the market with a convincing product. Established tech companies like Sony, Microsoft, Google, and Facebook (which owns Oculus) are entering the market with fully realized platforms aimed to do just that. Others may be testing the waters or identifying niche markets where a dominant role can be established. Smaller startups may be focusing on specific technological innovations in the realms of sensor integration, display technology, content capture, and so forth. The presence of all these companies at AVAR suggests that players at all levels understand the importance of audio, and auditory experience, to the ultimate success of VR and AR platforms.

The second observation was more of a surprise to me. Spatial audio (or “3D” audio) was, of course, a major topic of discussion at the conference. Some interesting presentations focused on rendering sound propagation in virtual spaces and working with object-based surround-sound formats (e.g. Dolby Atmos) for room-based or binaural AVAR. But a larger number of discussions focused on optimizing head-related transfer functions (HRTF) for virtual 3D sound. The latter problem has been studied extensively for 25 years. The approach, and its many limitations, have been understood by hearing scientists for most of that time. Yet AVAR technologists remain focused on refining this technology, presumably because they hope for a breakthrough that transcends the limitations and creates a universally compelling experience.

The HRTF approach is intuitively simple: the head and ears alter the sound that is picked up by the ear drum, and the pattern of alterations depends strongly on the direction from which a sound arrives. The brain is very good at detecting these alterations, and it uses that information to determine the direction of a sound source. Small microphones placed inside the ear canal can capture the pattern of sound alterations (the HRTF) and store it in a computer file. The HRTF refers to that information, which can be used to synthesize new sounds so that the signal at the ear drum matches that of a particular (virtual) location in space.

The strength of the HRTF approach is that it provides a way to deliver sounds that accurately match the acoustics of real-world listening. In other words, we can use the approach to get the physical cues exactly right. A major complication is that each person’s HRTF is different because the size and shape of the head, shoulders, and external ear differ from person to person. Research studies show that listeners are more accurate at localizing sounds synthesized with their own HRTF (listening through their own ears) than with a different HRTF. That is, spatial hearing is more accurate if we use individualized HRTFs rather than generalized HRTFs based on an “average” listener, recording manikin, etc. It’s easy to see that AVAR systems will ship with a limited set of generalized HRTFs. The result will be audio experiences that are acceptable for some listeners, unacceptable for others, and perfect for no one. In particular, because men, women, and children differ tremendously in terms of overall size, the “average” HRTF differs tremendously across these groups. It isn’t clear how vendors should target these groups, but the potential for peril, in terms of creating experiences that work for men but not women or vice versa, is clear.

Several presentations at the AVAR conference focused on the problem of customizing HRTFs using photographs or scanned measurements of individual users. Although technological innovations have changed the implementations, many such approaches have already been investigated in the literature on virtual auditory stimulation. Some degree of enhanced accuracy can be achieved, for example, simply by scaling the HRTF according to ear or body size. It isn’t clear to me, as an observer, whether the AVAR industry is aware of that literature or simply reinventing the wheel.

A major surprising weakness of the HRTF approach is that even when the physical cues are reproduced perfectly (with individualized HRTFs, etc.), the virtual experience is typically very different from the real. Virtual sounds are often heard “around” the head, an improvement over “within-head” perception produced by normal headphone listening, but not “out there” where the source is intended to be. That is, virtual experiences lack proper externalization, and distance is particularly misperceived. Research by Durand Begault and his NASA colleagues in the 1990s demonstrated that externalization could be maximized by a combination of individualized HRTFs, low-latency head tracking, and the addition of reverberation (see Begault et al. 2001 for a review of this work). That is, multiple auditory and non-auditory cues combine to shape auditory spatial perception. Contemporary approaches to AVAR should thus benefit from their inclusion of head tracking and room-acoustics modeling. For some entertainment applications, these factors might even be convincing enough to outweigh the inaccuracy of HRTF cues for many listeners.

If AVAR technology workers seem fixated on 25-year-old approaches to 3D audio, it’s not because the field of spatial hearing has not progressed. One way that field has progressed is in our understanding of how multiple auditory and non-auditory cues are combined to shape our perception of auditory space, and how much weight listeners place on each of those various cues. Interestingly, that weighting can vary tremendously between individual listeners and across different listening contexts.

So, for example, reverberation might greatly enhance externalization for me but not for you, because we differ in how we much weight we give to the reverberation cue. Similarly, listeners differ in how much weight they give to interaural timing and intensity cues. The weighting patterns can even change as listeners move between different rooms or engage in different tasks. Thus, the same physical stimulus can produce very different perceptual experiences for different listeners or in different contexts. Attempts by AVAR algorithms to manipulate spatial perception by altering various cues will need to understand and take account of these variables in order to produce compelling virtual experiences that match the real world.

Original post blogged on b2evolution.

Hearables that do more than just listen

Chis Stecker — Thu, 13 Oct 2016 12:47:00 +0000

Despite public familiarity with digital hearing aids and related sound-processing devices, the initial market for hearable technology seems to be defined less by hearing than by other concerns. A few counterexamples aside (Doppler Labs, for example), many devices appear simply as new form factors for wearable fitness trackers (Bragi Dash, Samsung’s Gear IconX). For those applications, a variety of sensors come into play: accelerometers, heart-rate monitors, etc. But what about devices intended mainly to process sound and augment hearing? What use can they make of non-audio sensors? In this post, I want to explore examples from the research world, where the future of auditory augmentation looks increasingly “multisensory.”

Last time, I wrote about FM systems for hearing aids, and how those systems might be enhanced by room-level monitoring and restoration of auditory spatial cues (see Just around the corner: enhanced FM systems, hearables for concert-goers). Typically, FM systems use radio waves to broadcast the signal of a microphone directly to a listener’s hearing aids. If the microphone is located close to (or worn by) a target talker–a classroom teacher, for example–a tremendous advantage in signal-to-noise ratio can be achieved. The listener hears a clean signal, as if standing very close to the talker.

FM systems have clear advantages in many scenarios, but they are especially limited in situations with multiple talkers of potential interest, such as at a cocktail party. Some FM systems provide two microphones, so that two talkers can transmit on separate channels. Such a system might present a mix of both talkers at all times, or allow the listener to manually select one or the other, or employ some type of auto-switching algorithm (switching to the louder source, perhaps). There are advantages and disadvantages of each approach, but one thing is clear, regardless: part of the time, the system will transmit the “wrong” signal, so that the input fails to match the listener’s goals and/or attentional focus. A more difficult and effortful listening situation is thus created, potentially offsetting the signal-to-noise advantage of the FM system. The situation could be drastically improved if a system could track the listener’s attention in real time, and use that information to present listeners with the most relevant/important sound.

MicUp: Head-controlled multi-channel wireless for hearing aids.

One approach to monitoring listeners’ attention is to keep track of which talkers they are facing. As a conversation evolves and different talkers add to it, most listeners naturally turn their heads back and forth to follow the action. Directional microphones–which selectively amplify sounds arriving from the front–can take advantage of this fact. But directional microphones are not as powerful, in signal-to-noise terms, as FM wireless systems. Scientists Owen Brimijoin and Alan Archer-Boyd, working with the MRC Institute for Hearing Research in Glasgow, Scotland, have implemented a different approach using wireless transmission and simple computer vision. In their system, called “MicUp,” each of several talkers wears a small badge that carries a microphone and an infrared light. Invisible to human eyes, the lights flash in a pattern that can be detected by small infrared cameras worn on the listener’s head. The principle is very similar to that used by Nintendo’s Wii controller. Because each badge uses a unique pattern of light flashes, the camera can “see” which badge(s) the listener is facing toward, and a simple device can then adjust the level of those badges in the mix delivered to the hearing aids. Because the cameras can also see where each badge is, additional processing can provide spatial cues so that sounds appear to come from the correct location. The result combines the advantages of FM systems with rapid tracking of the listener’s attentional focus among multiple talkers. Although numerous challenges remain, such as how to track badges when other objects get in the way, and how to incorporate the cameras into comfortable wearable frames, systems like MicUp suggest how future systems might integrate audio, video, and data signals to provide seamless perceptual experiences. For more information, visit Dr. Brimijoin’s web page at https://www.nottingham.ac.uk/medicine/people/owen.brimijoin or Dr. Archer-Boyd's at http://www-hearing-research.eng.cam.ac.uk/Main/HearingPeople.

Visually guided hearing aids.

A related way to identify what listeners are attending to is to consider where they are looking. By monitoring eye position, the direction of gaze can be computed quite accurately. Eyeglass-mounted eye tracking for research is now available from a number of vendors, suggesting that real-time, all-day eye tracking may become available (and affordable) in the near future. Scientists at Boston University, led by Prof. Gerald Kidd, have begun testing a new new system that uses eye gaze to control steerable directional hearing aids (Kidd et al. 2013). A head-worn microphone array is used to implement directional audio “beam-forming.” The multiple microphone signals are combined in various ways to alter the directional pattern of microphone sensitivity (ranging from broad to narrow, front to side, etc.). The beam-forming system is controlled by the listener’s eye gaze, so that sound is amplified from wherever the listener is looking. Similarly to MicUp, the Visually Guided Hearing Aid combines auditory and visual information to enhance important over distracting sounds, here in a single wearable package.

Brain-computer interfaces in hearable devices?

MicUp and the Visually Guided Hearing Aid both aim to enhance sounds arriving from an attended direction, and use overt signals about attention (head and eye orientation) to do so. But human listeners can also pay attention to sounds without turning and looking directly at the target talker (psychologists call this “covert” attention). Could future devices measure attention by some other means, and use that information to augment auditory experience of covertly attended items?

It turns out that auditory responses in the human brain mimic key features of attended sounds. When listeners are presented with two competing speech streams, brainwaves measured with electroencephalography (EEG) or magnetoencephalography (MEG) entrain to the envelopes of the attended stream. Computer algorithms can then decode the signals from scalp-attached electrodes and determine which source the listener is attending (see Ding and Simon, 2012). Currently, this type of “brain reading” requires a great deal of data from multiple sensors. But in the future, decoding algorithms might exploit redundancies to achieve real-time performance with a smaller number of sensors. In fact, compact EEG sensors have already been developed to integrate into wearable and hearable form factors (Looney et al. 2012, Bleichner et al. 2015, Mirkovic et al 2016). These employ electrodes placed near or inside the ear to make electrical contacts in close proximity to the auditory parts of the human brain. These early studies have demonstrated the technological feasibility of such devices, as well as their sensitivity to auditory brain responses in competing-talker scenarios.

These three examples demonstrate the potential of harnessing information from a wide variety of sensors, sensory modalities, and data channels for auditory devices. Multi-sensor integration will lead to hearable devices and opportunities for auditory augmentation far beyond what could be provided by sound alone.

Thanks to Simon Carlile (Starkey Hearing Technologies), and Owen Brimjoin (MRC IHR) for specific discussions that inspired and led to this post.

Original post blogged on b2evolution.

Just around the corner: enhanced FM systems, hearables for concert-goers

Chis Stecker — Thu, 01 Sep 2016 06:03:00 +0000

Last month, I posted (from 20 years in the future) about how the integration of hearable technology, augmented reality, and artificial intelligence might change the way we think about hearing aids and communication disorders. It only takes a bit of reflection to realize that the hearing aids of the future will offer capabilities even normal-hearing users will want to access. Similarly, many of the greatest benefits for impaired listeners may come from technologies developed for other purposes such as auditory telepresence and social communication. Today, I want to look a little closer to the present. What steps in these directions could be taken with today's technology? What applications might lie just around the corner that could benefit hearing aid users, or entertainment-minded listeners? Two exciting but achievable developments come to mind: enhanced FM systems for hearing-aid listening, and hearable applications for concert-goers.

Enhanced FM systems

Today's hearing aids aim to restore or enhance the audibility of target sounds–such as speech–in listeners with reduced auditory sensitivity (hearing loss). Amplification can enhance all sounds equally, or be programmed to enhance quiet sounds more than loud sounds (compression). Amplification can also be directional, amplifying sounds in front of the listener but not to the sides or behind. When a hearing-aid user knows in advance which talker they want to hear, another very powerful option becomes available: the talker can wear a microphone that transmits her speech directly to the hearing aids using FM radio signals. Using an FM system in this way, good audibility can be experienced no matter where the target talker stands in the room–even in the presence of other distracting noises.

Imagine an FM system used in a classroom setting. A student with hearing aids might normally experience tremendous difficulty understanding the teacher in a room full of restless kids. But with an FM system in place, the teacher's voice comes through loud and clear, beamed directly to hearing aids in both ears. The sounds of other children are still audible through the mic channels of the hearing aids, but the teacher's voice is heard as if through headphones. It's an invaluable and well-loved approach to giving impaired listeners the information they need to communicate effectively. Modern FM systems can adjust levels automatically and switch between channels tuned to different talkers. As the devices shift to digital audio signals, these capabilities will grow even more.

Despite their many clear benefits, FM systems are not perfect solutions. Some readers might have noticed from my description that the talker's voice is currently delivered to both ears at the same time. That means that the listener's perception is a lot like listening to music over headphones: sound appears in the middle of the head, rather than "out there" at the talker's location. This doesn't seem to be a problem for understanding speech, but it could certainly be a problem for spatial awareness: it may not be clear whether the teacher is instructing students nearby or across the room, or where to look when he requests "Eyes on me!". There is currently a lot of debate about whether this disruption of the natural spatial characteristics could be a problem for the development of spatial hearing. I'm not going to comment on that issue; instead, I'd like to imagine what we would need to build an FM system with more natural localization cues.

We know that the most important cues for sound localization are differences between sounds at the two ears. Specifically, sounds are louder and arrive earlier at the ear nearer to a sound source, giving rise to the so-called interaural level difference (ILD) and the interaural time difference (ITD) cues. An FM system capable of providing these cues would need to make small adjustments to the sound in each ear, and these would need to be updated as the talker moves around the room or the listener turns his head. Assuming that these signal processing steps are performed by a computer and not by the hearing aids (a safe assumption given current technology), that would also require broadcasting a separate signal to each hearing aid (i.e., the FM signal should be in stereo).

So, technically, we require a system that can (1) track the talker's location in the room, (2) track the listener's location and head orientation, and (3) broadcast a stereo signal to the hearing aids. Does such technology exist? Certainly. There are numerous products–at all different price points–designed to track motion and orientation using cameras, electrical signals, gyroscopes, etc. Some use remote cameras and are relatively non-intrusive (e.g., Microsoft Kinect), while others provide more accuracy but require a sensor or target to be worn (e.g., Vicon, Polhemus). The key point is that current motion-capture technology is already suitable for this application. Similarly, stereo broadcasting to hearing aids is also possible, given that many two-channel FM systems are currently in use.

A rudimentary binaural FM system could implement a very simple real-time algorithm to introduce ITD and ILD cues appropriate to the relative positions of talker and listener. These would provide reliable information that, when paired with motor and visual information, might even produce realistic spatial perception. A more advanced system might use head tracking data with recordings of head-related transfer functions in order to provide more realistic 3-D audio cues. Both are established approaches that any modern PC can implement in real time.

In all likelihood, our FM system would involve several components installed in a room (such as a classroom): at minimum, one or more motion-capture cameras and a PC. Could we also use installed hardware in place of the talker's body-worn microphone? An array of directional microphones embedded in the walls, ceiling, or furniture would be well suited to pick up the talker's voice. The challenge would be knowing which microphones to patch into the FM system, since some will be dominated by other noise sources. Recall, however, that the system is already required to track the talker's position in the room. This information could certainly be used to generate an appropriate mix of microphone signals that capture and isolate the talker's speech with no body-worn microphone at all. Given the right motion-capture software, it should even be feasible to track multiple potential talkers, adjusting the mix dynamically to emphasize them as they speak up.

So, how long until an FM listener and his student cohort can walk into a classroom and launch into a discussion, with the room invisibly tracking and adjusting FM signals to provide optimal signal-to-noise ratios and appropriate spatial cues for each talker? Certainly not 15 or 20 years. Each piece of this technology currently exists; it should be a matter of 1-2 years, or an Engineering Master's Thesis, to integrate them.

Hearables for concert-goers

By now, we all know that attending rock concerts without hearing protection is a bad idea. For many of us, that has meant progressing from disposable foam ear plugs (which kill all the high frequencies and make the music sound terrible) to spending $10-$15 on high-fidelity ear plugs with good frequency balance. You might even consider investing (wisely) hundreds of dollars in custom ear plugs, shaped–like hearing aids–precisely to your ears and offering customizable attenuation. It makes a big difference to listen in comfort and safety.

Musicians face a more serious and complicated version of the same problem: they are exposed more frequently–and for longer durations–than casual concert-goers, and they have a critical need to hear their music clearly as they perform. On-stage monitor speakers can present dangerously high levels of sound as the engineers attempt to overcome room and crowd noise while helping musicians hear themselves in the mix.

In-ear monitors have become an increasingly popular solution to this problem for musicians. Custom molded to individual ears, they block outside noise like powerful custom earplugs while their high-quality transducers act like custom earphones. Typically, the monitors receive an audio signal from the on-stage monitor mix, and engineers can adjust each musician's signal to craft an individual mix of all the instruments. The result is that each musician can hear themselves clearly while listening at a much lower level than with on-stage monitors.

As in every other field, the technology for in-ear monitors continues to advance. Monitor systems now transmit and receive wireless signals, with increasing options for "personal mixing systems" that allow each musician, rather than a sound engineer, to adjust their own mix directly. Such systems allow more flexibility in changing the monitor mix from song to song as performance needs change.

Much like hearable technology in general, in-ear monitors reflect the convergence of several technologies drawn from hearing aids (custom-molded inserts), earphones, and wireless communication technology. As such technologies continue to converge in hearable gear for the general public, will non-musicians want access to the capabilities that on-stage musicians have now? I asked my colleague, Erick Gallun, what that might look like:

Imagine attending a concert and, instead of slipping your earplugs in and shutting out your party's conversation, you insert your hearables and ~~set them to forward speech from your friends, but not other attendees, until the music starts [oops, getting ahead of ourselves here]~~ select from a number of mixes "published" by the sound engineer: standard front-of-house mix, vocals-heavy mix, front-of-house with crowd cancellation, etc. Or, dial in your own mix from the individual-instrument signals sent to musicians' monitors. Erick confessed that maybe only "music nerds" would want access to those signals. But, we reasoned, if hearables can stand in as hearing protectors (and they should), will they simply go silent to emulate ear plugs? Or should they provide a signal of some sort? And if so, what sort? The concert's own musical program seems the obvious choice.

The technology for this type of custom-mix concert is already available in the form of wireless audio transmitters and in-ear monitoring systems. Heck, a pretty solid demo could probably be built using a PC for local digital "broadcasting" and smartphone apps for the audience members. Would it be compelling enough to actually use? For many current concert-goers, it might not. But for many potential attendees who avoid concerts because they can't hear the band over all the noise, it just might.

One might also wonder how audiences will feel about attending concerts where each person listens through their own devices. Some concert-goers might find the experience socially isolating. Others might find the shared earphone experience to be more intimate. Interested in those issues? They are already being explored by pioneers of the "Silent Disco" movement.

Of course, there will probably be issues of copyright and broadcast licensing once bands start live-streaming to personal devices, but sooner or later an enterprising club or band will conduct the necessary experiments. With current technology, they could develop (and control) compelling new audience experiences. Eventually, though, hearables may become capable of forwarding signals directly to other devices. Imagine pulling up an audio stream from another listener in the first row, or dialing up a mix across potentially hundreds of time- and frequency-calibrated auditory viewpoints, cancelling out the various elements of crowd noise to obtain an ideal "crowd's ear view" of the performance. That type of sharing will open up amazing new possibilities, not just for music but throughout daily life. It will also expose extreme concerns about privacy and ownership of communication. That, however, is a discussion for another day.

Original post blogged on b2evolution.

Bin-Li: A short story about binaural listening agents and hearing aids of the future.

Chis Stecker — Fri, 29 Jul 2016 08:18:00 +0000

I "met" Bin-Li around the time of my 65th birthday, in 2036. I’d had hearing aids before…high-tech hearing aids that amplified the sounds my ears were no longer sensitive to. They had smart algorithms for reducing noise and different modes for focusing on a single conversation versus listening broadly to the world around me. They even had modes that were halfway decent for listening to music. But Bin-Li is different. Bin-Li (my audiologist told me this was short for “Binaural Listener”) is like a computerized agent that listens to sound through my own ears, understands, and remembers the events and conversations that are going on around me. She can even read my brainwaves–in a simple fashion–to help decide which parts I most want to hear and understand.

“Bin-Li, what did he just say?” Sometimes I feel like a broken record, asking Bin-Li to repeat something or recall an earlier part of the conversation. But then I think back to my grandfather, and his struggles with old-fashioned hearing aids. He never seemed to understand anything that was said, and he was always struggling with the volume setting, trying to find a balance where he could pick up someone’s voice without too much extra noise. He never could; instead, he spent most of his time withdrawn from conversations, sitting there with a blank or exasperated look. He was a fiercely intelligent man; you knew he had a lot to say, and that he desparately wanted to be part of the banter, if only he could make it out. Or I think back to my own father, who was constantly asking my mother to repeat what someone had just said. And how exasperated she was, that he never seemed to be paying attention to what she said, or what anyone else said.

Bin-Li’s calm and reassuring voice is never exasperated. She’s always there, close by my shoulder, ready to discreetly repeat or explain a bit of conversation. In response to “What did they say?,” Bin-Li will tell me, “The man on the left asked what restaurant you should visit tonight. The woman on the right responded that she’d had too much Chinese this week; maybe Thai would be better.” In fact, Bin-Li can usually identify each talker by name, and more: “Bin-Li, who is that speaking now?” She’ll reply “That’s Mary Wilson. She works at your daughter’s school, in the office. You met her last year at the Christmas party. She has a son, Jack, and a husband, John.”

Bin-Li is more than just a communication aid; she’s also a memory aid. She experiences my conversations; she can play them back, review them, and can even understand them. She can identify important items and add them to my itinerary or to my contacts. She can interface with my phone and use it, for example, to make restaurant reservations while I’m in a crowded, noisy bar. She can send messages, dictate notes. Many of these are things that my phone could do twenty years ago. But somehow it’s different, having her there with me, all the time. Especially now that it’s become so difficult for me to understand what people are saying around me.

Bin-Li’s voice is produced by two earpieces that seal snugly and comfortably in my ears. But her voice does not appear inside my head, like listening to music over headphones. Not normally, anyway; sometimes I like to have her voice close to my ear, a sort of “inner-voice” that guides me as I move through the world. But more often, I use the standard setting, which makes her appear as if she is in the room with me, just over my left shoulder. When I turn my head, her voice does not move along with it, but stays in the right place just like any other sound in the world. And she always sounds as if she is properly in the room I’m in. It’s hard to explain, but it’s very unlike listening to, say, an audiobook with my old-fashioned stereo earphones (or even modern "binaural" recordings). That always sounded strange and artificial, like a photo inserted haphazardly into a scene with the wrong lighting or camera angle. The result is quite literally "out of place:" a sound that comes from nowhere in particular, inside my head, or just somehow not belonging to the room I’m in. Bin-Li is different. She seems real, tangible. A lot of that, I think, has to do with where she seems to be when she speaks to me. Right there, just beyond my left shoulder. Always, that is, unless she finds someone standing in her place. Then she moves, as naturally as anything, to a different place where I can easily separate her voice from the others.

My old "directional" hearing aids made everything sound like it was in the middle of my head, and mushed together. But with Bin-Li, I hear separated talkers, in separated locations. When I turn my head to look at a talker, I hear that talker in the correct place. Usually, Bin-Li puts the talkers in the places they should be, so that when I look I can see the talkers in the locations I hear them. But Bin-Li can move the sources of sound to make it easier to tell them apart, if I ask her to. The new locations are always totally compelling. Just as with Bin-Li’s own voice, the locations appear fixed when I turn my head, and convincingly in the room.

Last week we went to a noisy jazz club. There was a lot of musical sound in the club–some coming from the band on stage, some coming from the PA speakers (which seemed to be everywhere)–not to mention the important conversation at our table. I asked Bin-Li to “collapse” the music and put it onstage. I’ve read a little about this, and find it extremely interesting. It’s a hard problem, because the sounds in the room–the music, the loudspeakers, the talkers–are mixed in with all kinds of echoes, reverberation, and noise. Bin-Li’s algorithms can sort that out, and in doing so they can figure out which sounds belong to the band, and which to the room itself. Bin-Li recreated the sound of the band, on the stage and with much less extra noise and reverberation–an acoustic experience much more like listening to music on my living-room stereo at home. It was a very pleasant experience, even for this hearing-impaired listener. I could hear the talkers at my table, each in their correct place, and still appreciate the music, which I could even turn toward and focus on when an interesting solo caught my ear.

I’m very thankful for Bin-Li and this new technology that has replaced my hearing aids. My communication is more effective, and I feel more connected to the space and to the people in it, my communication partners. Supplementing my own understanding and my memory for who is talking, Bin-Li makes me feel younger and more engaged.

But I’m not the only person using this technology. In fact, most of the users aren’t even hearing impaired at all. My kids and grandkids also have devices like Bin-Li. They call them “hearables;” an admittedly cutesy name that combines “hearing aids” with “wearable computing”. They use it for different things. Of course, they can use Bin-Li in much the same way I do, to remember conversations, identify people they’ve only met once or twice, to clean up a noisy listening environment. But mostly they use it for socializing with other users. These days, kids and younger adults always seem to be talking to someone who isn’t there. They wander the streets in animated conversations with real people who can’t be seen because they are located someplace else, but with whom they interact in much the same way they would if physically present. I suppose they never get bored or lonely, because their friends are always with them. And their friends can listen through their ears, to experience what’s happening in each others’ environment. I’ve even seen them do this while standing in the same room, at parties. When one of the kids shouts “Hey, you gotta listen to this,” their friends in the room and all around the world who are part of their current conversation can hear (in some kind of realistic sense that I don’t fully understand) what that person is talking about. They can play it back, experience the same space even though they might be on different continents, but most importantly experience the act of close conversation with their friends and colleagues.

Every once in a while, one of the kids calls me up like this. They don’t call it “calling;” they call it something else, but to me it seems like a phone call. There’s a little beep, and then Bin-Li tells me “Your grandson Jeffrey would like to speak to you. Should I add his layer?” When I say “yes,” suddenly it is as if Jeffrey is there in the room. If I closed my eyes, I would have a hard time telling that he isn’t. His voice sounds, just like Bin-Li, to be in the same room with me. When I turn my head, his voice stays in the correct place (just like all the other sound sources Bin-Li renders for me). We have a conversation: we laugh, we talk, we tell jokes. The exasperating thing is that the way kids use this technology, I never know when to hang up. They seem to just leave it on, like a full-time communication channel with each of the people in their lives. I suspect they “mute” the parts of their conversations they don’t want me to hear. Or maybe their version of Bin-Li knows which parts are addressed to me and which are not. Admittedly, I don’t understand this part, but it’s pretty interesting, and it’s really changed the world. People are running around having these “layered” conversations, regardless of their physical proximity.

I suppose we should have seen this technology coming. Twenty years ago, we certainly had earphones that fit in the ears, which people wore almost non-stop for music listening. We had advanced hearing aids that could take in sound, process it, and play the modified sound to the listener. We had the rudiments of artificially intelligent agents, in our phones: voices that we could talk to and make requests of. We had ubiquitous technology; everyone had a phone in their pockets. Now, I talk about my “phone” as if it’s a real thing, but it’s just a tiny function incorporated into Bin-Li. The world has sure changed.

Yes, even twenty years ago, everyone was running around with buds in their ears. The difference is, that back then they were isolated. They were isolated from the world around them, and they weren’t really integrated into the world of communication that they were trying to connect to. Some people ran around with “Bluetooth” headsets. They talked to people who weren’t there, much like the kids do today. But the people who weren’t there were simply voices in the ear; they didn’t really belong to the space, in the way that we now take for granted. I can hardly imagine how difficult a conference call with 8-12 people must have been back then.

Today’s technology is pretty amazing, and I can’t wait to see where it goes next. I wish I could have been there twenty years ago, as it was all coming together. As people were finally learning how to exploit spatial hearing to build “binaural listeners” that could understand an auditory space and the talkers in that space, and then to turn that information into realistic and comprehensible auditory scenes for both normal-hearing and hearing-impaired listeners.

People like me, with sensorineural hearing loss, have poor sensitivity to some sound frequencies due to a loss of hair cells in the ear. It’s less of an issue these days than in the past, before the advent of advanced hearing aids. Now we can very reliably amplify the affected frequencies and restore sensitivity. But other people suffer from communication disorders that are more “central” or “cognitive.” For them, the problem isn’t in the ear, it’s in the brain. Some have trouble understanding speech; others have trouble dealing with echoes and reverberation. There’s no quick fix for such people. You can’t just make some sounds louder, but Bin-Li works for them because she does so much more than that. Bin-Li can simplify the sounds to isolate a single talker, if necessary, repeat or explain parts of a conversation, or show them on a visual display. I don’t use a visual display myself, but I’ve seen demos that generate real-time captions even with multiple talkers. So regardless of the nature of the communication disorder, this technology has helped tremendously.

Today, this technology is everywhere: in the audiology clinic, the entertainment industry, and in normal day-to-day activity. I can’t imagine a young person today who would walk around without their “hearables” in place. As one of my grandkids put it recently, “It would be like walking around with your eyes closed.”

-Chris Stecker, Nashville, April 26 2016

Original post blogged on b2evolution.