March 8, 2001

Virtual Sound: The Future of Virtual Reality

Introduction

When people think about virtual environments the first thing that comes to mind is 3D graphics. However, using only graphics to create a virtual environment will fall short of the final goal of virtual reality. Sound plays just as important of a role as graphics when it comes to maximizing virtual realism. The physiology of the human ear, the physics of sound, and the current software and hardware are all significant in the implementation of surround sound in virtual environments.

Physiology of the Ear

The physiology of the human ear as it pertains to virtual surround sound can be divided into the parts of the ear and the way in which we hear sound. The outer ear is most important to sound localization since the pinna (also called the auricle) is used for spatial focusing and sound amplification. The pinna is the external part of the ear, which acts like a funnel to collect sound and channel it into the auditory canal. The auditory canal (also called the ear canal) helps to enhance frequencies which humans hear most such as the human voice. Its length causes resonance from variations in air pressure that produce sound waves. The auditory canal also contains the defense mechanisms for protecting the inner parts of the ear from debris. Sensitive skin, small hairs, and wax fibers called cerumen strands make sure nothing gets through except sound.

The pressure created by the auditory canal is converted to vibrations on the eardrum (also called the tympanic membrane), which separates the outer and middle ear. The eardrum consists of three layers made up of skin, elastic material and mucus and is held in place by the annulus, a special type of cartilage. The rest of the middle ear is made up of the ossicles: three bones, the malleus, incus, and stapes, and the muscles that control them. These bones are driven by the eardrum and work together like a lever system to amplify the vibrations. The muscles connecting the bones expand and contract involuntarily to deaden large vibrations caused by loud sounds. They act like an automatic volume control to prevent harmful vibrations from reaching the sensitive inner ear. The middle ear also amplifies sound waves using the oval window, a membrane about thirty times smaller than the eardrum, which acts as an interface between the middle and inner ear. The membrane receives amplified vibrations from the ossicles as well as sound waves from the eardrum and combines them to amplify the original sound wave more than eight hundred times before sending it to the inner ear.

The vibrations created in the middle ear are immediately converted to hydraulic pressure within the choclea, the significant part of the inner ear. Pressure created by the oval window passes through all the choclear ducts and is released at the round window. Along one of the walls of the choclear ducts is basilar membrane, which is used to separate the vibrations according to frequency. Pressure from the oval window creates a wave-like ripple across the membrane. One end of the membrane is tight and is used to detect high frequencies, while the other end is loose and is used to detect low frequencies. Nerve fibers along the membrane send the separated frequencies to the brain to complete the hearing process.

The hearing process is very important to creating virtual environments because it is the fastest form of sensory perception that humans have. The eyes can normally detect changes that are as close as 100ms apart, but the ears can identify separate sounds that are only 2ms apart. This means we could interpret spoken instructions fifty times faster than if the same instructions were displayed on a screen graphically. Virtual environments where time is critical, such as flight simulators, need to incorporate sound to increase user reaction time.

The ears can also localize sounds that come from any direction which is much more than we can see without moving our head. To do this the brain compares differences in sound-pressure and time-phase between the left and right ear. Sound-pressure level (SPL) can best be described as the loudness of a particular sound while time-phase is the interference of the same sound wave at two different locations (each ear) with respect to its wavelength. Sound pressure decreases with distance so when each ear receives the same sound at a different SPL, the brain can calculate the angle of the source of the sound along the horizontal plane. Time-phase is more accurate in localizing sound except that the phase interference repeats itself with each iteration of the wavelength. However, by combining the result of SPL difference and time-phase difference, the human ear can identify sounds with pinpoint accuracy with respect to the horizontal plane.

Vertical sound localization as of right now is not completely understood. The shape of the individual head, pinnae and shoulders play a significant role in identifying the vertical location of a sound source. This corresponds to the soundstage we hear, best described as the difference between listening to music on a set of speakers and a set of headphone plugs (the ones that you insert in your ear). With the speakers, the music appears to come from a ‘stage’ in front of you situated between the two speakers, but with the headphones, the music is centered within your head between your ears. The vertical cues we get from sound bouncing off of our shoulders, head and pinnae recreate the original stage on which the music was recorded. Taking away those cues causes the sound to be internalized between our ears.

Physics of Sound

Sound has many properties that must be fully understood to create realistic sound for virtual environments. First we must understand that sound is a wave just like light and obeys all properties of waves. However, sound can also be thought of as a vibration through a medium such as air or water. In either case, we describe sound as a change in pressure with respect to time. Sound can be broken up into its properties and different effects.

Sound has four main properties: wavelength, period, amplitude and frequency. Wavelength is the distance between equal parts of the wave or the distance between peak crests in the wave on a plot of amplitude vs. distance. The period is the change in time of a fixed point between equal parts of the wave, or the distance between peak crests in the wave on a plot of amplitude vs. time. The amplitude is the absolute maximum height of the wave with respect to its direction and at any given point in time can be either positive or negative. The amplitude corresponds to the loudness of a sound and is measured in decibels. An increase of 3 decibels will sound twice as loud to the human ear. The last property is frequency given in Hertz (Hz), which is the relationship of the wavelength and the velocity through the medium. The velocity is measured in meters per second and for air the velocity for sound is 344m/s, however many things can affect the velocity thus affecting the frequency we hear.

Some effects caused by sound waves include interference, defraction, and reflection. Interference comes in two forms, constructive and destructive, but to understand them, we must first understand phase. When dealing with two or more sound waves we must take into account the wave’s phase, which is the offset of the wave with respect to its starting point. It also corresponds to the amplitude of the wave at a specific point in time. If at any point two identical waves are not at the same amplitude they are considered to be out of phase. For example, a sine wave and a cosine wave are exactly 180 degrees out of phase.

When two waves occupy the same space there is always interference. Constructive interference is when the waves are exactly in phase and their amplitudes add (both amplitudes are positive at a given point in time) creating a sound twice as loud as the two original sounds. Destructive interference occurs when two waves are out of phase causing their amplitudes to subtract (one being positive and the other being negative at a give point in time) causing no sound to be heard at all. This concept of interference is very important when trying to simulate the acoustics of a room.

Defraction is a much more complicated property of all waves which has to do with traveling through small openings or slits. When sound travels through a small opening relative to its wavelength the edges of the opening cause the sound to bend in all directions. The intensity of the sound depends on the angle to which the listener is standing with respect to the opening. Consider the opening to be a doorway in a dorm room with a student listening to his/her stereo inside. Another student in the hallway can hear the music because doorway allows the sound to bend and defract down the hallway. As the second student walks closer to the door and past it, the sound gets louder. However, only mid to low frequencies are heard at the extreme angles to the door because low frequency wavelengths are large compared to the width of the door opening whereas the higher frequencies are small. Without defraction, the music would only be heard when the second student is directly in front of the door.

Reflection is much easier to understand since it can easily be compared to looking in a mirror. When light hits a surface some of it is absorbed in the surface and the rest of it bounces off at the same angle as it hit the surface. Certain surfaces like mirrors reflect light better than others do. This principle is also true for sound since we have established that it is also a wave. When sound hits a surface such as a wall, some of it reflects off the surface and the rest of it is absorbed. How much depends on the material: concrete reflects sound well, whereas tapestries do not. Reflections make a huge contribution to the acoustics of a room and are the primary cause of sound reverberation.

One of the most popular sound effects is the Doppler effect, which is caused by a listener in motion, a sound source in motion or both. Consider a stationary listener standing on the side of the road while a truck, the sound source, drives by. The listener hears high frequency as the truck is coming towards them, then as it passes the frequency changes and becomes lower while the truck is driving away. The wavelength of the sound emitted by the truck never changes (as observed by the truck driver), but the velocity of the sound does change. When it is coming towards the listener the velocity of the sound and the velocity of the truck add to create a high frequency. Once the truck passes, the two velocities subtract because the truck is now moving away from the listener while the sound is moving towards the listener. This causes a slight change in frequency due to the slow speed of sound (344m/s compared to the 3x10^8m/s speed of light).

Hardware and Sofware for Virtual Sound

Now that there is a brief explanation of the way humans hear and the physics of the sounds we hear, we can take a look at the current hardware and software available for creating virtual surround sound. The hardware can be divided into two different categories, sound output devices and sound processing devices.

Sound output devices have come a long way since stereo sound was invented. But the truth is that humans only have two ears and therefore only hear in stereo. Two different methods of sound output are currently used to simulate being surrounded by virtual sounds. The first is a multichannel setup consisting of 4 or more speakers surrounding the listener. By simply fading between channels we can create a 360 degree two dimensional sound field. However, as a user in a simulated VE spins around in the environment, the speakers must modify their output to follow the user’s movements and once the user moves out of the "sweet spot," the center location where the speakers are calibrated, all localization is compromised. The best application of this type of system would be with VE simulators such as the Cave that surrounds the user with multiple projection screens. It is also popular in movie theaters and home entertainment systems. Yet, this setup makes it almost impossible to implement any vertical sound localization unless more speakers are added in different vertical locations and more speakers means more hardware means more money.

The alternative to a multichannel setup is stereo headphones, which integrate well with head-mounted VR goggles. With headphones, only two channels need to be rendered to create realistic sound, thus reducing the hardware costs considerably. Head-tracking used for the goggles can be used to alter the sound as well. However, in order to create a 3D sound space in real time, layers upon layers of processing must be done to the sound before it reaches the user’s ears. Another drawback to headphones is the external localization cues from reflections off the user’s head and shoulders that are lost. Each user perceives sound differently and would require a different Head-Related-Transfer-Function (HRTF).

One of the first HTRF based processors was the Convolvotron, which was developed by NASA and manufactured by Crystal Lake. It consists of two convolution engines, which process the audio stream for each ear and output them to a set of headphones. The process of convolving includes calculating a specific room’s impulse response to render the room’s 3D acoustics. Each engine of the Convolvotron uses a fixed Head-Related-Impulse-Response (HRIR - generated from a HRTF) to create the sounds and if the HRIR is anywhere close to the HRIR of the user, the user will receive all the correct vertical and horizontal spatial cues through the headphones.

Dolby Laboratories has recently developed a processor to convert their famous Dolby Pro Logic and Dolby 5.1 Surround Sound formats back to standard stereo for use with headphones. As with the Convolvotron, any headphones will work. The processor takes the 5.1 channel input and simulates various room sizes to produce sound reflections off the walls and sends them to each ear. This system reduces the ambience found in commercial surround sound formats since actual rear speaker locations are implemented in the processing and 5.1 sound is already being used in commercial first person 3D games making the conversion from speakers to headphones a snap. However, this hardware does not support head tracking like the Convolvotron which is essential for the full VE experience.

Another company that has made significant contributions to the 3D sound hardware realm is Aureal (the originators of A3D). Unfortunately, Aureal recently filed for Chapter 11 bankruptcy for reorganization purposes so very little information is available on their latest advancements. However, Aureal’s latest sound card, the SQ3500, featuring A3D 3.0 did make it to the market. Aureal partnered with Dolby Laboratories in creating the latest version of A3D, which has features such as wavetracing, reverb, a geometry engine and volumetric sound sources. Wavetracing is more commonly seen in the implementation of light sources where waves are traced from their source to different parts of the room and then to the virtual user. The geometry engine simplifies the wavetracing to the size and shape of the room, allowing echoes and reverberation to be created with realistic results. Volumetric sources are sound sources such as a crowd or the ocean that don’t necessarily emit sound from a point source but rather a large area. A3D also incorporates Dolby’s ambient surround sound techniques used in 5.1 surround to create background sounds and music soundtracks for video games as well as add support for upcoming DVD games encoded in 5.1. A3D’s technology creates more realistic virtual surround sound in real time, but it still doesn’t support headtracking as of right now.

Since the hardware industry is so up to date on virtual sound and considering the creation of convincing surround sound requires very high speed computing techniques, there has been little need for software other than drivers for specific sound cards and files of the sounds to be placed in the environment. However, some companies have experimented with software plug-ins to emulate hardware functions such as Sensaura’s Virtual Ear, which allows the user to adjust parameters to create a unique HRTF. Altering the head size, ear size, concha depth, and concha type can create near perfect HRTFs for any user. The Virtual Ear is also compatible with many PC hardware devices such as A3D and DS3D (Microsoft’s version of A3D). Other plug-ins for win-amp and other sound applications utilize virtual sound techniques as well, but have little use in VEs.

Implementation

Since there are several different ways (and levels of realism) to implement virtual surround sound, I will only explain how to do it using headphones and a DSP (like the Convolvotron). The first step is to get a decent HRTF for the user. This can be done by using a generic HRTF, creating one with Sensaura’s Virtual Ear, or placing the user in a sound proof chamber with small microphones in their ears. The latter method requires recording and analyzing the sounds from 144 different speaker locations in the chamber to create a map of the user’s HRTF. Once the HRTF is derived it is then uploaded to the DSP (digital signal processor).

Inside the VE, sounds can be implemented exactly like objects with [x,y,z] coordinates that can be translated for motion (such as our moving truck example. The VE then has to output the generic mono analog sound to the DSP along with the coordinates of the sound and the coordinates and orientation of the user. Each sound source must be processed separately and most DSPs can handle about 16 different sounds at a time. The DSP then "convolves" two channels independently (one for each ear) using the analog sound, coordinates, and the HRTF template. The two channels are sent to an amplifier and then to a set of stereo headphones.

Real-time head tracking used to render the graphics portion of the VE alters the user’s coordinates thus altering the input to the DSP. So as the user moves around in the VE, the sounds whether static or in motion change also. This seems like a very simple process but it has taken years to get the hardware fast enough to render 3D sound in real time.

Conclusion

In the world of virtual reality, all possibilities must be explored in order to create an effective realistic virtual environment. Since sound is the fastest form of input that humans have, virtual environments should not be designed without it. Studying the way in which we perceive sound and the physics of sound opens up new possibilities for adding realism. Using the current software and hardware to its fullest capabilities, surround sound can enhance numerous simulations and VEs as far reaching as military and EMT training, air traffic control, and flight simulations.

References

Worrall, David. "Physics and Psychophysics of Music (Course notes)". Online. Internet. 4 March

2001. http://online.anu.edu.au/ITA/ACAT/drw/PPofM/INDEX.html.

Hanavan, Perry C. "Virtual tour of the Ear". (15 November 2000). Online. Internet. 4 March

2001. http://www.augie.edu/perry/frames.htm.

Kuleza, Alex, Green, David & Christopher, Granite. "The Soundry". Online. Internet. 5 March

2001. http://library.thinkquest.org/19537/.

Dolby Laboratories. "Dolby Headphone". (2000). Online. Internet. 5 March 2001.

http://www.dolby.com/headphone/.

Durham, Joel Jr. "The Future of 3D sound". (21 April 2000). Online. Internet. 4 March 2001.

http://singapore.gamecenter.com/Hardware/Roundup/Futuresound/index.html.

Sensaura Technology. "The Virtual Ear". (2000). Online. Internet. 6 March 2001.

http://www.sensaura.co.uk/wse/Tech/VirtualEar/STbotVEar.html.

 

 

 

 

Virtual Reality Resources - Directory of Virtual Reality related websites.
Last Update: October 21, 2010

Copyright © Christopher Huyler and Huyler.net.
All rights reserved; unauthorized use prohibited.