HEADER: a Harmonized Enveloping Audio Digital Experience Renderer

Abstract

An audio-visual interactive composition was designed and written to explore listener envelopment, defined here as how closely connected listeners felt towards the work. Aspects of user control are also examined, as well as the differences in rendering the same interactive digital signal processing on different audio sources. Thematic content of both the audio and the visuals are discussed in relation to other aspects of the design, as well as the concept of using the internet as a performance tool.

This project was designed in Max/MSP and JavaScript and has been implemented over the web utilizing the Web Audio API. A head-tracker controls aspects of the digital signal processing, analyzing the user’s head rotation and position captured via their webcam and using said information to control the number of voices in a harmonizer, the Dry/Wet value of a chorus effect, the sound source position in a binaural field, and the gain of the processed audio.

Six etudes were mixed binaurally, written and recorded as one piece with content written to highlight aspects of the design and explore listener connectivity. Visuals were created using WebGL in the Jitter environment of Max/MSP, drawing from Music Information Retrieval (MIR) techniques to further listener connectivity.

The project was reviewed by an expert panel consisting of panelists with backgrounds in digital signal processing, interactive multimedia installations, computer music, visual arts, musical listening, and face-tracking technology. Results conclusively show that listeners felt more connected with the audio than in a typical listening experience. Results also indicate that the control aspects of the design are successful, and that the digital signal processing is effective on multiple audio sources.

Introduction and Motivation

At its core, computer music is a fusion of music composition and digital signal processing, allowing the computer to be viewed as an instrument instead of just a processor, and giving artists the ability to create unique compositions. Since its inception, computer music has seeped from the obscure into the mainstream, with signal processing techniques once only used by artists creating for niche markets now being used in all forms of popular music, from pop to rock to rap. The ability to use a computer as an instrument, and how users can interact with the computer, seems to have become a necessity for musicians– an ability described by Robert Moog as “one of potentially the most fruitful areas of electronic music instrument development” (Moog, 1978.)

First developed in 1986, Max/MSP is a programming language designed mainly for use in music and multimedia environments. Its main user interface, which relies on visual objects, was first introduced by Miller Puckette in 1988. These objects allow users to quickly and easily manipulate bulk classes of code without disruption to the rest of the programming environment. The use of objects also allows access to Max/MSP for musicians, and others, who are not familiar with programming. Users are not limited to using only the tools provided by Max/MSP, but can design their own as well, resulting in nearly endless possibilities (Puckette, 1988.)

Recently, the consumer market seems to have trended towards headphone listening. During any walk on the street in a city like New York, one will encounter countless people consuming their chosen audio content over headphones. With this uptick, it would make sense that binaural listening in the consumer market may increase as well. Binaural audio takes stereo audio a step further, making full use of the directionality of the ears, allowing audio not only to be positioned in space from left to right, but in a full 360º sphere around the head, including top to bottom.

Since the onset of COVID-19, a majority of people have moved to working and learning in an online environment. Musicians have followed suit, broadcasting live concerts and events from home. However, the majority of these events are merely attempts at recreating live performance. Instead, musicians should look to finding new and unique ways to connect with an audience, making use of the tools at their disposal that they would not otherwise think to use, such as using the computer in new and interesting ways.

“Musicking,” a word coined by the author Christopher Small, is a verb form of the word “music” and defined as partaking in a musical performance in any capacity. Small argues that listening, dancing, rehearsing, composing, setting up a stage, running live sound- anything related to music, whether live or recorded- contributes to the music in its own way. He furthers this argument, stating that music should not be thought of as object, but instead should be thought of as an activity (Small, 1998.)

This thesis explores the concept of “musicking,” aiming to better connect the listener to the audio through the creation of an interactive tool to be used in binaural signal processing, with an audio-visual online installation built around it, thus giving the listener more agency than in a typical listening setting. The tool is based on a digital signal processing technique commonly used by the author, taking an input signal, harmonizing it, panning it in a binaural space, and applying a chorus effect to the signal and its copies. This would have many applications from mixing, to production, and to sound design, effectively taking any mono source and transforming it into a unique binaural instrument. There is currently no single tool available in the market that accomplishes this.

The online installation built around this tool aims to better connect the listeners with the music, through the use of binaural audio, through listener interaction with the production, and through the use of a superimposed image of the listener on the visual component. A composition has been written centered around the designed tool. Listeners are able to control aspects of the digital signal processing through the use of a built-in head-tracker. The listener, the audio, and the visuals will all be interacting with each other, essentially creating an online environment in which the listener becomes a musician, and the computer becomes their instrument.

Literature Review

In order to design a tool such as the one proposed, a thorough background is required on the multiple facets contained within the design. First, in order to design the tool, one must look in depth at digital signal processing techniques, specifically the design of harmonizers, binaural panners, and choruses. A background on binaural audio, human perception, and head-tracking needs to be introduced, in order to make informed design decisions. Next, an analysis of similar interactive installations must be discussed, looking at the current state of the industry and what can be learned.

2.1 DSP Tool Design

2.1.1 Harmonizer

The harmonizer is an effect that transposes an input signal to a new pitch, typically through use of a phase vocoder, that is then combined back with the original input signal to create a new signal. Developed in 1966, the phase vocoder is an extremely powerful tool, used initially for speech processing (Flanagan & Golden, 1966.) As opposed to its predecessor the channel vocoder, which works in the time domain, the phase vocoder works in the frequency domain, allowing for higher quality additive synthesis in the reproduction of the sound at a lower computational cost. Discrete Fourier Transforms (DFTs) are taken of the input over finite-duration windows, allowing analysis of the spectral content as well as magnitude and phase. The analysis is then used to resynthesize the original input signal as a sum of sinusoids, allowing for copies of the original input signal (Dolson, 1986.)

This ability to resynthesize the original input signal provides the ability to manipulate the new material in different ways. Two such options are through time scaling and pitch transposition. Originally, these were both done using similar processes. The temporal value of the input signal could be slowed down or sped up by simply spacing the inverse Short-Time Fourier Transforms (STFTs) further apart or closer together than the analysis STFTs, causing spectral changes to either occur slower or quicker. To change the pitch, first the signal is time scaled, and then the time scaled signal is played back at a new sample rate, preserving the original time and changing pitch. (Dolson & Laroche, 1999)

While the aforementioned techniques work very well, they are still somewhat likely to introduce artifacts, usually in the form of “phasiness” and “transient smearing.” This can occur due to improper unwrapping of the phase and is heard as a lack of presence or attack of the signals. When phase is unwrapped, it is important to consider the fact that phase gets reinitialized at the start of every cycle. If one were to compute the instantaneous frequency without this consideration, calculations would return drastic changes in frequencies at every 2π radians, which would cause these artifacts. Taking this into consideration as phase is unwrapped, one can avoid generating these artifacts (Park, 2010.)

In practice, this has been proven to work and provide both a musical and technical output. For example, AudioSculpt, developed by the Institute for Research and Coordination in Acoustics/Music (IRCAM), is an audio processing program based on the phase vocoder. It uses the aforementioned signal processing techniques to not only provide time-stretching and transposition, but also noise reduction through removal of unwanted frequencies in spectrum analysis, filtering through using the phase vocoder to act like a graphic equalizer with as many bands as frequency bins, and envelope preservation for accurate timbre representation. This last tool offers the ability to preserve the original timbre of the processed sound, by creating a spectral analysis that is subtracted from the sound pre-transposition, and then added in afterwards, preserving formant frequencies (Bogaards & Röbel 2005.)

One major focus when resynthesizing audio, especially the voice, is to maintain naturalness. (Farner, Röbel, & Rodet, 2009) proposed an update to the phase vocoder with a focus on naturalness and conducted a subjective listening experiment in which voices were changed to sound like a different age or sex of the speaker. While the listening test results did not show positive results in the researchers’ goals of changing the perceived sexes and ages of the voice, it did show positive results in the naturalness of the new voices. The sound processing was done with a focus on detection and preservation of transients and an update on spectral envelope estimation, pre-transposition subtraction, and post-transposition addition.

2.1.2 Binaural Panner

In signal processing, binaural audio can be achieved using a two-channel source played back over headphones. This is done by making use of how humans perceive sound, through interaural time difference (ITD,) referring to the time it takes an audio signal to reach one ear and wrap around the head to reach the other, and interaural intensity difference (IID,) referring to the differences in intensity of a wave as it reaches each ear. Anytime a sound is heard, humans instinctively take this information and use it to place the object in their sound field (Kendall, 1995.)

In order to create a binaural audio scene from pre-recorded audio, audio signals are convolved with Head Related Transfer Functions (HRTFs,) a process known as binaural synthesis. These HRTFs are frequency-domain signals derived from Head Related Impulse Responses (HRIRs,) which are impulse responses of the ear, and represent how the ear reacts to different frequencies coming into the ear from all directions. While these depend on ear shape and head size and therefore differ from person to person, HRTFs can be sourced from dummy head recordings, which aim to recreate the average ear shape and head size. These HRTFs are found to work for some, but not all listeners (Geluso & Roginska, 2018.)

An open-source software tool for binaural synthesis and rendering is SuperCollider based 3D Panner, as described in (Tatli, 2009,) which aims to provide itself as a tool for music creation as opposed to others which are designed for research aid. This tool works with the IRCAM dataset of HRIRs, convolving the input signal with the appropriate position HRTF to produce the binaural signal. It has an added distance feature, where it reduces the signal by 6dB for each doubling of distance, which can be entered in the Graphical User Interface (GUI.) However, this tool cannot be used to make the source appear closer, as it uses the original signal as the closest possible position to the listener.

In (Constantini, Fazi, Franck, & Pike, 2018,) an open-source real-time binaural synthesis toolkit based in Python was introduced, aimed to aid research in binaural audio and further extend the current tools and synthesis methods available at the time. The toolkit allows users to choose between three rendering methods: dynamic synthesis using HRTFs, High Order Ambisonics (HOA) synthesis, and virtual multichannel synthesis. The dynamic synthesis technique is similar to the one described above, in which HRTFs are convolved in real-time with a mono input signal to be placed in the sound field. The HOA technique works similarly, but uses an HOA input signal instead of a mono signal, and is capable of rendering object based audio inputs to binaural. The third technique uses binaural room impulse responses (BRIRs) taken from a room with a multichannel loudspeaker setup and aims to place the listener in that specific space.

Other research has been done to find best practices in real-time binaural synthesis and panning. One such method is through virtual vector based amplitude panning (VBAP.) (Radu, Sandler, Shukla, & Stewart, 2019) implemented a VBAP system and conducted an evaluation test against a First Order Ambisonic (FOA) approach. This VBAP method is derived using HRTFs measured from a virtual speaker array. For this experiment, a speaker array containing 8 speakers was used. Sound sources are located in the full binaural field by gain weighting the 3 closest speakers, which allows use of the full binaural spectrum with only 16 HRTF measurements (one for each ear for each speaker position.) The evaluation was conducted by recording HRTFs at 187 coordinates in each the VBAP and FOA systems and comparing these new HRTFs with the originals, focusing on ITD error, IID error, and spectral error. For both the ITD and IID comparisons, it was found that the VBAP system outperformed the FOA system at all positions, with spectral error shown to be too similar to completely determine better performance between systems but with data trending towards better FOA performance.

2.1.3 Chorus

The chorus is a delay based effect, achieved by adding delay lines, typically 10-25ms apart, to a signal. The addition of these lines detunes the original input signal slightly due to phasing of the fundamental frequency, creating a thickening effect. The delay lines are not static, but are oscillating to create dynamic constructive and deconstructive interference with the input signal and adding to the effect (Park, 2010.) Whereas more interference would be encouraged in a flanging effect, it is important to keep it controlled in a chorus. This can be done simply through ensuring the delay time stays within the 10-25ms range, as anything below 10ms would begin to introduce flanging and anything above 25ms would be moving towards an audibly different sonic event. If more control is desired, it is possible to do so through the addition of a negative feedback path within the delay circuit, which would minimize the interference of the delay lines to the input signal (Dattoro, 1997.)

(Platt, 2020) tested seven iterations of a binaural chorus effect, aiming to determine at what point in the signal processing binaural synthesis should occur, and at what azimuth positions the signal should be placed to produce best results. The chorus effect was first applied to the same audio input and tested with stereo output pairs convolved different azimuth positions. Next, the effect was applied post-convolution and pre-summation, first convolving the same input at the positions mentioned above, then applying the chorus effect before final left right summation. The chorus effect was also applied post-convolution and post-summation, being placed on the stereo output of the binaural signal.

These versions were tested alongside a stereo version, analyzing listener preference, envelopment, chorus thickness, and externalization. Test subjects were students in New York University’s 3D Audio Course, as well as professional audio engineers. Results showed that the chorus effect performed best was when applied post binaural synthesis and pre left-right summation.

2.2 Head-Tracking

Head-tracking is a technique typically used in audio to enhance binaural listening experiences, in which the listener can move his or her head within a sound field and in turn affect the location of a source in that sound field relative to the position of the listener’s head. This allows for added immersion for the listener into the sound field, as the listener now has the added ability to use head motion to help determine source location. Head-trackers can provide three degrees-of-freedom (3DoF,) where the listener position is determined on the y-axis (left-right,) x-axis (front-back,) and z-axis (up-down) or six degrees-of-freedom, which adds movement tracking within a space (Geluso & Roginska, 2018.)

(Anderson, Begault, & Wenzel, 2001) tested subjects on localization, externalization, and perceived realism of the human voice in a binaural space, and how they were affected by individualized HRTFs, reverberation, and head-tracking. Localization errors on the azimuth and elevation plain were found to only improve with the added reverberation, and surprisingly were not improved by individualized HRTFs or head-tracking, most likely due to a lack of high frequency information in the audio signal. Head-tracking, however, was shown to significantly improve front-back confusion of the sound source– when a listener flips a perceived source across the 90º azimuth axis, typically occurring when there are no visual stimuli tied to the sound. Reverberation was also shown to significantly improve externalization, which is when binaural sound produced over headphones is perceived in a field around the head as opposed to from ear to ear which occurs with stereo signals. However, since the only stimulus used in this test was a 3-second audio clip of speech, the results cannot fully indicate whether these data trends stay true for other audio signals with different envelopes and spectral information.

At the core, head-trackers must provide certain elements in order for professional use in audio applications. These elements include stability, ensuring that the head-tracker does not drift when a position is tracked and motion stops; minimal-latency, meaning there is no noticeable delay between head movement and system rendering; and resolution accuracy, ensuring that the movements tracked correspond to the actual movements of the listener. There are multiple head-tracking technologies available to the consumer market, each with their own advantages and disadvantages. One such technology is face tracking using either one or multiple cameras, and providing 6DoF. When one camera is used, a webcam can be routed into software, such as visageSDK, that tracks the head based on the movement of angles mapped to the face. When multiple cameras are used, the system has a better resolution of the face, and can therefore more accurately track movements. However, with the addition of cameras also comes the addition of more latency. Another downside to this method is that lighting plays a strong factor, and environments with changing lights will make it harder for the system to detect faces (Hess, 2012.)

2.3 Gesture-Based Instruments and Human-Machine Interactivity

Interaction between machines and humans has captivated popular culture throughout the past century, with examples appearing throughout literature, film, and music. This in turn has driven the field of computer music forward. The first appearance of a gesture-based instrument is the theremin, invented in 1920 by Leon Theremin, which immediately captured the attention of composers and performers. The theremin works through using the human as a capacitor, with two antennae controlling frequency and intensity respectively. The human’s hand works as a ground plate within the capacitor, breaking the electric fields from the antennae and controlling the frequency and intensity through the distance between the hand and each antenna, with a change in distance directly affecting the change in electrical current (Yurii, 2006.)

The theremin found, and continues to find, commercial success, with the instrument still being manufactured at the time of this research. One reason for this success may be that both composers and performers were drawn to the theremin, with pieces being composed specifically for the instrument as well as classical pieces being recreated featuring it. This is significant, as it shows that the theremin is viewed as a true instrument, not just an interesting piece of technology.

Another early gesture-based instrument, also created by Leon Theremin, is the terpsitone, which operates under similar principles as the theremin but on a much larger scale. The terpsitone works through the use of a large metal plate placed underneath a dance floor acting as a capacitor, with the upward and downward movement of dancers above either increasing or decreasing the capacitance, which in turn decreases or increases the frequency of an oscillator (Mason, 1936.)

The terpsitone did not find the same commercial success as the theremin, due to several factors. One reason is most likely due to the size and design of the instrument– the terpsitone is quite large with an intricate design, which made it hard to reproduce. Another reason could be that the design provided too much room for performer error, with miniscule movements directly affecting the frequency. While dancers may have more control over their bodies than the average person, the sensitivity of the terpsitone, combined with most dancers not having the musically trained listening skills necessary to make minor adjustments, proved too much of a feat to overcome for the terpsitone to find success.

As electronic music evolved so did designers’ approaches to interacting with instruments and machines. In 1967 Max Matthews and F. Richard Moore developed GROOVE, or Generated Realtime Operations on Voltage-Controlled Equipment, at Bell Laboratories. GROOVE, described by the developers as “the environment for effective man-machine interactions” (Matthews, 1970,) was a computer program that stored human movements, which in turn controlled a synthesizer. This worked through storing the movements on a disc, recorded via multiple sensors, including a standard electronic keyboard and a “magic wand” that was tracked in three-dimensional space, attached to the computer. Control lines would then send the information between the computer and the synthesizer with a sample rate up to 200 samples/second (Matthews, 1970.)

GROOVE was a significant project in furthering human-machine interactions. The wand itself is of note because it did not only store human movements but could also be programmed to be moved by the computer, meaning it could be controlled by both human and machine, one of the first designs to accomplish this feat (Park, 2009.) GROOVE also inspired many artists, such as Laurie Spiegel, because of its ability to store and reproduce human movements and performances. Spiegel composed using GROOVE, focusing her composition techniques on the real-time interactivity the system allowed; her piece Appalachian Grove showed how versatile the system was. GROOVE set the standard for human-machine interaction, with the designers envisioning its future to be used for “anything a person normally controls” (Matthews, 1970.) While GROOVE is no longer in use, due to the complexity of the design and the high cost of manufacturing- about $20,000 in 1967, which equates to roughly $160,000 at the time of this research- its legacy continues, as it was essentially the first form of automation in the audio realm.

Max Matthews’ work continued to further what was possible in the field of human-machine interaction in musical contexts. In the 1980s he began work on the Radio Baton, stemming from his desire to have “continuous information for expressive control” (Chadabe, p. 231) and his idea of “a controller that could sense the motion of the conductor’s hands and use those motions to control the expressive quantities of the instrument” (Park, 2009.) The Radio Baton utilizes two batons that each have a radio transmitting antenna at the tip and a surface with four receiving antennae. The receiving antennae track each baton in three-dimensional space, which allows for the performer to have control over six properties of the sound.

In a discussion regarding the creation of new controllers and their musical ability, Matthews cites the Radio Baton as a controller that is successful as a playable controller, whereas other designs are typically successful as creative uses of sensors in controllers, but less successful in musical applications. He credits this playability due to the fact that the score is saved in computer memory, removing “one of the mechanical problems that performers of normal instruments have” (Park, 2009.) However, this also limits its ability to be used in other applications, such as improvisatory music.

The Radio Baton continues to influence other designs, such as one presented in (Churnside, Leonard, & Pike, 2011,) which allowed users to conduct a virtual orchestral performance through control of dynamics and tempo through skeletal tracking. The audio for the installation was pre-recorded and video of the recording was captured in the conductor’s position using a fish-eye lens camera. The installation itself was designed where users first had their body scanned with their arms raised. After scanning and initialization into the skeletal tracking software, participants used vertical arm gestures to control dynamics and horizontal arm movement to control tempo. Temporal changes affect the rate at which the video plays back, and the audio is processed in the frequency domain allowing for changes in speed without changes in pitch. The feedback from test subjects was too varied to draw absolute conclusions, as it relied too heavily on the assumption that users had an understanding of conducting. This led to users without this background to not be able to interact properly with the installation, nor be able to provide an accurate assessment of the perceived realness. While inconclusive, these results can still be used to inform future decisions about interactive installation design and show the importance of the understanding of end-user background and capabilities when designing interactive installations.

MIDI, first introduced in (Smith & Wood 1981,) created a universal protocol for interfacing between humans and computers via a controller, which led to further innovation in human-machine interaction. MIDI controllers were developed both from traditional instruments, such as pianos, cellos, and trumpets, as well as from new approaches. One such new approach was demonstrated in Don Buchla’s Lightning, designed in 1991, and Lightning 2, designed in 1996, that utilized a wand controlled by a performer to send information to a receiver, which then translated the movements into MIDI data. The Lightning works by tracking the location and velocity of the wand in a predefined two-dimensional space via infrared sensors, which is translated into MIDI data via a hardware receiver, allowing for the MIDI data to be mapped to control any user-defined parameter (Scott, 2010.) While used in a number of compositions, the Lightning has not found much popularity. This could be due to the programming intensive aspect of the design, as the MIDI mapping is not user friendly for those who do not have a strong foundation in MIDI programming.

The Lady’s Glove, designed by Laetitia Sonami, was created to bridge the gap between controller and instrument. The Lady’s Glove works through connecting a multitude of sensors to a glove, which each measure different aspects of the hand motion and location. These sensors include accelerometers, pressure pads, switches, light sensors, as well as others. The data transmitted from these sensors is sent into a Max/MSP patch, where it is decoded and used to control different aspects of sound processing (Chadabe, 1997.)

Although it is used to control parameters, Sonami describes the glove as more instrument than controller due to the human limitations of control which cause “musical thinking and ideas [to] become more a symbiosis between the controller, the software, and the hardware” (Rodgers, p. 229.) This approach seems to be a common thread between the majority of successful gesture-based instruments, viewing the instrument more of an extension of the body rather than something on its own. Sonami continues to use and develop The Lady’s Glove, demonstrating the success of the design; however, as the glove is a singular instrument and not meant to be recreated and reproduced, it is difficult to determine whether the design is user friendly and how well it would be received by a consumer market.

While these examples only begin to scratch the surface of the evolution of gesture-based instruments, the analysis of the successes and failures of each can be used in the creation of future designs. Projects furthering human-machine interaction will continue to drive electronic music forward, resulting in designs where it there is little distinction between where the human ends and the machine begins.

Methodology

3.1 Design Overview

A high-level overview of the system designed for this work is visualized in Figure 1. First, the final composition was mixed in a binaural environment, working under the assumption that all voices from the harmonizer will be active in order to maintain enough headroom before clipping. From that mix, nine separate tracks were bounced: a two-channel binaural mix, and eight audio files containing the harmonized audio. When a user connects to the webpage, these nine audio files are loaded into the Web Audio API via buffers, where the remaining digital audio processing occurs. The eight audio sources are fed to a harmonizer toggle, determining how many voices are played back at any given time, then placed in a binaural space and put through the chorus effect; the binaural mix of the remaining audio content bypasses this processing. Once all files are ready for playback, a function is called that decodes all the audio buffers and begins playback of both the audio and video to ensure everything remains in sync. Concurrently, the user’s webcam feeds information to the head tracker, which interpolates the head movement frame-by-frame and sends the information to the harmonizer toggle, binaural panner, and chorus effect. The face mesh representation of the user’s face is superimposed on the browser, above the video file. The video file was created with a non-interactive mix of the full binaural scene, where the audio was analyzed and used to create video processing effects.

3.1.1 Harmonizer in Max/MSP

The harmonizer has been designed as an 8-voice harmonizer in Max/MSP. The harmonizer makes use of the phase vocoder technique, allowing for natural sounding pitch shifting of the audio content. DFTs are taken of the input signal every 1024 samples, overlapping 4 times per frame. After the center frequency is determined, it is multiplied by the predetermined transposition amount, then resynthesized at the new pitch.

In Max/MSP, the input signal is stored in buffer~ object, allowing the input signal to be accessed within the eight separate pfft~ objects. Transposition amounts are defined using the kslider object, with middle C equating to no transposition. Message boxes containing MIDI notes corresponding to predetermined voicings of each chord used throughout the piece are fed to the kslider objects, avoiding the voice-stealing that occurs within the poly~ object and thus ensuring that each voice follows smoother voice leading. The transposition amount in semitones from the kslider object is then converted to a frequency multiplier to be used within the pfft~ subpatch.

The pfft~ object is passed a non-zero fifth argument when it is called, ensuring it processes full-spectrum FFT frames and not the default mirrored frames. Other arguments for the pfft~ object set the FFT at 1024 frames; the overlap factor at 4, allowing for 4 overlaps per frame; and the start onset to 0, to ensure harmonizing begins on the first sample. Within the pfft~ subpatch, a Hanning window is applied to the signal before being fed into the fft~ object to prevent spectral leakage. The cartesian coordinates taken from the output of the fft~ object are turned into polar coordinates through complex math instead of through the arctan~ object, outputting phase and amplitude values at a less computationally expensive cost. These phase values are then used to calculate the phase difference in each frequency bin between frames, unwrapping the phase and in turn determining the center frequency. The sah~ object is used to store the transposition multiplier, ensuring that it remains constant for all FFT frames. Each of the 8 voices were routed out of Max/MSP to Pro Tools, where they were individually recorded, creating the 8 audio files used within the harmonizer toggle within the JavaScript implementation.

3.1.2 Proof of Concept in Max/MSP

Before implementing on the web, a proof of concept was designed within Max/MSP. This patch used the harmonizer designed above, sent the audio into a binaural panner, and then passed through a chorus effect. For gestural control in this proof of concept, a Leap Motion device was used. The Leap Motion detects hand movement through cameras and infrared sensors, which is then decoded in Max/MSP using the aka.leapmotion object. The decoded information toggles how many voices are harmonized depending on the hand’s position along the y-axis, the Dry/Wet of a chorus effect depending on the hand’s position along the x-axis, and an overall gain control along the z-axis. This proof of concept was then used to determine the most efficient and user-friendly way to process the audio in the JavaScript implementation.

3.1.2.1 Binaural Panner in Max/MSP

The binaural panner placed the harmonized copies in a binaural space along the azimuth plane. The untransposed input signal was placed at 0º azimuth, with the copies at 45º, 315º, 90º, 270º, 135º, 225º, and 180º, as shown in Figure 3. The voices were organized with the first pair at 45º and 315º, the second pair at 135º and 225º, the third at 90º and 270º, and the final voice at 180º. Convolution was done using HRTFs from the Neumann KU100 Microphone, chosen because they provide best overall externalization results for listeners. The Max/MSP based HISSTools Impulse Response Toolkit as presented in (Harker & Tremblay, 2012) was used to perform the real-time convolution, specifically the hirt.convolver~ object, performing the real-time convolution with pre-selected impulse responses, in this case using the aforementioned HRTFs. In order to perform this convolution in real time, a hybrid convolution method is used, in which the first samples of the signal are processed as linear convolution, and FFT-based convolution is used on the subsequent samples.

3.1.2.2 Chorus Effect in Max/MSP

The chorus effect for this proof of concept was designed in Max/MSP, utilizing the tapin~ and tapout~ objects to control delay lines. A narrower subset of the 10-25ms range typical in chorus effects was chosen in order to ensure that the delay lines did not stray too close to flanging or to separate audio events. The chorus depth was set to a fixed percentage of 16%, and the chorus rate was set at 3.3Hz and oscillated through the rand~ object. Separate feedback paths were created for both the left and right channels for additional control. The output of each the left and the right channels were sent through the rampsmooth~ object to minimize digital artifacts, and a modulating bandpass filter was placed on the dry signal through the reson~ and phasor~, and cos~ objects to slightly add to the effect, modulating at the same rate as the chorus itself. This bandpass filter has been set with a center frequency of 6666Hz and modulated +/- 3333Hz. As the chorus was designed to take a mono input and provide a stereo output, the output gain was scaled to more closely represent the input gain.

3.2 JavaScript Implementation

3.2.1 Web Audio API

For the browser implementation, all signal processing is done within the Web Audio API. The Web Audio API works by creating an Audio Context where signal processing occurs, through connecting nodes that perform different types of digital signal processing. This process works through WebRTC (Web Real-Time Communication) and the Media Capture and Streams API, which allow for both the audio and video components of this project to be implemented in the browser. Due to internet security protocols, this Audio Context cannot be instantiated until after a user gesture on the screen. For this work, a modal containing information about the project first loads on top of the webpage, and when the user clicks out of the modal the Audio Context initializes. Once initialized, the nine audio files are loaded into 17 buffers– one for the stereo mix and sixteen for two versions of each of the harmonized voices to be used in the Dry/Wet mix of the chorus effect processing. These buffers then pass through the signal processing nodes detailed below, until they are simultaneously decoded and played back through the user’s audio device.

3.2.2 Head-Tracking and Facial Mesh Representation

Head tracking is performed through the visageSDK HTML and JavaScript API, generously provided by Visage Technologies for use in this work. This head tracker is extremely robust, tracking 3D head position, facial expressions, gaze information, eye closure, facial feature points, and a 3D head model; however, for this work the focus is on 3D head position, specifically the yaw, pitch, and roll information, which is tracked in radian changes from center through pixel-by-pixel analysis of facial feature points. All of this information is returned and accessed through the FaceData class.

Pitch rotation, defined as up and down movement of the head, is used to control the number of voices in the harmonizer. A neutral pitch rotation, with the user looking straight ahead, will result in 4 voices, with a rotation of -.4 radians, looking up, resulting in all 8 voices and .2 radians, looking down, resulting in only one voice. Voices will be added or subtracted at intervals of .1 radians. Yaw rotation, defined as the user looking to his or her left or right, will be used to control the Dry/Wet value of the chorus. A neutral yaw rotation, looking straight ahead, will result in a Dry/ Wet value of 50%, with a rotation of -.8 radians, looking directly to the left, resulting in just the dry signal and a rotation of .8 radians, looking directly to the right, resulting in a signal that is 100% wet. Unlike the harmonizer, these values will be controlled linearly, with each subtle movement affecting the Dry/Wet amount. Roll rotation, defined as moving the chin up to the left or the right, will affect where the voices are positioned in the binaural space. A neutral roll rotation, looking straight ahead, will leave the voices in place at 0º, 45º, 315º, 90º, 270º, 135º, 225º, and 180º, as depicted in Figure 3. A roll rotation of -.5 radians, with the user moving his or her chin up and to the left, will result in the voices collapsing to the right and all placed within a roll rotation of .4 radians, with the user moving his or her chin up and to the right, will result in the voices collapsing to the left. An illustration depicting yaw, pitch, and roll is shown in Figure 5. The user’s distance to the webcam is also calculated and used to control the overall gain of the voices being processed. All radian change amounts were determined through the maximum rotation the designer could make in each direction before tracking was lost.

In order to retrieve the current value of the of the yaw and pitch, the .getFaceRotation() method from the visageSDK API is called on an array containing the FaceData every frame, accessing each element in the array individually to be sent to the harmonizer and chorus effect. The .getFaceTranslation() method from the visageSDK API is also called, accessing the element in the FaceData array corresponding to distance from the camera to be used in gain control. These values are sent to a hidden HTML form via the Document Object Model (DOM). An event is triggered via the .dispatchEvent() method of the Event class, allowing the values of the form to be sent to the corresponding digital signal processing via the DOM behind the scenes.

The roll information is accessed in a similar manner. The .getFaceRotation() method is called, and the index of the array containing the roll information is accessed and sent via the DOM to the hidden form. However, it was determined that accessing this information frame by frame would cause audible glitches in the audio. In order to avoid these glitches, the information is retrieved using the setInterval() method. The roll value is stored in a variable every 100ms, which is then checked against a variable containing the roll value that has been taken every 110ms. If these values differ, then .dispatchEvent() is called and updates the roll value to be used in the audio processing. If the values are the same, then the roll value does not need updating. This method was found to result in the smoothest audio experience by the designer.

Figure 7. Retrieving and sending roll values.

The facial mesh representation of the user is calculated through analysis of the feature points, and then drawn on the screen through use of the three.js 3D computer graphics JavaScript library, which allows for 3D computer graphics and animations to be drawn in the browser via WebGL. The vertices of the feature points are analyzed against a preloaded 3D model, which is then used by three.js to render the model. The size and positioning of the facial mesh is done through first determining the window size of the user through utilizing the window.screen.width and window.screen.height properties of the Web API Screen interface, and then scaled accordingly.

3.2.3 Binaural Panner

Within the Web Audio processing, the signals are first passed through the binaural panner. This panner convolves each of the input signal with a predetermined HRIR, making use of how humans perceive sound to place each signal at a point in the binaural space. This implementation of the binaural panner works through the use of the .createConvolver() method of the Web Audio API, which performs real time convolution. The HRIRs are loaded into 11 arrays containing 8 audio buffers each, corresponding to the coordinates where the audio can be placed in the binaural space. The roll of the listener’s head position, accessed via the .addEventListener() and the DOM, determines which one of these arrays the audio is convolved with at any given moment of time. This is calculated through a conditional statement analyzing the roll position.

Figure 9. Populating the convolution buffers and the conditional statement determining convolution.

At a neutral roll position, the array places the audio in a binaural space at azimuth positions 0º, 45º, 315º, 90º, 270º, 135º, 225º, and 180º, and elevation positions of 0º. Rolling the head in either direction results in the azimuth positions collapsing completely left or right to either 270º, 248º, 204º, 226º, 314º, 292º, 336º, and 0º, or 90º, 112º, 68º, 134º, 46º, 24º, 156º, and 1º after full rotation, with elevation positions of 75º for both. The values of both the azimuth and elevation positions linearly increment to these positions through the other convolution arrays with each new roll value. Each time a new position is determined, the .setValueCurveAtTime() method is called on gain nodes that the audio passes through before the convolution. This method allows programmers to specify an array of gains that the audio passes through, and the time it takes to iterate over the array. The gain passes through the array [.9, .75, .5, .25, 0, .25, .5, 1] over 200ms, ramping the gain down to zero (muted) and back up to 1 (full volume,) effectively smoothing out any clicks that would occur due to the sudden change.

3.2.4 Harmonizer Toggle

After passing through the binaural panner, the signals are passed through a harmonizer toggle. The harmonizer toggle works by checking the current value of the pitch rotation information, accessed via the DOM, and setting the gain values for the corresponding amount of voices either to 1 (on) or 0 (muted,) determined through conditional statements. This is done in real time using the DOM and the .setValueAtTime() method of the Web Audio API, which allows for the gain value of each voice to be set in real time.

3.2.5 Chorus Effect

The chorus effect works through use of both the .createDelay() and the .setValueCurveAtTime() methods of the Web Audio API. This works by delaying the timing of the input signal by a predetermined amount, which in turn slightly detunes the fundamental frequency of the input signal and creates the desired thickening effect. First, an array of delay times is created between 15 and 25ms long each through use of the Math.random() function. The length of the array is 1020 elements, which is roughly the length of the audio in seconds. Delays are created for the eight “wet” voices via the .createDelay() node, and then the delay times change throughout time through use of .setValueCurveAtTime(), iterating over all 1020 elements in the delay time array over the 1020 seconds of the length of the piece. These delays then pass through a modulating allpass filter, which results in a phase shift that adds to the effect (Park, 2010.)

In order to make the chorus effect interactive, gain nodes are created for both the wet and dry audio. The current yaw rotation information is accessed via the DOM, and then scaled to be a value between .01 and 1, representing the Dry/Wet percentage. For a dry/wet value under 50%, this scaled value is subtracted from 1 to set the gain of the wet signals and the dry signals pass through unaffected. For a dry/wet value above 50%, the scaled value is subtracted from the dry signals and the wet signals pass through unaffected.

3.2.6 Gain Control

The final processing of the signals is done through gain nodes created for all 16 audio tracks. The user’s distance from the webcam is accessed via the DOM and set via the .setValueAtTime() method. Before setting the gain, the incoming value is first subtracted from 1 in order to reverse the direction, allowing for users moving closer to the webcam to increase the gain and users moving away from the webcam to decrease it. This value is then slightly increased, allowing for what the programmer defined as a standard distance to the webcam to result in the gain the programmer determined is best fitting in the mix.

3.2.7 Server Side

A dedicated port on a server from the Computer Music Group at New York University is used for this project, controlled via a node.js file. Node.js allows for JavaScript functions to run in both the browser and the server, and provides websites to work in real-time with push capability and two-way connections between the client and the server via sockets. The Express framework is used within node.js to create the server, and is told to listen for requests on the specified hostname and port. An SSL key and certificate are also served, allowing for https access, which is needed for the browser to use the webcam and to create the Web Audio API Audio Context. When the server receives a request, it in turn sends the HTML and associated JavaScript files to the user, giving the user the full experience. The server is accessed via Secure Shell Protocol (SSH,) allowing for remote access. File transfers between the server and a local computer are done using SSH File Transfer Protocol (SFTP.)

3.3 Audio Visual Composition

3.3.1 Lyrical Content

The thematic content of the lyrics is loosely based off Kurt Vonnegut’s Player Piano (1952) and closely related to the aforementioned design aspects of this project. The novel examines the role of machines in the workforce, and the effect this automation has on the people; the lyrical content of this piece looks to do the same, examining the role of machines in everyday life, and how throughout history it seems that people are originally averse to technological innovation until they learn to embrace it. This is personified through the narrator of the story, who, after setting the scene for the listener in the first etude, learns that he is to be uploaded and integrated into a machine. He at first resists, only to realize he has no way of stopping the inevitable. Once the upload is complete, he realizes that he is at peace and that there was no reason to be hesitant against the integration into a machine.

3.3.2 Production and Mixing of Audio

The production and mixing of the audio aim to parallel the thematic content established by the lyrics. The piece is broken down into a suite of six etudes, with each one aiming to further the story of the piece in its own way. The first etude begins with a fairly straightforward instrumentation and light use of effects. As the piece progresses through the remaining five etudes, more effects are used and the instrumentation moves away from the use of acoustic instruments, until it reaches the climactic point in which no acoustic instruments are used and the instrumentation is purely electronic. The processing on the vocals mirrors this, adding noticeable autotune as the piece progresses. After this climax, the piece moves back to a similar instrumentation of the first section, reprising the melody, but with a slightly more electronic instrumentation. This is meant to highlight the integration of man and machine and their peaceful coexistence as discussed in the lyrics.

All acoustic audio content was recorded and mixed in Pro Tools with standard monophonic and stereo microphone configurations. The binaural mix was created using the Sennheiser AMBEO Orbit plugin, which uses a similar HRTF dataset to the one used in the JavaScript implementation of the binaural convolution, with both based on impulse responses from the Neumann KU100 Microphone. The binaural scene was approached from a creative standpoint; that is, the binaural rendering was not meant to recreate a realistic audio scene (such as that of a listener in an audience or of an in-the-band mix,) but instead to make creative use of the binaural space. Five different sources were recorded to be used in the online digital signal processing: background vocals, a Wurlitzer electric piano, an electric guitar, an upright piano, and a monophonic digital synthesizer. Each one of the etudes uses a different one of these sources in the online digital signal processing, exploring how different timbres are affected by the processing. Mixing was done using Beyerdynamic DT-990 Pro headphones, as open-back headphones are shown to produce better externalization results in listening to binaural audio (Platt, 2020.)

3.3.3 Production of Visual Component

The visual component contains two layers: a background layer created through the use of the OpenGL environment in Max/MSP/Jitter, and the foreground layer of the mesh representation of the user’s face. This foreground layer is meant to further the theme of the integration of humans and machines, as it is essentially uploading and rendering an image of the user into the computer. The background layer makes use of the psychological ties of the human auditory and visual systems, as presented in (Jones & Nevile, 2005.) Music Information Retrieval (MIR) and other digital signal processing techniques are used to affect this image. An STFT of the audio is displayed on an open cube shape inside the OpenGL world throughout the entire piece. The amplitude level of each channel of the audio controls the scale and rotation of two spheres, done through use of the peakamp~ object in Max/MSP. The spectral centroid- where the center of mass for all frequencies of a signal lies on the frequency spectrum- is scaled to control the color of the cube, via the zsa.easy_centroid~ object. The background color linearly fades from blue to red through the first 4 sections of the piece, before becoming black in the fifth section, and violet in the last section, again demonstrating the balance between man (depicted here as blue) and machine (depicted here as red.) The alpha value for the erase color decreases with each section, creating an increasing smearing effect as the piece progresses.

Videos are projected on the two spheres, sourced from footage shot by the developer as well as footage from the Prelinger Archive, an open-source archive of films related to U.S. cultural history. The videos for each section are meant to slightly relate to the story contained within that section. For example, as the narrator is being uploaded to the machine in the story, the videos show a reversed transition of a shot from a computer screen to a newscaster and two men walking around a life-sized model of a computer motherboard. The new videos for each section are triggered via a message box, which also sends the new bpm value to peakamp~ in order to ensure that the color changing occurs on beat throughout the piece. This message box also controls the perspective of the camera in the OpenGL world. The camera moves closer to the open cube shape with each section, until it goes completely through and looks back during the climactic section of the audio. The camera position then returns to inside the cube, furthering the metaphor of the integration of man into machine.

Evaluation

4.1 Description of Experts

Experts of different backgrounds were chosen for the evaluation of the experience. Two experts were chosen with backgrounds in interactive and multimedia installations as well as computer music and web audio from the faculty at New York University. Another expert has been chosen with a background in multimedia visual arts from the faculty at Boston College, who will be able to provide feedback from a different perspective. A representative from Visage Technologies has been selected as well, providing expertise on the face-tracking technology and analyzing its use in this project. A fifth panelist has been chosen with an expertise in musical listening and subject matter, a Senior Data Curator at Spotify who applies music and cultural knowledge to train, evaluate, and improve the quality of personalized user experiences.

4.2 Interview Protocol and Reporting of Feedback from Experts

Due to COVID-19, interviews with the expert panel were conducted asynchronously, allowing for panelists to listen and interact with the project at their own leisure. As this project is intended to be implemented over the browser, no aspect has been lost by conducting the interviews in this manner. Panelists were provided with a brief description of the project and were informed on best practices for listening in a binaural environment as well as how to use their head movements to control the audio. A list of question prompts was provided to the panelists, seeking to focus the evaluation in certain areas. Panelists were asked to rate their sense of envelopment, specifically their connection to the audio as compared to a typical listening experience. Another area of focus was on control, seeking to identify if the control of the audio through the head movements worked as expected, and whether the panelists desired any more or less control. Panelists were asked to discuss the different sources being processed and their preference of the sounds and interactions with each source. The next area of analysis was the content, and whether the panelists picked up on any themes, whether they felt the content matched the project, and if there were any specific moments that stood out. Panelists were also encouraged to discuss anything else they felt throughout the experience.

4.3 Results and Discussion

4.3.1 Envelopment

Panelists were asked to compare their envelopment, defined as how closely connected they felt with the audio, in this listening experience compared to that of their typical listening experiences; they were also asked to discuss if the envelopment felt consistent throughout the piece or if they noticed any changes. Results from all panelists indicate that they felt more connected with the audio, with one mentioning that he “felt glued to the music.” Results also showed that this feeling was fairly consistent throughout, with one panelist mentioning that it slightly differed at times based on the instrumentation and arrangement. This connection appears to be because of a few aspects of the design, listed in order of most mentioned:

(1) having control of the audio

(2) seeing their facial mesh on the screen, and

(3) the binaural mix.

A few panelists mentioned that having the control of the audio, and therefore having a task to do throughout the experience other than simply listening, caused them to pay more attention to what was happening within the audio. At points some panelists were not sure exactly what aspect of the audio they were controlling, which resulted in them listening more intently. The facial mesh representation on the screen added to this sense of envelopment, with panelists reporting that it heightened their awareness of their head position, which in turn caused them to further analyze their movements and how each movement affected the audio. The binaural audio was also discussed, with panelists mentioning that the “multi dimensional aspect of the piece” caused them to feel more immersed in the audio than in a typical listening experience.

This indicates that while a binaural mix contributes to a listener’s sense of connection with an audio scene, other factors play significant roles as well. The visual stimuli of the user’s facial mesh representation and the STFTs on the 3D cube enhancing the sense of connectivity with the audio furthers previous research done in the spatial audio field. (Cox, Davies, & Woodcock, 2019) tested the influence of visual stimuli on different aspects of spatial audio and found that the presence of visual stimuli contributed listeners’ sense of space, realism, and the spatial clarity of the audio. However, it was also determined that the visual stimuli did not affect envelopment, which is in slight contrast with the findings in this work, as the visual stimuli were mentioned more often by the members of the expert panel than the binaural mix. This indicates that visuals can play both a direct and indirect role in listener envelopment, connecting the listener directly to the content while enhancing the space and spatial clarity of the binaural scene.

4.3.2 Control

Panelists were asked to discuss different factors related to the control of the audio, including whether the head movements controlled the signal processing as expected and if any more or less control is desired. Feedback for this section was mixed; environment and background seemed to play roles in the results as well. Both panelists with non-music focused backgrounds reported that they did not feel they had the full control of the audio that they should, with one reporting that she felt as though her head movements had little to no effect. However, both of these panelists also mentioned that their ears may not be “sensitive” enough to detect small changes, as their ears are not musically trained. Another report stated that there were audible glitches when moving the head too quickly as well.

Panelists with musical backgrounds reported that overall, the head movements did control the signal processing as described. However, it was reported by multiple panelists that “it was a bit disorienting at first, and not entirely intuitive which head movements would affect certain effects or instrumentation in the arrangement,” but that after obtaining a better understanding of the interactions as the piece progressed, the additional control added to the experience immensely.

Another variable that was mentioned was listening environment. One panelist mentioned issues in finding a set up that provided the proper lighting of his face to allow for tracking. Another panelist reported that “the experience of being able to manipulate the volume was really different at night downstairs… versus morning upstairs.” These results indicate that lighting can play a crucial role in the face tracking and thus the signal processing, because if the tracker cannot properly obtain the facial feature data, then the corresponding processing will not be properly updated.

Panelists reported that they desired more control over the processed sound as well, each with a unique view as to which sounds and what control. One panelist mentioned a desire for a “more dramatic spectrum of sounds'' with an added fifth head movement. Having more control over other sounds present in the piece was also mentioned, such as the ability to “eliminate the vocals, piano, or guitar entirely with a certain head movement.”

A different aspect of control mentioned by panelists was how drastic or subtle the control should be to stay in line with the artistic vision. One mentioned that he desired “more control in some instances and less in others” to stay in line with artistic intentions, and another reported that glitches in the head tracker due to improper lighting led to hearing “the edges of the spatializing effect, where the sound would jump from one side to another” and that some listeners may “want to ‘play’ the track that way, where other users might just want the more subtle sense of… realistic stereo imaging.”

Comparing these results to past findings in computer-human interaction allow for more concrete conclusions about the design. Before designing one of the first synthesizers, the Buchla 100, Don Buchla invented an Optically Controlled Synthesizer, which created waveshapes through the analysis of hand movements. Users would draw the waveshape with their hands, using finger positioning to change the timbre of the sound (Chadabe, 1997.) Buchla did not find much success with this invention and decided it was not worth pursuing, most likely because users had too much control over some aspects of the sound and not enough control in others, with miniscule hand movements altering the timbre and very little control of the pitch.

This finding by Buchla implies a successful design for this project, as panelists’ desires for more control stem from wanting to push the amount of control to the limits, as opposed to finding the amount of control provided inadequate to enjoy the experience. Furthermore, no panelist mentioned wanting less control of any of the attributes, which indicates that no aspect of the design provided too much control that it detracted from the experience. These results are substantial enough to provide a baseline methodology for future research in this realm of creative interactive digital signal processing, but more work research should be conducted in order to set a standard.

4.3.3 Differences in Processed Sounds

Five different audio sources are processed throughout the piece, and panelists were asked to state their preferences, if they had any. The consensus between the panelists was a preference towards the sections containing vocal processing, specifically the final section. This could be due to a number of factors, but the most contributing factor may be that the human ear naturally can detect human voices easier than other sounds due to human biology, and thus the panelists could most likely identify the changes in these sections more easily. Another factor could be that in contemporary music production vocals tend to be the focus of the music, so modern listeners are immediately drawn to the vocals in a contemporary mix. This is in line with other feedback as well, as two panelists stated that the signal processing did not feel consistent throughout the piece, feeling there was more control in some sections compared to others, even though aspects of the design do not change. Apart from vocals, two panelists mentioned preferring the processing on the Wurlitzer and the synthesizer more than the guitar and piano, while a different panelist mentioned that the synthesizer was the “least interesting” of the processed sounds.

These results indicate that the digital signal processing worked successfully on a wide range of sounds containing different harmonic content, which in turn would indicate that the signal processing would work across the majority of monophonic audio content, although further work should be conducted to test this, as well as how the design would work with some polyphonic material.

4.3.4 Content

Panelists were asked if there were any particular moments or recurring themes that stood out to them, and if they felt as if the content matches the project. One theme that was consistent among panelists was a recognition of evolving repetitive patterns that “sound familiar but not boring, something similar in a different way.” Another comment about the experience as a whole was that “so many moments stood out and will be etched forever in memory, in part because of the triple input of visuals, user agency, and actions, and the music all interacting in a way that makes for more neural connections and methods of memory processing.” One moment that stood out for multiple panelists was the very end of the piece, which one panelist described as “transformative.” Another specific moment that stood out was the beginning, with a panelist describing the “upward movement and sound of invitational mystery” as a “perfect beginning.”

In terms of the content matching the project, one panelist mentioned that the musical content, digital signal processing, and visuals were cohesive and worked very well together, but that the presence of the facial mesh slightly took away from the experience. Another panelist mentioned a desire to see other options before reaching a conclusion.

This last response is significant because it implies that this is a new artistic medium that requires further research, as opposed to new research in an established artistic medium. Furthermore, it implies that this project successfully achieved the goal of creating a new and unique way of connecting the listener with the music, using the tools that the computer provides to allow an online musical performance that is beyond an attempt of a recreation of a live performance.

4.3.5 Other

There were a range of topics discussed by panelists when asked if they had any additional thoughts about the experience. One panelist described the whole experience as “totally engaging,” and that she was “right there for the whole 17 minutes, which in this age of divided attention says a lot.” Multiple panelists pointed out a “synergy” between the visuals, head movements, and sound, with one panelist mentioning he “really liked how the visuals matched the musical themes throughout, and how it felt like [his face mesh] was being increasingly surrounded by the ‘digital void’ as the piece progressed.”

Other notes included an idea of further research through conducting a similar experiment with a control track that is not spatialized, or even through the use of stems of a well-known song. Another panelist mentioned a desire for a more specific direction of how to use the head tracker prior to listening, or “a demo where concepts and controls are demonstrated in a clear fashion - perhaps not even within a musical context.”

Outside of this reported feedback by the panelists, one piece of important feedback was that the audio took too long to load. Multiple panelists emailed that they were not sure if the design was working properly, as they did not receive immediate auditory feedback due to the loading times, and one panelist reported having to wait eight minutes for the audio to load. This extended wait time is most likely due to the size of the audio (with the 9 audio files totaling about 2.7GB) and how the audio is loaded into the buffers, as playback does not begin until the entirety of the audio file is loaded.

Conclusions and Future Work

This project has proven to be effective in enveloping the listener to the audio content, with all experts indicating that they felt better connected with the audio than in a typical listening environment. This conclusion, in conjunction with responses about the content, indicate that there is a potential for a new medium of audio consumption, with the listener interacting with aspects of the audio, creating experiences unique to each listen. This has implications creatively in audio production, giving composers the ability to write music while viewing the listener as both the audience and as a performer, as well as in mixing and sound design. This finding is also significant in regard to live performance, as it showcases a way in which musicians can connect with their audience in new and unique ways, using the internet as their venue.

This work has implications in other realms of audio listening as well. For example, similar work could be conducted to provide people with hearing disabilities the ability to control aspects of sounds in both musical and non-musical contexts, which could help provide better listening experiences. Limitations in this context can be limited to avoid interference with artistic integrity, but also could be thought of similarly to how an equalizer works in a car: the audio can be personalized, yet its core components are not altered beyond recognition.

Future work for this project specifically will focus on providing better user experiences. Audio will be compressed to a lossless audio format, allowing for quicker load times. A more descriptive landing page with head movement visualizations will be added in order to counteract the initial disorienting feeling felt by the panelists in this work, as well as the addition of a collapsible instruction panel, giving users the ability to reread instructions at any point without having to reload.

Other future work will be conducted to further explore immersing the listener in the audio, as well as to determine best practices for interactive digital signal processing, especially regarding control. Further analysis of the control can be conducted using Fitts’s Law, mathematically predicting the ease or difficulty of this specific human-computer interaction. Subjective experiments should be conducted in which listeners are provided different amounts of control to more clearly determine where the line between artistic intention and user control lies. Other work should be conducted where each listener does not only have a unique listening experience, but a unique visual experience as well, with the audio specific to each listening controlling aspects of the video production.

Categorized Bibliography

Harmonizer

Bogaards, N. & Röbel, A. (2005.) An Interface for Analysis-Driven Sound Processing.

AES Convention 119, Paper 6550

Dolson, M. (1986.) The Phase Vocoder: A Tutorial

Computer Music Journal, vol. 10 no. 4 pp. 14-27

Dolson, M. & Laroche, J. (1999.) Improved Phase Vocoder Time-Scale Modification of Audio.

IEEE Transactions on Speech and Audio Processing, vol. 7 no. 3

Farner, S.; Röbel, A.; & Rodet, X. (2009.) Natural Transformation of Type and Nature of the

Voice for Extending Vocal Repertoire in High-Fidelity Applications.

AES 35th International Conference: Audio for Games, Paper 16

Flanagan, J. & Golden, R. (1966.) Phase Vocoder.

Bell Labs Technical Journal, vol. 45 iss. 9 pp. 1493-1509

Panner

Constantini, G.; Fazi, F.; Franck, A.; & Pike, C. (2018.) An Open Realtime Binaural Synthesis

Toolkit for Audio Research.

AES Convention 144, eBrief 412

Geluso, P. & Roginska, A. (2018.) Immersive sound: the art and science of binaural and

multi-channel audio. pp. 90-117

New York, NY: Routledge, an imprint of the Taylor & Francis Group

Kendall, G. (1995.) A 3D Sound Primer: Directional Hearing and Stereo Reproduction

Computer Music Journal, vol. 19 no. 4, pp. 23-46

Radu, I.; Sandler, M.; Shukla, R.; & Stewart, R. (2019.) Real-Time Binaural Rendering with

Virtual Vector Base Amplitude Panning.

2019 AES International Conference on Immersive and Interactive Audio, Paper 66

Tatli, T. (2009.) 3D Panner: A Compositional Tool for Binaural Sound Synthesis.

Proceedings of the International Computer Music Conference, vol. 2009 pp. 339-342

Chorus

Dattorro, J. (1997.) Effect Design, Part 2: Delay Line Modulation and Chorus.

Journal of the Audio Engineering Society, vol. 45 no. 10, pp. 764-788

Park, T. (2010.) Introduction to Digital Signal Processing, Computer Musically Speaking.

Singapore: World Scientific Publishing Co. Pte. Ltd. pp. 92-96, pp. 373-396

Max/MSP

Charles, J. (2008.) A Tutorial on Spectral Sound Processing Using Max/MSP and Jitter.

Computer Music Journal, vol. 32, no. 3, pp. 87-102

Harker, A. & Tremblay, P. (2012.) The HISSTools Impulse Response Toolbox: Convolution for

the Masses.

International Computer Music Conference, 2012. pp. 148-155

Jones, R. & Nevile, B. (2005.) Creating Visual Music in Jitter: Approaches and Techniques.

Computer Music Journal, vol. 29, no. 4 pp. 55-70

Ludovico, L.; Mauro, D.; & Pizzamiglio, D. (2010.) Head in Space: A Head-Tracking Based

Binaural Spatialization System.

Laboratorio di Informatica Musicale

Puckette, M. (1988.) The Patcher.

IRCAM

Gesture Based Devices and Installations

Churnside, A.; Leonard, M.; & Pike, C. (2011.) Musical Movements– Gesture Based Audio

Interfaces.

AES Convention 131, Paper 8496.

Chadabe, J. (1997.) Electric Sound: The Past and Promise of Electronic Music. p. 147,

pp. 230-234

Upper Saddle River, New Jersey: Simon & Schuster, A Viacom Company

Mason, C.P. “Theremin ‘Terpsitone’ A New Electronic Novelty.” Radio Craft. December, 1936

Mathews, M. & Moore, F. R. (1970.) GROOVE– A Computer Program for Real-Time Music

and Sound Synthesis.

The Journal of the Acoustical Society of America, vol. 47 no. 1, p. 132

Park, T. (2009.) An Interview with Max Mathews.

Computer Music Journal, vol. 33 no. 3, pp. 9-22

Rodgers, T. (2010.) Pink Noises: Women on Electronic Music and Sound. p. 229

Durham and London; Duke University Press

Scott, R. (2010.) Getting WiGi With It. Performing and Programming with an Infrared Gestural

Instrument.

eContact! Online Journal for Electroacoustic Practices, vol. 12 no. 3

Smith, D. & Wood, C. (1981.) The ‘USI,’ or Universal Synthesizer Interface.

AES Convention 70, Paper 1845

Yurii, V. (2006.) History and design of Russian electro-musical instrument “Theremin.”

AES Convention 120, Paper 6672

Other

Cox, T.; Davies, W.; & Woodcock, J. (2019.) Influence of Visual Stimuli on Perceptual Attributes

of Spatial Audio.

Journal of the Audio Engineering Society, vol. 67 no. 7-8, pp. 557-567

Moog, R. (1978.) The Human Finger– A Versatile Electronic Musical Instrument Component

Journal of the Audio Engineering Society, vol. 26 no. 12, pp. 958-960

Small, C. (1998.) Musicking: The Meanings of Performing and Listening

Middletown, CT: Wesleyan University Press

Vonnegut, K. (1952.) Player Piano.

New York, NY: The Dial Press