Audio Expansion Techniques For Real-Time Communication And Intelligibility Enhancement

December 31st, 2022

Categories: Applications, MS / PhD Thesis, Software, Audio Research

Unwrapped Schematic of Cochlea
Unwrapped Schematic of Cochlea


Novak, J. S. III


It has often been observed, informally and in academic literature, that speech directed toward individuals who are hard of hearing is naturally modified by talkers in an attempt to accommodate the listeners. These modifications take numerous forms, from the easily expressed “louder” and/or “slower” to the highly technical “vowel space expansion” and many others. There are also speech modifications made in numerous other circumstances, not only for hard of hearing listeners, but for listeners in noisy backgrounds, second language listeners, and many others, known under the collective (is sometimes too-broad) umbrella as clear speech. There is, however, an asymmetry inherent to all these talker-listener scenarios, which is that talkers have only the broadest guidance at best (“Louder!” “Slower!”) from their listener partners, which they may simply ignore.

With an interest in developing new and practical interventions for the elderly and hard of hearing, this dissertation takes as its research questions: First, the feasibility of the production of a signal processing system which allows real-time modification of audio signals. Second, the tolerance and effectiveness of partially emulating clear speech, by using signal processing techniques to modify (specifically, to slow) the rate of received speech signals. And finally, the effectiveness of placing the control of speech rate directly into the hands of listeners, a capability that has never been possible prior to electronic mediation.

This dissertation is organized as follows:
Chapter One is a survey of both the circumstances which prompt these spontaneous speech modifications, as well as of the details of the modifications which are made.

Chapter Two presents an overview of the various methods of changing the rate of speech including the simple but ineffective technique of re-sampling, as well as several frequency domain techniques. These systems were intended for use on pre-existing, rather than live, signals.

Chapter Three presents the details of several incarnations of this real-time system, including a single-talker system implemented on a laptop computer, presented in [1] and used directly (with two laptops for two talkers) in the study presented in [3]; a two-talker wireless system using Android smartphones and local network connectivity presented at [2]; and an unpublished system using a central laptop base station, to which Android smart phones could connect even when hundreds of miles apart. This chapter contains details required for the implementation of both live signals, and live control of the playback rate of those signals. Finally, this system includes the design details of systems used in the three user studies of Chapters Five, Six, and Seven.

Chapter Four presents a user study designed to examine the effects of using real-time temporal modification software in a conversational setting, previously published in [3]. The key element of the study design was the use of a scored Diapix test [7] which was designed to elicit spontaneous conversational speech between two study participants without the involvement of or prompting by researchers. The main results of this study are (i) that at modest amounts of stretching (40% additional playback time, with lengthy silences not stretched) the speech patterns of subjects are not changed to a statistically significant degree, and (ii) under those same circumstances, there is no statistical evidence of changes in performance on the Diapix test.

Chapter Five presents the first of three user studies focusing on the little-studied subject of the interaction of user choice and user control over received speech rates in adverse conditions. This study was based off the preexisting QuickSIN test [8], designed to test and characterize a listener’s ability to understand short sentences of speech played against a background of noise. In the first part of this study, users were asked to listen to several short sentences each played against multiple levels of background noise. For each level of background noise, the subjects were asked to use a computer interface to set the rate of the speech to whatever rate they believed was most helpful in understanding the foreground speech. (The background babble did not change rate.) These preferred expansion rates were recorded for later use and analysis. In the second part of the study, subjects were asked to listen to and immediately repeat back sentences at those same noise conditions, both with speech modified to their preferred expansion rates, and unmodified. The subjects’ repetition of those sentences was also recorded for analysis.

Our results included statistically significant differences in received speech rate at opposite ends of the noise condition, with subjects requesting slower speech in the presence of increase noise; as well as an overall belief that slowing speech was beneficial to subjects’ performance. However, analysis of the subjects’ ability to repeat was in fact mildly degraded at moderate to high noise conditions and was not improved at any noise condition. This work was published at Interspeech 2018 [4].

Chapter Six presents the second of the three user control studies. In this study, non-native students with recent TOEFL scores were recruited as test subjects. After familiarization with a graphical interface, test subjects were asked to participate in six comprehension tasks, each of which was a lengthy (several minutes long) audio track of a TOEFL test to which they had not previously been exposed. During three of these tasks, the subjects were given the ability to change the rate of playback as they saw fit, and asked to do so in whatever fashion they believed would best help them understand the audio passage. The other three comprehension tasks were presented unmodified, without a control interface. After each task, the subjects were immediately given a short multiple choice quiz about the contents of the previous audio passage. Subjects’ behavior (i.e., the rate changes specified) and quiz responses were collected and analyzed.

Our analysis of the data shows that a considerable majority of the subjects used speech slowing in all three of their trials (roughly 2/3) and an overwhelming majority used the technique in at least one of their trials (80%), but did not reveal a statistically significant improvement or degradation in listening comprehension. There is also a weak correlation between lower TOEFL scores and more slowing. This work was published at Interspeech 2019 [5]. In the work described above, the algorithms employed stretch speech uniformly - if set to expand an audio track by 40%, all sounds, including all parts of speech all background noises, and even silences, are expanded by the same rate. For large expansion factors, the speech so produced sounds somewhat unnatural, because this is not how talkers spontaneously produce slow speech. With this in mind, Chapter Seven returns to the topic of speech in noise, but puts even more precise control in the hands of test subjects: A relatively simple neural network was developed to classify individual phonemes into six broad phonemic categories within an audio track. A study was also designed around this tool which asked subjects to listen to short sentences in noise, and then expand or contract the audio only for particular phonemes (e.g., first modify vowels, then modify fricatives, etc.) Following this, all modifications were applied simultaneously, and subjects were asked to listen to modified and unmodified/control sentences in three levels of background noise and asked to transcribe the sentences into a computer interface.

Our analysis found either no statistically significant improvement or degradation (at one noise level) or statistically significant degradation (at two noise levels) of intelligibility while using this technique.

Finally, Chapter Eight presents overall conclusions, with additional discussion of the entire work and of directions for future studies.




Novak, J. S. III, Audio Expansion Techniques For Real-Time Communication And Intelligibility Enhancement, Submitted as partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois Chicago, Chicago, IL, December 31st, 2022.