LA2020-2 Speech to background music ratio in audiobooks PDF

Title LA2020-2 Speech to background music ratio in audiobooks
Course Laboratorio de Acústica
Institution Universidad Nacional de Tres de Febrero
Pages 3
File Size 192.9 KB
File Type PDF
Total Downloads 26
Total Views 169

Summary

Download LA2020-2 Speech to background music ratio in audiobooks PDF


Description

Acoustic Laboratory

December 2020, Argentina

SPEECH TO BACKGROUND MUSIC RATIO IN AUDIOBOOKS ´ SANTIAGO MARTINEZ Ingenier´ıa de Sonido, Universidad Nacional de Tres de Febrero. [email protected] Abstract - Audiobooks are becoming more and more popular in many countries around the world, and are often given a background music track to encourage reading. This study analyzed the subjective preference of the relationship between speech and background music to listen for long periods of time. A paired comparison test was carried out on 30 subjects between 18 and 30 years old. with different speech to background music relationships were used. The result was that the most preferred by the listeners was the 9 dB speech to background music ratio, while the 3 dB was the least preferred. as populares en diversos paises del mundo, se les suele colocar una pista Resumen - Los audiolibros son cada vez m´ musical de fondo para fomentar la lectura del mismo. En este estudio se analizo´ la preferencia subjetiva de la relaci´ on habla a m´ usica de fondo para escuchar por periodos prolongados. Se realiz´ o un test de comparaci´on por pares a 30 sujetos de entre 18 y 30 a n˜ os. Se utilizaron 4 est´ımulos con distinta relacion habla a m´ usica de fondo. El mismo di´o como resultado que el m´as preferido por los oyentes es el de 9 dB de relaci´on habla a m´usica de fondo, mientras que el de 3 dB fue el menos preferido.

1.

INTRODUCTION

Audiobooks are audios of people reading a book aloud. In addition to being a very inclusive activity as it allows visually impaired people to consume books that they would not otherwise be able to, this activity allows the consumption of both fiction and educational books to be encouraged in order to optimize time. They have become very popular in recent years, especially among young people. The share of time spent listening to spoken word audio in the U.S. has increased 30% in the past six years, and 8% in the last year [1]. As it is usually heard for long periods of time and doing other activities, like travel or exercise, it is important that the audio does not generate exhaustion or boredom in the listener. That is why a music track is usually added to keep the reading pace. However, if the music track is at a high level it can be exhausting for the listener. There are more audiobooks recorded by people in English than in Spanish. This problem is solved by artificial intelligence and text to speech algorithms. Despite what people usually think these algorithms are not new, they have existed for more than 50 years [2]. However, in recent years they have improved considerably, achieving surprising results [3]. Bradley et al. [4] performed speech intelligibility test and acoustical measurements in ten different classrooms and studied the influence that background noise and reverberation times of the room have over speech recognition in a normal academic situation.

Eiroa [5] tested an speech intelligibility test with traffic noise and found a considerable decrease in intelligibility between -4 and -5 dB speech to noise ratio. It is worth mentioning that the traffic noise used was not stationary noise but rather noise that varies over time. The problem is these studies do not analyze the exhaustion caused by having to pay more attention in case the speech-to-noise ratio is lower. If you have a voice and music there is no studio that measures what is the ideal relationship for the listener. Perhaps if it is a very high relationship the music can be confused with external noises. The aim of this study is to find out which is the best speech to background music ratio for a listener. The best speech to background music ratio taking into account the comfort of the listener when listening to an audiobook.

2. 2.1

PROCEDURE Speech-to-noise ratio

The most important parameter that affects the ability to hear and understand speech in the presence of background noise is the speech-to-noise ratio. When noise is mentioned in this work, it refers specifically to music. Other external noise sources are considered later for the analysis of the results. The speech is the voice generated by the text to speech.

1

2.2

Just-noticeable difference (JND)

According to McShefferty et al. the human being is capable of differentiating steps of 3 dB in the speech to noise ratio [6]. Their work was carried out on people with an average age of 69. However according to their work the JND of the speech-to-noise is independent of hearing loss, so there should not be a considerable problem in taking a JND of 3 dB.

2.3

Test stimuli

Each stimulus is an audio of a voice and a noise at different speech-to-noise ratios. The voice was generated by IBM text to speech with the voice SofiaV3 in Latin American [7] and the music was obtained from a classical music concert. The text chosen was an extract from ”The Feather Pillow”, by Horacio Quiroga: ”Sobre el fondo, entre las plumas, moviendo lentamente las patas velludas, hab´ıa un animal monstruoso, una bola viviente y viscosa”. The level of the voice remained the same for the different stimuli, only the level of the music was varied, thus obtaining different speech-to-noise ratios as shown in 1. Figure 1 shows the text to speech generated, the music used and the sum of the two for the N◦ 1 stimulus.

(a) Text to speech generated audio.

Table 1: Characteristics of each stimulus Stimulus Voice level [dBFS] Noise/Music level [dBFS] Speech-to-noise ratio [dB]

N◦ 1 -6 -9 3

N◦ 2 -6 -12 6

N◦ 3 -6 -15 9

N◦ 4 -6 -18 12

The stimulus with the lowest speech-to-noise ratio was chosen to obtain high intelligibility, since the idea of the test is not to measure this parameter. If an intelligibility test was done with the chosen stimuli, an intelligibility close to 100% would be obtained. Mainly because the sentence is already known because they hear the voice as a calibration. It should also be taken into account that both Bradley’s and Eiroa’s [4, 5] works studied cases of negative signal-noise relationship while this work is limited to the range of 3 to 12 dB.

2.4

Subjective test

The test is a paired difference test. In order to make a correct statistical analysis, thirty five subjects were selected randomly, without any preference regarding age, gender or musical training. The only condition was that they must be native Spanish speakers. The subjective test was performed by using Google Forms. First it was clarified to the test subjects that they must wear a headset to participate. That they must be in a low noise environment. That they can adjust the volume to a comfortable level with the but that they cannot change it during the rest of the test. Additionally, the following personal questions were asked: ” ? Do you listen to audiobooks or podcasts? Do you have a hearing problem?”.

(b) Music/noise extracted from a concert.

(c) Stimulus N◦ 1, with 3 dB of speech to noise ratio.

: The figure (c) was created by adding the signal from (a) and the signal from (b). In addition to a fade in and a fade out. Then they have a single voice audio that last 10 s and then twelve audios that they can repeat. Each of these has two 12 s stimuli plus 1 s of silence between each stimulus. So each audio lasts 25 s. The order of the stimuli in the test audio files was randomized to avoid any predictions by the subjects. In addition to this, each listener received a different sequence of stimuli as to avoid any spoiling of the test. They do not have the option of choosing a tie. With these times, the test last between 3 and 4 min, taking into account the reading time of the instructions and without counting repetitions. The question that guide the subject is: ”If you were asked to listen 2

to an audio like this for several hours, which would you find more comfortable?”.

3.

RESULTS

For the validation of the data is considered a maximum circular error (CER) of 0, and test of agreement and goodness of fit were made. To implement the analysis of variance (ANOVA), it was analyzed the normality, homoscedasticity and independence of the samples. For the test 19 of the 30 subjects presented a CER of 0, which represents the 63% of the sample. With this data, test of agreement and goodness of fit were made. The test of agreement indicated that there was significance between responses of the filtered subjects, with a confidence level of 95%. Otherwise, the goodness of fit yielded that there was not significant differences for the calculated values and the estimated scale at a 95% confidence level. Figure 2 shows the each of the stimuli. It can be seen that stimulus C is the most chosen, followed by stimulus D, B and A respectively.

5.

CONCLUSION

A subjective test was conducted to evaluate human perception of the ratio between speech and background music. The consistency test obtained 63% of the data. Since the test does not seek to find a correlation between an objective and a subjective variable, but rather to find a subjective point within the different objective variables, it was not necessary to perform a correlation analysis. . Since none of the subjects listen to audiobooks, the test would have to be repeated. Either looking for people who listen to audiobooks or changing the research for the study of the speech to background music ratio in podcast.

REFERENCES [1] NPR, The Spoken Word Audio Report, Edison Research (2020). [2] Fast text to speech algorithms for Esperanto, Spanish, Italian, Russian and English, Sherwood B. A. (1978).

Figure 2: Scale Value of test. A correlation study is not carried out because in this work results were expected and found whose peak was in the middle of the objective variable. Neither is a regression study carried out because . The Shapiro-Wilk normality test gave a result of less than 0.05 so the null hypothesis of normality is rejected. The is guaranteed as subjects were not related. The ANOVA analysis could not be carried out because the normality was not met, so Kruskal Wallis is used. The null hypothesis was rejected (x2 =9.622, p < 0.05) meaning that the parameters of the stimuli had a significant effect in the perception of the sound. In this test 100% of subjects listen to podcasts or online radios, but 0% listen to audio books.

4.

[3] Jia, Y., Zhang, Y., Weiss, R. J., Wang, Q., Shen, J., Ren, F., Chen, Z., Nguyen, P., Pang, R., Moreno, I. L., Wu, Y., Transfer Learning from Speaker Verification to Multispeaker Text To Speech Synthesis, 32nd Conference on Neural Information Processing Systems, Montreal (2018). [4] Bradley, J. S., Reich, R. D., Norcross, S. G., On the combined effects of signal-to-noise ratio and room acoustics on speech intelligibility, 106, 1820 (1999). [5] Eiroa, L. T., Influences of spectral centroid and signal to noise ratio over intelligibility of words and sentences, Report of Acoustic Laboratory, Universidad Nacional de Tres de Febrero (2018). [6] McShefferty, D., Whitmer, W. M., Akeroyd, M. A., The Just-Noticeable Difference in Speech-to-Noise Ratio (2015). [7] Watson, Text to Speech demo, text-to-speechdemo.ng.bluemix.net (Last viewed on 1/11/2020)

DISCUSSION

A maximum circular error rate of 0.5 (CER) because as there were 4 stimuli only a CER of 0, 0.5 and 1 could be found. Only 4 subjects obtained a CER equal to 0.5, which would become 76.6% of the subjects. Since it is not a significant increase, it was decided to discard these samples because it could cause errors in the results.

3...


Similar Free PDFs