Method and apparatus for dynamically modifying the timbre of speech by frequency shifting of spectral envelope formants

文档序号：1602700 发布日期：2020-01-07 浏览：41次中文

阅读说明：本技术 通过频谱包络共振峰的频移动态修改语音音色的方法和装置 (Method and apparatus for dynamically modifying the timbre of speech by frequency shifting of spectral envelope formants ) 是由让-朱立安·奥库蒂里耶帕布罗·阿里亚斯阿克塞尔·罗贝尔于 2018-02-12 设计创作，主要内容包括：本发明描述了一种用于修改声音信号的方法,所述方法包括：在频域中获得声音信号的时间帧的步骤；针对至少一个时间帧,在频域中应用声音信号的第一变换,包括：针对所述至少一个时间帧提取声音信号的频谱包络的步骤；计算所述频谱包络的共振峰的频率的步骤；修改(350)声音信号的频谱包络的步骤,所述修改包括应用(351)频谱包络的频率的连续递增变换函数,该连续递增变换函数由频谱包络的至少两个共振峰频率参数化。(The invention describes a method for modifying a sound signal, the method comprising: a step of obtaining a time frame of the sound signal in the frequency domain; applying a first transform of a sound signal in the frequency domain for at least one time frame, comprising: a step of extracting a spectral envelope of the sound signal for the at least one time frame; a step of calculating the frequency of the formants of the spectral envelope; a step of modifying (350) a spectral envelope of the sound signal, the modifying comprising applying (351) a continuous increasing transformation function of the frequency of the spectral envelope, the continuous increasing transformation function being parameterized by at least two formant frequencies of the spectral envelope.)

1. A method for modifying a sound signal, the method comprising:

a step of obtaining (310) a time frame of the sound signal in the frequency domain;

applying a first transform (320a) of the sound signal in the frequency domain for at least one time frame, comprising:

a step of extracting (330) a spectral envelope of the sound signal for the at least one time frame;

a step of calculating (340) the frequency of the formants of the spectral envelope;

a step of modifying (350) the spectral envelope of the sound signal, the modifying comprising applying (351) a continuous increasing transformation function of the frequency of the spectral envelope, the continuous increasing transformation function being parameterized by at least two formant frequencies of the spectral envelope.

2. The method of claim 1, wherein the step of modifying (350) the spectral envelope of the sound signal further comprises applying (352) a filter to the spectral envelope, the filter being parameterized by a frequency of a third formant (F3) of the spectral envelope of the sound signal.

3. The method according to claim 1 or 2, comprising: the method includes the step of classifying (360) the time frames according to a set of time frame categories including at least a voiced frame category and a non-voiced frame category.

4. The method of claim 3, comprising:

applying, for each voiced sound frame, the first transform of the sound signal in the frequency domain (320 a);

applying, for each non-voiced frame, a second transform (320b) of the sound signal in the frequency domain, the second transform comprising: a step of applying a filter to increase the energy of the sound signal (370) centered around the predetermined frequency.

5. The method of claim 4, wherein the second transformation (320b) of the sound signal comprises:

a step of extracting (330) a spectral envelope of the sound signal for the at least one time frame;

applying (351b) a continuous increasing transformation function of the frequency of the spectral envelope, which continuous increasing transformation function is parameterized identically to a continuous increasing transformation function of the frequency of the spectral envelope of an immediately preceding time frame.

6. The method of any one of claims 1 to 5, wherein applying (351) a continuously increasing transform function of the frequency of the spectral envelope comprises:

calculating a modified frequency (410a, 420a, 430a, 440a, 450a) for a set of initial frequencies (410, 420, 430, 440, 450) determined from formants of the spectral envelope;

determining a linear interpolation (460, 461, 462, 463) between initial frequencies of the set of initial frequencies from formants of the spectral envelope and the modified frequencies.

7. The method of claim 5, wherein at least one modified frequency (420a, 430a, 440a) is obtained by multiplying an initial frequency (420, 430, 440) from the set of initial frequencies by a multiplier coefficient (a).

8. The method of claim 7, wherein the set of frequencies determined from formants of the spectral envelope comprises:

a first initial frequency (410) calculated from half the frequency of a first resonance peak (F1) of the spectral envelope of the sound signal;

-a second initial frequency (420) calculated from the frequencies of a second formant (F2) of the spectral envelope of the sound signal;

-a third initial frequency (430) calculated from the frequencies of a third formant (F3) of the spectral envelope of the sound signal;

a fourth initial frequency (440) calculated from the frequencies of a fourth formant (F4) of the spectral envelope of the sound signal;

a fifth initial frequency (450) calculated from the frequencies of a fifth formant (F5) of the spectral envelope of the sound signal.

9. The method of claim 8, wherein:

calculating a first modified frequency (410a) as equal to the first initial frequency (410);

calculating a second modified frequency (420a) by multiplying the second initial frequency (420) by the multiplier coefficient (a);

calculating a third modified frequency (430a) by multiplying the third initial frequency (430) by the multiplier coefficient (a);

calculating a fourth modified frequency (440a) by multiplying the fourth initial frequency (440) by the multiplier coefficient (a);

calculating a fifth modified frequency (450a) equal to the fifth initial frequency (450).

10. A method according to claim 8 or 9, wherein each initial frequency is calculated from the frequency of a formant of the current time frame.

11. The method of claim 8, wherein each initial frequency is calculated from an average of the frequencies of formants of the same order for a number of consecutive time frames greater than or equal to two.

12. The method according to any one of claims 1 to 11, adapted for modifying the sound signal in real time, and wherein:

the sound signal comprises speech;

the step of obtaining (310) a time frame of the sound signal in the frequency domain comprises:

receiving an audio sample;

creating a frame of audio samples when a sufficient number of samples are available to form the frame;

applying a frequency transform to the audio samples of the frame.

13. The method according to any one of claims 1 to 12, adapted for applying a smile timbre to speech, wherein the at least two formant frequencies are frequencies of formants affected by the smile timbre of speech.

14. Method according to claim 13, characterized in that during the training phase the continuously increasing transformation function of the frequency of the spectral envelope is determined by comparing the spectral envelopes of phonemes spoken by a user when neutral or smiling.

15. A computer program product comprising program code instructions recorded on a computer readable medium for performing the steps of the method according to one of claims 1 to 12 when said program is run on a computer.

[ technical field ] A method for producing a semiconductor device

The present invention relates to the field of acoustic processing. More particularly, the present invention relates to modifying acoustic signals containing speech to provide a timbre, such as a smile timbre, to speech.

[ background of the invention ]

Smiling perceptibly alters the sound of our voice to the point where customer service recommends that their representative smile on the phone. Even if the customer does not see the smile, the customer satisfaction is positively influenced.

The study of the characteristics of sound signals associated with smiling speech is a new area of research that has not been adequately recorded. Smiling with the cheekbone can change the shape of the mouth, affecting the spectrum of speech. It has been particularly proven that when a speaker smiles, the voice's spectrum points to higher frequencies, and when the voice is sad, the voice's spectrum points to lower frequencies.

The documents Quen e H, Semin, G.R., & Foroni, F. (2012), Audible smiles and free after-market conversation, speech Communication,54(7), 917-. The experiment included recording a word that was neutral-pronounced by the experimenter. The experiment is based on the relationship between the frequency of formants and the timbre of speech. The formants of speech sounds are the energy maxima of the speech's sound spectrum. The Quen é experiment comprises: when it utters a word, the formants of the speech are analyzed, the frequencies of the formants are stored, modified formants are generated by increasing the frequency of the initial formants by 10%, and then the word is re-synthesized with the modified formants.

The queen experiment makes it possible to obtain words that are perceived to be pronounced at smiling. However, the synthesized word has a timbre that will be perceived by the user as artificial.

Furthermore, the two-step architecture proposed by queen é requires that a portion of the signal be analyzed before the signal can be resynthesized, resulting in a time shift between the time the word is issued and the time its transformation can be broadcast. Therefore, the Queen method cannot modify the speech in real time.

There are many interesting applications for modifying speech in real time. For example, real-time modification of speech may be applied to a call center application: the operator's voice may be modified in real time before transmission to the customer to appear more smiling. Thus, the customer will feel that his representative is smiling to him, which may increase customer satisfaction.

Another application is to modify the speech of non-player characters in a video game. Non-player characters are all characters controlled by a computer, usually secondary characters. These characters are typically associated with different responses to be spoken, which allows the player to progress through the plot of the video game. These responses are typically stored in the form of audio files and read when the player interacts with the non-player character. Interestingly, from a single neutral audio file, different filters are applied to the neutral sound to produce timbres, such as smiles or tension, to simulate the mood of the non-player character and enhance the sense of immersion in the game.

Therefore, there is a need for a method of modifying the timbre of speech that is simple enough to be performed in real time with current computing power, and the modified speech is perceived as natural speech.

[ summary of the invention ]

To this end, the invention describes a method for modifying a sound signal, the method comprising: a step of obtaining a time frame of the sound signal in the frequency domain; applying a first transform of a sound signal in the frequency domain for at least one time frame, comprising: a step of extracting a spectral envelope of the sound signal for the at least one time frame; a step of calculating the frequency of the formants of the spectral envelope; a step of modifying a spectral envelope of the sound signal, said modifying comprising applying a continuous increasing transformation function of the frequency of the spectral envelope, the continuous increasing transformation function being parameterized by at least two formant frequencies of the spectral envelope.

Advantageously, the step of modifying the spectral envelope of the sound signal further comprises applying a filter to the spectral envelope, the filter being parameterized by the frequency of a third formant of the spectral envelope of the sound signal.

Advantageously, the method comprises: the method includes the step of classifying the time frames according to a set of time frame categories including at least a voiced frame category and a non-voiced frame category.

Advantageously, the method comprises: applying, for each voiced frame, the first transform of the sound signal in the frequency domain; for each non-voiced frame, applying a second transform of the sound signal in the frequency domain, the second transform comprising: a step of applying a filter to increase the energy of the sound signal centered at the predetermined frequency.

Advantageously, the second transformation of the sound signal comprises: a step of extracting a spectral envelope of the sound signal for the at least one time frame; applying a successive increasing transformation function of the frequency of the spectral envelope, which successive increasing transformation function is parameterized the same as the successive increasing transformation function of the frequency of the spectral envelope of an immediately preceding time frame.

Advantageously, applying a continuously increasing transform function of the frequency of the spectral envelope comprises: calculating modified frequencies for a set of initial frequencies determined from formants of the spectral envelope; determining a linear interpolation between initial frequencies in the set of initial frequencies from formants of the spectral envelope and the modified frequencies.

Advantageously, the at least one modified frequency is obtained by multiplying an initial frequency from the set of initial frequencies by a multiplier coefficient (a).

Advantageously, the set of frequencies determined from the formants of the spectral envelope comprises: a first initial frequency calculated from half the frequency of a first formant of the spectral envelope of the sound signal; a second initial frequency calculated from the frequency of a second formant of the spectral envelope of the sound signal; a third initial frequency calculated from the frequency of a third formant of the spectral envelope of the sound signal; a fourth initial frequency calculated from the frequency of a fourth formant of the spectral envelope of the sound signal; a fifth initial frequency calculated from the frequencies of a fifth formant of the spectral envelope of the sound signal.

Advantageously, the first modified frequency is calculated to be equal to said first initial frequency; calculating a second modified frequency by multiplying the second initial frequency by the multiplier coefficient; calculating a third modified frequency by multiplying the third initial frequency by the multiplier coefficient; calculating a fourth modified frequency by multiplying the fourth initial frequency by the multiplier coefficient; calculating a fifth modified frequency to be equal to the fifth initial frequency.

Advantageously, each initial frequency is calculated from the frequencies of the formants of the current time frame.

Advantageously, each initial frequency is calculated from the average of the frequencies of formants of the same order for a number of time frames greater than or equal to two consecutive time frames.

Advantageously, the method is a method for modifying an audio signal comprising speech in real time, comprising: receiving an audio sample; creating a frame of audio samples when a sufficient number of samples are available to form the frame; applying a frequency transform to the audio samples of the frame; a first transformation of the sound signal is applied to at least one time frame in the frequency domain.

The invention also describes a method of applying a smile timbre to speech, implementing a method for modifying a sound signal according to the invention, the at least two formant frequencies being formant frequencies affected by the smile timbre of speech.

Advantageously, during the training phase, said continuous increasing transformation function of the frequency of the spectral envelope is determined by comparing the spectral envelopes of the phonemes spoken when the user is neutral or smiling.

The invention also describes a computer program product comprising program code instructions recorded on a computer readable medium for carrying out the steps of the method when said program is run on a computer.

The invention makes it possible to modify speech in real time to influence the speech with a timbre, such as a smiling or stressed timbre.

The method of the invention is not very complex and can be performed in real time by ordinary computing power.

The present invention introduces a minimum delay between the initial speech and the modified speech.

The present invention produces speech that is perceived as natural.

The present invention may be implemented on many platforms using different programming languages.

[ description of the drawings ]

Further features will appear upon reading the following detailed description, provided as a non-limiting example, in accordance with the accompanying drawings, which show:

fig. 1 is an example of a spectral envelope of a vowel 'a' spoken by an experimenter with and without a smile;

FIG. 2 is an example of a system implementing the present invention;

FIGS. 3a and 3b are two exemplary methods according to the present invention;

fig. 4a and 4b are two examples of continuously increasing transform functions of the frequency of the spectral envelope of a time frame according to the invention;

FIGS. 5a, 5b and 5c are three examples of spectral envelopes of vowels modified according to the present invention;

FIGS. 6a, 6b, and 6c are three examples of phoneme spectrograms uttered at smile and not smile;

FIG. 7 is an example of a vowel spectrogram transform according to the present invention;

fig. 8 shows three examples of vowel spectrogram transformations in accordance with three exemplary embodiments of the present invention.

[ detailed description ] embodiments

Fig. 1 shows an example of a frequency envelope in which a vowel 'a' is spoken by an experimenter with and without smiling.

The diagram 100 shows two spectral envelopes: the spectral envelope 120 shows the spectral envelope of a vowel 'a' that an experimenter utters without smiling; the spectral envelope 130 shows the same experimenter but utters the same vowel 'a' when smiling. The two spectral envelopes 120 and 130 show the interpolation of the peaks of the fourier spectrum of the sound: the horizontal axis 110 represents frequency using a logarithmic scale; the vertical axis 111 represents the magnitude of sound at a given frequency.

Spectral envelope 120 includes a fundamental frequency F0121 and a plurality of formants, including a first formant F1122, a second formant F2123, a third formant F3124, a fourth formant F4125, and a fifth formant F5126.

Spectral envelope 130 includes a fundamental frequency F0131 and a plurality of formants, including a first formant F1132, a second formant F2133, a third formant F3134, a fourth formant F4135, and a fifth formant F5136.

It may be noted that although the overall appearance of the two spectral envelopes is the same (which allows the same 'a' phoneme to be identified when the user utters that phoneme when smiling or not), smiling affects the frequency of formants. In practice, the frequencies of the first, second, third, fourth and fifth formants F1132, F2133, F3134, F4135 and F5136 of the spectral envelope 130 of the phoneme uttered at the smile are higher than the frequencies of the first, second, third, fourth and fifth formants F1122, F2123, F3124, F4125 and F5126, respectively, of the spectral envelope 120 of the phoneme uttered at the neutral. In contrast, the fundamental frequencies F0121 and 131 of the two spectral envelopes are the same.

Meanwhile, the spectral envelope of the smiling sound also has a greater intensity in the vicinity of the frequency of the third formant F3134.

These differences allow the listener to recognize both the uttered phoneme and how it was uttered (neutral or smiling).

Figure 2 shows an example of a system implementing the invention.

System 200 illustrates an exemplary embodiment of the present invention in the context of a connection between a user 240 and a call center agent 210. In this example, the call center agent 210 communicates using a microphone-equipped audio headset connected to a workstation. The workstation is connected to a server 220, and the server 220 may be used, for example, for an entire call center or a group of call center agents. The server 220 communicates with the relay antenna 230 by means of a communication link, allowing a radio link with the mobile phone of the user 240.

The system is given by way of example only and other architectures may be provided. For example, user 240 may use a fixed telephone. The call center agent may also use a phone connected to server 220. The invention can therefore be applied to all system architectures comprising at least one server or workstation allowing a connection between a user and a call centre agent.

Call center agents 210 typically speak in neutral speech. Thus, a workstation, such as server 220 or call center agent 210, may apply a method according to the present invention to modify the call center agent's speech in real-time and send the modified speech to customer 240 that appears to be smiling in nature. Thus, the customer's perception of interaction with the call center agent is improved as a result. In return, the customer may also pleasantly respond to sounds that appear smiling to him, which contributes to the overall improvement in the interaction between the customer 240 and the call center agent 210.

However, the present invention is not limited to this example. For example, the invention can be used for real-time modification of neutral speech. For example, the present invention may be used to impart a timbre (tension, smile, etc.) to the neutral sound of a non-player character of a video game so as to give the player a feeling that the non-player character is experiencing an emotion. Based on the same principle, the invention can be used for real-time modification of sentences spoken by the humanoid robot, in order to provide the user of the humanoid robot with the feeling that the latter is experiencing sensations and to improve the interaction between the user and the humanoid robot. The invention can also be applied to the sound of a player of an online video game or for therapeutic purposes for modifying the patient's sound in real time in order to improve the emotional state of the patient by giving him an impression that he is speaking in smiling speech.

Fig. 3a and 3b show two exemplary methods according to the present invention.

Fig. 3a shows a first exemplary method according to the present invention.

The method 300a is a method for modifying a sound signal and may for example be used for assigning emotions to a neutral sounding track. Emotions may include making speech more smiling, but may also include making speech less smiling, more stressed, or assigning an intermediate emotional state to speech.

The method 300a comprises the steps of obtaining 310 a time frame of the sound signal and transforming the time frame in the frequency domain. Step 310 comprises obtaining successive time frames forming the sound signal.

The audio frames may be obtained in different ways. For example, the audio frame may be obtained by recording an operator speaking into a microphone, reading an audio file, or receiving audio data, such as over a connection.

According to different embodiments of the invention, the time frame may be of fixed or variable duration. For example, the time frame may have as short a duration as possible, e.g. 25ms or 50ms, allowing a good spectral analysis. This duration advantageously makes it possible to obtain a sound signal representative of the phoneme while limiting the lag produced by the modification of the sound signal.

The sound signals may be of different types according to different embodiments of the invention. For example, the sound signal may be a mono signal, a stereo signal or a signal comprising more than two channels. The method 300a may be applied to all or some of the channels of the signal. Also, the signal may be sampled according to different frequencies, for example 16000Hz, 22050Hz, 32000Hz, 44100Hz, 48000Hz, 88200Hz or 96000 Hz. The samples may be represented in different ways. For example, the samples may be sound samples represented in 8, 12, 16, 24, or 32 bits. Thus, the present invention may be applied to computer representations of any type of sound signal.

According to different embodiments of the invention, the time frames may be obtained directly in the form of their frequency transformation, or obtained in the time domain and transformed into the frequency domain.

For example, if the sound signal is initially stored or transmitted using a compressed Audio format, such as according to the MP3 format (or the acronym for MPEG-1/2Audio Layer 3, Motion Picture expert group-1/2Audio Layer 3), AAC (acronym for Advanced Audio Coding), WMA (acronym for Windows Media Audio), or any other compressed format in which the Audio signal is stored in the frequency domain, the Audio signal may be obtained, for example, directly in the frequency domain.

It is also possible to first obtain the frame in the time domain and then convert the frame to the frequency domain. For example, a microphone, such as the microphone spoken by call center operator 210, may be used to directly record the sound. The time frame is then formed first by storing a given number of consecutive samples (defined by the duration of the frame and the sampling frequency of the sound signal), and then by applying a frequency transformation of the sound signal. The frequency transform may be, for example, of the type DFT (direct fourier transform), DCT (direct cosine transform), MDCT (modified direct cosine transform), or any other suitable transform that makes it possible to convert the sound samples from the time domain to the frequency domain.

The method 300a next comprises applying a first transform 320a of the sound signal to the frequency domain for at least one time frame.

The first transformation 320a comprises a step of extracting 330 a spectral envelope of the sound signal for the at least one frame. It is well known to those skilled in the art to extract the spectral envelope of a sound signal from a frequency transform of a frame. The frequency translation may be done in many ways known to those skilled in the art. The frequency transformation may be accomplished, for example, by Linear predictive coding, such as described by Makhoul, J. (1975). Linear prediction: A structural review. proceedings of the IEEE,63(4), 561-. The frequency transformation may also be done, for example, by a cepstral transformation, for example, by

A.,Villavicencio,F.,&Rodet, X. (2007), On-cepstral and all-pole based spectral attenuation modules with unknown model order, Pattern Recognition Letters,28(11),1343-The method is described. Any other frequency conversion method known to those skilled in the art may also be used.

The first transformation 300a further comprises the step of calculating 340 the frequency of the formants of said spectral envelope. The present invention may use many methods of extracting formants. The calculation of the frequency of the formants of the spectral envelope may be accomplished, for example, using the methods described in McCandless, S. (1974), An algorithm for automatic implementation using linear prediction, IEEE Transactions on Acoustics, Speech, and Signal Processing,22(2), 135-.

The method 300a further comprises the step of modifying 350 the spectral envelope of the sound signal. Modifying the spectral envelope of the sound signal makes it possible to obtain a spectral envelope that is more representative of the desired mood.

The step of modifying 350 the spectral envelope comprises: a continuous increasing transformation function of the frequency of the spectral envelope is applied 351, the continuous increasing transformation function being parameterized by at least two formant frequencies of the spectral envelope.

The frequencies of the spectral envelope are modified using a continuous increasing transformation function such that the spectral envelope can be modified without discontinuities between consecutive frequencies. Furthermore, the continuously increasing transformation function is parameterized by at least two formant frequencies, such that a continuous transformation of the spectral envelope at the part of the spectrum that is affected by a given emotion, defined by the frequency of a particular formant, can be affected.

In an embodiment of the invention, the step of modifying 350 the spectral envelope of the sound signal further comprises: a dynamic filter is applied 352 to the spectral envelope, which filter is parameterized by the frequency of the third formant F3 of the spectral envelope of the sound signal.

This step makes it possible to increase or decrease the signal strength around the frequency of the third formant F3 of the spectral envelope of the sound signal so that the modified spectral envelope more closely approximates the spectral envelope of a phoneme uttered with a desired emotion. For example, as shown in fig. 1, an increase in the sound intensity in the vicinity of the frequency of the third formant F3 of the spectral envelope of the sound signal makes it possible to obtain a spectral envelope closer to the spectral envelope of the same phoneme spoken at the time of smiling.

The filters used in this step may be of different types, according to different embodiments of the invention. For example, the filter may be a double quadrupole filter with a gain of 8dB, Q1.2, centered around the frequency of the third formant F3. This filter makes it possible to increase the intensity of the spectrum of frequencies in the vicinity of the formant F3, thereby obtaining a spectral envelope that is closer to the spectral envelope that a smiling speaker would obtain.

Once the spectral envelope is modified, the spectral envelope may be applied to the sound spectrum. Many other embodiments may apply a spectral envelope to the sound spectrum. For example, each component of the spectrum may be multiplied by a corresponding value of the envelope, such as described by Luini m.et al. (2013), Phase vocoder and beyond, musica/Tenologia, August 2013, vol.7, No.2013, p.77-89.

Once the spectra are reconstructed, different processing may be applied to the frames according to different embodiments of the invention. In some embodiments of the invention, the inverse frequency transform may be applied directly to the sound frames in order to reconstruct the audio signal and listen directly to the audio signal. This enables, for example, listening to the voice of a modified non-player character of the video game.

The modified sound signal may also be transmitted for listening by a third party user. This is the case for example for embodiments related to call centre operators call centres. In this case, the sound signal may be transmitted in raw or compressed form, in the frequency domain or in the time domain.

In some embodiments of the invention, the method 300a may be used to modify an audio signal comprising speech in real-time in order to assign a mood to neutral speech. Such real-time modification may be accomplished, for example, by:

receiving an audio sample, for example, recorded by a microphone in real-time;

creating a frame of audio samples when a sufficient number of samples are available to form the time frame;

applying a frequency transform to the audio samples of the frame;

a first transform 320a of the sound signal is applied to at least one transform frame in the frequency domain.

The method makes it possible to apply expressions to neutral speech in real time. The step for creating the frame (or windowing) includes a lag in the execution of the method, since the audio samples can only be processed if all samples of the frame are received. However, the lag depends only on the duration of the time frame and may be small, for example, if the time frame has a duration of 50 ms.

The invention also relates to a computer program product comprising program code instructions recorded on a computer readable medium for performing the method 300a or any other method according to different embodiments of the invention, when said program is run on a computer. The computer program may be stored and/or run, for example, on a workstation of call center operator 210 or on server 220.

Fig. 3b shows a second exemplary method according to the present invention.

The method 300b is also a method for modifying a sound signal such that time frames can be processed differently depending on the type of information contained in the time frames.

To this end, the method 300b includes the step of classifying 360 the time frames according to a set of time frame categories including at least one voiced frame category and one unvoiced frame category.

This step makes it possible to associate each frame with a category and to adapt the processing of the frame according to the category to which it belongs. The time frame may for example belong to the voiced frame class if it comprises a vowel and to the unvoiced frame class if it does not comprise a vowel, e.g. if it comprises a consonant. There are different methods for determining voiced or unvoiced properties of the time frame. For example, the ZCR (acronym for Zero Crossing Rate) of a frame may be calculated and compared to a threshold. If the ZCR is below the threshold, the frame will be considered non-voiced, otherwise voiced.

The method 300b includes: for each voiced sound frame, a first transform 320a of the sound signal is applied in the frequency domain. All of the embodiments of the invention discussed with reference to fig. 3a may be applied to the first transformation 320a in the context of the method 300 b.

The method 300b comprises, for each non-voiced frame, applying a second transform 320b of the sound signal in the frequency domain.

The second transformation 320b of the sound signal in the frequency domain comprises: a step of applying a filter to increase the energy of the sound signal 370 centered around a frequency, e.g. a predetermined frequency. In one embodiment, the filter is a double quadrupole filter with a gain of 8dB, Q1, centered at medium high/sharp frequencies, e.g. 6000 Hz.

This feature makes it possible to optimize the transformation of the audio signal by applying the transformation to non-voiced frames for which the spectral envelope has no formants.

In one embodiment of the present invention, the second transformation 320b of the sound signal further comprises: a step 330 for extracting the spectral envelope of the sound signal for the frame in question, and a step for applying 351b a continuous increasing transformation function of the frequency of the spectral envelope.

The step 351b for applying a continuously increasing transformation function of the frequency of the spectral envelope is parameterized the same as the continuously increasing transformation function of the frequency of the spectral envelope of the immediately preceding time frame. Thus, in this embodiment of the invention, if a voiced frame is followed by a non-voiced frame, the continuous increasing transformation function of the frequency of the envelope is parameterized according to the frequency of the formants of the spectral envelope of the voiced frame, and then applied to the immediately following non-voiced frame according to the same parameters. If several non-voiced frames follow a voiced frame, the same transformation function according to the same parameters may be applied to consecutive non-voiced frames.

This feature makes it possible to apply a transformation function of the frequency of the spectral envelope of a non-voiced frame, even if the non-voiced frame does not comprise formants, while benefiting from a transformation that is as consistent as possible with the preceding voiced frame.

Fig. 4a and 4b show two examples of continuously increasing transform functions of the frequency of the spectral envelope of a time frame according to the invention.

Fig. 4a shows a first example continuous increasing transformation function of the frequency of the spectral envelope of a time frame according to the invention.

Function 400a defines the frequency of the modified spectral envelope shown on x-axis 401 as a function of the frequency of the initial spectral envelope shown on y-axis 402. Thus, the function makes it possible to construct a modified spectral envelope as follows: the intensity of each frequency of the modified spectral envelope is equal to the intensity of the frequency of the initial spectral envelope as indicated by the function. For example, the intensity of the frequency 411a of the modified spectral envelope is equal to the intensity of the frequency 410a of the original spectral envelope.

In one set of embodiments of the invention, the transform function for frequency is defined as follows:

a modified frequency is calculated for each initial frequency of a set of initial frequencies. In the example of function 400a, modified frequencies 411a, 421a, 431a, 441a, and 451a are calculated corresponding to initial frequencies 410a, 420a, 430a, 440a, and 450a, respectively;

next, a linear interpolation is performed between the initial frequencies of the set of initial frequencies determined from the formants of the spectral envelope and the modified frequencies. For example, linear interpolation 460 makes it possible to define linearly for each initial frequency between first initial frequency 410a and second initial frequency 420a modified frequency between first modified frequency 411a and second modified frequency 421 a.

Similarly:

the linear interpolation 461 is such that the modified frequency between the second modified frequency 421a and the third modified frequency 431a can be defined linearly for each initial frequency between the second initial frequency 420a and the third initial frequency 430 a;

linear interpolation 462 makes it possible to define linearly, for each initial frequency between third initial frequency 430a and fourth initial frequency 440a, a modified frequency between third modified frequency 431a and fourth modified frequency 441 a;

the linear interpolation 463 makes it possible to define linearly the modified frequencies between the fourth modified frequency 441a and the fifth modified frequency 451a for each initial frequency between the fourth initial frequency 440a and the fifth initial frequency 450 a.

The modified frequency may be calculated in different ways. Some of the modified frequencies may be equal to the initial frequency. Some modified frequencies may be obtained, for example, by multiplying the initial frequency by a multiplier factor a. This makes it possible to obtain a modified frequency higher or lower than the initial frequency depending on whether the multiplier coefficient α is larger or smaller than 1. Typically, modified frequencies above the respective initial frequency (α >1) are associated with more happy or smiling voices, while modified frequencies below the respective initial voice (α <1) are associated with more stressed or less smiling voices. In general, the greater the value of the multiplier coefficient α differs from 1, the more pronounced the effect applied. The value of the coefficient alpha thus makes it possible to define not only the transformation to be applied to the speech, but also the significance of this transformation.

In one set of embodiments of the invention, the initial frequencies used to parameterize the transform function are as follows:

a first initial frequency (410a) calculated from half the frequency of a first resonance peak (F1) of a spectral envelope of the sound signal;

a second initial frequency (420a) calculated from the frequency of a second formant (F2) of the spectral envelope of the sound signal;

a third initial frequency (430a) calculated from the frequency of a third formant (F3) of the spectral envelope of the sound signal;

a fourth initial frequency (440a) calculated from the frequencies of a fourth formant (F4) of the spectral envelope of the sound signal;

a fifth initial frequency (450a) calculated from the frequency of a fifth formant (F5) of the spectral envelope of the sound signal.

The frequencies of the spectral envelope that are lower than the first initial frequency 410a and higher than the fifth initial frequency 450a are therefore not modified. This makes it possible to limit the transformation of the frequency to frequencies corresponding to formants affected by the tense or smile timbre of speech, and for example not to modify the fundamental frequency F0.

In one embodiment of the invention, the initial frequency corresponds to the frequency of the formants of the current time frame. Thus, the parameters of the transformation function are modified for each time frame.

For a number of time frames greater than or equal to two consecutive time frames, the initial frequency may also be calculated as the average of the frequencies of formants of equal order. For example, the first initial frequency 410a may be calculated as the average of the frequencies of a first resonant peak F1 of the spectral envelope of n consecutive time frames, where n ≧ 2.

In one set of embodiments of the present invention, the frequency transformation is mainly applied between the second formant F2 and the fourth formant F4. The modified frequency can thus be calculated as follows:

calculating a first modified frequency 411a as equal to the first initial frequency 410 a;

calculating a second modified frequency 421a by multiplying the second initial frequency 420a by a multiplier coefficient α;

calculating a third modified frequency 431a by multiplying the third initial frequency 430a by a multiplier coefficient α;

calculating a fourth modified frequency 441a by multiplying the fourth initial frequency 440a by a multiplier coefficient α;

the fifth modified frequency 451a is calculated to be equal to the fifth original frequency 450 a.

The example transformation function 400a makes it possible to transform the spectral envelope of the time frame to obtain a more smiling sound, thanks to the higher frequencies, in particular between the second formant F2 and the fourth formant F4.

In one embodiment, the multiplier coefficient α is predefined. For example, the multiplier factor α may be equal to 1.1 (10% increase in frequency).

In some embodiments of the invention, the multiplier coefficient α may depend on the modification strength of the speech to be generated.

In some embodiments of the invention, a multiplier factor α may also be determined for a given user. For example, it may be determined during a training phase during which the user utters phonemes with a neutral speech followed by a smiling speech. For phonemes uttered with neutral speech and smile speech, the frequencies of different formants are compared, whereby a multiplier coefficient α suitable for a given user can be calculated.

In one set of embodiments of the invention, the value of the coefficient α depends on the phoneme. In these embodiments of the invention the method according to the invention comprises a step for detecting the current phoneme and defining the value of the coefficient alpha for the current frame. For example, the value of α may be determined for a given phoneme during a training phase.

Fig. 4b shows a second example continuous increasing transformation function of the frequency of the spectral envelope of a time frame according to the invention.

Fig. 4b shows a second function 400b so that a more intense or smiling timbre may be given to the speech.

The illustration of fig. 4b is the same as the illustration of fig. 4 a: the frequency of the modified spectral envelope is shown on the x-axis 401 as a function of the frequency of the initial spectral envelope shown on the y-axis 402.

The function 400b is also established by calculating a modified frequency 411b, 421b, 431b, 441b, 451b for each initial frequency 410b, 420b, 430b, 440b, 450b, and then defining linear interpolations 460b, 461b, 462b, and 463b between the initial frequency and the modified frequency.

In the example of function 400b, modified frequencies 411b and 451b are equal to initial frequencies 410b and 450b, while modified frequencies 421b, 431b and 441b are obtained by multiplying initial frequencies 420b, 430b and 440b by a factor α < 1. Thus, the frequencies of the second, third and fourth formants F2, F3, F4 of the spectral envelope modified by the function 400b will be more severe than the frequencies of the corresponding formants of the initial spectral envelope. This makes it possible to give the voice a tense tone.

The functions 400a and 400b are given as examples only. Any continuously increasing function of the frequency of the spectral envelope parameterized from the frequency of the formants of the envelope may be used in the present invention. For example, a function defined based on the frequency of formants associated with the smiling nature of speech is particularly suitable for the present invention.

Fig. 5a, 5b and 5c show three examples of spectral envelopes of vowels modified according to the present invention.

Fig. 5a shows a spectral envelope 510a of a phoneme 'e' that is spoken neutrally by an experimenter, and a spectral envelope 520a of the same phoneme 'e' that is spoken smilingly by the experimenter. Fig. 5a also shows a spectral envelope 530a that is modified by the method according to the invention to make the speech more smiling. Thus, spectral envelope 530a shows the result of applying the method according to the present invention to spectral envelope 510 a.

Fig. 5b shows a spectral envelope 510b of a phoneme 'a' that is spoken neutrally by an experimenter, and a spectral envelope 520b of the same phoneme 'a' that is spoken smilingly by the experimenter. Fig. 5b also shows a spectral envelope 530b that is modified by the method according to the invention to make the speech more smiling. Thus, spectral envelope 530b shows the result of applying the method according to the present invention to spectral envelope 510 b.

Fig. 5c shows a spectral envelope 510c of a phoneme 'e' that is neutrally spoken by the second experimenter, and a spectral envelope 520c of the same phoneme 'e' that is smiling spoken by the second experimenter. Fig. 5c also shows a spectral envelope 530c that is modified by the method according to the invention to make the speech more smiling. Thus, spectral envelope 530c shows the result of applying the method according to the present invention to spectral envelope 510 c.

In this example, the method according to the invention comprises applying a function 400a to convert the frequency shown in fig. 4a, and applying a biquad filter centered on the frequency of the third formant F3 of the envelope.

Fig. 5a, 5b and 5c show that the method according to the invention makes it possible to preserve the overall shape of the envelope of the phonemes, while modifying the position and amplitude of certain formants, in order to simulate a smiling-appearing sound, while preserving naturalness.

More particularly, it is worth noting that the method according to the invention allows the spectral envelope transformed according to the invention to be very similar to that of a smiling sound, as illustrated by the similar curves 521a and 531a, 521b and 531b, 521c and 531c, respectively, for the frequencies of medium and high frequencies of the spectrum.

Fig. 6a, 6b and 6c show three examples of spectral diagrams of phonemes uttered with and without smiling.

Fig. 6a shows a spectrogram 610a of a neutral-sounding 'a' phoneme, as well as a spectrogram 620a of the same 'a' phoneme with the present invention applied to make the sound smiling. Fig. 6b shows a spectrogram 610b of a neutral-voiced 'e' phoneme, and a spectrogram 620b of the same 'e' phoneme with the present invention applied to make the speech more smiling. FIG. 6c shows a spectrogram 610c of a neutral-voiced 'i' phoneme, as well as a spectrogram 620c of the same 'i' phoneme with the present invention applied to make the speech more smiling.

Each spectrogram shows the evolution of sound intensity at different frequencies over time, as follows:

the horizontal axis represents time, within the wording of the phoneme;

the vertical axis represents different frequencies;

for a given time and frequency, the sound intensity is represented by the corresponding gray scale: white represents zero intensity and dark grey represents high intensity of the frequency at the corresponding time.

It is generally observed that according to the spectral envelope shown in fig. 1, in the case of smiling sounds, the energy is generally increased in the medium-high frequencies of the spectrum compared to neutral sounds: one can thus see an increase in sound intensity in medium and high frequencies, such as the frequency spectrum, e.g. between regions 611a and 621a, 611b and 621b, 611c and 621c, respectively.

Fig. 7 shows an example of a vowel spectrogram transform according to the present invention.

FIG. 7 shows a spectrogram 710 of a neutral-voiced 'i' phoneme, as well as a spectrogram 720 of the same 'i' phoneme with the present invention applied to make the speech more smiling.

Each spectrogram shows the evolution of the intensity of different frequencies over time, according to the same diagram as fig. 6a to 6 c.

In general, it can be observed that according to the spectral envelope shown in fig. 5a to 5c, the sound intensity generally increases in the medium and high frequencies of the spectrum: thus, one can see an increase in sound intensity at medium and high frequencies of the spectrum, as shown between regions 711 and 721. Thus, the speech effect of the smile is similar to the effect of a real smile as shown in fig. 6a to 6 c.

Fig. 8 shows three examples of vowel spectrogram transformations in accordance with three exemplary embodiments of the present invention.

In one set of embodiments of the invention, the value of the multiplier coefficient α may be modified over time, for example to simulate a gradual modification of the timbre of speech. For example, the value of the multiplier coefficient α may be increased to give the impression of increasingly smiling speech, or decreased to give the impression of increasingly stressed speech.

The spectrogram 810 represents a spectrogram of a vowel originating with a neutral tone and modified by the constant multiplier coefficient α of the present invention. Spectrogram 820 represents a spectrogram of a vowel pronounced with a neutral tone and modified by the present invention with a reduced multiplier coefficient α. Spectrogram 830 represents a spectrogram of a vowel pronounced with a neutral tone and modified by the present invention with an increased multiplier coefficient α.

It can be observed that the evolution of the spectrogram modified over time is different in these different examples: with decreasing multiplier factor α, the intensity of frequencies in the medium and high frequencies of the spectrum gradually increases 821 and then decreases 822. In contrast, in the case of increasing the multiplier coefficient α, the intensity of the frequency in the middle and high frequencies of the spectrum is gradually decreased 831, and then increased 832.

This example demonstrates the ability to adapt the transformation of the spectral envelope according to the method of the present invention to produce effects in real time, such as producing more or less smiling speech.

The above examples illustrate the ability of the present invention to assign timbres to speech with reasonable computational complexity while ensuring that the modified speech appears natural. However, they are provided by way of example only and in no way limit the scope of the invention as defined in the claims below.

27页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：基于网格偏移方法的联合宽带源定位和获取

Method and apparatus for dynamically modifying the timbre of speech by frequency shifting of spectral envelope formants

相关技术

网友询问留言