Sound processing method, sound processing device and recording medium

文档序号：1026962 发布日期：2020-10-27 浏览：11次中文

阅读说明：本技术 声音处理方法、声音处理装置及记录介质 (Sound processing method, sound processing device and recording medium ) 是由大道龙之介嘉山启于 2019-03-08 设计创作，主要内容包括：声音处理装置具有合成处理部,其将第1差分和第2差分合成为第1频谱包络概略形状,由此生成表示将歌唱语音与参照语音相应地变形的变形音的第3声音信号的合成频谱包络概略形状,生成与合成频谱包络概略形状相对应的第3声音信号,所述第1差分是表示歌唱语音的第1声音信号的第1频谱包络概略形状和第1声音信号中的第1时刻的第1基准频谱包络概略形状的差分,所述第2差分是表示参照语音的第2声音信号的第2频谱包络概略形状和第2声音信号中的第2时刻的第2基准频谱包络概略形状的差分。(The voice processing device includes a synthesis processing unit that synthesizes a1 st difference and a2 nd difference into a1 st spectral envelope approximate shape, generates a synthesized spectral envelope approximate shape of a3 rd voice signal representing a distorted voice in which a singing voice and a reference voice are distorted according to the synthesized voice envelope approximate shape, and generates a3 rd voice signal corresponding to the synthesized spectral envelope approximate shape, the 1 st difference being a difference between the 1 st spectral envelope approximate shape of the 1 st voice signal representing the singing voice and a1 st reference spectral envelope approximate shape at a1 st time in the 1 st voice signal, and the 2 nd difference being a difference between the 2 nd spectral envelope approximate shape of the 2 nd voice signal representing the reference voice and a2 nd reference spectral envelope approximate shape at a2 nd time in the 2 nd voice signal.)

1. A sound processing method, which is realized by a computer,

deforming the 1 st spectral envelope outline shape in accordance with the 1 st difference and the 2 nd difference, thereby generating a synthesized spectral envelope outline shape of the 3 rd sound signal,

generating the 3 rd sound signal corresponding to the synthetic spectral envelope sketch shape,

wherein the 1 st difference is a difference between the 1 st spectral envelope approximate shape of the 1 st sound signal representing the 1 st sound and the 1 st reference spectral envelope approximate shape at the 1 st time point in the 1 st sound signal, the 2 nd difference is a difference between the 2 nd spectral envelope approximate shape of the 2 nd sound signal representing the 2 nd sound having a difference in acoustic characteristic from the 1 st sound and the 2 nd reference spectral envelope approximate shape at the 2 nd time point in the 2 nd sound signal, and the 3 rd sound signal represents a distorted sound obtained by distorting the 1 st sound and the 2 nd sound in accordance with each other.

2. The sound processing method according to claim 1,

adjusting a temporal position of the 2 nd sound signal with respect to the 1 st sound signal so that end points thereof coincide between a1 st stationary period in which a spectral shape of the 1 st sound signal is temporally stable and a2 nd stationary period in which a spectral shape of the 2 nd sound signal is temporally stable,

the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period,

the synthesized spectral envelope summary shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal.

3. The sound processing method according to claim 2,

the 1 st time and the 2 nd time are rear times out of the start point of the 1 st plateau period and the start point of the 2 nd plateau period.

4. The sound processing method according to claim 1,

adjusting a temporal position of the 2 nd sound signal with respect to the 1 st sound signal so that starting points thereof coincide between a1 st stationary period in which a spectral shape of the 1 st sound signal is stable in time and a2 nd stationary period in which a spectral shape of the 2 nd sound signal is stable in time,

the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period,

the synthesized spectral envelope summary shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal.

5. The sound processing method according to claim 4,

the 1 st time and the 2 nd time are starting points of the 1 st stationary period.

6. The sound processing method according to any one of claims 2 to 5,

the 1 st stationary period is determined in correspondence with a1 st index representing a degree of change in a fundamental frequency of the 1 st sound signal and a2 nd index representing a degree of change in the spectral shape of the 1 st sound signal.

7. The sound processing method according to any one of claims 1 to 6,

in generating the synthesized spectral envelope sketch shape,

subtracting a result of multiplying the 1 st difference by a1 st coefficient and adding a result of multiplying the 2 nd difference by a2 nd coefficient with respect to the 1 st spectral envelope outline shape.

8. The sound processing method according to any one of claims 1 to 7,

in generating the synthesized spectral envelope sketch shape,

extending the processing period of the 1 st sound signal in accordance with the length of time of the presentation period of the 2 nd sound signal to be applied to the distortion of the 1 st sound signal,

-deforming said 1 st spectral envelope profile during said elongated processing, said 1 st difference during said elongated processing and said 2 nd difference during said rendering, respectively, thereby generating said composite spectral envelope profile.

9. A sound processing apparatus having a memory and 1 or more processors,

the sound processing apparatus executes the instruction stored in the memory by the 1 or more processors,

thereby deforming the 1 st spectral envelope outline shape in correspondence with the 1 st difference and the 2 nd difference, thereby generating a synthesized spectral envelope outline shape of the 3 rd sound signal,

generating the 3 rd sound signal corresponding to the synthetic spectral envelope sketch shape,

10. The sound processing apparatus according to claim 9,

the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period,

the synthesized spectral envelope summary shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal.

11. The sound processing apparatus according to claim 9,

the 1 st time and the 2 nd time are rear times out of the start point of the 1 st plateau period and the start point of the 2 nd plateau period.

12. The sound processing apparatus according to claim 9,

the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period,

the synthesized spectral envelope summary shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal.

13. The sound processing apparatus according to claim 12,

the 1 st time and the 2 nd time are starting points of the 1 st stationary period.

14. The sound processing apparatus according to any one of claims 9 to 13,

the 1 or more processors perform processing of subtracting a result obtained by multiplying the 1 st difference by a1 st coefficient and adding a result obtained by multiplying the 2 nd difference by a2 nd coefficient with respect to the 1 st spectral envelope outline shape.

15. A recording medium readable by a computer, having recorded thereon a program for causing the computer to execute:

a1 st process of generating a synthesized spectral envelope outline shape of a3 rd audio signal by deforming a1 st spectral envelope outline shape in accordance with a1 st difference and a2 nd difference, the 1 st difference being a difference between the 1 st spectral envelope outline shape of a1 st audio signal representing a1 st sound and a1 st reference spectral envelope outline shape at a1 st time in the 1 st audio signal, the 2 nd difference being a difference between a2 nd spectral envelope outline shape of a2 nd audio signal representing a2 nd sound having a difference in acoustic characteristic from the 1 st sound and a2 nd reference spectral envelope outline shape at a2 nd time in the 2 nd audio signal, the 3 rd audio signal representing a deformed sound in which the 1 st sound and the 2 nd sound are deformed in accordance with each other; and

a2 nd process of generating the 3 rd sound signal corresponding to the synthesized spectral envelope outline shape.

Technical Field

The present invention relates to a technique for processing an audio signal representing audio.

Background

Various techniques have been proposed to add vocal expressions such as vocal expression to speech. For example, patent document 1 discloses a technique of converting a speech represented by a speech signal into a speech having characteristic sound quality such as a turbid sound or a hoarse sound by shifting each harmonic component of the speech signal in a frequency region.

Patent document 1: japanese patent laid-open publication No. 2014-2338

Disclosure of Invention

However, the technique of patent document 1 has room for further improvement from the viewpoint of generating an acoustically natural sound. In view of the above, it is an object of the present invention to synthesize acoustically natural sounds.

In order to solve the above problem, a sound processing method according to a preferred embodiment of the present invention is a sound processing method for deforming a1 st spectral envelope in an approximate shape according to a1 st difference and a2 nd difference, thereby generating a synthesized spectral envelope summary shape of the 3 rd sound signal, generating the 3 rd sound signal corresponding to the synthesized spectral envelope summary shape, wherein the 1 st difference is a difference between the 1 st spectral envelope outline shape of the 1 st speech signal representing the 1 st sound and a1 st reference spectral envelope outline shape at the 1 st time in the 1 st speech signal, the 2 nd difference is a difference between a2 nd spectral envelope outline shape of a2 nd sound signal of a2 nd sound having a difference in acoustic characteristic from the 1 st sound and a2 nd reference spectral envelope outline shape at a2 nd time in the 2 nd sound signal, the 3 rd sound signal represents a distorted sound in which the 1 st sound and the 2 nd sound are distorted in accordance with each other.

In order to solve the above-described problem, a sound processing device according to a preferred aspect of the present invention includes a memory and 1 or more processors, and includes a synthesis processing unit that generates a synthesized spectral envelope approximate shape of a3 rd sound signal by executing an instruction stored in the memory by the 1 or more processors and by deforming a1 st spectral envelope approximate shape in accordance with a1 st difference and a2 nd difference, the 1 st difference being a difference between a1 st spectral envelope approximate shape of a1 st sound signal representing a1 st sound and a1 st reference spectral envelope approximate shape at a1 st time in the 1 st sound signal, and generates the 3 rd sound signal corresponding to the synthesized spectral envelope approximate shape, the 2 nd difference being a2 nd spectral envelope approximate shape of a2 nd sound signal representing a2 nd sound having a sound characteristic different from the 1 st sound and the 2 nd sound signal And a2 nd reference spectral envelope approximate shape difference at the 2 nd time in the signal, wherein the 3 rd sound signal represents a distorted sound obtained by distorting the 1 st sound and the 2 nd sound according to each other.

In order to solve the above problem, a recording medium according to a preferred embodiment of the present invention records a program for causing a computer to execute: a1 st process of generating a synthesized spectral envelope outline shape of a3 rd audio signal by deforming a1 st spectral envelope outline shape in accordance with a1 st difference and a2 nd difference, the 1 st difference being a difference between the 1 st spectral envelope outline shape of a1 st audio signal representing a1 st sound and a1 st reference spectral envelope outline shape at a1 st time in the 1 st audio signal, the 2 nd difference being a difference between a2 nd spectral envelope outline shape of a2 nd audio signal representing a2 nd sound having a difference in acoustic characteristic from the 1 st sound and a2 nd reference spectral envelope outline shape at a2 nd time in the 2 nd audio signal, the 3 rd audio signal representing a deformed sound in which the 1 st sound and the 2 nd sound are deformed in accordance with each other; and a2 nd process of generating the 3 rd sound signal corresponding to the synthesized spectral envelope outline shape.

Drawings

Fig. 1 is a block diagram illustrating a configuration of an audio processing device according to an embodiment of the present invention.

Fig. 2 is a block diagram illustrating a functional configuration of the sound processing apparatus.

Fig. 3 is an explanatory diagram of a stationary period in the 1 st sound signal.

Fig. 4 is a flowchart illustrating a specific procedure of the signal analysis processing.

Fig. 5 is a time variation of the fundamental frequency immediately after the start of utterance of a singing voice.

Fig. 6 is a time variation of the fundamental frequency just before the end of the utterance of the singing voice.

Fig. 7 is a flowchart illustrating a specific sequence of the sound processing.

Fig. 8 is an explanatory diagram of the sound interpretation processing.

Fig. 9 is an explanatory diagram of a schematic shape of a spectral envelope.

Fig. 10 is a flowchart illustrating a specific procedure of the attack processing.

Fig. 11 is an explanatory diagram of the attack sound processing.

Detailed Description

Fig. 1 is a block diagram illustrating a configuration of an audio processing device 100 according to a preferred embodiment of the present invention. The audio processing device 100 of the present embodiment is a signal processing device that adds various audio expressions to a voice of a musical composition (hereinafter referred to as a "singing voice") that a user sings. The acoustic expression is an acoustic characteristic added to the singing voice (an example of the 1 st voice). Focusing on the singing of a musical composition, the vocal expression is a representation of musical performance or expression associated with the pronunciation of speech (i.e., singing). Specifically, singing performance such as bubbly (Vocal fry), growling (growl), or hoarse (rough) is a preferred example of sound performance. In addition, sound expression is also called tone quality.

The sound expression is particularly prominent in a portion where the volume increases immediately after the start of the utterance in the singing voice (hereinafter referred to as "attack portion") and a portion where the volume decreases immediately before the end of the utterance in the singing voice (hereinafter referred to as "release portion"). In consideration of the above tendency, in the present embodiment, a vocal expression is added to the vocal part and the vocal release part in particular in the singing voice.

As illustrated in fig. 1, the sound processing apparatus 100 is realized by a computer system including a control apparatus 11, a storage apparatus 12, an operation apparatus 13, and a sound reproduction apparatus 14. For example, a mobile information terminal such as a mobile phone or a smart phone, or a mobile or stationary information terminal such as a personal computer is suitably used as the sound processing apparatus 100. The operation device 13 is an input device that receives an instruction from a user. For example, a plurality of operation members operated by a user or a touch panel that detects contact of the user is suitable as the operation device 13.

The control device 11 is, for example, a cpu (central Processing unit) or the like, which executes various arithmetic Processing and control Processing by 1 or more processors. The control device 11 of the present embodiment generates a3 rd speech signal Y, which represents a speech (hereinafter referred to as "distorted speech") in which a vocal expression is given to a singing speech. The sound emitting device 14 is, for example, a speaker or an earphone, and emits a distorted sound represented by the 3 rd sound signal Y generated by the control device 11. Note that, for convenience, the illustration of a D/a converter that converts the 3 rd sound signal Y generated by the control device 11 from digital to analog is omitted. Although fig. 1 illustrates the sound processing apparatus 100 having the sound emitting apparatus 14, the sound emitting apparatus 14 separate from the sound processing apparatus 100 may be connected to the sound processing apparatus 100 by wire or wirelessly.

The storage device 12 is a memory configured by a known recording medium such as a magnetic recording medium or a semiconductor recording medium, and stores a program executed by the control device 11 and various data used by the control device 11. The storage device 12 may be configured by a combination of a plurality of types of recording media. Further, a storage device 12 (e.g., a cloud storage) separate from the sound processing device 100 may be prepared, and the control device 11 may perform writing and reading with respect to the storage device 12 via a communication network. That is, the storage device 12 may be omitted from the sound processing device 100.

The storage device 12 of the present embodiment stores the 1 st audio signal X1 and the 2 nd audio signal X2. The 1 st sound signal X1 is an acoustic signal indicating a singing voice uttered by a user of the sound processing apparatus 100 singing a musical composition. The 2 nd speech signal X2 is an acoustic signal indicating a speech (hereinafter referred to as "reference speech") which a singer (e.g., singer) other than the user sings by giving a voice expression. There is a difference in acoustic characteristics (e.g., sound quality) between the 1 st sound signal X1 and the 2 nd sound signal X2. The audio processing device 100 of the present embodiment generates the 3 rd audio signal Y of a distorted sound by adding the sound expression of the reference sound (an example of the 2 nd sound) indicated by the 2 nd audio signal X2 to the singing sound indicated by the 1 st audio signal X1. Further, the difference in music is not considered between the singing voice and the reference voice. In the above description, it is assumed that the speaker of the singing voice and the speaker of the reference voice are different persons, but the speaker of the singing voice and the speaker of the reference voice may be the same person. For example, the singing voice is a voice that is singed by the user without adding the vocal expression, and the reference voice is a voice to which the vocal expression is added by the user.

Fig. 2 is a block diagram illustrating a functional configuration of the control device 11. As illustrated in fig. 2, the control device 11 executes a program (i.e., a series of instructions for a processor) stored in the storage device 12, thereby realizing a plurality of functions (the signal analysis unit 21 and the synthesis processing unit 22) for generating the 3 rd audio signal Y from the 1 st audio signal X1 and the 2 nd audio signal X2. The function of the control device 11 may be realized by a plurality of devices that are configured separately from each other, or a part or all of the functions of the control device 11 may be realized by a dedicated electronic circuit.

The signal analysis unit 21 generates analysis data D1 by analyzing the 1 st audio signal X1, and generates analysis data D2 by analyzing the 2 nd audio signal X2. The analysis data D1 and the analysis data D2 generated by the signal analysis unit 21 are stored in the storage device 12.

The analysis data D1 is data indicating a plurality of stationary periods Q1 of the 1 st sound signal X1. As illustrated in fig. 3, each stationary period Q1 indicated by the analysis data D1 is a variable-length period in which the fundamental frequency f1 and the spectral shape of the 1 st sound signal X1 are stable in time. The analysis data D1 specifies the time T1_ S of the start point (hereinafter referred to as "start point time") and the time T1_ E of the end point (hereinafter referred to as "end point time") of each stationary period Q1. In addition, the fundamental frequency f1 or the spectral shape (i.e., phoneme) often changes between the preceding and succeeding 2 notes in a musical piece. Therefore, each stationary period Q1 is highly likely to be a period corresponding to 1 note in a musical piece.

Similarly, the analysis data D2 is data indicating a plurality of stationary periods Q2 of the 2 nd sound signal X2. Each stationary period Q2 is a variable-length period in which the fundamental frequency f2 and the spectral shape of the 2 nd sound signal X2 are stable in time. The analysis data D2 specifies the start time T2_ S and the end time T2_ E of each stationary period Q2. Like the stationary period Q1, each stationary period Q2 is highly likely to correspond to a period of 1 note in a musical piece.

Fig. 4 is a flowchart of the process (hereinafter referred to as "signal analysis process") S0 in which the signal analysis unit 21 analyzes the 1 st audio signal X1. The signal analysis processing S0 of fig. 4 is started, for example, when an instruction from the user to the operation device 13 is triggered. As illustrated in fig. 4, the signal analysis unit 21 calculates the fundamental frequency f1 of the 1 st audio signal X1 for each of a plurality of unit periods (time frames) on the time axis (S01). A well-known technique is arbitrarily employed in calculating the fundamental frequency f 1. Each unit period is a period sufficiently shorter than the time length assumed in the stationary period Q1.

The signal analysis unit 21 calculates a mel cepstrum M1 representing the spectral shape of the 1 st audio signal X1 for each unit period (S02). The mel cepstrum M1 is expressed by a plurality of coefficients indicating the envelope of the spectrum of the 1 st sound signal X1. The mel cepstrum M1 is also expressed as a feature quantity indicating the phoneme of the singing voice. A well-known technique is arbitrarily employed in calculating the mel-frequency cepstrum M1. Note that, as the feature quantity indicating the spectral shape of the 1 st audio signal X1, MFCCs (Mel-Frequency Cepstrum Coefficients) may be calculated instead of the Mel Cepstrum M1.

The signal analysis unit 21 estimates the vocal performance of the singing voice indicated by the 1 st voice signal X1 for each unit period (S03). That is, it is determined which one of voiced and unvoiced voices is the singing voice. A known technique is arbitrarily employed when the presence of sound (presence/absence of sound) is estimated. The order of the calculation of the fundamental frequency f1 (S01), the calculation of the mel-frequency cepstrum M1 (S02), and the estimation of the voiced sound (S03) is arbitrary, and the order illustrated above is not limited.

The signal analysis unit 21 calculates the 1 st index 1 indicating the temporal change degree of the fundamental frequency f1 for each unit period (S04). For example, the difference between the fundamental frequencies f1 located in the preceding and following 2 unit periods is calculated as the 1 st index 1. The more significant the temporal change of the fundamental frequency f1 is, the larger the value of the 1 st index 1 becomes.

The signal analysis unit 21 calculates the 2 nd index 2 indicating the temporal change degree of the mel cepstrum M1 for each unit period (S05). For example, a value obtained by combining (for example, adding or averaging) the difference between each coefficient of the mel-frequency cepstrum M1 between the preceding and following 2 unit periods with respect to a plurality of coefficients is suitable as the 2 nd index 2. The more significant the temporal change in the spectral shape of the singing voice, the larger the numerical value of the 2 nd index 2. For example, in the vicinity of the time of change of the phoneme of the singing voice, the 2 nd index 2 becomes a large numerical value.

The signal analysis unit 21 calculates the variation index Δ corresponding to the 1 st index 1 and the 2 nd index 2 for each unit period (S06). For example, a weighted sum of the 1 st index 1 and the 2 nd index 2 is calculated for each unit period as the variation index Δ. The weighting values of the 1 st index 1 and the 2 nd index 2 are set to a predetermined fixed value or a variable value corresponding to an instruction from the user to the operation device 13. As understood from the above description, the change index Δ tends to have a larger value as the temporal change in the fundamental frequency f1 or mel-frequency cepstrum 1 (i.e., the spectrum shape) of the 1 st audio signal X1 increases.

The signal analysis unit 21 identifies a plurality of stationary periods Q1 in the 1 st audio signal X1 (S07). The signal analysis unit 21 of the present embodiment specifies the stationary period Q1 in accordance with the result of the vocal estimation of the singing voice (S03) and the variation index Δ. Specifically, the signal analysis unit 21 defines a set of a series of unit periods in which the singing voice is estimated to be voiced and the variation index Δ is lower than a predetermined threshold as the stationary period Q1. The unit period during which it is estimated that the singing voice is silent or the unit period during which the variation index Δ exceeds the threshold value is excluded from the stationary period Q1. If the stationary periods Q1 of the 1 st audio signal X1 are defined in the above-described order, the signal analysis unit 21 stores analysis data D1 specifying the start point time T1_ S and the end point time T1_ E of each stationary period Q1 in the storage device 12 (S08).

The signal analysis unit 21 also performs the signal analysis processing S0 described above on the 2 nd audio signal X2 indicating the reference speech, thereby generating analysis data D2. Specifically, the signal analysis unit 21 calculates the fundamental frequency f2 (S01), calculates the mel cepstrum M2 (S02), and estimates the voiced sound (voiced sound/unvoiced sound) (S03) for each unit period of the 2 nd audio signal X2. The signal analysis unit 21 calculates the variation index Δ corresponding to the 1 st index 1 indicating the temporal variation degree of the fundamental frequency f2 and the 2 nd index 2 indicating the temporal variation degree of the mel cepstrum M2 (S04 to S06). Then, the signal analysis unit 21 specifies each stationary period Q2 of the 2 nd speech signal X2 in accordance with the result of the estimation of the vocal performance of the reference speech (S03) and the fluctuation index Δ (S07). The signal analysis unit 21 stores analysis data D2 specifying the start point time T2_ S and the end point time T2_ E of each smoothing period Q2 in the storage device 12 (S08). The analysis data D1 and the analysis data D2 may be edited in accordance with an instruction from the user of the operation device 13. Specifically, the analysis data D1 specifying the start point time T1_ S and the end point time T1_ E indicated by the user and the analysis data D2 specifying the start point time T2_ S and the end point time T2_ E indicated by the user are stored in the storage device 12. That is, the signal analysis process S0 is omitted.

The synthesis processing unit 22 in fig. 2 uses the analysis data D2 of the 2 nd audio signal X2 to transform the analysis data D1 of the 1 st audio signal X1. The synthesis processing unit 22 of the present embodiment includes an attack sound processing unit 31, a release sound processing unit 32, and a speech synthesis unit 33. The attack sound processing unit 31 performs attack sound processing S1 of adding the sound expression of the attack sound unit in the 2 nd sound signal X2 to the 1 st sound signal X1. The mute processing section 32 performs mute processing S2 of adding the sound expression of the mute section in the 2 nd sound signal X2 to the 1 st sound signal X1. The speech synthesis unit 33 synthesizes the 3 rd speech signal Y of the distorted speech based on the processing results of the attack processing unit 31 and the release processing unit 32.

The temporal change of the fundamental frequency f1 immediately after the start of utterance of a singing voice is illustrated in fig. 5. As illustrated in fig. 5, there is an audible period Va immediately before the stationary period Q1. The voiced period Va is a voiced period before the stationary period Q1. The voiced period Va is a period in which the acoustic characteristics (for example, the fundamental frequency f1 or the spectral shape) of the singing voice vary unstably immediately before the stationary period Q1. For example, if attention is paid to a stationary period Q1 immediately after the start of the utterance of a singing voice, a start point from the time τ 1_ a at which the utterance of the singing voice starts to the start point time T1_ S of the stationary period Q1 corresponds to the voiced period Va. Note that although the above description has focused on singing speech, similarly with reference speech, there is a voiced period Va immediately before the stationary period Q2. In the sound emission process S1, the synthesis processing unit 22 (specifically, the sound emission processing unit 31) adds the sound expression of the sound emission unit in the 2 nd sound signal X2 to the sound period Va in the 1 st sound signal X1 and the subsequent stationary period Q1.

The temporal change of the fundamental frequency f1 just before the end of the utterance of the singing voice is illustrated in fig. 6. As illustrated in fig. 6, the voiced period Vr exists immediately after the stationary period Q1. The voiced period Vr is a voiced period after the stationary period Q1. The voiced period Vr is a period in which the acoustic characteristic (for example, the fundamental frequency f2 or the spectral shape) of the singing voice unstably changes immediately after the stationary period Q1. For example, if attention is paid to a stationary period Q1 immediately before the end of the vocal sound emission, the vocal section from the end time T1_ E of the stationary period Q1 to the time τ 1_ R at which the vocal sound is silenced corresponds to the vocal period Vr. Note that although the above description has focused on singing voice, similarly with reference to reference voice, the voice period Vr exists immediately after the stationary period Q2. In the release processing S2, the synthesis processing unit 22 (specifically, the release processing unit 32) adds the sound expression of the release portion of the 2 nd speech signal X2 to the voiced period Vr in the 1 st speech signal X1 and the immediately preceding stationary period Q1.

< Sound Release processing S2 >

Fig. 7 is a flowchart illustrating specific contents of the sound interpretation process S2 executed by the sound interpretation process section 32. The release processing S2 of fig. 7 is performed for each stationary period Q1 of the 1 st sound signal X1.

If the sound release processing S2 is started, the sound release processing unit 32 determines whether or not the sound expression of the sound release section of the 2 nd sound signal X2 is added to the stationary period Q1 of the processing target in the 1 st sound signal X1 (S21). Specifically, the release processing unit 32 determines that the sound expression of the release unit is not added to the stationary period Q1 corresponding to any of the conditions Cr1 to Cr3 described below. However, the conditions for determining whether or not to add a sound expression to the stationary period Q1 of the 1 st sound signal X1 are not limited to the following examples.

[ Condition Cr1] the time length of the plateau Q1 is less than a prescribed value.

[ condition Cr2] the time length of a silent period immediately after the quiet period Q1 is lower than a prescribed value.

[ Condition Cr3] the time length of the voiced period Vr after the stationary period Q1 exceeds a predetermined value.

The stationary period Q1 sufficiently short in time length is difficult to represent by natural sound quality additional sound. Therefore, in the case where the time length of the stationary period Q1 is lower than the prescribed value (condition Cr1), the mute processing section 32 excludes the stationary period Q1 from the additional objects of sound expression. In addition, when there is a sufficiently short silent period immediately after the stationary period Q1, the silent period may be a period of a silent consonant in the middle of the singing voice. Further, if a sound expression is added during a silent consonant, there is a tendency that an auditory sense of discomfort is perceived. In consideration of the above tendency, in the case where the time length of the silent period immediately after the stationary period Q1 is lower than the predetermined value (condition Cr2), the release processing section 32 excludes the stationary period Q1 from the additional object of sound expression. In addition, in the case where the time length of the voiced period Vr immediately after the stationary period Q1 is sufficiently long, there is a high possibility that sufficient sound expression is already added to the singing voice. Therefore, when the time length of the voiced period Vr subsequent to the stationary period Q1 is sufficiently long (condition Cr3), the unvoiced sound processing unit 32 excludes the stationary period Q1 from the additional object of the sound expression. When determining that NO sound expression is added to the Q1 in the stationary period of the 1 st sound signal X1 (S21: NO), the sound emission processing unit 32 does not perform the processing (S22 to S26) described in detail below and ends the sound emission processing S2.

When it is determined that the sound expression of the mute section of the 2 nd sound signal X2 is added to the quiet period Q1 of the 1 st sound signal X1 (S21: YES), the mute processing section 32 selects the quiet period Q2 corresponding to the sound expression to be added to the quiet period Q1 of the 1 st sound signal X1 from among the plural quiet periods Q2 of the 2 nd sound signal X2 (S22). Specifically, the release processing unit 32 selects a stationary period Q2 in which the state in the music is similar to the stationary period Q1 of the processing target. For example, as a situation (context) considered with respect to 1 stationary period (hereinafter referred to as "stationary period of interest"), a time length of the stationary period of interest, a time length of a stationary period immediately after the stationary period of interest, a pitch difference between the stationary period of interest and the stationary period immediately after the stationary period of interest, a pitch of the stationary period of interest, and a time length of a silence period immediately before the stationary period of interest are exemplified. The release processing unit 32 selects the stationary period Q2 in which the difference in the stationary period Q1 is the smallest in relation to the situation illustrated above.

The release processing section 32 performs processing (S23-S26) for appending the sound expression corresponding to the stationary period Q2 selected in the above order to the 1 st sound signal X1 (analysis data D1). Fig. 8 is an explanatory diagram of the processing of the sound processing unit 32 for adding the sound expression of the sound output unit to the 1 st sound signal X1.

In fig. 8, the 1 st audio signal X1, the 2 nd audio signal X2, and the 3 rd audio signal Y after the modification are collectively described with respect to the waveform and the temporal change of the fundamental frequency on the time axis. In fig. 8, the start time T1_ S and the end time T1_ E of the stationary period Q1 of the singing voice, the end time τ 1_ R of the voiced period Vr immediately after the stationary period Q1, the start time τ 1_ a of the voiced period Va corresponding to the note immediately after the stationary period Q1, the start time T2_ S and the end time T2_ E of the stationary period Q2 of the reference voice, and the end time τ 2_ R of the voiced period Vr immediately after the stationary period Q2 are known information.

The release processing unit 32 adjusts the positional relationship on the time axis between the stationary period Q1 of the processing object and the stationary period Q2 selected in step S22 (S23). Specifically, the sound release processing unit 32 adjusts the position on the time axis of the stationary period Q2 to a position based on the end point (T1_ S or T1_ E) of the stationary period Q1. As illustrated in fig. 8, the sound emission processing unit 32 of the present embodiment determines the position of the 2 nd audio signal X2 (stationary period Q2) on the time axis with respect to the 1 st audio signal X1 such that the end time T2_ E of the stationary period Q2 coincides with the end time T1_ E of the stationary period Q1 on the time axis.

< elongation (S24) > < of Z1_ R during treatment

The release processing unit 32 expands or contracts a period (hereinafter referred to as "processing period") Z1_ R in the 1 st audio signal X1, to which the 2 nd audio signal X2 is added, on the time axis (S24). As illustrated in fig. 8, the processing period Z1_ R is a period from a time Tm _ R at which addition of sound expression starts (hereinafter referred to as "synthesis start time") to an end time τ 1_ R of the sound period Vr immediately after the stationary period Q1. The synthesis start time Tm _ R is a rear time of the start time T1_ S of the stationary period Q1 of the singing voice and the start time T2_ S of the stationary period Q2 of the reference voice. As illustrated in fig. 8, when the start point time T2_ S of the plateau period Q2 is located behind the start point time T1_ S of the plateau period Q1, the start point time T2_ S of the plateau period Q2 is set as the synthesis start time Tm _ R. However, the synthesis start time Tm _ R is not limited to the start time T2_ S.

As illustrated in fig. 8, the release processing unit 32 according to the present embodiment extends the processing period Z1_ R of the 1 st audio signal X1 by the time length of the presentation period Z2_ R of the 2 nd audio signal X2. The expression period Z2_ R is a period showing the voice expression of the sound producing unit in the 2 nd voice signal X2, and is used for adding the voice expression to the 1 st voice signal X1. As illustrated in fig. 8, the presentation period Z2_ R is a period from the synthesis start time Tm _ R to the end time τ 2_ R of the voiced period Vr immediately after the stationary period Q2.

While there is a tendency that a reference voice sung by a skilled singer such as a singer is added with a sufficient voice expression over a corresponding time period, the voice expression is not temporally sufficient in a singing voice sung by a user who is not familiar with singing. In the above tendency, as illustrated in fig. 8, the expression period Z2_ R of the reference voice is longer than the processing period Z1_ R of the singing voice. Therefore, the release processing unit 32 of the present embodiment extends the processing period Z1_ R of the 1 st audio signal X1 to the presentation period Z2_ R of the 2 nd audio signal X2.

The extension of the processing period Z1_ R is realized by processing (mapping) in which an arbitrary time t1 of the 1 st sound signal X1 (singing voice) and an arbitrary time t of the deformed 3 rd sound signal Y (deformed voice) are associated with each other. Fig. 8 illustrates a correspondence relationship between the time t1 of the singing voice (vertical axis) and the time t of the distorted voice (horizontal axis).

The time t1 in the correspondence relationship in fig. 8 is the time of the 1 st sound signal X1 corresponding to the time t of the distorted sound. A reference line L indicated by a dashed-dotted line in fig. 8 indicates a state where the 1 st audio signal X1 is not expanded or contracted (t1 is t). The section in which the slope of the time t1 of the singing voice with respect to the time t of the distorted voice is smaller than the reference line L is the section in which the 1 st sound signal X1 is stretched. The section in which the slope of time t1 with respect to time t is larger than the reference line L is a section in which the singing voice is contracted.

The correspondence relationship between time t1 and time t is expressed by a nonlinear function of equations (1a) to (1c) shown below.

[ formula 1]

As illustrated in fig. 8, the time T _ R is a predetermined time between the combination start time Tm _ R and the end time τ 1_ R of the processing period Z1_ R. For example, the midpoint ((T1_ S + T1_ E)/2) between the start time T1_ S and the end time T1_ E of the plateau period Q1 and the rear time of the synthesis start time Tm _ R are set as the time T _ R. As understood from equation (1a), the period before time T _ R in processing period Z1_ R does not expand or contract. That is, the expansion of the treatment period Z1_ R starts from the time T _ R.

As understood from equation (1b), the period behind the time T _ R in the processing period Z1_ R extends to a large extent at a position close to the time T _ R, and extends on the time axis such that the extent of the extension decreases as the end point time τ 1_ R approaches. The function η (t) of equation (1b) is a nonlinear function for extending the processing period Z1_ R farther forward on the time axis and reducing the degree of extension of the processing period Z1_ R farther rearward on the time axis. Specifically, for example, a 2-degree function at time t (η (t) ═ t2) is applied to function η (t). As described above, in the present embodiment, the processing period Z1_ R is extended on the time axis such that the degree of extension is smaller as the position is closer to the end time τ 1_ R of the processing period Z1_ R. Therefore, the acoustic characteristics in the vicinity of the end point time τ 1_ R of the singing voice can be sufficiently maintained even in the inflected voice. Further, at a position close to the time T _ R, there is a tendency that auditory discomfort due to elongation is less noticeable than the vicinity of the end time τ 1_ R. Therefore, even if the degree of extension is increased at a position close to the time T _ R as in the above-described example, the acoustic naturalness of the distortion sound is hardly reduced. In addition, the period from the end point time τ 2_ R of the presentation period Z2_ R to the start point time τ 1_ a of the next voiced period Vr in the 1 st audio signal X1 is shortened on the time axis as understood from equation (1 c). Since no speech is present from the end point time τ 2_ R to the start point time τ 1_ a, the 1 st sound signal X1 can be deleted by local deletion.

As described above, the processing period Z1_ R of the singing voice extends to the time length of the presentation period Z2_ R of the reference voice. On the other hand, the reference speech expression period Z2_ R does not scale on the time axis. That is, the time t2 of the disposed 2 nd audio signal X2 corresponding to the time t of the distortion sound coincides with the time t (t2 equals t). As described above as an example, in the present embodiment, since the processing period Z1_ R of the singing voice is extended in accordance with the time length of the presentation period Z2_ R, it is not necessary to extend the 2 nd audio signal X2. Therefore, the sound expression of the vocal section indicated by the 2 nd vocal signal X2 can be accurately added to the 1 st vocal signal X1.

If the processing period Z1_ R is extended in the order shown in the above example, the release processing unit 32 deforms the processing period Z1_ R after the extension of the 1 st sound signal X1 in accordance with the presentation period Z2_ R of the 2 nd sound signal X2 (S25-S26). Specifically, between the processing period Z1_ R after the extension of the singing voice and the presentation period Z2_ R of the reference voice, the synthesis of the fundamental frequency (S25) and the synthesis of the spectral envelope outline shape (S26) are performed.

< fundamental frequency Synthesis (S25) >)

The release processing unit 32 calculates the fundamental frequency f (t) at each time t of the 3 rd audio signal Y by the calculation of the following expression (2).

[ formula 2]

F(t)＝f1(t1)-λ1(f1(t1)-F1(t1))+λ2(f2(t2)-F2(t2))...(2)

The smoothing fundamental frequency F1(t1) in expression (2) is a frequency obtained by smoothing the time series of the fundamental frequency F1(t1) of the 1 st sound signal X1 on the time axis. Similarly, the smoothed fundamental frequency F2(t2) of expression (2) is a frequency obtained by smoothing the time series of the fundamental frequency F2(t2) of the 2 nd audio signal X2 on the time axis. The coefficient lambda 1 and the coefficient lambda 2 of the formula (2) are set to be non-negative values less than or equal to 1 (0. ltoreq. lambda.1. ltoreq.1, 0. ltoreq. lambda.2ltoreq.1).

As understood from equation (2), the 2 nd term of equation (2) is a process of subtracting the difference of the fundamental frequency F1(t1) and the smooth fundamental frequency F1(t1) of singing voice from the fundamental frequency F1(t1) of the 1 st sound signal X1 to the extent corresponding to the coefficient λ 1. In addition, the term 3 of the equation (2) is a process of adding the difference between the fundamental frequency F2(t2) and the smoothed fundamental frequency F2(t2) of the reference speech to the fundamental frequency F1(t1) of the 1 st sound signal X1 to the extent corresponding to the coefficient λ 2. As understood from the above description, the unvoiced sound processing unit 32 functions as an element for replacing the difference between the fundamental frequency F1(t1) and the smoothed fundamental frequency F1(t1) of the singing voice with the difference between the fundamental frequency F2(t2) and the smoothed fundamental frequency F2(t2) of the reference voice. That is, the temporal change of the fundamental frequency f1(t1) in the processing period Z1_ R after the extension of the 1 st audio signal X1 is close to the temporal change of the fundamental frequency f2(t2) in the presentation period Z2_ R of the 2 nd audio signal X2.

< Synthesis of the approximate shape of the spectral envelope (S26) >)

The release processing unit 32 synthesizes a spectral envelope outline between the extended processing period Z1_ R of the singing voice and the expression period Z2_ R of the reference voice. The approximate shape G1 of the spectral envelope of the 1 st audio signal X1 is, as illustrated in fig. 9, an intensity distribution obtained by further smoothing the spectral envelope G2, which is the approximate shape of the spectrum G1 of the 1 st audio signal X1, in the frequency domain. Specifically, the intensity distribution obtained by smoothing the spectral envelope G2 to such an extent that phonological characteristics (differences depending on the phonological level) and individualities (differences depending on the utterer) are not noticeable is the spectral envelope approximate shape G1. For example, the spectral envelope outline G1 is expressed by a predetermined number of coefficients located on the lower order side among a plurality of coefficients representing the mel cepstrum of the spectral envelope G2. In the above description, the spectral envelope outline G1 of the 1 st speech signal X1 was focused, but the spectral envelope outline G2 of the 2 nd speech signal X2 is also the same.

The interpretation processing unit 32 calculates a spectral envelope outline (hereinafter referred to as "synthesized spectral envelope outline") g (t) at each time t of the 3 rd speech signal Y by the operation of the following expression (3).

[ formula 3]

G(t)＝G1(t1)-μ1(G1(t1)-G1_ref)+μ2(G2(t2)-G2_ref)...(3)

The notation G1_ ref in equation (3) is a reference spectral envelope outline shape. Of the plurality of spectral envelope approximate shapes G1 of the 1 st speech signal X1, 1 spectral envelope approximate shape G1 at a specific time is used as the reference spectral envelope approximate shape G1_ ref (an example of the 1 st reference spectral envelope approximate shape). Specifically, the reference spectral envelope outline shape G1_ ref is the spectral envelope outline shape G1(Tm _ R) at the synthesis start time Tm _ R (an example of the 1 st time) of the 1 st sound signal X1. That is, the timing at which the reference spectral envelope approximate shape G1_ ref is extracted is located at the rear of the start timing T1_ S of the stationary period Q1 and the start timing T2_ S of the stationary period Q2. The time at which the reference spectral envelope outline shape G1_ ref is extracted is not limited to the synthesis start time Tm _ R. For example, the spectral envelope outline shape G1 at an arbitrary time point within the stationary period Q1 is used as the reference spectral envelope outline shape G1_ ref.

Similarly, the reference spectral envelope outline shape G2_ ref in expression (3) is 1 spectral envelope outline shape G2 at a specific time point among the plurality of spectral envelope outline shapes G2 of the 2 nd sound signal X2. Specifically, the reference spectral envelope outline G2_ ref is the spectral envelope outline G2(Tm _ R) at the synthesis start time Tm _ R (an example of time 2) of the 2 nd sound signal X2. That is, the timing at which the reference spectral envelope approximate shape G2_ ref is extracted is located at the rear of the start timing T1_ S of the stationary period Q1 and the start timing T2_ S of the stationary period Q2. The time at which the reference spectral envelope outline shape G2_ ref is extracted is not limited to the synthesis start time Tm _ R. For example, the spectral envelope outline shape G2 at an arbitrary time point within the stationary period Q1 is used as the reference spectral envelope outline shape G2_ ref.

The coefficient [ mu ] 1 and the coefficient [ mu ] 2 of formula (3) are set to non-negative values of 1 or less (0 [ mu ] 1, 0 [ mu ] 2 [ mu ] 1). The term 2 of the equation (3) is a process of subtracting a difference between the spectral envelope outline shape G1(t1) of the singing voice and the reference spectral envelope outline shape G1_ ref from the spectral envelope outline shape G1(t1) of the 1 st sound signal X1 to a degree corresponding to the coefficient μ 1 (illustration of the 1 st coefficient). The term 3 of the expression (3) is a process of adding the difference between the spectral envelope outline shape G2(t2) of the reference speech and the reference spectral envelope outline shape G2_ ref to the spectral envelope outline shape G1(t1) of the 1 st sound signal X1 to the extent corresponding to the coefficient μ 2 (exemplified by the 2 nd coefficient). As understood from the above description, the interpretation processing unit 32 calculates the synthesized spectral envelope approximate shape G (t) of the 3 rd audio signal Y by transforming the spectral envelope approximate shape G1(t1) in accordance with the difference (an example of the 1 st difference) between the spectral envelope approximate shape G1(t1) of the singing voice and the reference spectral envelope approximate shape G1_ ref and the difference (an example of the 2 nd difference) between the spectral envelope approximate shape G2(t2) of the reference voice and the reference spectral envelope approximate shape G2_ ref. Specifically, the utterance processing unit 32 functions as an element for replacing the difference (an example of the 1 st difference) between the spectral envelope approximate shape G1(t1) of the singing voice and the reference spectral envelope approximate shape G1_ ref with the difference (an example of the 2 nd difference) between the spectral envelope approximate shape G2(t2) of the reference voice and the reference spectral envelope approximate shape G2_ ref. Step S26 described above is an example of "processing 1".

< Sound initiation processing S1 >

Fig. 10 is a flowchart illustrating specific contents of the sound emission process S1 executed by the sound emission processing unit 31. The attack processing S1 of fig. 10 is performed for each stationary period Q1 of the 1 st sound signal X1. The specific procedure of the attack sound processing S1 is the same as that of the release sound processing S2.

If the attack processing S1 is started, the attack processing unit 31 determines whether or not the sound expression of the attack portion of the 2 nd sound signal X2 is added to the stationary period Q1 to be processed in the 1 st sound signal X1 (S11). Specifically, the sound generation processing unit 31 determines that the sound expression of the sound generation unit is not added to the stationary period Q1 corresponding to any of the conditions Ca1 to Ca5 described below. However, the conditions for determining whether or not to add a sound expression to the stationary period Q1 of the 1 st sound signal X1 are not limited to the following examples.

[ condition Ca1] the time length of the plateau Q1 is lower than a prescribed value.

[ condition Ca2] the fluctuation width of smoothed fundamental frequency f1 exceeds a predetermined value in the smoothing period Q1.

The condition Ca3 indicates that the fluctuation width of the smoothed fundamental frequency f1 exceeds a predetermined value in a predetermined length period including the start point in the stationary period Q1.

[ condition Ca4] the length of time of the voiced period Va immediately before the stationary period Q1 exceeds a prescribed value.

The fluctuation width of the fundamental frequency f1 in the voiced period Va immediately before the stationary period Q1 exceeds a predetermined value [ condition Ca5 ].

The condition Ca1 is a condition in which it is considered that it is difficult to express a natural sound quality additional sound in the stationary period Q1 having a sufficiently short time length, similarly to the condition Cr1 described above. In addition, when the fundamental frequency f1 greatly varies in the stationary period Q1, there is a high possibility that sufficient sound expression is added to the singing voice. Therefore, the stationary period Q1 in which the fluctuation width of the smoothed fundamental frequency f1 exceeds the predetermined value is excluded from the additional objects of the sound expression (condition Ca 2). The condition Ca3 is the same as the condition Ca2, and is a condition focusing on a period close to the sound-originating part in the stationary period Q1 in particular. In addition, when the time length of the voiced period Va immediately before the stationary period Q1 is sufficiently long, or when the fundamental frequency f1 greatly varies in the voiced period Va, there is a high possibility that sufficient sound expression is already added to the singing voice. Therefore, the stationary period Q1 (condition Ca4) in which the time length of the immediately preceding sound period Va exceeds the predetermined value and the stationary period Q1 (condition Ca5) in which the fluctuation width of the fundamental frequency f1 in the sound period Va exceeds the predetermined value are excluded from the additional objects of sound expression. When determining that no sound expression is added to the Q1 during the stationary period (S11: YES), the sound start processing unit 31 ends the sound start processing S1 without performing the processing (S12 to S16) described in detail below.

When determining that the sound expression of the attack section of the 2 nd sound signal X2 is added to the stationary period Q1 of the 1 st sound signal X1 (S11: YES), the attack processing section 31 selects the stationary period Q2 corresponding to the sound expression to be added to the stationary period Q1 among the plurality of stationary periods Q2 of the 2 nd sound signal X2 (S12). The method of selecting the stationary period Q2 by the attack sound processing unit 31 is the same as the method of selecting the stationary period Q2 by the release sound processing unit 32.

The attack processing section 31 performs processing for appending a sound expression corresponding to the stationary period Q2 selected in the above order to the 1 st sound signal X1 (S13-S16). Fig. 11 is an explanatory diagram of the processing of the sound processing unit 31 for adding the sound expression of the sound generating unit to the 1 st sound signal X1.

The attack processing unit 31 adjusts the positional relationship on the time axis between the stationary period Q1 of the processing target and the stationary period Q2 selected in step S12 (S13). Specifically, as illustrated in fig. 11, the sound-start processing unit 31 determines the position of the 2 nd audio signal X2 (stationary period Q2) on the time axis with respect to the 1 st audio signal X1 such that the start time T2_ S of the stationary period Q2 coincides with the start time T1_ S of the stationary period Q1 on the time axis.

< elongation of Z1_ A during treatment >

The attack processing unit 31 extends the processing period Z1_ a of the 1 st sound signal X1 added with the sound expression of the 2 nd sound signal X2 on the time axis (S14). The processing period Z1_ a is a period from the start time τ 1_ a of the voiced period Va immediately before the stationary period Q1 to the time Tm _ a at which the addition of the voice expression is ended (hereinafter referred to as "synthesis end time"). The synthesis end time Tm _ a is, for example, a start time T1_ S of the stationary period Q1 (a start time T2_ S of the stationary period Q2). That is, in the sound emission process S1, the voiced period Va before the stationary period Q1 is extended as the process period Z1_ a. As described above, the stationary period Q1 is a period corresponding to the note of the music piece. If the sound period Va is extended and the quiet period Q1 is not extended, the change of the start time T1_ S of the quiet period Q1 can be suppressed. That is, the possibility that the start of a note in singing voice moves back and forth can be reduced.

As illustrated in fig. 11, the attack processing unit 31 according to the present embodiment extends the processing period Z1_ a of the 1 st audio signal X1 by the time length of the presentation period Z2_ a in the 2 nd audio signal X2. The expression period Z2_ a is a period for expressing the voice expression of the sound source in the 2 nd voice signal X2, and is used for adding the voice expression to the 1 st voice signal X1. As illustrated in fig. 11, the presentation period Z2_ a is the voiced period Va immediately before the stationary period Q2.

Specifically, the attack processing unit 31 extends the processing period Z1_ a of the 1 st audio signal X1 to the length of time of the presentation period Z2_ a of the 2 nd audio signal X2. Fig. 11 illustrates a correspondence relationship between the time t1 of the singing voice (vertical axis) and the time t of the distorted voice (horizontal axis).

As illustrated in fig. 11, in the present embodiment, the processing period Z1_ a is extended on the time axis such that the degree of extension is smaller as the position is closer to the starting point time τ 1_ a of the processing period Z1_ a. Therefore, the acoustic characteristics in the vicinity of the starting point time τ 1_ a of the singing voice can be sufficiently maintained even in the inflected voice. On the other hand, the reference speech expression period Z2_ a does not scale on the time axis. Therefore, the sound expression of the sound source indicated by the 2 nd sound signal X2 can be accurately added to the 1 st sound signal X1.

If the processing period Z1_ a is extended in the order shown in the above example, the sound-originating processing unit 31 deforms the processing period Z1_ a after the extension of the 1 st sound signal X1 in accordance with the presentation period Z2_ a of the 2 nd sound signal X2 (S15-S16). Specifically, between the processing period Z1_ a after the extension of the singing voice and the presentation period Z2_ a of the reference voice, the synthesis of the fundamental frequency (S25) and the synthesis of the spectral envelope outline shape (S26) are performed.

Specifically, the sound generation processing unit 31 calculates the fundamental frequency f (t) of the 3 rd sound signal Y from the fundamental frequency f1(t1) of the 1 st sound signal X1 and the fundamental frequency f2(t2) of the 2 nd sound signal X2 by the same calculation as the above-described expression (2) (S15). That is, the sound-onset processing unit 31 subtracts the difference between the fundamental frequency F1(t1) and the smoothed fundamental frequency F1(t1) from the fundamental frequency F1(t1) of the 1 st sound signal X1 to an extent corresponding to the coefficient λ 1, and adds the difference between the fundamental frequency F2(t2) and the smoothed fundamental frequency F2(t2) to the fundamental frequency F1(t1) of the 1 st sound signal X1 to an extent corresponding to the coefficient λ 2, thereby calculating the fundamental frequency F (t) of the 3 rd sound signal Y. Therefore, the temporal change of the fundamental frequency f1(t1) in the extended processing period Z1_ a of the 1 st audio signal X1 approaches the temporal change of the fundamental frequency f2(t2) in the presentation period Z2_ a of the 2 nd audio signal X2.

The attack processing unit 31 synthesizes a spectral envelope outline between the extended processing period Z1_ a of the singing voice and the expression period Z2_ a of the reference voice (S16). Specifically, the start sound processing unit 31 calculates the synthesized spectral envelope outline G (t) of the 3 rd sound signal Y from the spectral envelope outline G1(t1) of the 1 st sound signal X1 and the spectral envelope outline G2(t2) of the 2 nd sound signal X2 by the same operation as the above equation (3). Step S16 described above is an example of "processing 1".

The reference spectral envelope outline G1_ ref applied to the expression (3) in the attack processing S1 is the spectral envelope outline G1(Tm _ a) at the synthesis end time Tm _ a (an example of the 1 st time) in the 1 st sound signal X1. That is, the timing at which the reference spectral envelope approximate shape G1_ ref is extracted is located at the start timing T1_ S of the stationary period Q1.

Similarly, the reference spectral envelope outline G2_ ref applied to the expression (3) in the attack processing S1 is the spectral envelope outline G2(Tm _ a) at the synthesis end time Tm _ a (an example of time 2) in the 2 nd sound signal X2. That is, the timing at which the reference spectral envelope approximate shape G2_ ref is extracted is located at the start timing T1_ S of the stationary period Q1.

As understood from the above description, the attack sound processing unit 31 and the release sound processing unit 32 according to the present embodiment each deform the 1 st sound signal X1 (analysis data D1) with the 2 nd sound signal X2 (analysis data D2) at a position on the time axis with reference to the end point (start point time T1_ S or end point time T1_ E) of the stationary period Q1. The time series of the fundamental frequency f (t) of the 3 rd speech signal Y representing the distorted sound and the time series of the synthesized spectral envelope outline shape g (t) are generated by the attack sound processing S1 and the release sound processing S2 described above. The speech synthesis unit 33 in fig. 2 generates the 3 rd speech signal Y from the time series of the fundamental frequencies f (t) of the 3 rd speech signal Y and the time series of the synthesized spectral envelope outline shape g (t). The processing of generating the 3 rd speech signal Y by the speech synthesis unit 33 is an example of the "2 nd processing".

The speech synthesis unit 33 in fig. 2 synthesizes the 3 rd speech signal Y of the distorted speech by using the results of the attack sound processing S1 and the release sound processing S2 (i.e., the analysis data after the distortion). Specifically, the speech synthesis unit 33 adjusts each spectrum g1 calculated from the 1 st speech signal X1 to follow the synthesized spectral envelope outline g (t), and also adjusts the fundamental frequency f1 of the 1 st speech signal X1 to the fundamental frequency f (t). The adjustment of the frequency spectrum g1 and the fundamental frequency f1 is performed in the frequency domain, for example. The speech synthesis unit 33 synthesizes the 3 rd speech signal Y by converting the adjusted spectrum illustrated above into a time domain.

As described above, in the present embodiment, the difference between the spectral envelope approximate shape G1(t1) of the 1 st audio signal X1 and the reference spectral envelope approximate shape G1_ ref (G1(t1) -G1 _ ref) and the difference between the spectral envelope approximate shape G2(t2) of the 2 nd audio signal X2 and the reference spectral envelope approximate shape G2_ ref (G2(t2) -G2 _ ref) are synthesized in the spectral envelope approximate shape G1(t1) of the 1 st audio signal X1. Therefore, in the 1 st sound signal X1, a natural acoustically distorted sound having continuous acoustic characteristics can be generated at the boundary between the period (the processing period Z1_ a or Z1_ R) distorted by the 2 nd sound signal X2 and the periods before and after the period.

In the present embodiment, the stationary period Q1 in which the fundamental frequency f1 and the spectral shape are temporally stable in the 1 st audio signal X1 is determined, and the 1 st audio signal X1 is modified by the 2 nd audio signal X2 arranged with reference to the end point (the start point time T1_ S or the end point time T1_ E) of the stationary period Q1. Therefore, the 1 st sound signal X1 is distorted for an appropriate period according to the 2 nd sound signal X2, and an acoustically natural distorted sound can be generated.

In the present embodiment, the processing period (Z1_ a or Z1_ R) of the 1 st audio signal X1 is extended according to the time length of the presentation period (Z2_ a or Z2_ R) of the 2 nd audio signal X2, and therefore, the extension of the 2 nd audio signal X2 is not required. Therefore, the acoustic characteristics (e.g., sound expression) of the reference speech are accurately added to the 1 st sound signal X1, and an acoustically natural distorted sound can be generated.

< modification example >

Next, specific modifications to the above-described embodiments will be described as examples. The modes of 2 or more arbitrarily selected from the following examples can be appropriately combined within a range not contradictory to each other.

(1) In the foregoing embodiment, the stationary period Q1 of the 1 st sound signal X1 is specified by the variation index Δ calculated from the 1 st index 1 and the 2 nd index 2, but the method of specifying the stationary period Q1 in accordance with the 1 st index 1 and the 2 nd index 2 is not limited to the above example. For example, the signal analysis unit 21 specifies the 1 st temporary period corresponding to the 1 st index 1 and the 2 nd temporary period corresponding to the 2 nd index 2. The 1 st provisional period is, for example, a period of voiced sound in which the 1 st index 1 is lower than the threshold value. That is, the period in which the fundamental frequency f1 is stable in time is determined as the 1 st provisional period. The 2 nd provisional period is, for example, a period of voiced sound in which the 2 nd index 2 is lower than the threshold value. That is, a period in which the spectrum shape is stable in time is determined as the 2 nd provisional period. The signal analysis unit 21 determines a period in which the 1 st provisional period and the 2 nd provisional period overlap each other as the stationary period Q1. That is, a period in which both the fundamental frequency f1 and the spectral shape in the 1 st sound signal X1 are stable in time is determined as the stationary period Q1. As understood from the above description, the calculation of the variation index Δ may be omitted when determining the stationary period Q1. Note that, although the determination of the stationary period Q1 was focused on in the above description, the same is true for the determination of the stationary period Q2 in the 2 nd sound signal X2.

(2) In the above-described embodiment, the stationary period Q1 is determined as the period in which both the fundamental frequency f1 and the spectral shape of the 1 st sound signal X1 are temporally stable, but the stationary period Q1 may be determined as the period in which one of the fundamental frequency f1 and the spectral shape of the 1 st sound signal X1 is temporally stable. Similarly, a period in which one of the fundamental frequency f2 and the spectral shape in the 2 nd sound signal X2 is stable in time may be determined as the stationary period Q2.

(3) In the above-described embodiment, the spectral envelope outline shape G1 at the synthesis start time Tm _ R or the synthesis end time Tm _ a in the 1 st speech signal X1 is used as the reference spectral envelope outline shape G1_ ref, but the time (1 st time) at which the reference spectral envelope outline shape G1_ ref is extracted is not limited to the above example. For example, the spectral envelope outline shape G1 at the end point (the start point time T1_ S or the end point time T1_ E) of the stationary period Q1 may be used as the reference spectral envelope outline shape G1_ ref. However, the 1 st time at which the reference spectral envelope approximate shape G1_ ref is extracted is preferably a time within the stationary period Q1 in which the spectral shape is stable in the 1 st sound signal X1.

The same applies to the reference spectral envelope outline shape G2_ ref. That is, in the above-described embodiment, the spectral envelope outline shape G2 at the synthesis start time Tm _ R or the synthesis end time Tm _ a in the 2 nd speech signal X2 is used as the reference spectral envelope outline shape G2_ ref, but the time (time 2) at which the reference spectral envelope outline shape G2_ ref is extracted is not limited to the above example. For example, the spectral envelope outline shape G2 at the end point (the start point time T2_ S or the end point time T2_ E) of the stationary period Q2 may be used as the reference spectral envelope outline shape G2_ ref. However, the 2 nd time when the reference spectral envelope approximate shape G2_ ref is extracted is preferably a time within the stationary period Q2 during which the spectral shape is stable in the 2 nd sound signal X2.

In addition, the 1 st time at which the reference spectral envelope outline shape G1_ ref in the 1 st sound signal X1 is extracted and the 2 nd time at which the reference spectral envelope outline shape G2_ ref in the 2 nd sound signal X2 is extracted may be different times on the time axis.

(4) In the above-described embodiment, the 1 st audio signal X1 indicating the singing voice of the user of the audio processing apparatus 100 is processed, but the voice indicated by the 1 st audio signal X1 is not limited to the singing voice of the user. For example, the 1 st sound signal X1 synthesized by a known speech synthesis technique of a segment concatenation type or a statistical model may be processed. The 1 st audio signal X1 read from a recording medium such as an optical disk may be processed. Similarly, the 2 nd sound signal X2 is obtained by an arbitrary method.

The sounds indicated by the 1 st audio signal X1 and the 2 nd audio signal X2 are not limited to a narrow speech (i.e., speech sounds uttered by a human). For example, the present invention can be applied to a case where various sound expressions (for example, performance expressions) are added to the 1 st sound signal X1 representing the performance sound of the musical instrument. For example, a performance expression such as a vibrato is added to the 1 st sound signal X1 indicating a monotonous performance sound to which no performance expression is added, using the 2 nd sound signal X2.

(5) The function of the sound processing device 100 according to the above-described embodiment is realized by executing the instruction (program) stored in the memory by 1 or more processors, as described above. The above program is provided as being stored in a computer-readable recording medium and can be installed in a computer. The recording medium is, for example, a non-volatile (non-volatile) recording medium, and preferably an optical recording medium (optical disc) such as a CD-ROM, but includes any known recording medium such as a semiconductor recording medium or a magnetic recording medium. The non-volatile recording medium includes any recording medium other than a temporary transmission signal (transient signal), and volatile recording media are not excluded. In a configuration in which the transmission device transmits the program via the communication network, the storage device that stores the program in the transmission device corresponds to the aforementioned nonvolatile recording medium.

< appendix >)

In the manner illustrated in the above example, the following structure is grasped, for example.

A sound processing method according to a preferred aspect (aspect 1) of the present invention is a sound processing method for deforming a1 st spectral envelope approximate shape of a1 st sound signal representing a1 st sound by a1 st difference that is a difference between the 1 st spectral envelope approximate shape and a1 st reference spectral envelope approximate shape at a1 st time in the 1 st sound signal and a2 nd difference that is a difference between a2 nd spectral envelope approximate shape of a2 nd sound signal representing a2 nd sound having a difference in acoustic characteristics from the 1 st sound and a2 nd reference spectral envelope approximate shape at a2 nd time in the 2 nd sound signal, thereby, a synthesized spectral envelope outline shape in a3 rd audio signal representing a distorted sound obtained by distorting the 1 st sound and the 2 nd sound in accordance with each other is generated, and the 3 rd audio signal corresponding to the synthesized spectral envelope outline shape is generated. In the above aspect, the 1 st difference between the 1 st spectral envelope outline shape of the 1 st speech signal and the 1 st reference spectral envelope outline shape and the 2 nd difference between the 2 nd speech signal spectral envelope outline shape and the 2 nd reference spectral envelope outline shape are synthesized into the 1 st spectral envelope outline shape, thereby generating a synthesized spectral envelope outline shape of a distorted sound in which the 1 st sound and the 2 nd sound are distorted in accordance with each other. Therefore, it is possible to generate an acoustically natural distorted sound having continuous acoustic characteristics at the boundary between the period in which the 2 nd speech signal is synthesized and the periods before and after the period in the 1 st speech signal.

The spectral envelope outline shape is an outline shape of a spectral envelope. Specifically, the intensity distribution on the frequency axis in which the spectral envelope is smoothed to such an extent that the phonological property (difference between phonemes) and the individuality (difference between speakers) are not noticeable corresponds to the approximate shape of the spectral envelope. A spectrum envelope outline shape is expressed by a predetermined number of coefficients positioned on the lower order side among a plurality of coefficients of a Mel cepstrum representing the outline shape of a spectrum.

In a preferred example (claim 2) of the 1 st aspect, the temporal position of the 2 nd sound signal with respect to the 1 st sound signal is adjusted so that the end points of the 1 st stationary period in which the spectral shape of the 1 st sound signal is stable in time and the 2 nd stationary period in which the spectral shape of the 2 nd sound signal is stable in time coincide with each other, the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period, and the synthesized spectral envelope approximate shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal. In a preferred example (claim 3) of the 2 nd aspect, the 1 st time and the 2 nd time are rear times out of a start point of the 1 st plateau period and a start point of the 2 nd plateau period. In the above aspect, when the 1 st and 2 nd stationary periods are made to coincide in their end points, the rear timings of the start point of the 1 st stationary period and the start point of the 2 nd stationary period are selected as the 1 st and 2 nd timings. Therefore, the distortion sound in which the acoustic characteristic of the sound emitting portion in the 2 nd sound is added to the 1 st sound can be generated while maintaining the continuity of the acoustic characteristic at the starting points of the 1 st and 2 nd stationary periods.

In a preferred example (4 th aspect) of the 1 st aspect, the temporal position of the 2 nd sound signal with respect to the 1 st sound signal is adjusted so that the starting points thereof coincide between a1 st stationary period in which the spectral shape of the 1 st sound signal is temporally stable and a2 nd stationary period in which the spectral shape of the 2 nd sound signal is temporally stable, the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period, and the synthesized spectral envelope approximate shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal. In a preferable example of the 4 th aspect (the 5 th aspect), the 1 st time and the 2 nd time are starting points of the 1 st plateau period. In the above aspect, when the starting points of the 1 st plateau period and the 2 nd plateau period are made to coincide with each other, the starting point of the 1 st plateau period (the starting point of the 2 nd plateau period) is selected as the 1 st time and the 2 nd time. Therefore, while suppressing the movement of the starting point of the 1 st stationary period, the distorted sound can be generated in which the acoustic characteristic in the vicinity of the sound emitting point of the 2 nd sound is added to the 1 st sound.

In a preferable example (6 th aspect) of any one of the 2 nd to 5 th aspects, the 1 st stationary period is determined in accordance with a1 st index indicating a degree of change in a fundamental frequency of the 1 st sound signal and a2 nd index indicating a degree of change in the spectral shape of the 1 st sound signal. According to the above manner, the period in which both the fundamental frequency and the spectral shape are stable in time can be determined as the 1 st stationary period. Further, for example, a configuration is conceivable in which a fluctuation index corresponding to the 1 st index and the 2 nd index is calculated, and the 1 st stationary period is determined in accordance with the fluctuation index. Further, the 1 st tentative period may be determined according to the 1 st index, the 2 nd tentative period may be determined according to the 2 nd index, and the 1 st stationary period may be determined based on the 1 st tentative period and the 2 nd tentative period.

In a preferred example (7 th aspect) of any one of the 1 st to 6 th aspects, in generating the synthesized spectral envelope outline, a result obtained by multiplying the 1 st difference by a1 st coefficient is subtracted from the 1 st spectral envelope outline, and a result obtained by multiplying the 2 nd difference by a2 nd coefficient is added. In the above aspect, the time series of the synthesized spectral envelope outline shape is generated by subtracting the result of multiplying the 1 st difference by the 1 st coefficient from the 1 st spectral envelope outline shape, and adding the result of multiplying the 2 nd difference by the 2 nd coefficient to the 1 st spectral envelope outline shape. Therefore, the sound expression of the 1 st sound can be reduced, and the distorted sound that effectively adds the sound expression of the 2 nd sound can be generated.

In a preferred example (claim 8) of any one of the 1 st to 7 th aspects, in generating the synthesized spectral envelope outline shape, the 1 st spectral envelope outline shape in the extended processing period is generated by extending a processing period of the 1 st audio signal and a time length of an expression period to be applied to the deformation of the 1 st audio signal in the 2 nd audio signal, and the 1 st spectral envelope outline shape in the extended processing period is deformed in accordance with the 1 st difference in the extended processing period and the 2 nd difference in the expression period.

A sound processing device according to a preferred aspect of the present invention (claim 9) includes a memory and 1 or more processors, and generates a synthesized spectral envelope shape of a3 rd sound signal representing a distorted sound obtained by distorting the 1 st sound and the 2 nd sound by executing an instruction stored in the memory by the 1 or more processors, the synthesized spectral envelope shape being a synthesized spectral envelope shape of the 3 rd sound signal representing the distorted sound obtained by distorting the 1 st sound and the 2 nd sound by deforming the 1 st spectral envelope approximate shape in accordance with a1 st difference, which is a difference between a1 st spectral envelope approximate shape of a1 st sound signal representing the 1 st sound and a1 st reference spectral envelope approximate shape at a1 st time in the 1 st sound signal, and a2 nd difference, which is a difference between a2 nd spectral envelope approximate shape of a2 nd sound signal representing a2 nd sound having a difference in acoustic characteristic from the 1 st sound and the 2 nd sound signal at a2 nd time in the 2 nd sound signal, generating the 3 rd sound signal corresponding to the synthesized spectral envelope sketch shape.

In a preferred example (10 th aspect) of the 9 th aspect, the temporal position of the 2 nd sound signal with respect to the 1 st sound signal is adjusted so that the end points of the 1 st stationary period in which the spectral shape of the 1 st sound signal is stable in time and the 2 nd stationary period in which the spectral shape of the 2 nd sound signal is stable in time coincide with each other, the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period, and the synthesized spectral envelope approximate shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal. In a preferred example (11 th aspect) of the 10 th aspect, the 1 st time and the 2 nd time are rear times out of a start point of the 1 st plateau period and a start point of the 2 nd plateau period.

In a preferred example (12 th aspect) of the 9 th aspect, the temporal position of the 2 nd sound signal with respect to the 1 st sound signal is adjusted so that the starting points thereof coincide between a1 st stationary period in which the spectral shape of the 1 st sound signal is temporally stable and a2 nd stationary period in which the spectral shape of the 2 nd sound signal is temporally stable, the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period, and the synthesized spectral envelope approximate shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal. In a preferred example (13 th aspect) of the 12 th aspect, the 1 st time and the 2 nd time are starting points of the 1 st plateau period.

In a preferred example (14 th mode) of any one of the 9 th to 13 th modes, the 1 or more processors perform processing of subtracting a result of multiplying the 1 st difference by a1 st coefficient and adding a result of multiplying the 2 nd difference by a2 nd coefficient with respect to the 1 st spectral envelope outline shape.

A recording medium according to a preferred embodiment (15 th aspect) of the present invention is a computer-readable recording medium having recorded thereon a program for causing a computer to execute: a1 st process of generating a synthesized spectral envelope approximate shape in a3 rd audio signal representing a distorted sound in which a1 st sound and a2 nd sound are distorted in accordance with a1 st difference, which is a difference between a1 st spectral envelope approximate shape of a1 st audio signal representing a1 st sound and a1 st reference spectral envelope approximate shape at a1 st time in the 1 st audio signal, and a2 nd difference, which is a difference between a2 nd spectral envelope approximate shape of a2 nd audio signal representing a2 nd sound having an acoustic characteristic different from the 1 st sound and a2 nd reference spectral envelope approximate shape at a2 nd time in the 2 nd audio signal; and a2 nd process of generating the 3 rd sound signal corresponding to the synthesized spectral envelope outline shape.

Description of the reference numerals

100 … sound processing device, 11 … control device, 12 … storage device, 13 … operation device, 14 … sound playing device, 21 … signal analysis unit, 22 … synthesis processing unit, 31 … starting sound processing unit, 32 … sound playing processing unit and 33 … voice synthesis unit.

26页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：声音处理方法、声音处理装置及程序

Sound processing method, sound processing device and recording medium

相关技术

网友询问留言