Sound processing method, sound processing device and recording medium
阅读说明:本技术 声音处理方法、声音处理装置及记录介质 (Sound processing method, sound processing device and recording medium ) 是由 大道龙之介 嘉山启 于 2019-03-08 设计创作,主要内容包括:声音处理装置具有合成处理部,其将第1差分和第2差分合成为第1频谱包络概略形状,由此生成表示将歌唱语音与参照语音相应地变形的变形音的第3声音信号的合成频谱包络概略形状,生成与合成频谱包络概略形状相对应的第3声音信号,所述第1差分是表示歌唱语音的第1声音信号的第1频谱包络概略形状和第1声音信号中的第1时刻的第1基准频谱包络概略形状的差分,所述第2差分是表示参照语音的第2声音信号的第2频谱包络概略形状和第2声音信号中的第2时刻的第2基准频谱包络概略形状的差分。(The voice processing device includes a synthesis processing unit that synthesizes a1 st difference and a2 nd difference into a1 st spectral envelope approximate shape, generates a synthesized spectral envelope approximate shape of a3 rd voice signal representing a distorted voice in which a singing voice and a reference voice are distorted according to the synthesized voice envelope approximate shape, and generates a3 rd voice signal corresponding to the synthesized spectral envelope approximate shape, the 1 st difference being a difference between the 1 st spectral envelope approximate shape of the 1 st voice signal representing the singing voice and a1 st reference spectral envelope approximate shape at a1 st time in the 1 st voice signal, and the 2 nd difference being a difference between the 2 nd spectral envelope approximate shape of the 2 nd voice signal representing the reference voice and a2 nd reference spectral envelope approximate shape at a2 nd time in the 2 nd voice signal.)
1. A sound processing method, which is realized by a computer,
deforming the 1 st spectral envelope outline shape in accordance with the 1 st difference and the 2 nd difference, thereby generating a synthesized spectral envelope outline shape of the 3 rd sound signal,
generating the 3 rd sound signal corresponding to the synthetic spectral envelope sketch shape,
wherein the 1 st difference is a difference between the 1 st spectral envelope approximate shape of the 1 st sound signal representing the 1 st sound and the 1 st reference spectral envelope approximate shape at the 1 st time point in the 1 st sound signal, the 2 nd difference is a difference between the 2 nd spectral envelope approximate shape of the 2 nd sound signal representing the 2 nd sound having a difference in acoustic characteristic from the 1 st sound and the 2 nd reference spectral envelope approximate shape at the 2 nd time point in the 2 nd sound signal, and the 3 rd sound signal represents a distorted sound obtained by distorting the 1 st sound and the 2 nd sound in accordance with each other.
2. The sound processing method according to claim 1,
adjusting a temporal position of the 2 nd sound signal with respect to the 1 st sound signal so that end points thereof coincide between a1 st stationary period in which a spectral shape of the 1 st sound signal is temporally stable and a2 nd stationary period in which a spectral shape of the 2 nd sound signal is temporally stable,
the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period,
the synthesized spectral envelope summary shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal.
3. The sound processing method according to claim 2,
the 1 st time and the 2 nd time are rear times out of the start point of the 1 st plateau period and the start point of the 2 nd plateau period.
4. The sound processing method according to claim 1,
adjusting a temporal position of the 2 nd sound signal with respect to the 1 st sound signal so that starting points thereof coincide between a1 st stationary period in which a spectral shape of the 1 st sound signal is stable in time and a2 nd stationary period in which a spectral shape of the 2 nd sound signal is stable in time,
the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period,
the synthesized spectral envelope summary shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal.
5. The sound processing method according to claim 4,
the 1 st time and the 2 nd time are starting points of the 1 st stationary period.
6. The sound processing method according to any one of claims 2 to 5,
the 1 st stationary period is determined in correspondence with a1 st index representing a degree of change in a fundamental frequency of the 1 st sound signal and a2 nd index representing a degree of change in the spectral shape of the 1 st sound signal.
7. The sound processing method according to any one of claims 1 to 6,
in generating the synthesized spectral envelope sketch shape,
subtracting a result of multiplying the 1 st difference by a1 st coefficient and adding a result of multiplying the 2 nd difference by a2 nd coefficient with respect to the 1 st spectral envelope outline shape.
8. The sound processing method according to any one of claims 1 to 7,
in generating the synthesized spectral envelope sketch shape,
extending the processing period of the 1 st sound signal in accordance with the length of time of the presentation period of the 2 nd sound signal to be applied to the distortion of the 1 st sound signal,
-deforming said 1 st spectral envelope profile during said elongated processing, said 1 st difference during said elongated processing and said 2 nd difference during said rendering, respectively, thereby generating said composite spectral envelope profile.
9. A sound processing apparatus having a memory and 1 or more processors,
the sound processing apparatus executes the instruction stored in the memory by the 1 or more processors,
thereby deforming the 1 st spectral envelope outline shape in correspondence with the 1 st difference and the 2 nd difference, thereby generating a synthesized spectral envelope outline shape of the 3 rd sound signal,
generating the 3 rd sound signal corresponding to the synthetic spectral envelope sketch shape,
wherein the 1 st difference is a difference between the 1 st spectral envelope approximate shape of the 1 st sound signal representing the 1 st sound and the 1 st reference spectral envelope approximate shape at the 1 st time point in the 1 st sound signal, the 2 nd difference is a difference between the 2 nd spectral envelope approximate shape of the 2 nd sound signal representing the 2 nd sound having a difference in acoustic characteristic from the 1 st sound and the 2 nd reference spectral envelope approximate shape at the 2 nd time point in the 2 nd sound signal, and the 3 rd sound signal represents a distorted sound obtained by distorting the 1 st sound and the 2 nd sound in accordance with each other.
10. The sound processing apparatus according to claim 9,
adjusting a temporal position of the 2 nd sound signal with respect to the 1 st sound signal so that end points thereof coincide between a1 st stationary period in which a spectral shape of the 1 st sound signal is temporally stable and a2 nd stationary period in which a spectral shape of the 2 nd sound signal is temporally stable,
the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period,
the synthesized spectral envelope summary shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal.
11. The sound processing apparatus according to claim 9,
the 1 st time and the 2 nd time are rear times out of the start point of the 1 st plateau period and the start point of the 2 nd plateau period.
12. The sound processing apparatus according to claim 9,
adjusting a temporal position of the 2 nd sound signal with respect to the 1 st sound signal so that starting points thereof coincide between a1 st stationary period in which a spectral shape of the 1 st sound signal is stable in time and a2 nd stationary period in which a spectral shape of the 2 nd sound signal is stable in time,
the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period,
the synthesized spectral envelope summary shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal.
13. The sound processing apparatus according to claim 12,
the 1 st time and the 2 nd time are starting points of the 1 st stationary period.
14. The sound processing apparatus according to any one of claims 9 to 13,
the 1 or more processors perform processing of subtracting a result obtained by multiplying the 1 st difference by a1 st coefficient and adding a result obtained by multiplying the 2 nd difference by a2 nd coefficient with respect to the 1 st spectral envelope outline shape.
15. A recording medium readable by a computer, having recorded thereon a program for causing the computer to execute:
a1 st process of generating a synthesized spectral envelope outline shape of a3 rd audio signal by deforming a1 st spectral envelope outline shape in accordance with a1 st difference and a2 nd difference, the 1 st difference being a difference between the 1 st spectral envelope outline shape of a1 st audio signal representing a1 st sound and a1 st reference spectral envelope outline shape at a1 st time in the 1 st audio signal, the 2 nd difference being a difference between a2 nd spectral envelope outline shape of a2 nd audio signal representing a2 nd sound having a difference in acoustic characteristic from the 1 st sound and a2 nd reference spectral envelope outline shape at a2 nd time in the 2 nd audio signal, the 3 rd audio signal representing a deformed sound in which the 1 st sound and the 2 nd sound are deformed in accordance with each other; and
a2 nd process of generating the 3 rd sound signal corresponding to the synthesized spectral envelope outline shape.
Technical Field
The present invention relates to a technique for processing an audio signal representing audio.
Background
Various techniques have been proposed to add vocal expressions such as vocal expression to speech. For example,
Patent document 1: japanese patent laid-open publication No. 2014-2338
Disclosure of Invention
However, the technique of
In order to solve the above problem, a sound processing method according to a preferred embodiment of the present invention is a sound processing method for deforming a1 st spectral envelope in an approximate shape according to a1 st difference and a2 nd difference, thereby generating a synthesized spectral envelope summary shape of the 3 rd sound signal, generating the 3 rd sound signal corresponding to the synthesized spectral envelope summary shape, wherein the 1 st difference is a difference between the 1 st spectral envelope outline shape of the 1 st speech signal representing the 1 st sound and a1 st reference spectral envelope outline shape at the 1 st time in the 1 st speech signal, the 2 nd difference is a difference between a2 nd spectral envelope outline shape of a2 nd sound signal of a2 nd sound having a difference in acoustic characteristic from the 1 st sound and a2 nd reference spectral envelope outline shape at a2 nd time in the 2 nd sound signal, the 3 rd sound signal represents a distorted sound in which the 1 st sound and the 2 nd sound are distorted in accordance with each other.
In order to solve the above-described problem, a sound processing device according to a preferred aspect of the present invention includes a memory and 1 or more processors, and includes a synthesis processing unit that generates a synthesized spectral envelope approximate shape of a3 rd sound signal by executing an instruction stored in the memory by the 1 or more processors and by deforming a1 st spectral envelope approximate shape in accordance with a1 st difference and a2 nd difference, the 1 st difference being a difference between a1 st spectral envelope approximate shape of a1 st sound signal representing a1 st sound and a1 st reference spectral envelope approximate shape at a1 st time in the 1 st sound signal, and generates the 3 rd sound signal corresponding to the synthesized spectral envelope approximate shape, the 2 nd difference being a2 nd spectral envelope approximate shape of a2 nd sound signal representing a2 nd sound having a sound characteristic different from the 1 st sound and the 2 nd sound signal And a2 nd reference spectral envelope approximate shape difference at the 2 nd time in the signal, wherein the 3 rd sound signal represents a distorted sound obtained by distorting the 1 st sound and the 2 nd sound according to each other.
In order to solve the above problem, a recording medium according to a preferred embodiment of the present invention records a program for causing a computer to execute: a1 st process of generating a synthesized spectral envelope outline shape of a3 rd audio signal by deforming a1 st spectral envelope outline shape in accordance with a1 st difference and a2 nd difference, the 1 st difference being a difference between the 1 st spectral envelope outline shape of a1 st audio signal representing a1 st sound and a1 st reference spectral envelope outline shape at a1 st time in the 1 st audio signal, the 2 nd difference being a difference between a2 nd spectral envelope outline shape of a2 nd audio signal representing a2 nd sound having a difference in acoustic characteristic from the 1 st sound and a2 nd reference spectral envelope outline shape at a2 nd time in the 2 nd audio signal, the 3 rd audio signal representing a deformed sound in which the 1 st sound and the 2 nd sound are deformed in accordance with each other; and a2 nd process of generating the 3 rd sound signal corresponding to the synthesized spectral envelope outline shape.
Drawings
Fig. 1 is a block diagram illustrating a configuration of an audio processing device according to an embodiment of the present invention.
Fig. 2 is a block diagram illustrating a functional configuration of the sound processing apparatus.
Fig. 3 is an explanatory diagram of a stationary period in the 1 st sound signal.
Fig. 4 is a flowchart illustrating a specific procedure of the signal analysis processing.
Fig. 5 is a time variation of the fundamental frequency immediately after the start of utterance of a singing voice.
Fig. 6 is a time variation of the fundamental frequency just before the end of the utterance of the singing voice.
Fig. 7 is a flowchart illustrating a specific sequence of the sound processing.
Fig. 8 is an explanatory diagram of the sound interpretation processing.
Fig. 9 is an explanatory diagram of a schematic shape of a spectral envelope.
Fig. 10 is a flowchart illustrating a specific procedure of the attack processing.
Fig. 11 is an explanatory diagram of the attack sound processing.
Detailed Description
Fig. 1 is a block diagram illustrating a configuration of an
The sound expression is particularly prominent in a portion where the volume increases immediately after the start of the utterance in the singing voice (hereinafter referred to as "attack portion") and a portion where the volume decreases immediately before the end of the utterance in the singing voice (hereinafter referred to as "release portion"). In consideration of the above tendency, in the present embodiment, a vocal expression is added to the vocal part and the vocal release part in particular in the singing voice.
As illustrated in fig. 1, the
The
The
The
Fig. 2 is a block diagram illustrating a functional configuration of the
The
The analysis data D1 is data indicating a plurality of stationary periods Q1 of the 1 st sound signal X1. As illustrated in fig. 3, each stationary period Q1 indicated by the analysis data D1 is a variable-length period in which the fundamental frequency f1 and the spectral shape of the 1 st sound signal X1 are stable in time. The analysis data D1 specifies the time T1_ S of the start point (hereinafter referred to as "start point time") and the time T1_ E of the end point (hereinafter referred to as "end point time") of each stationary period Q1. In addition, the fundamental frequency f1 or the spectral shape (i.e., phoneme) often changes between the preceding and succeeding 2 notes in a musical piece. Therefore, each stationary period Q1 is highly likely to be a period corresponding to 1 note in a musical piece.
Similarly, the analysis data D2 is data indicating a plurality of stationary periods Q2 of the 2 nd sound signal X2. Each stationary period Q2 is a variable-length period in which the fundamental frequency f2 and the spectral shape of the 2 nd sound signal X2 are stable in time. The analysis data D2 specifies the start time T2_ S and the end time T2_ E of each stationary period Q2. Like the stationary period Q1, each stationary period Q2 is highly likely to correspond to a period of 1 note in a musical piece.
Fig. 4 is a flowchart of the process (hereinafter referred to as "signal analysis process") S0 in which the
The
The
The
The
The
The
The
The
The temporal change of the fundamental frequency f1 immediately after the start of utterance of a singing voice is illustrated in fig. 5. As illustrated in fig. 5, there is an audible period Va immediately before the stationary period Q1. The voiced period Va is a voiced period before the stationary period Q1. The voiced period Va is a period in which the acoustic characteristics (for example, the fundamental frequency f1 or the spectral shape) of the singing voice vary unstably immediately before the stationary period Q1. For example, if attention is paid to a stationary period Q1 immediately after the start of the utterance of a singing voice, a start point from the time τ 1_ a at which the utterance of the singing voice starts to the start point time T1_ S of the stationary period Q1 corresponds to the voiced period Va. Note that although the above description has focused on singing speech, similarly with reference speech, there is a voiced period Va immediately before the stationary period Q2. In the sound emission process S1, the synthesis processing unit 22 (specifically, the sound emission processing unit 31) adds the sound expression of the sound emission unit in the 2 nd sound signal X2 to the sound period Va in the 1 st sound signal X1 and the subsequent stationary period Q1.
The temporal change of the fundamental frequency f1 just before the end of the utterance of the singing voice is illustrated in fig. 6. As illustrated in fig. 6, the voiced period Vr exists immediately after the stationary period Q1. The voiced period Vr is a voiced period after the stationary period Q1. The voiced period Vr is a period in which the acoustic characteristic (for example, the fundamental frequency f2 or the spectral shape) of the singing voice unstably changes immediately after the stationary period Q1. For example, if attention is paid to a stationary period Q1 immediately before the end of the vocal sound emission, the vocal section from the end time T1_ E of the stationary period Q1 to the time τ 1_ R at which the vocal sound is silenced corresponds to the vocal period Vr. Note that although the above description has focused on singing voice, similarly with reference to reference voice, the voice period Vr exists immediately after the stationary period Q2. In the release processing S2, the synthesis processing unit 22 (specifically, the release processing unit 32) adds the sound expression of the release portion of the 2 nd speech signal X2 to the voiced period Vr in the 1 st speech signal X1 and the immediately preceding stationary period Q1.
< Sound Release processing S2 >
Fig. 7 is a flowchart illustrating specific contents of the sound interpretation process S2 executed by the sound
If the sound release processing S2 is started, the sound
[ Condition Cr1] the time length of the plateau Q1 is less than a prescribed value.
[ condition Cr2] the time length of a silent period immediately after the quiet period Q1 is lower than a prescribed value.
[ Condition Cr3] the time length of the voiced period Vr after the stationary period Q1 exceeds a predetermined value.
The stationary period Q1 sufficiently short in time length is difficult to represent by natural sound quality additional sound. Therefore, in the case where the time length of the stationary period Q1 is lower than the prescribed value (condition Cr1), the
When it is determined that the sound expression of the mute section of the 2 nd sound signal X2 is added to the quiet period Q1 of the 1 st sound signal X1 (S21: YES), the
The
In fig. 8, the 1 st audio signal X1, the 2 nd audio signal X2, and the 3 rd audio signal Y after the modification are collectively described with respect to the waveform and the temporal change of the fundamental frequency on the time axis. In fig. 8, the start time T1_ S and the end time T1_ E of the stationary period Q1 of the singing voice, the end time τ 1_ R of the voiced period Vr immediately after the stationary period Q1, the start time τ 1_ a of the voiced period Va corresponding to the note immediately after the stationary period Q1, the start time T2_ S and the end time T2_ E of the stationary period Q2 of the reference voice, and the end time τ 2_ R of the voiced period Vr immediately after the stationary period Q2 are known information.
The
< elongation (S24) > < of Z1_ R during treatment
The
As illustrated in fig. 8, the
While there is a tendency that a reference voice sung by a skilled singer such as a singer is added with a sufficient voice expression over a corresponding time period, the voice expression is not temporally sufficient in a singing voice sung by a user who is not familiar with singing. In the above tendency, as illustrated in fig. 8, the expression period Z2_ R of the reference voice is longer than the processing period Z1_ R of the singing voice. Therefore, the
The extension of the processing period Z1_ R is realized by processing (mapping) in which an arbitrary time t1 of the 1 st sound signal X1 (singing voice) and an arbitrary time t of the deformed 3 rd sound signal Y (deformed voice) are associated with each other. Fig. 8 illustrates a correspondence relationship between the time t1 of the singing voice (vertical axis) and the time t of the distorted voice (horizontal axis).
The time t1 in the correspondence relationship in fig. 8 is the time of the 1 st sound signal X1 corresponding to the time t of the distorted sound. A reference line L indicated by a dashed-dotted line in fig. 8 indicates a state where the 1 st audio signal X1 is not expanded or contracted (t1 is t). The section in which the slope of the time t1 of the singing voice with respect to the time t of the distorted voice is smaller than the reference line L is the section in which the 1 st sound signal X1 is stretched. The section in which the slope of time t1 with respect to time t is larger than the reference line L is a section in which the singing voice is contracted.
The correspondence relationship between time t1 and time t is expressed by a nonlinear function of equations (1a) to (1c) shown below.
[ formula 1]
As illustrated in fig. 8, the time T _ R is a predetermined time between the combination start time Tm _ R and the end time τ 1_ R of the processing period Z1_ R. For example, the midpoint ((T1_ S + T1_ E)/2) between the start time T1_ S and the end time T1_ E of the plateau period Q1 and the rear time of the synthesis start time Tm _ R are set as the time T _ R. As understood from equation (1a), the period before time T _ R in processing period Z1_ R does not expand or contract. That is, the expansion of the treatment period Z1_ R starts from the time T _ R.
As understood from equation (1b), the period behind the time T _ R in the processing period Z1_ R extends to a large extent at a position close to the time T _ R, and extends on the time axis such that the extent of the extension decreases as the end point time τ 1_ R approaches. The function η (t) of equation (1b) is a nonlinear function for extending the processing period Z1_ R farther forward on the time axis and reducing the degree of extension of the processing period Z1_ R farther rearward on the time axis. Specifically, for example, a 2-degree function at time t (η (t) ═ t2) is applied to function η (t). As described above, in the present embodiment, the processing period Z1_ R is extended on the time axis such that the degree of extension is smaller as the position is closer to the end time τ 1_ R of the processing period Z1_ R. Therefore, the acoustic characteristics in the vicinity of the end point time τ 1_ R of the singing voice can be sufficiently maintained even in the inflected voice. Further, at a position close to the time T _ R, there is a tendency that auditory discomfort due to elongation is less noticeable than the vicinity of the end time τ 1_ R. Therefore, even if the degree of extension is increased at a position close to the time T _ R as in the above-described example, the acoustic naturalness of the distortion sound is hardly reduced. In addition, the period from the end point time τ 2_ R of the presentation period Z2_ R to the start point time τ 1_ a of the next voiced period Vr in the 1 st audio signal X1 is shortened on the time axis as understood from equation (1 c). Since no speech is present from the end point time τ 2_ R to the start point time τ 1_ a, the 1 st sound signal X1 can be deleted by local deletion.
As described above, the processing period Z1_ R of the singing voice extends to the time length of the presentation period Z2_ R of the reference voice. On the other hand, the reference speech expression period Z2_ R does not scale on the time axis. That is, the time t2 of the disposed 2 nd audio signal X2 corresponding to the time t of the distortion sound coincides with the time t (t2 equals t). As described above as an example, in the present embodiment, since the processing period Z1_ R of the singing voice is extended in accordance with the time length of the presentation period Z2_ R, it is not necessary to extend the 2 nd audio signal X2. Therefore, the sound expression of the vocal section indicated by the 2 nd vocal signal X2 can be accurately added to the 1 st vocal signal X1.
If the processing period Z1_ R is extended in the order shown in the above example, the
< fundamental frequency Synthesis (S25) >)
The
[ formula 2]
F(t)=f1(t1)-λ1(f1(t1)-F1(t1))+λ2(f2(t2)-F2(t2))...(2)
The smoothing fundamental frequency F1(t1) in expression (2) is a frequency obtained by smoothing the time series of the fundamental frequency F1(t1) of the 1 st sound signal X1 on the time axis. Similarly, the smoothed fundamental frequency F2(t2) of expression (2) is a frequency obtained by smoothing the time series of the fundamental frequency F2(t2) of the 2 nd audio signal X2 on the time axis. The
As understood from equation (2), the 2 nd term of equation (2) is a process of subtracting the difference of the fundamental frequency F1(t1) and the smooth fundamental frequency F1(t1) of singing voice from the fundamental frequency F1(t1) of the 1 st sound signal X1 to the extent corresponding to the
< Synthesis of the approximate shape of the spectral envelope (S26) >)
The
The
[ formula 3]
G(t)=G1(t1)-μ1(G1(t1)-G1_ref)+μ2(G2(t2)-G2_ref)...(3)
The notation G1_ ref in equation (3) is a reference spectral envelope outline shape. Of the plurality of spectral envelope approximate shapes G1 of the 1 st speech signal X1, 1 spectral envelope approximate shape G1 at a specific time is used as the reference spectral envelope approximate shape G1_ ref (an example of the 1 st reference spectral envelope approximate shape). Specifically, the reference spectral envelope outline shape G1_ ref is the spectral envelope outline shape G1(Tm _ R) at the synthesis start time Tm _ R (an example of the 1 st time) of the 1 st sound signal X1. That is, the timing at which the reference spectral envelope approximate shape G1_ ref is extracted is located at the rear of the start timing T1_ S of the stationary period Q1 and the start timing T2_ S of the stationary period Q2. The time at which the reference spectral envelope outline shape G1_ ref is extracted is not limited to the synthesis start time Tm _ R. For example, the spectral envelope outline shape G1 at an arbitrary time point within the stationary period Q1 is used as the reference spectral envelope outline shape G1_ ref.
Similarly, the reference spectral envelope outline shape G2_ ref in expression (3) is 1 spectral envelope outline shape G2 at a specific time point among the plurality of spectral envelope outline shapes G2 of the 2 nd sound signal X2. Specifically, the reference spectral envelope outline G2_ ref is the spectral envelope outline G2(Tm _ R) at the synthesis start time Tm _ R (an example of time 2) of the 2 nd sound signal X2. That is, the timing at which the reference spectral envelope approximate shape G2_ ref is extracted is located at the rear of the start timing T1_ S of the stationary period Q1 and the start timing T2_ S of the stationary period Q2. The time at which the reference spectral envelope outline shape G2_ ref is extracted is not limited to the synthesis start time Tm _ R. For example, the spectral envelope outline shape G2 at an arbitrary time point within the stationary period Q1 is used as the reference spectral envelope outline shape G2_ ref.
The coefficient [ mu ] 1 and the coefficient [ mu ] 2 of formula (3) are set to non-negative values of 1 or less (0 [ mu ] 1, 0 [ mu ] 2 [ mu ] 1). The
< Sound initiation processing S1 >
Fig. 10 is a flowchart illustrating specific contents of the sound emission process S1 executed by the sound
If the attack processing S1 is started, the
[ condition Ca1] the time length of the plateau Q1 is lower than a prescribed value.
[ condition Ca2] the fluctuation width of smoothed fundamental frequency f1 exceeds a predetermined value in the smoothing period Q1.
The condition Ca3 indicates that the fluctuation width of the smoothed fundamental frequency f1 exceeds a predetermined value in a predetermined length period including the start point in the stationary period Q1.
[ condition Ca4] the length of time of the voiced period Va immediately before the stationary period Q1 exceeds a prescribed value.
The fluctuation width of the fundamental frequency f1 in the voiced period Va immediately before the stationary period Q1 exceeds a predetermined value [ condition Ca5 ].
The condition Ca1 is a condition in which it is considered that it is difficult to express a natural sound quality additional sound in the stationary period Q1 having a sufficiently short time length, similarly to the condition Cr1 described above. In addition, when the fundamental frequency f1 greatly varies in the stationary period Q1, there is a high possibility that sufficient sound expression is added to the singing voice. Therefore, the stationary period Q1 in which the fluctuation width of the smoothed fundamental frequency f1 exceeds the predetermined value is excluded from the additional objects of the sound expression (condition Ca 2). The condition Ca3 is the same as the condition Ca2, and is a condition focusing on a period close to the sound-originating part in the stationary period Q1 in particular. In addition, when the time length of the voiced period Va immediately before the stationary period Q1 is sufficiently long, or when the fundamental frequency f1 greatly varies in the voiced period Va, there is a high possibility that sufficient sound expression is already added to the singing voice. Therefore, the stationary period Q1 (condition Ca4) in which the time length of the immediately preceding sound period Va exceeds the predetermined value and the stationary period Q1 (condition Ca5) in which the fluctuation width of the fundamental frequency f1 in the sound period Va exceeds the predetermined value are excluded from the additional objects of sound expression. When determining that no sound expression is added to the Q1 during the stationary period (S11: YES), the sound start processing
When determining that the sound expression of the attack section of the 2 nd sound signal X2 is added to the stationary period Q1 of the 1 st sound signal X1 (S11: YES), the
The
The
< elongation of Z1_ A during treatment >
The
As illustrated in fig. 11, the
Specifically, the
As illustrated in fig. 11, in the present embodiment, the processing period Z1_ a is extended on the time axis such that the degree of extension is smaller as the position is closer to the starting point time τ 1_ a of the processing period Z1_ a. Therefore, the acoustic characteristics in the vicinity of the starting point time τ 1_ a of the singing voice can be sufficiently maintained even in the inflected voice. On the other hand, the reference speech expression period Z2_ a does not scale on the time axis. Therefore, the sound expression of the sound source indicated by the 2 nd sound signal X2 can be accurately added to the 1 st sound signal X1.
If the processing period Z1_ a is extended in the order shown in the above example, the sound-originating
Specifically, the sound
The
The reference spectral envelope outline G1_ ref applied to the expression (3) in the attack processing S1 is the spectral envelope outline G1(Tm _ a) at the synthesis end time Tm _ a (an example of the 1 st time) in the 1 st sound signal X1. That is, the timing at which the reference spectral envelope approximate shape G1_ ref is extracted is located at the start timing T1_ S of the stationary period Q1.
Similarly, the reference spectral envelope outline G2_ ref applied to the expression (3) in the attack processing S1 is the spectral envelope outline G2(Tm _ a) at the synthesis end time Tm _ a (an example of time 2) in the 2 nd sound signal X2. That is, the timing at which the reference spectral envelope approximate shape G2_ ref is extracted is located at the start timing T1_ S of the stationary period Q1.
As understood from the above description, the attack
The
As described above, in the present embodiment, the difference between the spectral envelope approximate shape G1(t1) of the 1 st audio signal X1 and the reference spectral envelope approximate shape G1_ ref (G1(t1) -G1 _ ref) and the difference between the spectral envelope approximate shape G2(t2) of the 2 nd audio signal X2 and the reference spectral envelope approximate shape G2_ ref (G2(t2) -G2 _ ref) are synthesized in the spectral envelope approximate shape G1(t1) of the 1 st audio signal X1. Therefore, in the 1 st sound signal X1, a natural acoustically distorted sound having continuous acoustic characteristics can be generated at the boundary between the period (the processing period Z1_ a or Z1_ R) distorted by the 2 nd sound signal X2 and the periods before and after the period.
In the present embodiment, the stationary period Q1 in which the fundamental frequency f1 and the spectral shape are temporally stable in the 1 st audio signal X1 is determined, and the 1 st audio signal X1 is modified by the 2 nd audio signal X2 arranged with reference to the end point (the start point time T1_ S or the end point time T1_ E) of the stationary period Q1. Therefore, the 1 st sound signal X1 is distorted for an appropriate period according to the 2 nd sound signal X2, and an acoustically natural distorted sound can be generated.
In the present embodiment, the processing period (Z1_ a or Z1_ R) of the 1 st audio signal X1 is extended according to the time length of the presentation period (Z2_ a or Z2_ R) of the 2 nd audio signal X2, and therefore, the extension of the 2 nd audio signal X2 is not required. Therefore, the acoustic characteristics (e.g., sound expression) of the reference speech are accurately added to the 1 st sound signal X1, and an acoustically natural distorted sound can be generated.
< modification example >
Next, specific modifications to the above-described embodiments will be described as examples. The modes of 2 or more arbitrarily selected from the following examples can be appropriately combined within a range not contradictory to each other.
(1) In the foregoing embodiment, the stationary period Q1 of the 1 st sound signal X1 is specified by the variation index Δ calculated from the 1
(2) In the above-described embodiment, the stationary period Q1 is determined as the period in which both the fundamental frequency f1 and the spectral shape of the 1 st sound signal X1 are temporally stable, but the stationary period Q1 may be determined as the period in which one of the fundamental frequency f1 and the spectral shape of the 1 st sound signal X1 is temporally stable. Similarly, a period in which one of the fundamental frequency f2 and the spectral shape in the 2 nd sound signal X2 is stable in time may be determined as the stationary period Q2.
(3) In the above-described embodiment, the spectral envelope outline shape G1 at the synthesis start time Tm _ R or the synthesis end time Tm _ a in the 1 st speech signal X1 is used as the reference spectral envelope outline shape G1_ ref, but the time (1 st time) at which the reference spectral envelope outline shape G1_ ref is extracted is not limited to the above example. For example, the spectral envelope outline shape G1 at the end point (the start point time T1_ S or the end point time T1_ E) of the stationary period Q1 may be used as the reference spectral envelope outline shape G1_ ref. However, the 1 st time at which the reference spectral envelope approximate shape G1_ ref is extracted is preferably a time within the stationary period Q1 in which the spectral shape is stable in the 1 st sound signal X1.
The same applies to the reference spectral envelope outline shape G2_ ref. That is, in the above-described embodiment, the spectral envelope outline shape G2 at the synthesis start time Tm _ R or the synthesis end time Tm _ a in the 2 nd speech signal X2 is used as the reference spectral envelope outline shape G2_ ref, but the time (time 2) at which the reference spectral envelope outline shape G2_ ref is extracted is not limited to the above example. For example, the spectral envelope outline shape G2 at the end point (the start point time T2_ S or the end point time T2_ E) of the stationary period Q2 may be used as the reference spectral envelope outline shape G2_ ref. However, the 2 nd time when the reference spectral envelope approximate shape G2_ ref is extracted is preferably a time within the stationary period Q2 during which the spectral shape is stable in the 2 nd sound signal X2.
In addition, the 1 st time at which the reference spectral envelope outline shape G1_ ref in the 1 st sound signal X1 is extracted and the 2 nd time at which the reference spectral envelope outline shape G2_ ref in the 2 nd sound signal X2 is extracted may be different times on the time axis.
(4) In the above-described embodiment, the 1 st audio signal X1 indicating the singing voice of the user of the
The sounds indicated by the 1 st audio signal X1 and the 2 nd audio signal X2 are not limited to a narrow speech (i.e., speech sounds uttered by a human). For example, the present invention can be applied to a case where various sound expressions (for example, performance expressions) are added to the 1 st sound signal X1 representing the performance sound of the musical instrument. For example, a performance expression such as a vibrato is added to the 1 st sound signal X1 indicating a monotonous performance sound to which no performance expression is added, using the 2 nd sound signal X2.
(5) The function of the
< appendix >)
In the manner illustrated in the above example, the following structure is grasped, for example.
A sound processing method according to a preferred aspect (aspect 1) of the present invention is a sound processing method for deforming a1 st spectral envelope approximate shape of a1 st sound signal representing a1 st sound by a1 st difference that is a difference between the 1 st spectral envelope approximate shape and a1 st reference spectral envelope approximate shape at a1 st time in the 1 st sound signal and a2 nd difference that is a difference between a2 nd spectral envelope approximate shape of a2 nd sound signal representing a2 nd sound having a difference in acoustic characteristics from the 1 st sound and a2 nd reference spectral envelope approximate shape at a2 nd time in the 2 nd sound signal, thereby, a synthesized spectral envelope outline shape in a3 rd audio signal representing a distorted sound obtained by distorting the 1 st sound and the 2 nd sound in accordance with each other is generated, and the 3 rd audio signal corresponding to the synthesized spectral envelope outline shape is generated. In the above aspect, the 1 st difference between the 1 st spectral envelope outline shape of the 1 st speech signal and the 1 st reference spectral envelope outline shape and the 2 nd difference between the 2 nd speech signal spectral envelope outline shape and the 2 nd reference spectral envelope outline shape are synthesized into the 1 st spectral envelope outline shape, thereby generating a synthesized spectral envelope outline shape of a distorted sound in which the 1 st sound and the 2 nd sound are distorted in accordance with each other. Therefore, it is possible to generate an acoustically natural distorted sound having continuous acoustic characteristics at the boundary between the period in which the 2 nd speech signal is synthesized and the periods before and after the period in the 1 st speech signal.
The spectral envelope outline shape is an outline shape of a spectral envelope. Specifically, the intensity distribution on the frequency axis in which the spectral envelope is smoothed to such an extent that the phonological property (difference between phonemes) and the individuality (difference between speakers) are not noticeable corresponds to the approximate shape of the spectral envelope. A spectrum envelope outline shape is expressed by a predetermined number of coefficients positioned on the lower order side among a plurality of coefficients of a Mel cepstrum representing the outline shape of a spectrum.
In a preferred example (claim 2) of the 1 st aspect, the temporal position of the 2 nd sound signal with respect to the 1 st sound signal is adjusted so that the end points of the 1 st stationary period in which the spectral shape of the 1 st sound signal is stable in time and the 2 nd stationary period in which the spectral shape of the 2 nd sound signal is stable in time coincide with each other, the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period, and the synthesized spectral envelope approximate shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal. In a preferred example (claim 3) of the 2 nd aspect, the 1 st time and the 2 nd time are rear times out of a start point of the 1 st plateau period and a start point of the 2 nd plateau period. In the above aspect, when the 1 st and 2 nd stationary periods are made to coincide in their end points, the rear timings of the start point of the 1 st stationary period and the start point of the 2 nd stationary period are selected as the 1 st and 2 nd timings. Therefore, the distortion sound in which the acoustic characteristic of the sound emitting portion in the 2 nd sound is added to the 1 st sound can be generated while maintaining the continuity of the acoustic characteristic at the starting points of the 1 st and 2 nd stationary periods.
In a preferred example (4 th aspect) of the 1 st aspect, the temporal position of the 2 nd sound signal with respect to the 1 st sound signal is adjusted so that the starting points thereof coincide between a1 st stationary period in which the spectral shape of the 1 st sound signal is temporally stable and a2 nd stationary period in which the spectral shape of the 2 nd sound signal is temporally stable, the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period, and the synthesized spectral envelope approximate shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal. In a preferable example of the 4 th aspect (the 5 th aspect), the 1 st time and the 2 nd time are starting points of the 1 st plateau period. In the above aspect, when the starting points of the 1 st plateau period and the 2 nd plateau period are made to coincide with each other, the starting point of the 1 st plateau period (the starting point of the 2 nd plateau period) is selected as the 1 st time and the 2 nd time. Therefore, while suppressing the movement of the starting point of the 1 st stationary period, the distorted sound can be generated in which the acoustic characteristic in the vicinity of the sound emitting point of the 2 nd sound is added to the 1 st sound.
In a preferable example (6 th aspect) of any one of the 2 nd to 5 th aspects, the 1 st stationary period is determined in accordance with a1 st index indicating a degree of change in a fundamental frequency of the 1 st sound signal and a2 nd index indicating a degree of change in the spectral shape of the 1 st sound signal. According to the above manner, the period in which both the fundamental frequency and the spectral shape are stable in time can be determined as the 1 st stationary period. Further, for example, a configuration is conceivable in which a fluctuation index corresponding to the 1 st index and the 2 nd index is calculated, and the 1 st stationary period is determined in accordance with the fluctuation index. Further, the 1 st tentative period may be determined according to the 1 st index, the 2 nd tentative period may be determined according to the 2 nd index, and the 1 st stationary period may be determined based on the 1 st tentative period and the 2 nd tentative period.
In a preferred example (7 th aspect) of any one of the 1 st to 6 th aspects, in generating the synthesized spectral envelope outline, a result obtained by multiplying the 1 st difference by a1 st coefficient is subtracted from the 1 st spectral envelope outline, and a result obtained by multiplying the 2 nd difference by a2 nd coefficient is added. In the above aspect, the time series of the synthesized spectral envelope outline shape is generated by subtracting the result of multiplying the 1 st difference by the 1 st coefficient from the 1 st spectral envelope outline shape, and adding the result of multiplying the 2 nd difference by the 2 nd coefficient to the 1 st spectral envelope outline shape. Therefore, the sound expression of the 1 st sound can be reduced, and the distorted sound that effectively adds the sound expression of the 2 nd sound can be generated.
In a preferred example (claim 8) of any one of the 1 st to 7 th aspects, in generating the synthesized spectral envelope outline shape, the 1 st spectral envelope outline shape in the extended processing period is generated by extending a processing period of the 1 st audio signal and a time length of an expression period to be applied to the deformation of the 1 st audio signal in the 2 nd audio signal, and the 1 st spectral envelope outline shape in the extended processing period is deformed in accordance with the 1 st difference in the extended processing period and the 2 nd difference in the expression period.
A sound processing device according to a preferred aspect of the present invention (claim 9) includes a memory and 1 or more processors, and generates a synthesized spectral envelope shape of a3 rd sound signal representing a distorted sound obtained by distorting the 1 st sound and the 2 nd sound by executing an instruction stored in the memory by the 1 or more processors, the synthesized spectral envelope shape being a synthesized spectral envelope shape of the 3 rd sound signal representing the distorted sound obtained by distorting the 1 st sound and the 2 nd sound by deforming the 1 st spectral envelope approximate shape in accordance with a1 st difference, which is a difference between a1 st spectral envelope approximate shape of a1 st sound signal representing the 1 st sound and a1 st reference spectral envelope approximate shape at a1 st time in the 1 st sound signal, and a2 nd difference, which is a difference between a2 nd spectral envelope approximate shape of a2 nd sound signal representing a2 nd sound having a difference in acoustic characteristic from the 1 st sound and the 2 nd sound signal at a2 nd time in the 2 nd sound signal, generating the 3 rd sound signal corresponding to the synthesized spectral envelope sketch shape.
In a preferred example (10 th aspect) of the 9 th aspect, the temporal position of the 2 nd sound signal with respect to the 1 st sound signal is adjusted so that the end points of the 1 st stationary period in which the spectral shape of the 1 st sound signal is stable in time and the 2 nd stationary period in which the spectral shape of the 2 nd sound signal is stable in time coincide with each other, the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period, and the synthesized spectral envelope approximate shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal. In a preferred example (11 th aspect) of the 10 th aspect, the 1 st time and the 2 nd time are rear times out of a start point of the 1 st plateau period and a start point of the 2 nd plateau period.
In a preferred example (12 th aspect) of the 9 th aspect, the temporal position of the 2 nd sound signal with respect to the 1 st sound signal is adjusted so that the starting points thereof coincide between a1 st stationary period in which the spectral shape of the 1 st sound signal is temporally stable and a2 nd stationary period in which the spectral shape of the 2 nd sound signal is temporally stable, the 1 st time is a time within the 1 st stationary period, the 2 nd time is a time within the 2 nd stationary period, and the synthesized spectral envelope approximate shape is generated between the 1 st sound signal and the adjusted 2 nd sound signal. In a preferred example (13 th aspect) of the 12 th aspect, the 1 st time and the 2 nd time are starting points of the 1 st plateau period.
In a preferred example (14 th mode) of any one of the 9 th to 13 th modes, the 1 or more processors perform processing of subtracting a result of multiplying the 1 st difference by a1 st coefficient and adding a result of multiplying the 2 nd difference by a2 nd coefficient with respect to the 1 st spectral envelope outline shape.
A recording medium according to a preferred embodiment (15 th aspect) of the present invention is a computer-readable recording medium having recorded thereon a program for causing a computer to execute: a1 st process of generating a synthesized spectral envelope approximate shape in a3 rd audio signal representing a distorted sound in which a1 st sound and a2 nd sound are distorted in accordance with a1 st difference, which is a difference between a1 st spectral envelope approximate shape of a1 st audio signal representing a1 st sound and a1 st reference spectral envelope approximate shape at a1 st time in the 1 st audio signal, and a2 nd difference, which is a difference between a2 nd spectral envelope approximate shape of a2 nd audio signal representing a2 nd sound having an acoustic characteristic different from the 1 st sound and a2 nd reference spectral envelope approximate shape at a2 nd time in the 2 nd audio signal; and a2 nd process of generating the 3 rd sound signal corresponding to the synthesized spectral envelope outline shape.
Description of the reference numerals
100 … sound processing device, 11 … control device, 12 … storage device, 13 … operation device, 14 … sound playing device, 21 … signal analysis unit, 22 … synthesis processing unit, 31 … starting sound processing unit, 32 … sound playing processing unit and 33 … voice synthesis unit.
- 上一篇:一种医用注射器针头装配设备
- 下一篇:声音处理方法、声音处理装置及程序