Speech synthesis method, speech synthesis device, and program

文档序号：1327786 发布日期：2020-07-14 浏览：20次中文

阅读说明：本技术 声音合成方法、声音合成装置及程序 (Speech synthesis method, speech synthesis device, and program ) 是由大道龙之介于 2018-11-28 设计创作，主要内容包括：声音合成装置具有：中间训练好的模型,其生成与包含对音位进行指定的第1控制数据的输入相对应的第2控制数据；编辑处理部,其与来自利用者的第1指示相应地对第2控制数据进行变更；输出训练好的模型,其与包含第1控制数据和变更后的第2控制数据在内的输入相应地,生成与合成声音的频率特性相关的合成数据；以及合成处理部,其生成与合成数据相对应的声音信号。(The speech synthesis device includes: an intermediately trained model that generates 2 nd control data corresponding to an input including 1 st control data that specifies a phoneme; an edit processing unit that changes the 2 nd control data in accordance with the 1 st instruction from the user; outputting a trained model for generating synthetic data relating to the frequency characteristics of the synthetic speech in response to an input including the 1 st control data and the modified 2 nd control data; and a synthesis processing unit that generates an audio signal corresponding to the synthesis data.)

1. A method of synthesizing sound, which is computer-implemented,

generating 2 nd control data corresponding to an input including 1 st control data specifying a phoneme through the intermediate trained model,

the 2 nd control data is changed according to the 1 st instruction from the user,

generating synthetic data relating to frequency characteristics of synthetic speech by outputting a trained model in response to an input including the 1 st control data and the modified 2 nd control data,

generating a sound signal corresponding to the synthesized data.

2. The sound synthesis method according to claim 1,

the intermediate trained model is a1 st trained model that generates the 2 nd control data in response to an input including the 1 st control data,

the 2 nd control data is data related to phonemes of the synthetic sound.

3. The sound synthesis method according to claim 2,

generating 3 rd control data related to an expression of the synthesized sound by a2 nd trained model in response to an input including the 1 st control data and the modified 2 nd control data,

the 3 rd control data is changed in accordance with the 2 nd instruction from the user,

when the composite data is generated, the composite data is generated in accordance with an input including the 1 st control data, the modified 2 nd control data, and the modified 3 rd control data.

4. The sound synthesis method according to claim 1,

the 2 nd control data is data related to phonemes and expressions of the synthetic sound.

5. The sound synthesis method according to any one of claims 1 to 4,

changes are made to the composite data in response to the 3 rd indication from the user,

when the audio signal is generated, the audio signal is generated in accordance with the changed synthesis data.

6. A sound synthesizing apparatus having:

an intermediately trained model that generates 2 nd control data corresponding to an input including 1 st control data that specifies a phoneme;

an edit processing unit that changes the 2 nd control data in accordance with a1 st instruction from a user;

outputting a trained model for generating synthetic data relating to frequency characteristics of synthetic speech in response to an input including the 1 st control data and the modified 2 nd control data; and

and a synthesis processing unit that generates an audio signal corresponding to the synthesis data.

7. A program that causes a computer to function as:

a mid-trained model that generates 2 nd control data in correspondence with an input including 1 st control data that specifies a phoneme;

an edit processing unit that changes the 2 nd control data in accordance with a1 st instruction from a user;

and a synthesis processing unit that generates an audio signal corresponding to the synthesis data.

Technical Field

The present invention relates to a technique for synthesizing sound.

Background

Various speech synthesis techniques for synthesizing speech of arbitrary phoneme have been proposed. For example, patent document 1 discloses a technique of synthesizing a singing voice uttering a note sequence indicated by a user on an editing screen. The editing screen is a piano reel screen in which a time axis and a pitch axis are set. The user specifies a phoneme (pronunciation character), a pitch, and a pronunciation period for each note constituting the music.

Patent document 1: japanese patent laid-open publication No. 2016-90916

Disclosure of Invention

However, in the technique of patent document 1, the user can simply indicate the phoneme, pitch, and pronunciation period for each note, and it is not easy in practice to delicately reflect the intention or taste of the user in the synthesized sound. In view of the above, it is an object of a preferred embodiment of the present invention to generate a synthesized sound in accordance with the intention or taste of a user.

In order to solve the above problem, a speech synthesis method according to a preferred embodiment of the present invention includes: generating 2 nd control data corresponding to an input including 1 st control data specifying a phoneme through the middle trained model; changing the 2 nd control data in accordance with a1 st instruction from a user; generating synthetic data relating to frequency characteristics of synthetic speech by outputting a trained model in response to an input including the 1 st control data and the modified 2 nd control data; and generating a sound signal corresponding to the synthesized data.

A speech synthesis apparatus according to a preferred embodiment of the present invention includes: an intermediately trained model that generates 2 nd control data corresponding to an input including 1 st control data that specifies a phoneme; an edit processing unit that changes the 2 nd control data in accordance with a1 st instruction from a user; outputting a trained model for generating synthetic data relating to frequency characteristics of synthetic speech in response to an input including the 1 st control data and the modified 2 nd control data; and a synthesis processing unit that generates an audio signal corresponding to the synthesis data.

A program according to a preferred embodiment of the present invention causes a computer to function as: a mid-trained model that generates 2 nd control data in correspondence with an input including 1 st control data that specifies a phoneme; an edit processing unit that changes the 2 nd control data in accordance with a1 st instruction from a user; outputting a trained model for generating synthetic data relating to frequency characteristics of synthetic speech in response to an input including the 1 st control data and the modified 2 nd control data; and a synthesis processing unit that generates an audio signal corresponding to the synthesis data.

Drawings

Fig. 1 is a block diagram illustrating a configuration of a speech synthesis apparatus according to embodiment 1 of the present invention.

Fig. 2 is a block diagram illustrating a functional configuration of the speech synthesis apparatus.

Fig. 3 is a schematic diagram of an editing screen.

Fig. 4 is a flowchart of the sound synthesis process.

Fig. 5 is a block diagram illustrating a functional configuration of the speech synthesis apparatus according to embodiment 2.

Fig. 6 is a flowchart of the speech synthesis process in embodiment 2.

Detailed Description

< embodiment 1 >

Fig. 1 is a block diagram illustrating a configuration of a speech synthesis apparatus 100 according to embodiment 1 of the present invention. The speech synthesis apparatus 100 synthesizes speech of an arbitrary phoneme (hereinafter referred to as "synthesized speech"). The voice synthesis apparatus 100 according to embodiment 1 is a singing synthesis apparatus that synthesizes voices uttered by a singer virtually singing a piece of music as synthesized voices. As illustrated in fig. 1, the speech synthesis apparatus 100 according to embodiment 1 is realized by a computer system including a control apparatus 11, a storage apparatus 12, an operation apparatus 13, a display apparatus 14, and a sound reproduction apparatus 15. For example, a mobile information terminal such as a mobile phone or a smart phone, or a mobile or stationary information terminal such as a personal computer is suitable as the speech synthesis apparatus 100.

The display device 14 is composed of, for example, a liquid crystal display panel, and displays an image instructed from the control device 11. The operation device 13 is an input device that receives an instruction from a user. Specifically, a plurality of operation elements that can be operated by a user or a touch panel that detects contact with the display surface of the display device 14 is suitably used as the operation device 13.

The control device 11 is a Processing circuit such as a cpu (central Processing unit), and centrally controls each element constituting the speech synthesis device 100. The control device 11 according to embodiment 1 generates a sound signal V indicating a time region of a waveform of a synthesized sound. The sound reproducing device 15 (e.g., a speaker or an earphone) reproduces the sound represented by the sound signal V generated by the control device 11. Note that, for convenience, illustration of a D/a converter that converts the audio signal V generated by the control device 11 from digital to analog and an amplifier that amplifies the audio signal V is omitted. Although fig. 1 shows an example in which the sound emitting device 15 is mounted on the sound synthesizing device 100, the sound emitting device 15 separate from the sound synthesizing device 100 may be connected to the sound synthesizing device 100 by wire or wirelessly.

The storage device 12 is configured by a known recording medium such as a magnetic recording medium or a semiconductor recording medium, or a combination of a plurality of types of recording media, and stores a program executed by the control device 11 and various data used by the control device 11. Further, a storage device 12 (e.g., a cloud storage) separate from the speech synthesis device 100 may be prepared, and the control device 11 may execute writing and reading with respect to the storage device 12 via a communication network such as a mobile communication network or the internet. That is, the storage device 12 may be omitted from the speech synthesis device 100.

The storage device 12 stores control data C0 representing the musical characteristics of the music pieces. The control data C0 of embodiment 1 is music data in which the pitch, phoneme, and sound emission period are specified for each of a plurality of notes constituting a music piece. That is, the control data C0 is data used for controlling the music level (i.e., the musical element). In other words, the control data C0 is data representing a musical score. The pitch is, for example, the note number of MIDI (musical Instrument interface). A phoneme is a text (i.e., the lyrics of a piece of music) that is pronounced by synthesizing a sound. Specifically, a phoneme is a text event of MIDI. For example, 1 syllable is designated as a phoneme for each note. The sound emission period is a period in which 1 note of the music is emitted, and is specified by, for example, a start point, an end point, or a duration of the note. Note that, for example, the utterance period may be specified by duration data of MIDI. The control data C0 according to embodiment 1 specifies a performance token indicating the representation of the musical performance of a music piece. For example, performance markers such as a strong (f), piano (p), fade-in, fade-out, intermittent, delayed accent, or lingering-over are specified by the control data C0.

Fig. 2 is a block diagram illustrating a functional configuration of the control device 11. As illustrated in fig. 2, the control device 11 executes the program stored in the storage device 12, thereby realizing a plurality of functions (a display control unit 21, an editing processing unit E0, a Well-trained model M1, an editing processing unit E1, a trained model M2, an editing processing unit E2, a trained model M3, an editing processing unit E3, and a synthesis processing unit 22) for generating the audio signal V corresponding to the control data C0. The function of the control device 11 may be realized by a set of a plurality of devices (i.e., a system), or a part or all of the function of the control device 11 may be realized by a dedicated electronic circuit (e.g., a signal processing circuit).

The display control unit 21 causes the display device 14 to display an image. The display control unit 21 according to embodiment 1 displays an editing screen to be referred to by a user to instruct the user to adjust the synthesized sound on the display device 14. Fig. 3 is a schematic diagram of an editing screen. As illustrated in fig. 3, the editing screen is an image including a plurality of editing regions a (a0 to A3) and a plurality of operating units B (B0 to B3). Each of the plurality of operation portions B is an image of an operation piece that receives an instruction from a user. In addition, a common time axis (horizontal axis) is set in the plurality of editing regions a (a0 to A3).

The editing area a0 is an image (so-called piano reel screen) representing the content of the control data C0 of the music level. Specifically, in the editing area a0, note images (note bars) indicating notes specified by the control data C0 are arranged in time series on a coordinate plane including a time axis and a pitch axis. The position and display length of each note image on the time axis are set in accordance with the sound emission period specified by the control data C0, and the position of the note image on the pitch axis is set in accordance with the pitch specified by the control data C0. The phoneme (specifically, grapheme) designated by the control data C0 is displayed inside the note image. In addition, the musical performance score designated by the control data C0 is also displayed in the editing area a 0. For example, in fig. 3, fade-in, fade-out, and fade-out are illustrated as performance marks. The user can give an edit instruction Q0 to the edit area a0 by operating the operation device 13. The edit instruction Q0 is, for example, an instruction to change the condition (sound generation period, pitch, or phoneme) of each note or an instruction to change (add or delete) the musical performance note.

The editing region a1 is a feature indicating a phoneme level (i.e., an element related to a phoneme), and is, for example, an image indicating a time series of a plurality of phonemes (vowels or consonants) constituting a synthetic sound. Specifically, in the editing area a1, a phoneme symbol and a pronunciation period are displayed for each of a plurality of phonemes in the synthesized sound. The user can give an edit instruction Q1 to the edit area a1 by operating the operation device 13. The edit instruction Q1 is, for example, an instruction to change the phoneme symbol of each phoneme or an instruction to change (for example, move or expand or contract) the pronunciation period.

The editing area a2 is a feature indicating a sound generation level (i.e., a component related to sound generation), and is, for example, an image indicating a musical expression given to a synthesized sound. Specifically, in the edit area a2, a period in which a musical expression is given in the synthesized sound (hereinafter referred to as "expression period") and the type of expression in each expression period (hereinafter referred to as "expression type") are displayed. Examples of musical expressions given to the synthesized voice include voice quality such as hoarse voice or breath sound, and pronunciation techniques such as vibrato or pitch reduction. The user can give an edit instruction Q2 to the edit area a2 by operating the operation device 13. The edit instruction Q2 is, for example, an instruction to change (for example, move or expand or contract) the expression period or an instruction to change the expression category in the expression period.

The edit region a3 is a feature indicating a Vocoder level (Vocoder level) (i.e., an element related to a Vocoder), for example, an image indicating temporal change in frequency characteristics of a synthesized sound. Specifically, a graph showing the temporal change in the fundamental frequency F0 of the synthesized sound is displayed in the edit area a 3. The user can give an edit instruction Q3 to the edit area a3 by operating the operation device 13. The edit instruction Q3 is an instruction to change the temporal change of the fundamental frequency F0, for example.

The editing processing unit E0 in fig. 2 changes the control data C0 of the music genre in accordance with the edit instruction Q0 from the user for the edit region a 0. Specifically, the editing processor E0 changes the condition (sound emission period, pitch, or phoneme) of each note designated by the control data C0 or the performance score designated by the control data C0 in accordance with the editing instruction Q0. When the edit instruction Q0 is given, the changed control data C0 by the edit processing unit E0 is supplied to the trained model M1, the trained model M2, and the trained model M3. On the other hand, when the edit instruction Q0 is not given, the control data C0 stored in the storage device 12 is supplied to the trained model M1, the trained model M2, and the trained model M3.

The trained model M1 outputs the control data C1 of the phoneme level corresponding to the control data C0 of the music level. The control data C1 is data relating to phonemes of the synthesized sound. Specifically, the control data C1 specifies a time series of a plurality of phonemes corresponding to the phonemes specified by the control data C0. For example, the control data C1 specifies a phoneme notation (i.e., a phoneme type) and a pronunciation period for each of a plurality of phonemes constituting the synthesized sound. The pronunciation period of each phoneme is specified by, for example, a start point and an end point or a duration.

The trained model M1 according to embodiment 1 is a statistical prediction model obtained by learning (training) the relationship between the control data C0 and the control data C1 by machine learning (in particular, deep learning) using a plurality of teacher data in which the control data C0 and the control data C1 are associated with each other. For example, a neural network that outputs control data C1 for input of the control data C0 is suitable as the trained model M1. A plurality of coefficients K1 defining the trained model M1 are set by machine learning and stored in the storage device 12. Therefore, statistically appropriate control data C1 is output from the trained model M1 for unknown control data C0 based on the tendency (relationship between control data C0 and control data C1) extracted from the plurality of teacher data. The display control unit 21 causes the display device 14 to display the edit area a1 in accordance with the control data C1 generated from the trained model M1. That is, the phoneme notation and the pronunciation period specified for each phoneme by the control data C1 are displayed in the editing area a 1.

The editing processor E1 changes the control data C1 of the phoneme level output from the trained model M1 in accordance with the editing instruction Q1 from the user for the editing area a 1. Specifically, the editing processor E1 changes the phoneme symbol or pronunciation period specified for each phoneme by the control data C1 according to the editing instruction Q1. The display control unit 21 updates the editing area a1 to the content corresponding to the changed control data C1. When the edit instruction Q1 is given, the changed control data C1 by the edit processing unit E1 is supplied to the trained model M2 and the trained model M3, and when the edit instruction Q1 is not given, the control data C1 output from the trained model M1 is supplied to the trained model M2 and the trained model M3.

The trained model M2 outputs control data C2 of the utterance level corresponding to the input data D2 including control data C0 of the music level and control data C1 of the phoneme level. The control data C2 is data relating to the musical expression of the synthesized sound. Specifically, the control data C2 specifies expression periods greater than or equal to 1 on the time axis and expression categories in the respective expression periods. Each expression period is specified, for example, by a start point and an end point or a duration.

The trained model M2 according to embodiment 1 is a statistical prediction model obtained by learning (training) the relationship between the input data D2 and the control data C2 by machine learning (in particular, deep learning) using a plurality of teacher data in which the input data D2 and the control data C2 are associated with each other. For example, a neural network that outputs control data C2 for input of input data D2 is suitable for use as the trained model M2. A plurality of coefficients K2 defining the trained model M2 are set by machine learning and stored in the storage device 12. Therefore, based on the tendency (the relationship between the input data D2 and the control data C2) extracted from the plurality of teacher data, statistically appropriate control data C2 is output from the trained model M2 for the unknown input data D2. The display control unit 21 causes the display device 14 to display the edit area a2 in accordance with the control data C2 generated from the trained model M2. That is, the expression period and the expression category specified by the control data C2 for each phoneme are displayed in the editing region a 2.

The editing processor E2 changes the control data C2 of the pronunciation level output from the trained model M2 in accordance with the edit instruction Q2 from the user for the edit region a 2. Specifically, the editing processor E2 changes the expression period or the expression category specified by the control data C2 in accordance with the edit instruction Q2. The display control unit 21 updates the editing area a2 to the content corresponding to the changed control data C2. When the edit instruction Q2 is given, the control data C2 after the change by the edit processing unit E2 is supplied to the trained model M3, and when the edit instruction Q2 is not given, the control data C2 output from the trained model M2 is supplied to the trained model M3.

The trained model M3 outputs vocoder level control data C3 (an example of synthetic data) corresponding to input data D3 including music level control data C0, phoneme level control data C1, and pronunciation level control data C2. The control data C3 is data relating to the frequency characteristics of the synthesized sound. For example, the control data C3 specifies the time series of the fundamental frequency F0 of the synthesized sound, the time series of the envelope of the harmonic component, and the time series of the envelope of the anharmonic component. The envelope of the harmonic component is a curve showing the approximate shape of the intensity spectrum (amplitude spectrum or power spectrum) of the harmonic component. The harmonic component is a periodic component including a fundamental component of the fundamental frequency F0 and a plurality of harmonic components of frequencies that are integral multiples of the fundamental frequency F0. On the other hand, the envelope of the anharmonic component is a curve showing the approximate shape of the intensity spectrum of the anharmonic component. The non-harmonic component is a non-periodic component (residual component) other than the harmonic component. The envelope of the harmonic component and the non-harmonic component is expressed by a plurality of mel-frequency cepstral coefficients, for example.

The trained model M3 according to embodiment 1 is a statistical prediction model obtained by learning the relationship between the input data D3 and the control data C3 by machine learning (in particular, deep learning) using a plurality of teacher data in which the input data D3 and the control data C3 are associated with each other. A neural network that outputs control data C3 for input of the input data D3, for example, is suitable for use as the trained model M3. A plurality of coefficients K3 defining the trained model M3 are set by machine learning and stored in the storage device 12. Therefore, statistically appropriate control data C3 is output from the trained model M3 for unknown input data D3 based on the tendency (the relationship between the input data D3 and the control data C3) extracted from the plurality of teacher data. The display control unit 21 causes the display device 14 to display the edit area A3 in accordance with the control data C3 generated from the trained model M3. That is, the time series of the fundamental frequency F0 specified by the control data C3 is displayed in the editing region A3.

The editing processor E3 changes the vocoder level control data C3 output from the trained model M3 in accordance with the user's editing instruction Q3 for the editing region A3. Specifically, the editing processor E3 changes the fundamental frequency F0 specified by the control data C3 in accordance with the editing instruction Q3. The display control unit 21 updates the editing area a3 to the content corresponding to the changed control data C3. When the edit instruction Q3 is given, the control data C3 after the change by the edit processing unit E3 is supplied to the synthesis processing unit 22, and when the edit instruction Q3 is not given, the control data C3 output from the trained model M3 is supplied to the synthesis processing unit 22.

The synthesis processing unit 22 generates the audio signal V corresponding to the control data C3. A known speech synthesis technique is optionally employed when the synthesis processing unit 22 generates the speech signal V. For example, sms (spectral model synthesis) is suitable for the generation of the sound signal V. The sound signal V generated by the synthesis processing unit 22 is supplied to the sound reproducing device 15 and reproduced as a sound wave. As understood from the above description, the synthesis processing unit 22 corresponds to a so-called vocoder.

If the control data C0 of the music genre is changed by the editing processing unit E0 in accordance with the editing instruction Q0, the user operates the operation unit B0 of fig. 3 using the operation device 13. When the operation unit B0 is operated, the control data C1 generated by the trained model M1, the control data C2 generated by the trained model M2, and the control data C3 generated by the trained model M3 are executed with respect to the changed control data C0.

If the control data C1 of the phoneme level is changed by the editing processor E1 in accordance with the editing instruction Q1, the user operates the operation unit B1 using the operation device 13. When the operation unit B1 is operated, the changed control data C1 is supplied to the trained model M2 and the trained model M3, and the generation of the control data C2 by the trained model M2 and the generation of the control data C3 by the trained model M3 are executed. When the operation unit B1 is operated, the audio signal V is generated using the control data C1 in which the edit instruction Q1 is reflected, without generating the control data C1 by the trained model M1.

When the edit processing unit E2 changes the control data C2 of the sound generation level in accordance with the edit instruction Q2, the user operates the operation unit B2 using the operation device 13. When the operation unit B2 is operated, the changed control data C2 is supplied to the trained model M3, and the generation of the control data C3 by the trained model M3 is executed. When the operation unit B2 is operated, the audio signal V is generated using the control data C2 in which the edit instruction Q2 is reflected, without performing the generation of the control data C1 by the trained model M1 and the generation of the control data C2 by the trained model M2.

Fig. 4 is a flowchart of a process (hereinafter referred to as "sound synthesis process") in which the control device 11 generates the sound signal V. For example, the speech synthesis process is executed in response to an instruction from the user to the speech synthesis apparatus 100. For example, when the operation unit B3 (playback) in fig. 3 is operated, the speech synthesis process is executed.

When the voice synthesis process is started, the editing processor E0 changes the music level control data C0 in accordance with the editing instruction Q0 from the user (Sa 1). If the edit instruction Q0 is not given, the change of the control data C0 is omitted.

The trained model M1 generates control data C1(Sa2) relating to the phonemes of the synthesized sound in response to the control data C0. The editing processor E1 changes the control data C1 of the phoneme level in accordance with the editing instruction Q1 from the user (Sa 3). If the edit instruction Q1 is not given, the change of the control data C1 is omitted.

The trained model M2 generates control data C2 relating to the musical expression of the synthesized sound in response to the input data D2 including the control data C0 and the control data C1(Sa 4). The editing processor E2 changes the control data C2 of the sound emission level in accordance with the editing instruction Q2 from the user (Sa 5). If the edit instruction Q2 is not given, the change of the control data C2 is omitted.

The trained model M3 generates control data C3 relating to the frequency characteristics of the synthesized sound in accordance with input data D3 including control data C0, control data C1, and control data C2(Sa 6). The edit processor E3 changes the vocoder level control data C3 in accordance with the edit instruction Q3 from the user (Sa 7). If the edit instruction Q3 is not given, the change of the control data C3 is omitted. The synthesis processing unit 22 generates the audio signal V corresponding to the control data C3(Sa 8).

As described above, in embodiment 1, since the instruction from the user (the edit instruction Q1 or the edit instruction Q2) is reflected in the stage from the control data C0 to the stage of generating the control data C3, there is an advantage that the sound signal V of the synthesized sound in accordance with the intention or taste of the user can be generated, compared with a configuration in which the user can only edit the control data C0.

In embodiment 1, in particular, the control data C1 relating to the phoneme of the synthesized speech is changed in accordance with the edit instruction Q1 from the user. Therefore, it is possible to generate the sound signal V of the synthesized sound in which the phonemes are adjusted according to the intention or preference of the user. In addition, the control data C2 relating to the expression of the synthesized voice is changed in accordance with the edit instruction Q2 from the user. Therefore, it is possible to generate a sound signal of a synthesized sound in which the musical expression is adjusted according to the intention or taste of the user. The control data C3 is changed in accordance with the edit instruction Q3 from the user. Therefore, the audio signal V of the synthesized audio in which the frequency characteristics are adjusted according to the intention or preference of the user can be generated.

< embodiment 2 >

Embodiment 2 of the present invention will be explained. In the following embodiments, elements having the same functions or functions as those of embodiment 1 are denoted by the same reference numerals as those of embodiment 1, and detailed description thereof is omitted as appropriate.

Fig. 5 is a block diagram illustrating a functional configuration of the control device 11 in embodiment 2. As illustrated in fig. 5, the trained model M1, the editing processor E1, the trained model M2, and the editing processor E2 illustrated in embodiment 1 are replaced with a trained model M12 and an editing processor E12 in embodiment 2. The edited control data C0 obtained by the editing processor E0 is supplied to the trained model M12.

The trained model M12 outputs control data C12 of phoneme/pronunciation level corresponding to the control data C0 of music level. The control data C12 is data related to phonemes and musical expressions of the synthesized sound. Specifically, the control data C12 specifies a phoneme symbol and a pronunciation period of each phoneme corresponding to the phoneme specified by the control data C0, and an expression period and an expression type for giving an expression to the synthesized voice. That is, the control data C12 according to embodiment 2 is a combination of the control data C1 and the control data C2 according to embodiment 1.

The trained model M12 according to embodiment 2 is a statistical prediction model obtained by learning the relationship between the control data C0 and the control data C12 by machine learning (in particular, deep learning) using a plurality of teacher data in which the control data C0 and the control data C12 are associated with each other. A neural network that outputs control data C12, for example, for input of the control data C0, is suitable as the trained model M12. A plurality of coefficients defining the trained model M12 are set by machine learning and stored in the storage device 12. Therefore, based on the tendency (the relationship between the control data C0 and the control data C12) extracted from the plurality of teacher data, statistically appropriate control data C12 is output from the trained model M12 for the unknown control data C0. The display controller 21 causes the display device 14 to display the edit region a1 and the edit region a2 in accordance with the control data C12 generated from the trained model M12.

The editing processor E12 changes the control data C12 of the phoneme/pronunciation level output from the trained model M12 in accordance with the edit instruction Q1 from the user for the edit region a1 or the edit instruction Q2 from the user for the edit region a 2. Specifically, the editing processor E12 changes the phoneme symbol and the pronunciation period specified by the control data C12 for each phoneme in accordance with the editing instruction Q1, and changes the expression period and the expression type specified by the control data C12 in accordance with the editing instruction Q2. The display control unit 21 updates the editing area a1 and the editing area a2 to the contents corresponding to the changed control data C12. When the edit instruction Q1 or the edit instruction Q2 is given, the control data C12 changed by the edit processing unit E12 is supplied to the trained model M3, and when the edit instruction Q1 or the edit instruction Q2 is not given, the control data C12 output from the trained model M12 is supplied to the trained model M3.

The trained model M3 according to embodiment 2 outputs vocoder level control data C3 (an example of synthesized data) corresponding to input data D3 including music level control data C0 and phoneme/pronunciation level control data C12. The specific operation of the trained model M3 for outputting the control data C3 in response to the input of the input data D3 is the same as that of embodiment 1. In addition, as in embodiment 1, the editing processor E3 changes the control data C3 in accordance with the editing instruction Q3 from the user, and the synthesis processor 22 generates the audio signal V corresponding to the control data C3.

Fig. 6 is a flowchart of the speech synthesis process in embodiment 2. For example, the speech synthesis process is executed in response to an instruction from the user to the speech synthesis apparatus 100. For example, when the operation unit B3 (playback) in fig. 3 is operated, the speech synthesis process is executed.

When the voice synthesis process is started, the editing processor E0 changes the music level control data C0 in accordance with the edit instruction Q0 from the user (Sb 1). If the edit instruction Q0 is not given, the change of the control data C0 is omitted.

The trained model M12 generates control data C12 related to the phonemes of the synthesized sound in correspondence with the control data C0 (Sb 2). The editing processor E12 changes the control data C12 of the phoneme/pronunciation level in accordance with the editing instruction Q1 or the editing instruction Q2 from the user (Sb 3). When the edit instruction Q1 or the edit instruction Q2 is not given, the control data C12 is not changed.

The trained model M3 generates control data C3 relating to the frequency characteristics of the synthesized sound in response to the input data D3 including the control data C0 and the control data C12(Sb 4). The edit processor E3 changes the control data C3 of the vocoder level in accordance with the edit instruction Q3 from the user (Sb 5). If the edit instruction Q3 is not given, the change of the control data C3 is omitted. The synthesis processing unit 22 generates the audio signal V corresponding to the control data C3(Sb 6).

In embodiment 2, since the instruction from the user (the edit instruction Q1 or the edit instruction Q2) is reflected in the stage from the control data C0 to the stage of generating the control data C3, there is an advantage that the sound signal V of the synthesized sound in accordance with the intention or taste of the user can be generated, as compared with the configuration in which the user can only edit the control data C0, as in embodiment 1. In embodiment 2, the control data C12 relating to the phoneme and expression of the synthesized sound is changed in accordance with the edit instruction Q1 or the edit instruction Q2 from the user. Therefore, there is an advantage that the sound signal V of the synthesized sound in which the phoneme or expression is adjusted according to the intention or preference of the user can be generated.

< modification example >

Specific modifications to the above-illustrated embodiments are exemplified below.

(1) The speech synthesis apparatus 100 can be realized by a server apparatus that communicates with a terminal apparatus (for example, a mobile phone or a smartphone) via a communication network such as a mobile communication network or the internet, for example. Specifically, the speech synthesis apparatus 100 generates the speech signal V for the control data C0 received from the terminal apparatus by the speech synthesis process (fig. 4 or fig. 6), and transmits the speech signal V to the terminal apparatus. The sound reproducing device 15 of the terminal device reproduces the sound indicated by the sound signal V received from the sound synthesizing device 100. Further, the control data C3 generated by the editing unit E3 of the speech synthesis apparatus 100 may be transmitted to the terminal apparatus, and the synthesis processing unit 22 provided in the terminal apparatus may generate the speech signal V based on the control data C3. That is, the synthesis processing unit 22 is omitted from the speech synthesis apparatus 100. Further, the control data C0 generated by the editing processing unit E0 mounted on the terminal device may be transmitted to the speech synthesis device 100, and the speech signal V generated in accordance with the control data C0 may be transmitted from the speech synthesis device 100 to the terminal device. That is, the editing processing unit E0 is omitted from the speech synthesis apparatus 100. In the configuration in which the speech synthesis apparatus 100 is realized by the server apparatus, the display control unit 21 causes the display device 14 of the terminal apparatus to display the editing screen of fig. 3.

(2) The speech synthesis apparatus 100 according to each of the above-described embodiments is realized by the cooperative operation of a computer (specifically, the control apparatus 11) and a program, as exemplified in each of the embodiments. The program according to each of the above-described aspects is provided so as to be stored in a computer-readable recording medium and installed in a computer. The recording medium is, for example, a non-volatile (non-volatile) recording medium, and preferably an optical recording medium (optical disc) such as a CD-ROM, but includes any known recording medium such as a semiconductor recording medium or a magnetic recording medium. The non-volatile recording medium includes any recording medium other than a temporary transmission signal (transient signal), and volatile recording media are not excluded. The program may be transferred to the computer by a transfer via a communication network. The main body of the program execution is not limited to the CPU, and the program may be executed by a processor for a Neural network such as a sensor Processing Unit and a Neural Engine, or a dsp (digital signal processor) for signal Processing. Further, the various subjects selected from the above examples may execute the program in cooperation with each other.

(3) The trained model is realized by a program (for example, a program module constituting artificial intelligence software) for causing the control device 11 to execute an operation for specifying the output B based on the input a, and a combination of a plurality of coefficients applied to the operation. The plurality of coefficients of the trained model are optimized by prior machine learning (in particular, deep learning) using a plurality of teacher data corresponding to the input a and the output B. That is, the trained model is a statistical model obtained by learning (training) the relationship between the input a and the output B. The control device 11 statistically generates appropriate outputs B for the input a based on tendencies (the relationship between the input a and the outputs B) extracted from the plurality of teacher data by performing an operation to which a plurality of trained coefficients and a predetermined response function are applied for the unknown input a.

(4) According to the above-described embodiment, the following configuration can be grasped, for example.

A speech synthesis method according to a preferred aspect (1 st aspect) of the present invention generates 2 nd control data corresponding to an input including 1 st control data specifying a phoneme by using an intermediate trained model, changes the 2 nd control data in accordance with a1 st instruction from a user, generates synthesis data relating to frequency characteristics of a synthesized speech by outputting the trained model in accordance with an input including the 1 st control data and the 2 nd control data after the change, and generates a speech signal corresponding to the synthesis data. In the above aspect, the 1 st instruction from the user is reflected at a stage from the 1 st control data to the generation of the synthesized data, and therefore, compared with a configuration in which the user can edit only the 1 st control data, it is possible to generate a sound signal representing a synthesized sound in accordance with the intention or taste of the user.

For example, the trained model M1 in embodiment 1, the trained model M2 in embodiment 2, and the trained model M12 in embodiment 2 are preferable examples of the "mid-trained model" in embodiment 1. When the trained model M1 of embodiment 1 is interpreted as the "mid-trained model", the control data C1 corresponds to the "2 nd control data", and the edit instruction Q1 corresponds to the "1 st instruction". When the trained model M2 of embodiment 1 is interpreted as the "mid-trained model", the control data C2 corresponds to the "2 nd control data", and the edit instruction Q2 corresponds to the "1 st instruction". When the trained model M12 of embodiment 2 is interpreted as the "mid-trained model", the control data C12 corresponds to the "2 nd control data", and the edit instruction Q1 or the edit instruction Q2 corresponds to the "1 st instruction". The trained model M3 in embodiment 1 or embodiment 2 is an example of "outputting a trained model".

In a preferred example (claim 2) of the 1 st aspect, the intermediate trained model is a1 st trained model that generates the 2 nd control data in response to an input including the 1 st control data, and the 2 nd control data is data related to a phoneme of the synthesized speech. In the above aspect, the 2 nd control data relating to the phoneme of the synthesized speech is changed in accordance with the 1 st instruction from the user. Therefore, it is possible to generate a speech signal of a synthesized speech in which the phonemes are adjusted according to the intention or preference of the user. A preferred example of the "1 st trained model" in the 2 nd embodiment is, for example, the "trained model M1" in the 1 st embodiment.

In a preferred example (3 rd aspect) of the 2 nd aspect, the 3 rd control data related to the expression of the synthesized voice is generated by a2 nd trained model in accordance with an input including the 1 st control data and the modified 2 nd control data, the 3 rd control data is modified in accordance with a2 nd instruction from a user, and the synthesized data is generated in accordance with an input including the 1 st control data, the modified 2 nd control data, and the modified 3 rd control data at the time of generating the synthesized data. In the above aspect, the 3 rd control data relating to the expression of the synthesized voice is changed in accordance with the 2 nd instruction from the user. Therefore, it is possible to generate a sound signal of the synthesized sound in which the expression is adjusted according to the intention or preference of the user. A preferred example of the "2 nd trained model" in the 3 rd embodiment is, for example, the trained model M2 in the 1 st embodiment, and a preferred example of the "3 rd control data" in the 3 rd embodiment is, for example, the control data C2 in the 1 st embodiment.

In a preferred example (4 th aspect) of the 1 st aspect, the 2 nd control data is data relating to phonemes and expressions of the synthetic speech. In the above aspect, the 2 nd control data relating to the phoneme and expression of the synthesized voice is changed in accordance with the 1 st instruction from the user. Therefore, it is possible to generate a sound signal of a synthesized sound in which the phoneme and the expression are adjusted according to the intention or preference of the user. A preferred example of the "intermediate trained model" in the 4 th mode is, for example, the trained model M12 in the 2 nd embodiment, and a preferred example of the "1 st indication" in the 4 th mode is, for example, the edit indication Q1 or the edit indication Q2 in the 2 nd embodiment.

In a preferred example (claim 5) of any one of the first to fourth aspects, the synthesis data is changed in accordance with a3 rd instruction from a user, and the audio signal is generated in accordance with the changed synthesis data at the time of generating the audio signal. In the above aspect, the synthesis data is changed in accordance with the 3 rd instruction from the user. Therefore, it is possible to generate a sound signal of a synthesized sound in which the frequency characteristics are adjusted according to the intention or preference of the user. A preferred example of "instruction 3" in the 5 th embodiment is, for example, the edit instruction Q3 in the 1 st embodiment or the 2 nd embodiment.

A speech synthesis device according to a preferred embodiment (6 th aspect) of the present invention includes: an intermediately trained model that generates 2 nd control data corresponding to an input including 1 st control data that specifies a phoneme; an edit processing unit that changes the 2 nd control data in accordance with a1 st instruction from a user; outputting a trained model for generating synthetic data relating to frequency characteristics of synthetic speech in response to an input including the 1 st control data and the modified 2 nd control data; and a synthesis processing unit that generates an audio signal corresponding to the synthesis data. In the above aspect, the 1 st instruction from the user is reflected at a stage from the 1 st control data to the generation of the synthesized data, and therefore, compared with a configuration in which the user can edit only the 1 st control data, it is possible to generate a sound signal representing a synthesized sound in accordance with the intention or taste of the user.

A program according to a preferred embodiment (7 th aspect) of the present invention causes a computer to function as: a mid-trained model that generates 2 nd control data in correspondence with an input including 1 st control data that specifies a phoneme; an edit processing unit that changes the 2 nd control data in accordance with a1 st instruction from a user; outputting a trained model for generating synthetic data relating to frequency characteristics of synthetic speech in response to an input including the 1 st control data and the modified 2 nd control data; and a synthesis processing unit that generates an audio signal corresponding to the synthesis data. In the above aspect, the 1 st instruction from the user is reflected at a stage from the 1 st control data to the generation of the synthesized data, and therefore, compared with a configuration in which the user can edit only the 1 st control data, it is possible to generate a sound signal representing a synthesized sound in accordance with the intention or taste of the user.

Description of the reference numerals

100 … sound synthesis device, 11 … control device, 12 … storage device, 13 … operation device, 14 … display device, 15 … sound reproduction device, 21 … display control unit, 22 … synthesis processing unit, E0, E1, E2, E3, E12 … edit processing unit, M1, M2, M3, M12 … trained model, Q0, Q1, Q2, Q3 … edit instruction, A0, A1, A2, A3 … edit area, B0, B1, B2, B3 … operation unit.

18页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：多轮预制对话

Speech synthesis method, speech synthesis device, and program

相关技术

网友询问留言