Multi-language text speech synthesis method

文档序号：1256435 发布日期：2020-08-21 浏览：24次中文

阅读说明：本技术 多种语言文本语音合成方法 (Multi-language text speech synthesis method ) 是由金泰洙李泳槿于 2019-01-11 设计创作，主要内容包括：本公开涉及多种语言(multilingual)文本语音合成(text-to-speech synthesis)方法及系统。多种语言文本语音合成方法包括：接收第一学习数据的步骤,上述第一学习数据包含第一语言的学习文本及与第一语言的学习文本相对应的第一语言的学习语音数据；接收第二学习数据的步骤,上述第二学习数据包含第二语言的学习文本及与上述第二语言的学习文本相对应的第二语言的学习语音数据；以及基于第一学习数据及第二学习数据学习第一语言的音素与第二语言的音素之间的相似信息来生成单一人工神经网络文本语音合成模型的步骤。(The present disclosure relates to a multi-language (multilingual) text-to-speech synthesis (text-to-speech synthesis) method and system. The multi-language text speech synthesis method comprises the following steps: a step of receiving first learning data including a learning text in a first language and learning speech data in the first language corresponding to the learning text in the first language; a step of receiving second learning data including a learning text in a second language and learning speech data in the second language corresponding to the learning text in the second language; and a step of learning similar information between the phonemes of the first language and the phonemes of the second language based on the first learning data and the second learning data to generate a single artificial neural network text-to-speech synthesis model.)

1. A method for synthesizing text-to-speech in multiple languages, comprising:

a step of receiving first learning data including a learning text in a first language and learning speech data in the first language corresponding to the learning text in the first language;

a step of receiving second learning data including a learning text in a second language and learning speech data in the second language corresponding to the learning text in the second language; and

and a step of generating a single artificial neural network text-to-speech synthesis model by learning information about similarity between the phonemes of the first language and the phonemes of the second language based on the first learning data and the second learning data.

2. The method of multi-lingual text-to-speech synthesis according to claim 1, further comprising:

receiving the utterance feature of the speaker associated with the first language;

a step of receiving an input text in a second language; and

and generating output speech data relating to the input text in the second language that imitates the speech of the speaker by inputting the input text in the second language and the utterance feature of the speaker relating to the first language into the single artificial neural network text-to-speech synthesis model.

3. The method of claim 2, wherein the utterance features of the speaker associated with the first language are generated by extracting feature vectors from speech data of the speaker uttering in the first language.

4. The multilingual text-to-speech synthesis method of claim 2, further comprising:

receiving emotional characteristics; and

and generating output speech data relating to the input text in the second language that imitates the speech of the speaker by inputting the input text in the second language, the utterance feature of the speaker relating to the first language, and the emotional feature into the single artificial neural network text-to-speech synthesis model.

5. The multilingual text-to-speech synthesis method of claim 2, further comprising:

receiving a prosodic feature; and

and generating output speech data relating to the input text of the second language that imitates the speech of the speaker by inputting the input text of the second language, the utterance feature of the speaker relating to the first language, and the prosodic feature to the single artificial neural network text-to-speech synthesis model.

6. The method of claim 5, wherein the prosodic features include at least one of information relating to a speed of utterance, information relating to an accent of utterance, information relating to a pitch of a voice, and information relating to a pause interval.

7. The method of multi-lingual text-to-speech synthesis according to claim 1, further comprising:

a step of receiving an input speech of a first language;

generating an utterance feature of a speaker related to a first language by extracting a feature vector from an input speech of the first language;

converting the input speech of the first language into an input text of the first language;

converting the input text of the first language into an input text of a second language; and

and generating output speech data in the second language that imitates the speech of the speaker and is related to the input text in the second language by inputting the input text in the second language and the utterance feature of the speaker related to the first language into the single artificial neural network text-to-speech synthesis model.

8. The method of claim 1, wherein the learner text in the first language and the learner text in the second language are converted into phoneme sequences using a phonetic conversion algorithm.

9. The method of claim 1, wherein the model is generated without inputting similarity information regarding at least one of pronunciation and labeling between the phonemes of the first language and the phonemes of the second language.

10. A computer-readable storage medium in which a program containing instructions for executing the steps of the multilingual text-to-speech synthesizing method according to claim 1 is recorded.

Technical Field

The disclosure relates to a multi-language text-to-speech synthesis method and system. And, to a method and apparatus for synthesizing text in a second language into speech of a speaker using a first language based on voice characteristics of the speaker.

Background

In general, a Speech synthesis technique called Text-To-Speech (TTS) is a technique for reproducing a desired Speech without previously recording an actual Speech of a person in an application program requiring a Speech of a person, such as a broadcast announcement, a satellite navigator, an artificial intelligence secretary, and the like. Typical methods of speech synthesis include a concatenative synthesis method (concatenativeTTS) in which speech is cut and stored in advance in very short units such as phonemes to combine a plurality of phonemes constituting an article to be synthesized into speech, and a parameter synthesis method (parameter TTS) in which the characteristics of speech are expressed as parameters and a vocoder (vocoder) is used to synthesize a plurality of parameters expressing the characteristics of a plurality of speech constituting an article to be synthesized into speech corresponding to the article.

On the other hand, a speech synthesis method based on an artificial neural network (artificial neural network) according to which a synthesized speech exhibits more natural speech characteristics than the existing method is being actively studied recently. However, a speech synthesizer that presents a new voice by a speech synthesis method based on an artificial neural network requires much data corresponding to the voice, and a neural network model using the data needs to be learned again, so that convenience of a user is reduced.

Disclosure of Invention

Technical problem to be solved

The method and the device can generate a multi-language text speech synthesis machine learning model in an end-to-end mode only through input text (text input) and output speech (audio output) related to multiple languages. Moreover, the method and the device can synthesize the voice from the text in a mode of reflecting the pronunciation characteristics, the emotion characteristics and the prosody characteristics of the speaker.

Technical scheme

The method for synthesizing the multi-language text voice comprises the following steps: a step of receiving first learning data including a learning text in a first language and learning speech data in the first language corresponding to the learning text in the first language; a step of receiving second learning data including a learning text in a second language and learning speech data in the second language corresponding to the learning text in the second language; and a step of learning similar information between the phonemes of the first language and the phonemes of the second language based on the first learning data and the second learning data to generate a single artificial neural network text-to-speech synthesis model.

The method for synthesizing the multi-language text voice of the embodiment of the disclosure further comprises the following steps: a step of receiving the utterance feature of a speaker associated with a first language; a step of receiving an input text in a second language; and a step of generating output speech data related to the input text of the second language imitating the speech of the speaker by inputting the input text of the second language and the utterance features of the speaker related to the first language to a single artificial neural network text-to-speech synthesis model.

In a multilingual text-to-speech synthesis method of an embodiment of the present disclosure, utterance features of a speaker related to a first language are generated by extracting feature vectors from speech data uttered by the speaker in the first language.

The method for synthesizing the multi-language text voice of the embodiment of the disclosure further comprises the following steps: a step of receiving an emotional feature (emothionfeature); and a step of generating output speech data related to the input text of the second language imitating the speech of the speaker by inputting the input text of the second language, the utterance feature and the emotion feature of the speaker related to the first language to the single artificial neural network text-to-speech synthesis model.

The method for synthesizing the multi-language text voice of the embodiment of the disclosure further comprises the following steps: a step of receiving prosodic features (prosody features); and a step of generating output speech data related to the input text of the second language imitating the speech of the speaker by inputting the input text of the second language, the utterance feature and the prosodic feature of the speaker related to the first language to the single artificial neural network text-to-speech synthesis model.

In the multilingual text-to-speech synthesizing method according to an embodiment of the present disclosure, the prosodic feature includes at least one of information on a vocalization speed, information on a vocalization accent, information on a pitch, and information on a pause interval.

The method for synthesizing the multi-language text voice of the embodiment of the disclosure further comprises the following steps: a step of receiving an input speech of a first language; a step of generating an utterance feature of a speaker related to a first language by extracting a feature vector from an input speech of the first language; a step of converting an input speech of a first language into an input text of the first language; a step of converting an input text in a first language into an input text in a second language; and a step of generating output speech data of the second language related to the input text of the second language imitating the speech of the speaker by inputting the input text of the second language and the utterance features of the speaker related to the first language to the single artificial neural network text-to-speech synthesis model.

In the method for synthesizing a multi-language text speech according to an embodiment of the present disclosure, a phonetic conversion (G2P) algorithm is used to convert the learner text of the first language and the learner text of the second language into a phoneme sequence (phone sequence).

In the multi-language text-to-speech synthesis method of an embodiment of the present disclosure, the single artificial neural network text-to-speech synthesis model is generated in a manner that similar information on at least one of pronunciations and labels between the phonemes of the first language and the phonemes of the second language is not input.

Also, a program for implementing the multilingual text-to-speech synthesis method as described above may be recorded on a computer-readable recording medium.

Drawings

Fig. 1 is a diagram showing a case where a speech synthesizer synthesizes english speech by using a single artificial neural network text-to-speech synthesis model that has learned a plurality of languages.

Fig. 2 is a diagram showing a case where a speech synthesizer synthesizes korean speech using a single artificial neural network text-to-speech synthesis model that has learned multiple languages.

FIG. 3 is a flow diagram illustrating a method of generating a single artificial neural network text-to-speech synthesis model of an embodiment of the present disclosure.

Fig. 4 is a diagram illustrating a machine learning section according to an embodiment of the present disclosure.

Fig. 5 is a diagram showing a case where a speech synthesizer of an embodiment of the present disclosure synthesizes output speech data based on utterance features of a speaker related to a first language and input text of a second language.

Fig. 6 is a diagram showing a case where a speech synthesizer of an embodiment of the present disclosure generates output speech data based on utterance features of a speaker related to a first language, input text of a second language, and emotion features.

Fig. 7 is a diagram showing a case where a speech synthesizer of an embodiment of the present disclosure generates output speech data based on utterance features of a speaker related to a first language, input text of a second language, and prosodic features (prosody features).

Fig. 8 is a diagram showing the structure of a speech translation system according to an embodiment of the present disclosure.

Fig. 9 is a diagram showing a structure of a prosody translator of an embodiment of the present disclosure.

Fig. 10 is a diagram showing the structure of a multilingual text-to-speech synthesizer according to an embodiment of the present disclosure.

Fig. 11 shows a correspondence between International Phonetic symbols (IPA) and korean Phonetic conversion (KoG2P) phonemes, and a phoneme correspondence with a common pronunciation of english and korean.

Fig. 12 shows a table indicating english phonemes most similar to korean phonemes.

Fig. 13 is a spectrum diagram showing the similarity between a voice generated from english phonemes and a voice generated from korean phonemes.

Fig. 14 is a table showing a time-varying word error rate (CER) of english data used in learning a text-to-speech machine learning model.

FIG. 15 is a block diagram of a text-to-speech synthesis system according to an embodiment of the present disclosure.

Detailed Description

Advantages, features, and methods of accomplishing the same of the disclosed embodiments will become apparent from the following detailed description of the disclosed embodiments when taken in conjunction with the accompanying drawings and the several embodiments described herein. However, the present disclosure is not limited to the embodiments disclosed below, and the present invention can be implemented in various embodiments, which are only intended to make the present disclosure complete and to make the scope of the present invention more completely understood by those skilled in the art to which the present disclosure pertains.

Brief description of the drawingsthe terms used in this specification will be used to describe the disclosed embodiments with particularity.

Terms used in the present specification are selected as general terms currently widely used as possible in consideration of functions in the present disclosure, which may be different according to intentions and practices of persons skilled in the related art, and the emergence of new technologies, etc. In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning thereof will be described in detail in the description section of the present invention. Therefore, the terms used in the present disclosure should be defined according to the meaning of the terms and the entire contents of the present disclosure, not the names of the simple terms.

The expression in the singular in this specification includes the expression in the plural as long as it is not particularly specified in the singular in the context. Also, unless expressly specified in context to a plurality, the expression of a plurality includes the expression of a single number.

Throughout the specification, when it is indicated that a certain portion "includes" a certain structural element, unless specifically stated to the contrary, it means that other structural elements may be included, but not excluded.

The term "part" used in the specification means a software component or a hardware component, and the term "part" will play a certain role. However, the "section" is not limited to software or hardware. The "" section "" can be formed on an addressable storage medium, and can reproduce one or more of a plurality of processing programs. Thus, for example, the "part" includes a plurality of software components, a plurality of object software components, a plurality of class components, a plurality of task components, and a plurality of programs, a plurality of functions, a plurality of attributes, a plurality of steps, a plurality of subroutines, a plurality of segments of program code, a plurality of drivers, firmware, microcode, circuitry, data, databases, a plurality of data structures, a plurality of tables, a plurality of arrays, and a plurality of variables. The functions provided in the plurality of components and the plurality of "sections" may be combined by a smaller number of components and "sections", or may be separated into additional components and a plurality of "sections".

According to an embodiment of the present disclosure, the "-" section "may be constituted by a processor and a memory. The term "processor" should be broadly interpreted as including general purpose processors, Central Processing Units (CPUs), microprocessors, Digital Signal Processors (DSPs), controllers, microcontrollers, state machines, and the like. In some environments, "processor" may also refer to an Application Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), a field-programmable gate array (FPGA), or the like. The term "processor" may also refer to a combination of processing devices, e.g., a combination of a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor chip, or any other such configuration.

The term "memory" should be interpreted broadly, in a manner that includes any electronic component that can store electronic information. The term memory may refer to various types of processor-readable media, such as Random Access Memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory, magnetic or optical data storage, recorders, and the like. If the processor can read information from or record information in the memory, the memory will be considered in electronic communication with the processor. The memory integrated in the processor will be in electronic communication with the processor.

In the present disclosure, the "first language" may refer to one of a plurality of languages used in a plurality of countries and nations such as korean, japanese, chinese, english, etc., and the "second language" may refer to one of languages used in other countries or nations different from the first language.

The embodiments are described in detail below with reference to the accompanying drawings in such a manner that a person having ordinary skill in the art to which the disclosure pertains can easily practice the disclosure. In the drawings, parts not related to the description are omitted for clarity of description of the present disclosure.

Fig. 1 is a diagram showing a case where a speech synthesizer 110 is caused to synthesize english speech using a single artificial neural network text-to-speech synthesis model that has learned multiple languages. In the illustrated embodiment, a single artificial neural network text-to-speech model can be learned for both Korean and English data. The speech synthesizer 110 may receive english text and the vocal characteristics of korean speakers. For example, the english text may be "Hello? ", the utterance feature of the korean speaker may be a feature vector extracted from speech data uttered by the korean speaker in korean.

The speech synthesizer 110 simulates the speech of the corresponding korean speaker by inputting the received english text and the utterance characteristics of the korean speaker to the single artificial neural network text-to-speech synthesis model and synthesizes the words "Hello? And outputting the voice. That is, the speech output from the speech synthesizer 110 may be used to utter "Hello? "is used.

Fig. 2 is a diagram showing a case where a speech synthesizer 210 is caused to synthesize korean speech using a single artificial neural network text-to-speech synthesis model that has learned multiple languages. In the illustrated example, a single artificial neural network text-to-speech model can be learned for both Korean and English data. The speech synthesizer 210 may receive korean text and utterance characteristics of american speakers. For example, the korean text may be "hello? (Is there a ) ", the utterance features of a american speaker may be feature vectors extracted from speech data of the american speaker uttered in english.

The speech synthesizer 210 mimics the speech of the corresponding american speaker by inputting the received korean text and the utterances of the american speaker into a single artificial neural network text-to-speech synthesis model and synthesizes "hello" in korean? And outputting the voice. That is, the speech output by the speech synthesizer 210 may be spoken in korean "hello? "is used.

FIG. 3 is a flow diagram illustrating a method of generating a single artificial neural network text-to-speech synthesis model of an embodiment of the present disclosure. The multi-language text-to-speech synthesis system may perform the step of receiving first learning data (step 310), the first learning data including a learning text in a first language and learning speech data in the first language corresponding to the learning text in the first language. The multilingual text-to-speech synthesis system may perform the step of receiving second learning data (step 320), the second learning data including a learning text in a second language and learning speech data in the second language corresponding to the learning text in the second language.

The multi-language text-to-speech synthesis system may perform the step of learning similarity information between phonemes of the first language and phonemes of the second language based on the first learning data and the second learning data to generate a single artificial neural network text-to-speech synthesis model (step 330). The single artificial neural network text-to-speech synthesis model generation method will be described in detail with reference to fig. 4.

Fig. 4 is a diagram illustrating the machine learning section 420 according to an embodiment of the present disclosure. The machine learning part 420 may correspond to the data learning part 1510 in fig. 15. The machine learning portion 420 may receive a plurality of pairs 411 of learning data in the first language. The pair of learning data 411 of the first language may include a learning text of the first language and learning voice data of the first language corresponding to the learning text of the respective first language.

The learning text of the first language may contain at least one word, and the machine learning part 420 may convert into a phoneme sequence by using a word-pronunciation conversion algorithm. The learning speech data of the first language may be data recorded with speech of a learning text in which a person reads the first language, a sound characteristic (sound feature) or a spectrogram (spectrogram) extracted from the recorded data, or the like. The first learning data may not include a language identifier or language information related to the first language.

The machine learning portion 420 may receive a plurality of pairs of learning data 412 in the second language. The pair of learned data in the second language 412 may include a learned text in the second language and learned speech data in the second language corresponding to the learned text in the respective second language. The first language and the second language may be different languages.

The learning text of the second language may contain at least one word, and the machine learning part 420 may convert into a phoneme sequence by using a word-pronunciation conversion algorithm. The learned voice data of the second language may be data in which a voice of a person reading the learned text of the second language is recorded, a sound characteristic or a spectrogram extracted from the recorded data, or the like. The second learning data may not include a language identifier or language information related to the second language.

The machine learning part 420 may perform machine learning based on the received plurality of learning data pairs 411 of the first language and the plurality of learning data pairs 412 of the second language to generate a single artificial neural network text-to-speech synthesis model 430. In one embodiment, the machine learning part 420 may learn similarity information between the phonemes of the first language and the phonemes of the second language without prior information about the first language and the second language to generate the single artificial neural network text-to-speech synthesis model 430. For example, the machine learning unit 420 does not receive the language identifier relating to the first language, the language identifier relating to the second language, the similarity information relating to the pronunciation between the phonemes of the first language and the phonemes of the second language, and the similarity information relating to the label between the phonemes of the first language and the phonemes of the second language, and learns the similarity information between the phonemes of the first language and the phonemes of the second language based on the plurality of pairs of learning data 411 of the first language and the plurality of pairs of learning data 412 of the second language, thereby generating the single artificial neural network text-to-speech synthesis model.

The language identifier may be an identifier indicating one of a plurality of languages used in a plurality of countries or nations, such as korean, japanese, chinese, and english. Also, the similar information on the pronunciation may be information that associates phonemes having similar pronunciations between languages, and the similar information corresponding to the flag may be information that associates phonemes having an inter-language similar flag. Similar information will be described in detail with reference to fig. 11 and 12.

Conventionally, a single machine learning model has been created by preparing learning data for each language to create a machine learning model for each language or by preparing learning data for each language and inputting the similarity information and the learning data into a plurality of languages. According to an embodiment of the present disclosure, a Multi-language (Multi-language) text-to-speech synthesis model may be embodied in one machine learning model without similar information between the languages that are learned. Fig. 4 shows a case where the learning data is received for two languages to generate a single artificial neural network text-to-speech synthesis model, but the present invention is not limited thereto, and the learning data may be received for three or more languages to generate a single artificial neural network text-to-speech synthesis model for three or more languages.

In one embodiment, the text may be synthesized into speech and output using a single artificial neural network text-to-speech synthesis model 430 generated by the machine learning portion 420. A method of synthesizing text into speech and outputting the same using the single artificial neural network text-to-speech synthesis model 430 will be described in more detail with reference to fig. 5 to 7.

Fig. 5 is a diagram showing a case where the speech synthesizer 520 synthesizes output speech data 530 based on the utterance feature 511 of the speaker related to the first language and the input text 512 of the second language according to an embodiment of the present disclosure. The speech synthesizer 520 may correspond to the data recognition part 1520 in fig. 15. The speech synthesizer 520 may be used to synthesize output speech data by receiving a single artificial neural network text-to-speech synthesis model generated by the machine learning portion 420 in fig. 4. As shown, speech synthesizer 520 may receive utterance characteristics 511 of a speaker associated with a first language and input text 512 in a second language.

The utterance features 511 of the speaker associated with the first language may be generated by extracting feature vectors from speech data of the speaker uttering in the first language. For example, the vocal features of the speaker may include the timbre or pitch of the speaker, etc. The input text 512 in the second language may include at least one word formed in the second language.

The speech synthesizer 520 may generate output speech data 530 by inputting the utterance features 511 of the speaker associated with the first language and the input text 512 in the second language to a single artificial neural network text-to-speech synthesis model. The output speech data 530 may be speech data that has synthesized the input text 512 in the second language into speech that reflects the utterance features 511 of the speaker associated with the first language. That is, the output speech data 530 may be data that mimics the speech of the corresponding speaker based on the utterance characteristics 511 of the speaker associated with the first language and synthesizes the speech of the input text 512 uttered in the second language for the corresponding speaker. In one embodiment, the output voice data 530 may be output through a speaker or the like.

Fig. 6 is a diagram showing a case where the speech synthesizer 620 generates output speech data 630 based on the utterance feature 611 of the speaker related to the first language, the input text 612 of the second language, and the emotion feature 613 according to an embodiment of the present disclosure. The speech synthesizer 620 may correspond to the data recognition part 1520 in fig. 15. The speech synthesizer 620 may be used to synthesize the output speech data 630 by receiving a single artificial neural network text-to-speech synthesis model generated by the machine learning portion 420 in fig. 4. As shown, the speech synthesizer 620 can receive utterance features 611 of a speaker associated with a first language, input text 612 in a second language, and emotional features 613. Since the utterance features of the speaker related to the first language and the input text of the second language have been described with reference to fig. 5, a repetitive description will be omitted.

In one embodiment, the emotional feature 613 may represent one of happiness, sadness, anger, fear, trust, disgust, startle, and expectation. In another embodiment, the emotional features 613 may be generated by extracting feature vectors from the speech data. The speech synthesizer 620 may generate output speech data 630 by inputting the utterance features 611 of the speaker, the input text 612 in the second language, and the emotional features 613 associated with the first language to a single artificial neural network text-to-speech synthesis model.

The output speech data 630 may be speech data that synthesizes the input text 612 in the second language into speech, and may reflect the utterance features 611 and emotional features 613 of the speaker associated with the first language. That is, the output speech data 630 may be data that mimics the speech of the corresponding speaker based on the utterance features 611 of the speaker related to the first language and is synthesized by reflecting the emotional features 613 into speech that utters the input text 612 of the second language with the emotional features 613 input by the corresponding speaker. For example, in the case where the emotional characteristic 613 is anger, the speech synthesizer 620 may generate output speech data 630 of the input text 612 in the second language that the corresponding speaker is angry to speak. In an embodiment, the output voice data 630 may be output through a speaker or the like.

Fig. 7 is a diagram showing a case where the speech synthesizer 720 of an embodiment of the present disclosure generates output speech data 730 based on the utterance features 711 of the speaker related to the first language, the input text 712 of the second language, and the prosody features 713. The speech synthesizer 720 may correspond to the data recognition part 1520 in fig. 15. The speech synthesizer 720 is used to synthesize the output speech data 730 by receiving a single artificial neural network text-to-speech synthesis model generated by the machine learning portion 420 in fig. 4. As shown, the speech synthesizer 720 can receive utterance features 711 of a speaker associated with a first language, input text 712 of a second language, and prosodic features 713. Since the utterance features of the speaker related to the first language and the input text of the second language have been described with reference to fig. 5, a repetitive description will be omitted.

The prosodic feature 713 may include at least one of information on a speed of utterance, information on an accent of utterance, information on a pitch, and information on a pause interval (e.g., information on pause readout). In one embodiment, prosodic features 713 may be generated by extracting feature vectors from the speech data. The speech synthesizer 720 may generate output speech data 730 by inputting the utterance features 711 of the speaker associated with the first language, the input text 712 of the second language, and the prosodic features 713 to a single artificial neural network text-to-speech synthesis model.

The output speech data 730 may be speech data in which the input text 712 in the second language is synthesized into speech, and may reflect the utterance features 711 and prosody features 713. That is, the output speech data 730 may be data that mimics the speech of the corresponding speaker based on the utterance features 711 of the speaker related to the first language and is synthesized into speech that utters the input text 712 of the second language with the prosodic features 713 input by the corresponding speaker by reflecting the prosodic features 713. For example, the speech synthesizer 720 may generate output speech data 730 for the corresponding speaker to read the input text 712 in the second language based on the speed of occurrence, pronunciation prominence, pitch, and information about the pause interval (pause read) included in the prosodic features 713.

Fig. 6 to 7 show a case where the emotion feature 613 or prosody feature 713, the utterance feature of the speaker in the first language, and the input text in the second language are all input to the speech synthesizer, but the speech synthesizer may be configured to input one or more of the utterance feature, emotion feature, and prosody feature of the speaker in the first language and the input text in the second language.

Fig. 8 is a diagram showing the structure of a speech translation system 800 according to an embodiment of the present disclosure. The speech translation system 800 may include a speech recognizer 810, a machine translator 820, a speech synthesizer 830, an utterance feature extractor 840, an emotion feature extractor 850, a prosodic feature extractor 860, and a prosodic translator 870(prosody translation). The voice synthesizer 830 may correspond to the data recognition part 1520 in fig. 15. As shown, the speech translation system 800 can receive input speech in a first language.

The received input speech in the first language may be passed to the speech recognizer 810, the utterance feature extractor 840, the emotional feature extractor 850, and the prosodic feature extractor 860. Speech recognizer 810 may convert to input text in a first language by receiving input speech in the first language. A machine translator 820 included in the speech translation system 800 can convert/translate input text in a first language to input text in a second language for delivery to a speech synthesizer 830.

The utterance feature extractor 840 may generate utterance features of a speaker who reads out input speech of a first language by extracting feature vectors from the input speech of the first language. The speech synthesizer 830 may generate output speech data in a second language corresponding to input text in the second language that mimics the speech of the speaker by inputting the input text in the second language and the utterance characteristics of the speaker associated with the first language to a single artificial neural network text-to-speech synthesis model. In this case, the output speech of the second language may be speech synthesized in a manner that reflects the vocal characteristics of the speaker who uttered the input speech of the first language.

The emotional feature extractor 850 may communicate to the speech synthesizer 830 by extracting emotional features from the input speech in the first language. The speech synthesizer 830 may simulate the speaker's speech by inputting the input text of the second language, the utterance feature and the emotional feature of the speaker related to the first language to the single artificial neural network text-to-speech synthesis model, and may generate the output speech data of the second language corresponding to the input text of the second language reflecting the emotional feature of the input speech of the first language. In this case, the output speech of the second language may be speech synthesized so as to reflect the vocal characteristics and emotional characteristics of the speaker who uttered the input speech of the first language.

The prosodic feature extractor 860 may extract prosodic features from the input speech of the first language. The prosodic feature extractor 860 may translate the prosodic features associated with the first language into prosodic features associated with the second language by passing the extracted prosodic features to the prosodic translator 870. That is, the prosody translator 870 may generate information reflecting prosodic features extracted from the input speech of the first language in the output speech of the second language.

The speech synthesizer 830 may mimic the speaker's speech by inputting the input text of the second language, the utterance features of the speaker associated with the first language, and the translated prosodic features to a single artificial neural network text-to-speech synthesis model, and may generate output speech data of the second language corresponding to the input text of the second language reflecting the prosodic features of the input speech of the first language. In this case, the output speech of the second language may be speech synthesized so as to reflect the utterance characteristics and prosody characteristics of the speaker who uttered the input speech of the first language. In the case of reflecting prosodic features, features such as the speech rate, pause reading, emphasis, and the like of the input speech of the first language may also be applied to the output speech of the second language.

For example, if there is a word emphasized by the user in the input speech of the first language, the prosody translator 870 may generate information for emphasizing a word of the second language corresponding to the emphasized word of the first language. The speech synthesizer 830 may generate speech in a manner of emphasizing a word of the second language corresponding to the word emphasized in the first language based on the information received from the prosody translator 870.

In one embodiment, speech synthesizer 830 can mimic a speaker's speech by inputting an input text in a second language, utterance features, emotion features, and translated prosodic features of the speaker associated with a first language into a single artificial neural network text-to-speech synthesis model and generate output speech data in the second language corresponding to the input text in the second language reflecting the emotion features and prosodic features of the input speech in the first language. In this case, the output speech of the second language may be speech synthesized in a manner reflecting the vocal characteristics, emotional characteristics, and prosodic characteristics of the speaker who uttered the input speech of the first language.

In the case where the features of the speaker are extracted from the input speech of the first language for synthesizing the translated speech, the output speech of the second language can be generated in a similar speech imitating the voice of the corresponding speaker even without learning the voice of the corresponding speaker in advance. Also, in the case where emotional characteristics of a speaker are extracted from an input speech of a first language, an output speech of a second language can be generated more naturally in a manner of imitating the emotion of the corresponding speaker in the corresponding utterance. Also, in the case where prosodic features of a speaker are extracted from input speech of a first language, output speech of a second language can be generated more naturally mimicking prosody in a respective utterance of a respective speaker.

Fig. 8 illustrates a case where all of the utterance feature, emotion feature, prosody feature, and the like are extracted from the input speech of the first language to synthesize speech, but the present invention is not limited thereto. In other embodiments, at least one of the vocal features, emotional features, and prosodic features may also be extracted from the input speech of other speakers. For example, emotional features and prosodic features may be extracted from input speech in a first language, and vocal features may be extracted from other input speech (e.g., the speech of a celebrity), thereby synthesizing speech. In this case, the synthesized speech reflects the emotion and prosody of the speaker who uttered the input speech of the first language, but can reflect the voice of the speaker who uttered the other input speech (e.g., celebrity).

Fig. 9 is a diagram showing a structure of the prosody translator 870 according to an embodiment of the present disclosure. As shown, the prosody translator 870 may include a prosody encoder 910(prosody encoder), an attention mechanism layer 920(attention), and a prosody decoder 930(prosody decoder). The prosodic encoder 910 may extract prosodic features of a first language from input speech of the first language (original language) by the prosodic feature extractor.

The received first prosodic features are converted into prosodic features of a second language (a language to be translated) through the prosodic encoder 910, the attention mechanism layer 920, and the prosodic decoder 930. In one example, the prosody translator 870 may perform learning using a sequence-to-sequence learning model (sequence-to-sequence model) to convert prosodic features of the original language into prosodic features of the translated language. That is, the Sequence-to-Sequence Learning model may be embodied by combining an encoder-decoder structure (refer to "Sequence-to-Sequence Learning with Neural Networks," IlyaSutskeeper, et al, 2014) based on a Recurrent Neural Network (RNN) with an Attention mechanism (refer to "Neural machinery transformation by Jointly Learning to Align and Translate," DzmitoryBanana, atal, 2015, and "Effective approach attachment-based Neural machinery 2015," Minh-Thang, Luong, at.).

Fig. 10 is a diagram showing the structure of a multilingual text-to-speech synthesizer 1000 according to an embodiment of the present disclosure. As shown, the multilingual text-to-speech synthesizer 1000 may include an encoder 1010, a decoder 1020, and a vocoder 1030. The encoder 1010 may receive input text.

The input text may be formed of multiple languages and may not contain language identifiers or language related information. For example, the input text may contain "hello" or "How are you? "and the like. The encoder 1010 may separate the received input text by letter unit, word unit, phoneme (phone) unit. Also, the encoder 1010 may receive input text separated by letter units, word units, and phoneme units.

Encoder 1010 may include at least one embedded layer (e.g., EL language 1, EL language 2, … …, EL language N). At least one embedding layer of the encoder 1010 may convert input texts separated by letter units, word units, and phoneme units in decibels into text embedding vectors. To convert the separated input text into text-embedded vectors, encoder 1010 may use an already-learned machine learning model. The encoder may update the machine learning model in performing the machine learning. In the case of updating the machine learning model, the text embedding vector of the separated input text may also be updated.

The encoder 1010 may input the text-embedded vector to a Deep Neural Network (DNN) module composed of a fully-connected layer. The deep neural network may be a feed forward layer (feed forward layer) or a linear layer (linear layer).

The encoder 1010 may input an output of the deep Neural Network to a module including at least one of a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN). In this case, a module including at least one of a convolutional neural network and a cyclic neural network may receive an output of the deep neural network and an output s of the embedded layer of the decoder 1020. The convolutional neural network may capture the region characteristics based on the convolutional kernel size, and the circular neural network may capture the long term dependency. The module including at least one of the convolutional neural network and the cyclic neural network may output the hidden state h of the encoder 1010 as its own output.

The embedded layers of decoder 1020 may perform similar operations as the embedded layers of encoder 1010. The embedding layer may receive a speaker Identification (ID). For example, the speaker ID may be a unique heat vector. In one embodiment, the speaker id of "telop" may be set to "1", the speaker id of "zhangyin" may be set to "2", and the speaker id of "obama" may be set to "3". The embedding layer of the decoder 1020 may convert the speaker ID into an embedded vector s of the speaker. To convert the speaker ID into the speaker embedded vector s, the decoder 1020 may use a learned machine learning model. The decoder 1020 may update the machine learning model in the course of performing machine learning. In the case of updating the machine learning model, the vector s may also be embedded for the speaker in relation to the speaker ID.

The attention mechanism of the decoder 1020 may receive the hidden state h of the encoder from the encoder 1010. Also, the attention mechanism of the decoder 1020 may receive information from the attention-cycling neural network. The information received from the attention-cycling neural network may be information about what speech was generated by the decoder 1020 by the previous time step (time-step). Also, the attention mechanism of the decoder 1020 may output the context vector Ct based on the information received from the attention cycle neural network and the hidden state h of the encoder. The hidden state h of the encoder may be information relating to the input text of the speech to be generated.

The context vector Ct may be information used to determine that speech is being generated from some portion of the input text at the current time step. For example, the attention mechanism of the decoder 1020 may output information that generates speech based on a front portion of the text input at an early stage of generating speech and gradually generates speech based on a rear portion of the text input as the speech is gradually generated.

As shown in the figure, the decoder 1020 inputs the speaker-embedded vector s to a module including at least one of the convolutional neural network and the cyclic neural network of the attention-cycle neural network, the decoder-cycle neural network, and the encoder 1010, and can configure the structure of the artificial neural network in such a manner that the speaker-embedded vector s is decoded differently for each speaker. The recurrent neural network of the decoder 1020 may be constructed by an autoregressive (autoregressive) method. That is, the r frame output at the previous time step may be used as an input for the current time step. Since there is no previous time step for initial time step 1022, multiple dummy frames may be input to the deep neural network.

The decoder 1022 may include a deep neural network composed of fully-connected layers. The deep neural network may be a feed-forward layer or a linear layer. Also, the decoder 1022 may include an attention-cycling neural network composed of gated cycling units (GRUs). The attention-cycling neural network is a layer that outputs information to be used in the attention mechanism. Since the attention mechanism is explained in the foregoing, a detailed explanation will be omitted.

Decoder 1020 may include a decoder circular neural network composed of residual (residual) gated circular units. The decoder recurrent neural network may receive location information of the input text from the attention mechanism. That is, the position information may be information regarding where the decoder 1020 converts the input text into voice.

The decoder recurrent neural network may receive information from the attention recurrent neural network. The information received from the attention-cycling neural network may be information about what speech was generated by the decoder up to the previous time step and information about the speech to be generated at the current time step. The decoder recurrent neural network may generate a later output speech to be connected with the speech generated so far. The output speech may have a Mel-Spectrum (Mel-Spectrogram) and may be composed of r frames.

For text-to-speech synthesis, the actions of the deep neural network, the attention-cycling neural network, and the decoder cycling neural network may be performed iteratively. For example, the r frames acquired at the initial time step 1022 may become the input for the subsequent time step 1024. Also, the r frames output at time step 1024 may become inputs for a subsequent time step 1026.

The speech related to all units of text can be generated by the steps described above. The text-to-speech system concatenates (concatenates) the Mel-spectrums that occur at each time step in time order and may obtain the Mel-spectrums that are associated with the entire text. The mel spectrum generated at the decoder 1020 in relation to the entire text may be output to the first vocoder 1030 or the second vocoder 1040.

The first vocoder 1030 may include a Griffin-Lim restoration module and a module including at least one of a convolutional neural network and a cyclic neural network. The module of the first vocoder 1030 including at least one of the convolutional neural network and the cyclic neural network may perform the same actions as the module of the encoder 1010 including at least one of the convolutional neural network and the cyclic neural network. That is, the module of the first vocoder 1030 including at least one of the convolutional neural network and the cyclic neural network may capture region characteristics and long-term dependence, and may output a linear-scale spectrum (linear-scale spectrum). The first vocoder 1030 may employ the Griffin-Lim algorithm for the linear scale spectrum to mimic and output a voice signal corresponding to an input text with a sound corresponding to a speaker identification code.

The second vocoder 1040 may obtain the speech signal from the mel spectrum based on a machine learning model. The machine learning model may have been learned for a network that predicts speech signals from the mel-frequency spectrum. For example, the machine learning model may use the model of WaveNet or WaveGlow, etc. The second vocoder 1040 may be used instead of the first vocoder 1030.

Such an artificial neural network-based multilingual text-to-speech synthesizer 1000 can learn using a large database in which learning texts of a plurality of countries and learning speech signals corresponding thereto exist in pairs. The multilingual text speech synthesizer 1000 may receive the learner text and define a loss function (loss function) by comparing the output speech signal with the learner speech signal. The speech synthesizer may learn a loss function through an error back propagation (error back propagation) algorithm, and may finally obtain an artificial neural network that outputs a desired speech when inputting an arbitrary text.

The multi-language text-to-speech synthesizer 1000 may synthesize speech that mimics the voice of a particular speaker by utilizing a single artificial neural network text-to-speech synthesis model generated as described above. The multi-language text speech synthesizer 1000 may simulate a speech of a language different from the native language of a specific speaker with the voice of the corresponding speaker and synthesize the speech. That is, the multilingual text-to-speech synthesizer 1000 may synthesize speech spoken in a second language by a speaker using a first language. For example, the input korean text can be synthesized so that the input korean text is spoken in korean by telangpu.

Fig. 11 shows a correspondence between international phonetic symbols and korean phonetic conversion phonemes and a phoneme correspondence with a common pronunciation of english and korean. The sounds of different languages can be described by international phonetic symbols as a system of letters. International phonetic symbols related to the pronunciation of different languages may be used as the similar information. The translation tables for IPA-CMUD and IPA-KoG2P are shown as table 1110. At the international phonetic level, although a one-to-one correspondence does not occur between the phonemes of the first language and the phonemes of the second language, a partial set including the phonemes having a common pronunciation of the first language and the second language therein may be selected. For example, a lower set of phonemes with a common pronunciation for English and Korean is selected as table 1120.

The first language and the second language may have different text systems and may have different pronunciation systems. In the case where the first language and the second language are expressed by international phonetic symbols that are the same alphabet system, a speech synthesis model can be obtained by performing normalization processing for a plurality of languages. However, the international phonetic symbols only represent each language by the same letter system, and cannot completely represent the similarity of the pronunciation or the mark of different languages. For example, the international phonetic alphabet used in the first language may not be used in the second language at all. Since the speech synthesis model cannot know which international phonetic alphabet of the second language corresponds to the international phonetic alphabet used only in the first language, in the case of using the international phonetic alphabet, only the speech synthesis models specific to the respective languages can be acquired. That is, the speech synthesis model associated with the first language can process only data associated with the first language, and cannot process data associated with the second language. In contrast, a speech synthesis model associated with a second language can only process data associated with the second language and cannot process data associated with the first language.

Fig. 12 shows a table indicating english phonemes most similar to korean phonemes. The text-to-speech synthesis system of an embodiment of the present disclosure may calculate cosine distances between phonemes for anchor (anchor) phonemes of multiple languages based on a machine learning model. To calculate the cosine distance, a phoneme embedding vector obtained based on a machine learning model may be utilized. The cosine distance between phonemes may represent the similarity between phonemes.

Based on the calculated cosine distances between the phonemes, the 5 most similar english phoneme embeddings for the korean phoneme are listed as in table 1210. The numbers 0, 1, and 2 after english phoneme embedding represent "no accent", "1 accent", and "2 accents", respectively. Cmdit distinguishes between emphasized sounds, whereas international phonetic symbols cannot distinguish between emphasized sounds. The symbols in parentheses represent international phonetic symbols.

From table 1210, it can be confirmed that the 5 most similar phoneme embeddings for the anchor phoneme based on the machine learning model of an embodiment of the present disclosure are similar to table 1120 in fig. 11. That is, it is confirmed that the machine learning model according to the embodiment of the present disclosure automatically learns similar pronunciations or labels between languages even if similar information about pronunciations or similar information about labels between phonemes of the first language and phonemes of the second language, language identifier/language information about the first language, and language identifier/language information about the second language are not received at the time of learning. Thus, the text-to-speech synthesis system of an embodiment of the present disclosure can perform text-to-speech synthesis for multiple languages learned based on a single artificial neural network text-to-speech synthesis model.

Fig. 13 is a spectrum diagram showing the similarity between a voice generated from english phonemes and a voice generated from korean phonemes. The spectrogram 1310 is a result of synthesizing an article of "He has good friends" from an english phoneme sequence (HH, IY1, HH, AE1, Z, M, EH1, N, IY0, G, UH1, D, F, R, EH1, N, D, Z) into speech. The spectrogram 1320 is the result of synthesizing speech from sequences of korean phonemes (h0, wi, h0, ya, s0, mf, ye, nf, ii, kk, yo, tt, ph, ks, ye, nf, tt, s0) generated by replacing each phoneme within the english phoneme sequence of the same article with the most similar korean phoneme.

If the spectrogram 1310 and spectrogram 1320 are compared, it is determined that the result of synthesizing speech with the english phoneme sequence is similar to the result of synthesizing speech with the korean phoneme sequence. Therefore, it can be confirmed that a high-quality speech synthesis result can be obtained even if the text in the second language is synthesized into speech using a plurality of phonemes in the first language. That is, even if the text in the second language is synthesized into speech using the utterance features of the speaker who speaks in the first language, the result of the speaker who has the native language as the first language speaking in the second language can be obtained.

Fig. 14 is a table 1410 showing a time-varying word error rate (CER) for english data used in learning a text-to-speech machine learning model. In the present embodiment, the text-to-speech machine learning model is learned in such a manner that the amount of english learning data is converted under the condition that there is sufficient korean learning data. To quantify the speech synthesis quality, in table 1410, the error rate is shown for a person listening to the speech output synthesized from text to record in text form and compare it to the original text.

According to table 1410, in the case where english text and the vocal characteristics of a korean speaker are input to the text-to-speech machine learning model to synthesize english speech corresponding to the english text for the corresponding speaker, the longer the english learning data is used, the word error rate will be reduced. That is, the longer the english learning data for machine learning is, the word error rate associated with the speech of the english text read by the korean speaker will be reduced.

On the other hand, when the korean text and the utterance feature of an english speaker are input to the text-to-speech machine learning model to synthesize a korean speech corresponding to the korean text for the speaker, the word error rate is not much different even if the time for using english learning data becomes long. This is because the amount of data made up of korean for machine learning is larger than that of english, and thus a state in which the word error rate has been reduced to a critical value can be displayed. It was confirmed that the word error rate can be sufficiently reduced when the text-to-speech synthesis system performs machine learning using data of a critical amount or more. Further, when the text-to-speech machine learning model is learned using a large amount of korean learning data and a small amount of english learning data, it can be confirmed that the result of synthesizing english text into speech can be generated with high quality.

According to the present disclosure, a multi-language text-to-speech machine learning model can be generated end-to-end only by input text and output speech related to multiple languages. In addition, in the conventional system, in order to express a plurality of different languages in one language feature set (linear feature set), a common notation method for a plurality of languages such as an international phonetic symbol is required, and advance information on similarity between languages is required. However, according to the present disclosure, since a language feature (linguistic feature) is not required, different letters may be used for each language, and advance information on similarity between languages is not required.

Further, in the present disclosure, since the model is learned in an end-to-end manner, it is not necessary to predict features (features) required in the conventional text speech synthesis such as a phoneme length (phoneme duration) using a separate model, and a text speech synthesis job can be processed by a single neural network (neural network) model. Also, according to the present disclosure, it is possible to adjust the chinese/fluent mood according to whether the speaker's identification code (speaker ID) is used or not in the process of extracting the Text code (Text encoding) by the Text encoder (Text encoder). For example, when generating speech in a first language, a penalty may be given in learning, in case of strong vocalization in a second language. According to the machine learning model employing penalties, speech can be generated closer to the pronunciation of the first language.

FIG. 15 is a block diagram of a text-to-speech synthesis system 1500 according to an embodiment of the disclosure. The text-to-speech synthesis system 1500 of an embodiment may comprise a data learning section 1510 and a data recognition section 1520. The data learning section 1510 may acquire a machine learning model by inputting data. Also, the data recognition unit 1520 may generate an output voice by applying data to the machine learning model. The text-to-speech system 1500, as described above, can include a processor and a memory.

The data learning section 1510 may perform speech learning for a text. The data learning unit 1510 learns a reference for which speech is output from the text. Also, the data learning unit 1510 may learn a reference regarding which speech sound feature is used to output the speech sound. The characteristics of the speech may include at least one of pronunciation of a phoneme, tone of a user, or accent. The data learning section 1510 may acquire data for learning and use the acquired data in a data learning model described later, so that text-based speech may be learned.

The data recognition part 1520 may output a voice related to the text based on the text. The data recognition unit 1520 may output a voice from a predetermined text using the learned data learning model. The data recognition part 1520 may acquire a prescribed text (data) according to a preset reference based on the learning. Also, the data recognition part 1520 may output a voice based on the prescribed data by using the acquired data as an input value and learning a model using the data. Also, a result value that uses the acquired data as an input value and is output by the data learning model may be used to update the data learning model.

At least one of the data learning unit 1510 and the data recognition unit 1520 is manufactured in the form of at least one hardware chip and is mounted on the electronic device. For example, at least one of the data learning unit 1510 and the data recognition unit 1520 may be manufactured as a dedicated hardware chip for Artificial Intelligence (AI), or may be manufactured as a part of an existing general-purpose processor (e.g., a Central Processing Unit (CPU) or an application processor (application processor)) or a graphic dedicated processor (e.g., a Graphic Processing Unit (GPU)) to be mounted on the various electronic devices.

The data learning unit 1510 and the data recognition unit 1520 may be mounted on separate electronic devices. For example, one of the data learning unit 1510 and the data recognition unit 1520 may be loaded on the electronic device, and the other may be loaded on the server. The data learning unit 1510 and the data recognition unit 1520 may be connected by wire or wireless to supply the model information constructed by the data learning unit 1510 to the data recognition unit 1520, and the data input to the data recognition unit 1520 may be supplied to the data learning unit 1510 as additional learning data.

On the other hand, at least one of the data learning section 1510 or the data recognition section 1520 may be embodied by a software module. In the case where at least one of the data learning section 1510 and the data identification section 1520 is embodied by a software module (or a program module including an instruction), the software module may be stored in a memory or a non-transitory computer readable recording medium. Also, in this case, at least one software module may be provided by an Operating System (OS) or may be provided by a prescribed application program. In contrast, a portion of the at least one software module may be provided by the operating system and the remaining portion may be provided by the specified application.

The data learning unit 1510 according to an embodiment of the present disclosure may include a data acquiring unit 1511, a preprocessing unit 1512, a learning data selecting unit 1513, a model learning unit 1514, and a model evaluating unit 1515.

The data acquisition unit 1511 can acquire data necessary for machine learning. Since many data are necessary for learning, the data acquisition unit 1511 may receive a plurality of texts and voices corresponding to the texts.

The preprocessor 1512 may preprocess the acquired data so that the acquired data can be used for machine learning in order to determine the psychological state of the user. The preprocessor 1512 can process the acquired data in a predetermined format so that the model learning unit 1514 described later can be used. For example, the preprocessing unit 1512 may obtain the morpheme embedding by performing morpheme analysis on the text and the voice.

The learning data selection part 1513 may select data necessary for learning among the preprocessed data. The selected data may be provided to the model learning portion 1514. The learning data selection part 1513 may select data required for learning among the preprocessed data according to a preset reference. The learning data selection unit 1513 may select data based on a preset reference by learning performed by a model learning unit 1514, which will be described later.

The model learning section 1514 may learn, based on the learning data, a reference regarding which speech is output from the text. Also, the model learning portion 1514 may learn by using a learning model that outputs speech from text as learning data. In this case, the data learning model may include a pre-constructed model. For example, the data learning model may include a model that is pre-constructed by receiving base learning data (e.g., sample images, etc.).

The data learning model may be constructed in consideration of the field of use of the learning model, the purpose of learning, the computer performance of the apparatus, or the like. For example, the data learning model may include a Neural Network (Neural Network) based model. For example, a model such as a Deep neural network, a Recurrent neural network, a Long-Short Term memory artificial neural network model (LSTM), a Bidirectional Recurrent Deep Neural Network (BRDNN), a convolutional neural network, or the like can be used as the data learning model, but is not limited thereto.

According to a different embodiment, in the case where there are a plurality of data learning models constructed in advance, the model learning portion 1514 determines the data learning model having the greatest correlation between the input learning data and the base learning data as the data learning model to be learned. In this case, the basic learning data may be classified first according to the type of data, and the data learning model may be constructed first according to the type of data. For example, the basic learning data may be classified in advance according to various criteria such as an area in which the learning data is generated, a time for generating the learning data, a size of the learning data, a type of the learning data, a generator of the learning data, and a type of an object in the learning data.

The model learning unit 1514 may learn the data learning model by using a learning algorithm including an error back propagation method or a gradient descent method (gradient device), for example.

Also, for example, the model learning portion 1514 may learn the data learning model by supervised learning (super learning) using the learning data as an input value. For example, the model learning unit 1514 may self-learn the type of data necessary for determining the situation without special supervision, and may learn the data learning model by unsupervised learning (unsupervised learning) for finding a reference necessary for determining the situation. Also, for example, the model learning portion 1514 may learn the data learning model by reinforcement learning (reinformance learning) in which feedback regarding whether or not the result of the determination based on the learning is accurate is utilized.

When the data learning model is learned, the model learning unit 1514 may store the learned data learning model. In this case, the model learning part 1514 may store the learned data learning model in a memory of the electronic device including the data recognition part 1520. Alternatively, the model learning unit 1514 may store the learned data learning model in a memory of a server connected to the electronic apparatus in a wired or wireless network.

In this case, for example, the memory storing the learned data learning model may also store instructions or data relating to at least one other structural element of the electronic device together. Also, the memory may store software and/or programs. For example, a program may include a kernel, middleware, an Application Program Interface (API), and/or an application program (or "app"), among others.

The model evaluation unit 1515 may input evaluation data to the data learning model, and may cause the model learning unit 1514 to perform learning again when the result output from the evaluation data fails to satisfy a predetermined criterion. In this case, the evaluation data may include preset data for evaluating the data learning model.

For example, in the case where the number or the ratio of the learned data learning model to the evaluation data exceeds a preset threshold value in the result of learning the evaluation data, the model evaluation unit 1515 may evaluate that the predetermined criterion is not satisfied. For example, when the predetermined criterion is 2%, if more than 20 recognition results of the learned data learning models are erroneous with respect to the 1000 pieces of evaluation data, the model evaluation unit 1515 may evaluate that the learned data learning models are not suitable.

On the other hand, when there are a plurality of learned data learning models, the model evaluation unit 1515 may evaluate whether or not each of the learned video learning models satisfies a predetermined criterion, and may determine the model satisfying the predetermined criterion as the final data learning model. In this case, when there are a plurality of models satisfying the predetermined criterion, the model evaluation unit 1515 may determine one or a predetermined number of models preset in order of higher evaluation score as the final data learning model.

Meanwhile, at least one of the data acquisition unit 1511, the preprocessing unit 1512, the learning data selection unit 1513, the model learning unit 1514, and the model evaluation unit 1515 in the data learning unit 1510 may be manufactured in the form of at least one hardware chip and mounted on the electronic device. For example, at least one of the data acquisition unit 1511, the preprocessing unit 1512, the learning data selection unit 1513, the model learning unit 1514, and the model evaluation unit 1515 may be manufactured as a dedicated hardware chip for artificial intelligence, or may be manufactured as a part of an existing general-purpose processor (for example, a central processing unit or an application processor) or a graphics-dedicated processor (for example, a graphics processor) to be mounted on the various electronic devices described above.

The data acquisition unit 1511, the preprocessing unit 1512, the learning data selection unit 1513, the model learning unit 1514, and the model evaluation unit 1515 may be mounted in one electronic device, or may be mounted in a plurality of separate electronic devices. For example, some of the data acquisition unit 1511, the preprocessing unit 1512, the learning data selection unit 1513, the model learning unit 1514, and the model evaluation unit 1515 may be mounted on the electronic device, and the rest may be mounted on the server.

At least one of the data acquisition unit 1511, the preprocessing unit 1512, the learning data selection unit 1513, the model learning unit 1514, and the model evaluation unit 1515 may be implemented by a software module. In the case where at least one of the data acquisition section 1511, the preprocessing section 1512, the learning data selection section 1513, the model learning section 1514, or the model evaluation section 1515 is embodied by a software module (or a program module containing instructions), the software module may be stored in a non-transitory computer-readable recording medium. Also, in this case, at least one software module may be provided by an operating system, or may be provided by a prescribed application program. In contrast, a portion of the at least one software module may be provided by the operating system and the remaining portion may be provided by the specified application.

The data identification unit 1520 according to an embodiment of the present disclosure may include a data acquisition unit 1521, a preprocessing unit 1522, an identification data selection unit 1523, an identification result providing unit 1524, and a model update unit 1525.

The data acquisition section 1521 may acquire text necessary for outputting voice. In contrast, the data acquisition section 1521 may acquire a voice necessary for outputting a text. The preprocessing section 1522 can preprocess acquired data in such a manner that the acquired data is used for outputting voice or text. The preprocessor 1522 may process the acquired data in a predetermined format so that the later-described recognition result providing unit 1524 can use the acquired data for outputting the voice or the text.

The recognition data selecting part 1523 may select data required for outputting voice or text among the preprocessed data. The selected data may be supplied to the recognition result supply part 1524. The recognition data selecting part 1523 may select a part or all of the preprocessed data according to a preset reference for outputting voice or text. Also, the identification data selection portion 1523 may select data according to a reference preset by learning of the model learning portion 1514.

The recognition result providing part 1524 may output voice or text by employing the selected data to the data learning model. The recognition result providing section 1524 may use the data selected by the recognition data selecting section 1523 as an input value, and may adopt the selected data to the data learning model. Also, the recognition result may be determined according to a data learning model.

The model updating unit 1525 can update the data learning model based on the evaluation of the recognition result supplied from the recognition result supplying unit 1524. For example, the model updating unit 1525 may cause the model learning unit 1514 to update the data learning model by supplying the recognition result supplied from the recognition result supply unit 1524 to the model learning unit 1514.

On the other hand, at least one of the data acquisition unit 1521, the preprocessing unit 1522, the identification data selection unit 1523, the identification result providing unit 1524, and the model update unit 1525 in the data identification unit 1520 may be manufactured in the form of at least one hardware chip and mounted on the electronic device. For example, at least one of the data acquisition unit 1521, the preprocessing unit 1522, the recognition data selection unit 1523, the recognition result providing unit 1524, and the model updating unit 1525 may be manufactured as a dedicated hardware chip for artificial intelligence, or may be manufactured as a part of an existing general-purpose processor (e.g., a central processing unit or an application processor) or a graphics-dedicated processor (e.g., a graphics processor) to be mounted on the various electronic devices described above.

The data acquisition unit 1521, the preprocessing unit 1522, the recognition data selection unit 1523, the recognition result providing unit 1524, and the model update unit 1525 may be mounted on one electronic device, or may be mounted on a plurality of separate electronic devices. For example, some of the data acquisition unit 1521, the preprocessing unit 1522, the recognition data selection unit 1523, the recognition result providing unit 1524, and the model update unit 1525 may be loaded on the electronic device, and the rest may be loaded on the server.

Also, at least one of the data acquisition section 1521, the preprocessing section 1522, the recognition data selecting section 1523, the recognition result providing section 1524, or the model updating section 1525 may be embodied by a software module. In the case where at least one of the data obtaining section 1521, the preprocessing section 1522, the identification data selecting section 1523, the identification result providing section 1524, or the model updating section 1525 is embodied by a software module (or a program module containing instructions), the software module may be stored in a non-transitory computer-readable recording medium. Also, in this case, at least one software module may be provided by an operating system, or may be provided by a prescribed application program. In contrast, a portion of the at least one software module may be provided by the operating system and the remaining portion may be provided by the specified application.

The above description has been focused on various embodiments. Those skilled in the art to which the invention relates will appreciate that the invention may be practiced in many modified embodiments without departing from the essential characteristics thereof. The disclosed embodiments are, therefore, to be considered in all respects as illustrative and not restrictive. The scope of the invention is indicated by the appended claims rather than the foregoing description, and all differences within the scope and range of equivalents of the claims are intended to be construed as being included in the present invention.

On the other hand, the embodiments of the present invention can be made in the form of programs that can be executed in a computer, and the present invention can be embodied in a general-purpose digital computer for executing the above-described programs, using a computer readable recording medium. The computer-readable recording medium includes magnetic storage media (e.g., read only memory, floppy disks, hard disks, etc.), optical reading media (e.g., compact disk read only memories (CD-ROMs), Digital Versatile Disks (DVDs), etc.), and the like.

31页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：利用多种语言文本语音合成模型的语音翻译方法及系统

Multi-language text speech synthesis method

相关技术

网友询问留言