Training method of prosodic phrase boundary prediction model and prosodic phrase boundary prediction method

文档序号：513263 发布日期：2021-05-28 浏览：10次中文

阅读说明：本技术 韵律短语边界预测模型训练方法和韵律短语边界预测方法 (Training method of prosodic phrase boundary prediction model and prosodic phrase boundary prediction method ) 是由江源窦云峰凌震华于 2021-01-25 设计创作，主要内容包括：本发明公开了一种韵律短语边界预测模型训练方法和韵律短语边界预测方法。其中,韵律短语边界预测模型训练方法包括：获取训练文本集,其中,训练文本集中的每条训练文本包括至少两种相似语序语种的训练文本；获取每条训练文本的文本特征,其中文本特征包括训练文本中每个单词的词面、词性、词长、词缀、停顿概率、词向量、以及语种标志位；利用每条训练文木的文本特征、以及训练文本的标签训练初始韵律短语边界预测模型,得到经训练得到的韵律短语边界预测模型。采用本发明提供的方法,可以增加了模型预测的准确率,进一步可提高后期语音合成的自然度。(The invention discloses a prosodic phrase boundary prediction model training method and a prosodic phrase boundary prediction method. The prosodic phrase boundary prediction model training method comprises the following steps: acquiring a training text set, wherein each training text in the training text set comprises training texts of at least two similar language order languages; acquiring text characteristics of each training text, wherein the text characteristics comprise the word face, the part of speech, the word length, the affix, the pause probability, the word vector and the language flag bit of each word in the training text; and training an initial prosodic phrase boundary prediction model by using the text features of each training text and the labels of the training texts to obtain a trained prosodic phrase boundary prediction model. By adopting the method provided by the invention, the accuracy of model prediction can be increased, and the naturalness of later-stage speech synthesis can be further improved.)

1. A prosodic phrase boundary prediction model training method comprises the following steps:

acquiring a training text set, wherein each training text in the training text set comprises training texts of at least two similar language order languages;

acquiring text characteristics of each training text, wherein the text characteristics comprise word face, part of speech, word length, affix, pause probability, word vector and language flag of each word in the training text;

training an initial prosodic phrase boundary prediction model by using the text features of each training text and the labels of the training texts to obtain a trained prosodic phrase boundary prediction model, wherein the labels of the training texts are used for representing the pause state of each word in the training texts.

2. The training method of claim 1, wherein the prosodic phrase boundary prediction model comprises a dimension reduction feature model and a DNN network, wherein the dimension reduction feature model is used for performing dimension reduction processing on the training text to obtain a high-order feature vector, and the DNN network is used for outputting a pause state of each word in the training text.

3. The training method of claim 1, wherein training an initial prosodic phrase boundary prediction model using the text features of the training text and the labels of the training text, and obtaining the trained prosodic phrase boundary prediction model comprises:

training an initial dimension reduction feature model by using the text features of the training text to obtain a dimension reduction feature model obtained through training;

inputting the text features of the training text into the dimension reduction feature model, and outputting a high-order feature vector of the training text;

training an initial DNN network by utilizing the high-order characteristic vector of the training text and the label of the training text to obtain a DNN network obtained through training;

and combining the dimension reduction feature model with the DNN to obtain the prosodic phrase boundary prediction model.

4. The training method of claim 3, wherein training an initial dimension-reduced feature model using the text features of the training text, and obtaining the trained dimension-reduced feature model comprises:

inputting the text features of the training text into the initial dimension reduction feature model;

and adjusting the network weight of the initial dimensionality reduction feature model through an error back propagation algorithm to enable the output layer node value of the initial dimensionality reduction feature model to approach the input layer node value until the trained dimensionality reduction feature model is obtained under the condition that the difference value between the output layer node value and the input layer node value meets the preset condition.

5. The training method of claim 3, wherein training an initial DNN network using higher-order feature vectors of the training text and labels of the training text, resulting in a trained DNN network comprises:

inputting the high-order feature vector of the training text and the label of the training text into the initial DNN network, and outputting the pause state of each word in the training text;

and calculating a cross entropy loss value between the label of the training text and the pause state of each word in the training text, and obtaining the DNN network obtained through training when the cross entropy loss value meets a preset condition.

6. The training method of claim 1, the obtaining training text comprising:

and acquiring the training text through voice data.

7. The training method of claim 1, the probability of a pause for each word being:

wherein N represents the total number of prosodic phrases in the training text; n (x) table represents the number of times the word appears in the prosodic phrase in the training text; tf (x) represents the frequency of occurrence of the words in prosodic phrases in the training text.

8. A method for prosodic phrase boundary prediction using the prosodic phrase boundary prediction model of any one of claims 1-7, comprising:

acquiring predicted text data, wherein the predicted text data comprises predicted text data of at least two similar language order languages;

processing the predicted text data to obtain text features of the predicted text data, wherein the text features comprise the word face, the part of speech, the word length, the affix, the pause probability, the word vector and the language flag of each word in the predicted text data;

inputting the text features of the predicted text data into the prosodic phrase boundary prediction model, and outputting the pause state of each word in the predicted text data;

and acquiring prosodic phrase boundaries according to the pause state of each word in the predicted text data.

9. A prosodic phrase boundary prediction model training device, comprising:

the first acquisition module is used for acquiring a training text, wherein the training text comprises training texts of at least two similar language order languages;

the second obtaining module is used for obtaining text features of the training text, wherein the text features comprise word faces, word properties, word lengths, affixes, pause probabilities, word vectors and language flags of each word in the training text;

and the training module is used for training an initial prosodic phrase boundary prediction model by using the text features of the training text and the labels of the training text to obtain a trained prosodic phrase boundary prediction model, wherein the labels of the training text are used for representing the pause state of each word in the training text.

10. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

Technical Field

The invention belongs to the technical field of voice synthesis, and mainly relates to a prosodic phrase boundary prediction model training method, a prosodic phrase boundary prediction model training device and electronic equipment.

Background

In speech synthesis, prosody prediction of text data is always an important task in front-end text processing, and the correctness of the prediction position directly influences the comprehension of naturalness and semantic information of synthesized speech. Therefore, it is important to correctly predict the prosodic boundary of text data. The conventional prosodic phrase prediction method generally constructs a model aiming at a single language, the model is a conditional random field model (CRF), a maximum entropy Model (ME) and the like, a prediction model is obtained after model training, and a prediction result of a prosodic phrase boundary is obtained through the prediction model.

Because prosody prediction is performed based on texts of single languages, the trained model only aims at the single languages, and the number of samples is small. The prosodic phrase prediction model of a certain language has no universality on languages of similar language sequences, for scarce language texts, an effective model structure is difficult to establish due to too little data, in addition, because the characteristics extracted from a text end are too simple, deeper information of the language texts cannot be mined, model training of a neural network is difficult to use, the prosodic phrase boundary of the text cannot be effectively predicted, the influence on the later-stage speech synthesis effect is large, the prediction accuracy of prosodic phrases is low, and the speech synthesis effect naturalness is low.

Disclosure of Invention

Technical problem to be solved

In view of the above, the present invention provides a prosodic phrase boundary prediction model training method, a prosodic phrase boundary prediction model training device, and an electronic device, which can at least partially solve the problems in the prior art.

(II) technical scheme

A prosodic phrase boundary prediction model training method comprises the following steps:

acquiring a training text set, wherein each training text in the training text set comprises training texts of at least two similar language order languages;

acquiring text characteristics of each training text, wherein the text characteristics comprise the word face, the part of speech, the word length, the affix, the pause probability, the word vector and the language flag bit of each word in the training text;

and training the initial prosodic phrase boundary prediction model by using the text features of each training text and the labels of the training texts to obtain a trained prosodic phrase boundary prediction model, wherein the labels of the training texts are used for representing the pause state of each word in the training texts.

According to the embodiment of the invention, the prosodic phrase boundary prediction model comprises a dimension reduction feature model and a DNN (domain name network), wherein the dimension reduction feature model is used for carrying out dimension reduction processing on the training text to obtain a high-order feature vector, and the DNN is used for outputting the pause state of each word in the training text.

According to an embodiment of the present invention, training an initial prosodic phrase boundary prediction model using text features of a training text and labels of the training text to obtain a trained prosodic phrase boundary prediction model includes:

training an initial dimension reduction feature model by using the text features of the training text to obtain a dimension reduction feature model obtained through training;

inputting the text features of the training text into a dimension reduction feature model, and outputting a high-order feature vector of the training text;

training an initial DNN network by using the high-order characteristic vector of the training text and the label of the training text to obtain a DNN network obtained through training;

and combining the dimension reduction feature model with the DNN to obtain a prosodic phrase boundary prediction model.

According to the embodiment of the invention, training the initial dimension reduction feature model by using the text features of the training text to obtain the trained dimension reduction feature model comprises the following steps:

inputting the text features of the training text into an initial dimension reduction feature model;

According to an embodiment of the present invention, training an initial DNN network using a higher-order feature vector of a training text and a label of the training text to obtain a trained DNN network includes:

inputting the high-order characteristic vector of the training text and the label of the training text into an initial DNN network, and outputting the pause state of each word in the training text;

According to an embodiment of the present invention, obtaining the training text includes:

and acquiring a training text through voice data.

According to an embodiment of the present invention, the probability of a pause for each word is:

wherein N represents the total number of prosodic phrases in the training text; n (x) table represents the number of times a word appears in a prosodic phrase in the training text; tf (x) represents the frequency of occurrence of words in prosodic phrases in the training text.

A method for performing prosodic phrase boundary prediction by using the prosodic phrase boundary prediction model comprises the following steps:

acquiring predicted text data, wherein the predicted text data comprises predicted text data of at least two similar language order languages;

processing the predicted text data to obtain text characteristics of the predicted text data, wherein the text characteristics comprise word faces, word properties, word lengths, affixes, pause probabilities, word vectors and language flags of each word in the predicted text data;

inputting text characteristics of the predicted text data into a prosodic phrase boundary prediction model, and outputting a pause state of each word in the predicted text data;

and acquiring prosodic phrase boundaries according to the pause state of each word in the predicted text data.

A prosodic phrase boundary prediction model training device, comprising:

the first acquisition module is used for acquiring a training text, wherein the training text comprises training texts of at least two similar language sequence languages;

the second acquisition module is used for acquiring the text characteristics of the training text, wherein the text characteristics comprise the word face, the part of speech, the word length, the affix, the pause probability, the word vector and the language flag bit of each word in the training text;

and the training module is used for training the initial prosodic phrase boundary prediction model by using the text features of the training text and the labels of the training text to obtain the trained prosodic phrase boundary prediction model, wherein the labels of the training text are used for representing the pause state of each word in the training text.

An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the prosodic phrase boundary prediction model training method described above.

(III) advantageous effects

According to the prosodic phrase boundary prediction model training method provided by the embodiment of the invention, the prosodic phrase boundary prediction model is trained by increasing the difficulty of the sample, namely the diversification of text characteristics, particularly the characteristics of pause probability and the diversification of languages (at least two similar language sequence languages, a single language in the prior art), so that the trained model has better prediction capability. The method solves the problems that the existing training method of the prosodic phrase boundary prediction model only aims at a single language, and for scarce language texts, an effective model structure is difficult to establish due to too little data, and the prosodic phrase boundary of the text cannot be effectively predicted due to the fact that the characteristics extracted from the text end are too simple and deeper information of the language text cannot be mined, and the synthetic effect is greatly influenced. Furthermore, by using the prosodic phrase boundary prediction model training method provided by the embodiment of the invention, the mixed training of multiple languages is beneficial to the collection of the training linguistic data of the scarce languages, in addition, the selection of multiple characteristics is beneficial to the mining of potential information among text data, is more suitable for model training of a neural network, increases the accuracy of prediction, and thus can improve the accuracy of prosodic phrase prediction, the accuracy of prosodic pause and the naturalness of later-stage speech synthesis.

Drawings

FIG. 1 schematically shows a flow chart of a prosodic phrase boundary prediction model training method according to an embodiment of the present invention;

FIG. 2 schematically shows a flow diagram of a method of prosodic phrase boundary prediction according to an embodiment of the present invention;

FIG. 3 schematically shows a block diagram of a prosodic phrase boundary prediction model training apparatus according to an embodiment of the present disclosure; and

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

FIG. 1 schematically shows a flowchart of a prosodic phrase boundary prediction model training method according to an embodiment of the present invention. As shown in fig. 1, the prosodic phrase boundary prediction model training method provided by the embodiment of the present invention includes operations S201 to S203.

In operation S201, a training text set is obtained, where each training text in the training text set includes training texts in at least two similar language order languages. The similar language order languages refer to languages with the same sentence pattern, sentence pattern and sentence class structure, for example, the hassaxophone language and the Mongolian language are two languages with similar language order, the characters used by the two languages are all Sirill letters, have similar language order, and have some same vocabulary between the two languages.

According to the embodiment of the invention, the training text in the text format can be directly obtained by obtaining the training text set, and the training text can also be obtained through voice data.

When the training text in the text format is directly obtained, at least two languages with similar language sequences to be synthesized are firstly determined, then text data of the corresponding languages are collected to obtain the training text, the training text can be downloaded on line or relevant texts are designed by self, wherein the text contains characters of all languages, then the prosodic phrase boundary position of the training text is marked manually, namely a label of the training text is obtained, and the label of the training text is used for representing the pause state of each word in the training text.

When the training text is obtained through the voice data, firstly, the voice data is recognized into the training text with prosodic phrase boundaries in a text format through the analysis of acoustic signals of audio by using a Kaldi tool, and then the standard prosodic phrase boundaries are obtained through slightly modifying the given prosodic phrase boundaries manually, so that the labels of the training text are obtained.

In operation S202, text features of each training text are obtained, where the text features include a word face, a part of speech, a word length, an affix, a pause probability, a word vector, and a language flag of each word in the training text.

In the operation, feature analysis is mainly performed on the text end of each training text which is manually labeled with a label, so as to obtain text features. After the training text is subjected to word segmentation, corresponding text features are extracted by taking the words as minimum units. The pause probability represents the state of pause, for example, the pause represents that the front or the back of the word has a pause duration in different sentences, the pause is not required in the combined word, and the pause can represent the pause or not depending on whether the rhythm requirement of the whole sentence is met or not. In addition, because of the mixed training of similar language order languages, language flag bits are required to be set to distinguish different languages, and the language flag bits can represent the category difference of training texts of different languages trained by the current model. For the language of the sticky language, the affix information of the language is extracted as the text feature, and all affixes of the sticky language only express one meaning or only have one grammar function. The affix is divided into a prefix, a infix and a suffix, the prefix and the suffix are common, and suffix information is used as text characteristics in the embodiment of the invention.

The calculation method of the pause probability of each word comprises the following steps:

According to the embodiment of the invention, in the process of performing model training by using text features, the place with wrong prediction is analyzed, a part of prediction errors can be obtained between words before and after the boundary of a correct prosodic phrase, and the boundary judgment of some regular prosodic phrases can be obtained by knowing the text prosodic phrase rule of each language, such as: the left side of a part of word lists needs (forbids) pause, the right side of the part of word lists needs (forbids) pause, and a part of word phrases cannot pause.

In operation S203, an initial prosodic phrase boundary prediction model is trained using the text features of each training text and the labels of the training texts, so as to obtain a trained prosodic phrase boundary prediction model.

According to the embodiment of the invention, the dimension reduction feature model uses a self-encoder as the network structure front end of the prosodic phrase boundary prediction model, the self-encoder is used for carrying out dimension reduction processing and information fusion on text features to obtain a high-order feature vector, and then a DNN network is connected in the following. The self-encoder has stable expression capacity in the aspect of data dimension reduction, can perform distributed expression on input data, and has strong capacity of extracting essential features from the data, so that more abstract feature expression can be obtained, and simultaneously, text feature information between different languages can be mutually referred for more richly.

The specific training process is as follows:

1. training an initial dimension reduction feature model by using the text features of the training text to obtain a dimension reduction feature model obtained through training;

2. inputting the text features of the training text into a dimension reduction feature model, and outputting a high-order feature vector of the training text;

3. training an initial DNN network by using the high-order characteristic vector of the training text and the label of the training text to obtain a DNN network obtained through training;

and combining the dimension reduction feature model with the DNN to obtain a prosodic phrase boundary prediction model.

In the training process, according to an embodiment of the present invention, the training of the initial dimension reduction feature model by using the text features of the training text includes:

inputting the text features of the training text into an initial dimension reduction feature model;

and adjusting the network weight of the initial dimensionality reduction feature model through an error back propagation algorithm to enable the output layer node value of the initial dimensionality reduction feature model to approach the input layer node value until the difference value of the output layer node value and the input layer node value meets a preset condition, and obtaining a trained dimensionality reduction feature model, wherein the trained dimensionality reduction feature model comprises the optimal combination of the number of layers in the self-encoder and the number of nodes of each layer.

In the training process, according to an embodiment of the present invention, training an initial DNN network using a higher-order feature vector of a training text and a label of the training text, and obtaining a trained DNN network includes:

inputting the high-order characteristic vector of the training text and the label of the training text into an initial DNN network, and outputting the pause state of each word in the training text;

An embodiment of the present invention further provides a method for prosodic phrase boundary prediction by using the trained prosodic phrase boundary prediction model, and fig. 2 schematically shows a flowchart of the method for prosodic phrase boundary prediction according to an embodiment of the present invention, as shown in fig. 2, the method includes operations S201 to S204.

In operation S201, predicted text data is obtained, where the predicted text data includes predicted text data of at least two similar language order languages.

In operation S202, the predicted text data is processed to obtain text features of the predicted text data, where the text features include a word face, a part of speech, a word length, an affix, a pause probability, a word vector, and a language flag of each word in the predicted text data.

In operation S203, text features of the predicted text data are input into the prosodic phrase boundary prediction model, and a pause state of each word in the predicted text data is output.

In operation S204, a prosodic phrase boundary is acquired according to a pause state of each word in the predicted text data.

The following exemplary illustrates a method for performing prosodic phrase boundary prediction by using the trained prosodic phrase boundary prediction model, as follows:

firstly, obtaining predicted text data, wherein the predicted text data comprises predicted text data of at least two similar language order languages. The predicted text data can be downloaded from the internet or be used as self-designed predicted text data, and the prosodic phrase boundary position of the predicted text data is marked by manual work, for example, the text of one language in the text data adopts Mongolian:

Aмралт тзргзн зогсоол та чигYYрззр очмоор байнуу утас авж YзхYY，

the result after manual labeling is as follows:

Aмралт#тзргзн зогсоол#та чигYYрззр очмоор байнуу утас авж YзхYY，

wherein "#" is a prosodic phrase boundary, in addition, the audio corresponding to the predicted text data can be downloaded, the corresponding predicted text is automatically identified according to a Kaldi tool through the analysis of the acoustic signal of the audio, and then the prosodic phrase boundary is marked on the predicted text data. Other predicted text data in similar language order languages are obtained using the same method.

Then, text feature extraction is performed on the predicted text data, the Mongolian is used as the text data of one language, and the text feature extraction result is as follows:

the first column is a language flag bit, the second column is a word face, the third column is a suffix (a word without a suffix is a letter combination behind the last vowel of a reciprocal word), the fourth column is a part of speech, the fifth column is a word length, the sixth column is a word pause probability, 0 represents pause prohibition, 1 represents pause necessity, and the pause probability is calculated according to rules and whole text data statistics.

Then, inputting the text features of the obtained predicted text data into a prosodic phrase boundary prediction model, and outputting the pause state of each word in the predicted text data, wherein the output result is as follows:

Aмралт/1 тзргзн/0 зогсоол/1 та/0 чигYYрззр/0 очмоор/0 байнуу/0 утас авж/0 YзхYY/0，

wherein 0 represents no pause and 1 represents a pause.

And finally, acquiring prosodic phrase boundaries according to the pause state of each word in the predicted text data, wherein the result is as follows:

Aмралт#тзргзн зогсоол#та чигYYрззр очмоор байнуу утас авж YзхYY。

FIG. 3 schematically shows a block diagram of a prosodic phrase boundary prediction model training apparatus according to an embodiment of the present disclosure.

The prosodic phrase boundary prediction model training device 300 may be used to implement the method described with reference to fig. 1.

As shown in fig. 3, the prosodic phrase boundary prediction model training device 300 includes: a first acquisition module 310, a second acquisition module 320, and a training module 330.

The first obtaining module 310 is configured to obtain a training text, where the training text includes training texts in at least two similar language order languages.

The second obtaining module 320 is configured to obtain text features of the training text, where the text features include a word face, a part of speech, a word length, an affix, a pause probability, a word vector, and a language flag of each word in the training text.

And the training module 330 is configured to train the initial prosodic phrase boundary prediction model by using the text features of the training text and the labels of the training text to obtain a trained prosodic phrase boundary prediction model, where the labels of the training text are used to represent a pause state of each word in the training text.

Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.

For example, any plurality of the first obtaining module 310, the second obtaining module 320, and the training module 330 may be combined and implemented in one module/unit/sub-unit, or any one of the modules/units/sub-units may be split into a plurality of modules/units/sub-units. Alternatively, at least part of the functionality of one or more of these modules/units/sub-units may be combined with at least part of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. According to an embodiment of the present disclosure, at least one of the first obtaining module 310, the second obtaining module 320, and the training module 330 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or may be implemented by any one of three implementations of software, hardware, and firmware, or any suitable combination of any of the three. Alternatively, at least one of the first acquisition module 310, the second acquisition module 320, the training module 330 may be at least partially implemented as a computer program module, which when executed, may perform a corresponding function.

It should be noted that, in the embodiment of the present disclosure, the prosodic phrase boundary prediction model training device part corresponds to the prosodic phrase boundary prediction model training method part in the embodiment of the present disclosure, and the description of the prosodic phrase boundary prediction model training device part specifically refers to the prosodic phrase boundary prediction model training method part, which is not described herein again.

An embodiment of the present disclosure also provides an electronic device, including: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the prosodic phrase boundary prediction model training method.

FIG. 4 schematically illustrates a block diagram of an electronic device for implementing a prosodic phrase boundary prediction model training method according to an embodiment of the present disclosure. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 4, an electronic device 400 according to an embodiment of the present disclosure includes a processor 401 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. Processor 401 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 401 may also include onboard memory for caching purposes. Processor 401 may include a single processing unit or multiple processing units for performing the different actions of the method flows in accordance with embodiments of the present disclosure.

In the RAM 403, various programs and data necessary for the operation of the electronic apparatus 400 are stored. The processor 401, ROM 402 and RAM 403 are connected to each other by a bus 404. The processor 401 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 402 and/or the RAM 403. Note that the programs may also be stored in one or more memories other than the ROM 402 and RAM 403. The processor 401 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, electronic device 400 may also include an input/output (I/O) interface 404, input/output (I/O) interface 404 also being connected to bus 404. The system 400 may also include one or more of the following components connected to the I/O interface 404: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A drive 410 is also connected to the I/O interface 404 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.

According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411. The computer program, when executed by the processor 401, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to an embodiment of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

For example, according to embodiments of the present disclosure, a computer-readable storage medium may include ROM 402 and/or RAM 403 and/or one or more memories other than ROM 402 and RAM 403 described above.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

15页详细技术资料下载

Training method of prosodic phrase boundary prediction model and prosodic phrase boundary prediction method

相关技术

网友询问留言