Training set generation method and device, electronic equipment and computer readable medium

文档序号:1955156 发布日期:2021-12-10 浏览:11次 中文

阅读说明:本技术 训练集生成方法、装置、电子设备和计算机可读介质 (Training set generation method and device, electronic equipment and computer readable medium ) 是由 宋伟 张政臣 于 2021-01-20 设计创作,主要内容包括:本公开的实施例公开了训练集生成方法、装置、电子设备和计算机可读介质。该训练集生成方法的一具体实施方式包括:获取数据集,其中,该数据集包括文本集和与该文本集相关联的语音集;确定该文本集中是否存在与该语音集中对应语音未对齐的文本;响应于该文本集中存在与对应语音未对齐的文本,从该数据集去除文本和对应语音未对齐的至少一个数据,得到去除后的数据集作为目标数据集;根据该目标数据集,确定文本与语音模型的训练集。该实施方式通过确定文本和对应语音未对齐的至少一个数据可以准确、有效的确定训练集。(The embodiment of the disclosure discloses a training set generation method, a training set generation device, electronic equipment and a computer readable medium. One specific implementation of the training set generation method includes: acquiring a data set, wherein the data set comprises a text set and a speech set associated with the text set; determining whether text exists in the text set that is not aligned with corresponding speech in the speech set; in response to the text set having a text misaligned with the corresponding voice, removing at least one of the text and the data misaligned with the corresponding voice from the data set to obtain a removed data set as a target data set; a training set of text and speech models is determined based on the target data set. The embodiment can accurately and effectively determine the training set by determining at least one datum that the text and the corresponding voice are not aligned.)

1. A training set generation method, comprising:

obtaining a data set, wherein the data set comprises a text set and a speech set associated with the text set;

determining whether text exists in the set of text that is not aligned with corresponding speech in the set of speech;

in response to the text set having a text misaligned with the corresponding voice, removing at least one of the text and the data misaligned with the corresponding voice from the data set to obtain a removed data set as a target data set;

and determining a training set of text and voice models according to the target data set.

2. The method of claim 1, wherein the determining whether text in the set of text that is not aligned with corresponding speech in the set of speech comprises:

extracting a phoneme sequence corresponding to each text in the text set to obtain a phoneme sequence group;

inputting a phoneme sequence corresponding to each text in the text set and corresponding voice into a pre-trained hidden Markov model to output information representing whether each text is aligned with the corresponding voice as first information to obtain a first information set;

generating information representing whether each text in the text set is aligned with the corresponding voice as second information by utilizing a dynamic time warping algorithm according to the phoneme sequence of each text and the corresponding voice to obtain a second information set;

and determining whether the text which is not aligned with the corresponding voice in the voice set exists in the text set according to the first information set and the second information set.

3. The method of claim 2, wherein said removing at least one of text and corresponding speech misalignment from the dataset in response to the text not aligned with the corresponding speech in the dataset, resulting in a removed dataset as a target dataset, comprises:

in response to the text set having the text which is not aligned with the corresponding voice, determining at least one text which is not aligned with the corresponding voice in the text set as a target text set according to the first information set and the second information set;

removing at least one data associated with the target text set from the data set, resulting in the target data set.

4. The method of claim 3, wherein the determining, from the first set of information and the second set of information, at least one text in the set of text that is not aligned with a corresponding voice as a target set of text in response to a presence of text in the set of text that is not aligned with a corresponding voice comprises:

responding to the text which is not aligned with the corresponding voice in the text set, and selecting first information which represents that the text is not aligned with the corresponding voice from the first information set as first target information to obtain a first target information set;

selecting second information representing misalignment of the text and the corresponding voice from the second information set as second target information to obtain a second target information set;

and determining at least one text which is the same in the text set corresponding to the first target information set and the text set corresponding to the second target information set as the target text set.

5. The method of claim 1, wherein the determining a training set of text and speech models from the target data set comprises:

receiving a modified data set transmitted by a relevant terminal according to the target data set, wherein the modified data set is obtained by modifying the target data set;

and replacing the target data set in the data set with the corrected data set to obtain a replaced data set serving as a training set of the text and voice model.

6. The method of claim 2, wherein the hidden markov models are trained according to a forced alignment method.

7. A training set generation apparatus comprising:

an acquisition unit configured to acquire a data set, wherein the data set includes a text set and a speech set associated with the text set;

a first determination unit configured to determine whether there is text in the set of text that is misaligned with corresponding speech in the set of speech;

a removing unit configured to remove at least one of text and corresponding speech misaligned data from the data set in response to the text misaligned with the corresponding speech in the text set, resulting in a removed data set as a target data set;

a second determining unit configured to determine a training set of text and speech models from the target data set.

8. The apparatus of claim 7, wherein the first determining unit is further configured to:

extracting a phoneme sequence corresponding to each text in the text set to obtain a phoneme sequence group;

inputting a phoneme sequence corresponding to each text in the text set and corresponding voice into a pre-trained hidden Markov model to output information representing whether each text is aligned with the corresponding voice as first information to obtain a first information set;

generating information representing whether each text in the text set is aligned with the corresponding voice as second information by utilizing a dynamic time warping algorithm according to the phoneme sequence of each text and the corresponding voice to obtain a second information set;

and determining whether the text which is not aligned with the corresponding voice in the voice set exists in the text set according to the first information set and the second information set.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-6.

10. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-6.

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a training set generation method, a training set generation device, electronic equipment and a computer readable medium.

Background

With the widespread application of artificial intelligence, more and more scenes need To use Text-To-Speech (TTS) technology To enhance the interactivity of artificial intelligence applications. TTS technology has advanced greatly, and the synthesized sound is very close to human voice, but requires a large amount of studio recording data (usually 10-20 hours of professional recording).

The recording data of the recording studio is marked by a user to possibly have the condition of wrong marking, namely the condition that the text and the voice do not correspond to each other. These false audio text labels can have a bad negative impact on the training of the model. At present, the method of manual verification is often adopted to verify the wrong audio text, and the following technical problems often exist: manual labeling is costly and slow resulting in low efficiency.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose training set generation methods, apparatuses, devices and computer readable media to solve the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a training set generation method, including: acquiring a data set, wherein the data set comprises a text set and a voice set associated with the text set; determining whether a text which is not aligned with the corresponding voice in the voice set exists in the text set; in response to the text which is not aligned with the corresponding voice in the text set, removing at least one piece of data which is not aligned with the text and the corresponding voice from the data set to obtain a removed data set as a target data set; and determining a training set of the text and voice models according to the target data set.

Optionally, the determining whether a text that is not aligned with a corresponding voice in the voice set exists in the text set includes: extracting a phoneme sequence corresponding to each text in the text set to obtain a phoneme sequence group; inputting a phoneme sequence corresponding to each text in the text set and a corresponding voice into a pre-trained hidden Markov model to output information representing whether each text is aligned with the corresponding voice as first information to obtain a first information set; generating information representing whether each text in the text set is aligned with the corresponding voice as second information by utilizing a dynamic time warping algorithm according to the phoneme sequence of each text and the corresponding voice to obtain a second information set; and determining whether the text which is not aligned with the corresponding voice in the voice set exists in the text set according to the first information set and the second information set.

Optionally, the removing, in response to a text that is not aligned with a corresponding voice in the text set, at least one of the text and data that is not aligned with the corresponding voice from the data set to obtain a removed data set as a target data set includes: in response to the text set having the text misaligned with the corresponding voice, determining at least one text misaligned with the corresponding voice in the text set as a target text set according to the first information set and the second information set; and removing at least one data associated with the target text set from the data set to obtain the target data set.

Optionally, the determining, according to the first information set and the second information set, at least one text in the text set that is not aligned with the corresponding speech as a target text set in response to the text set that is not aligned with the corresponding speech includes: responding to the text which is not aligned with the corresponding voice in the text set, and selecting first information which represents that the text is not aligned with the corresponding voice from the first information set as first target information to obtain a first target information set; selecting second information representing misalignment of the text and the corresponding voice from the second information set as second target information to obtain a second target information set; and determining at least one text which is the same as the text set corresponding to the first target information set and the text set corresponding to the second target information set as the target text set.

Optionally, the determining a training set of text and speech models according to the target data set includes: receiving a modified data set transmitted by a relevant terminal according to the target data set, wherein the modified data set is obtained by modifying the target data set; and replacing the target data set in the data set with the corrected data set to obtain a replaced data set serving as a training set of the text and voice model.

Optionally, the hidden markov model is trained according to a forced alignment method.

In a second aspect, some embodiments of the present disclosure provide a training set generation apparatus, the apparatus comprising: an acquisition unit configured to acquire a data set, wherein the data set includes a text set and a speech set associated with the text set; a first determining unit configured to determine whether there is a text in the text set that is not aligned with a corresponding voice in the voice set; a removing unit configured to remove at least one of text and data that is not aligned with corresponding speech from the data set in response to the text that is not aligned with corresponding speech existing in the text set, and obtain a removed data set as a target data set; and the second determining unit is configured to determine a training set of the text and the voice model according to the target data set.

Optionally, the first determining unit is further configured to: extracting a phoneme sequence corresponding to each text in the text set to obtain a phoneme sequence group; inputting a phoneme sequence corresponding to each text in the text set and a corresponding voice into a pre-trained hidden Markov model to output information representing whether each text is aligned with the corresponding voice as first information to obtain a first information set; generating information representing whether each text in the text set is aligned with the corresponding voice as second information by utilizing a dynamic time warping algorithm according to the phoneme sequence of each text and the corresponding voice to obtain a second information set; and determining whether the text which is not aligned with the corresponding voice in the voice set exists in the text set according to the first information set and the second information set.

Optionally, the first determining unit is further configured to: in response to the text set having the text misaligned with the corresponding voice, determining at least one text misaligned with the corresponding voice in the text set as a target text set according to the first information set and the second information set; and removing at least one data associated with the target text set from the data set to obtain the target data set.

Optionally, the first determining unit is further configured to: responding to the text which is not aligned with the corresponding voice in the text set, and selecting first information which represents that the text is not aligned with the corresponding voice from the first information set as first target information to obtain a first target information set; selecting second information representing misalignment of the text and the corresponding voice from the second information set as second target information to obtain a second target information set; and determining at least one text which is the same as the text set corresponding to the first target information set and the text set corresponding to the second target information set as the target text set.

Optionally, the second determining unit is further configured to: receiving a modified data set transmitted by a relevant terminal according to the target data set, wherein the modified data set is obtained by modifying the target data set; and replacing the target data set in the data set with the corrected data set to obtain a replaced data set serving as a training set of the text and voice model.

Optionally, the hidden markov model is trained according to a forced alignment method.

In a third aspect, some embodiments of the present disclosure provide an electronic device, comprising: one or more processors; a storage device having one or more programs stored thereon which, when executed by one or more processors, cause the one or more processors to implement a method as in any one of the first aspects.

In a fourth aspect, some embodiments of the disclosure provide a computer readable medium having a computer program stored thereon, wherein the program when executed by a processor implements a method as in any one of the first aspect.

The above embodiments of the present disclosure have the following beneficial effects: by the training set generation method of some embodiments of the present disclosure, the training set can be accurately and effectively determined by determining at least one datum that the text and the corresponding speech are not aligned. In particular, manual labeling is costly and slow resulting in low efficiency. Based on this, the training set generation method of some embodiments of the present disclosure may first acquire a data set. Wherein the data set includes a text set and a speech set associated with the text set. Then, whether the text which is not aligned with the corresponding voice in the voice set exists in the text set is determined to be used for subsequently determining a training set. And in response to the text which is not aligned with the corresponding voice in the text set, removing at least one of the text and the data which is not aligned with the corresponding voice from the data set to obtain a removed data set as a target data set. And finally, determining a training set of the text and voice models according to the target data set. Alternatively, the data set from which the target data set is removed may be determined as a training set of text and speech models. Therefore, the training set generation method can accurately and effectively determine the training set by determining at least one piece of data of the misaligned text and the misaligned corresponding voice.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

FIG. 1 is a schematic diagram of an application scenario of a training set generation method of some embodiments of the present disclosure;

FIG. 2 is a flow diagram of some embodiments of a training set generation method according to the present disclosure;

FIG. 3 is a flow diagram of further embodiments of a training set generation method according to the present disclosure;

FIG. 4 is a schematic illustration of text to phoneme conversion for a training set generation method according to some embodiments of the present disclosure;

FIG. 5 is a schematic block diagram of some embodiments of a training set generation apparatus according to the present disclosure;

FIG. 6 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is a schematic diagram of an application scenario of a training set generation method according to some embodiments of the present disclosure.

As shown in fig. 1, an electronic device 101 may first acquire a data set 102. Wherein the data set 102 includes a text set and a speech set associated with the text set. In the application scenario, the data set 102 may include: first data 1021, second data 1022, third data 1023, and fourth data 1024. The first data 1021 includes: a first text and a first voice corresponding to the first text. The second data 1022 includes: a second text and a second speech corresponding to the second text. The third data 1023 includes: a third text and a third speech corresponding to the third text. The fourth data 1024 includes: a fourth text and a fourth voice corresponding to the fourth text. Then, whether the text which is not aligned with the corresponding voice in the voice set exists in the text set is determined. Further, in response to the text set having text that is not aligned with the corresponding speech, at least one of the text and the data that is not aligned with the corresponding speech is removed from the data set 102, and the removed data set is obtained as the target data set 103. In this application scenario, the target data set 103 may include: data 1031 identical to the first data 1021, and 1032 identical to the second data 1022. Finally, a training set 104 of text and speech models is determined based on the target data set 103. In the application scenario, the training set 104 may include: data 1041 identical to third data 1023 and 1042 identical to fourth data 1024.

The electronic device 101 may be hardware or software. When the electronic device is hardware, the electronic device may be implemented as a distributed cluster formed by a plurality of servers or terminal devices, or may be implemented as a single server or a single terminal device. When the electronic device is embodied as software, it may be installed in the above-listed hardware devices. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

It should be understood that the number of electronic devices in fig. 1 is merely illustrative. There may be any number of electronic devices, as desired for implementation.

With continued reference to fig. 2, a flow 200 of some embodiments of a training set generation method according to the present disclosure is shown. The training set generation method comprises the following steps:

step 201, a data set is acquired.

In some embodiments, the subject of execution of the training set generation method (e.g., electronic device 101 shown in fig. 1) may obtain the data set via a wired connection or a wireless connection. Wherein the data set includes a text set and a speech set associated with the text set. Further, the texts in the text set and the voices in the voice set may be in one-to-one correspondence.

It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G/5G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

Step 202, determining whether a text which is not aligned with a corresponding voice in the voice set exists in the text set.

In some embodiments, the execution body may determine whether text is present in the set of text that is misaligned with corresponding speech in the set of speech.

As an example, the execution subject may receive a manual verification result to determine whether text exists in the set of text that is not aligned with a corresponding voice in the set of voice.

Step 203, in response to the text set having the text misaligned with the corresponding voice, removing at least one of the text and the data misaligned with the corresponding voice from the data set, and obtaining the removed data set as a target data set.

In some embodiments, in response to the text set having text that is not aligned with the corresponding speech, the execution subject may remove at least one of the text and the corresponding speech from the data set, resulting in the removed data set as the target data set.

And step 204, determining a training set of the text and the voice model according to the target data set.

In some embodiments, the executive may determine a training set of text and speech models based on the target data set. The text and voice model can be a voice-to-text model or a text-to-voice model.

As an example, the executive may determine the dataset excluding the target dataset as a training set of text and speech models.

In some optional implementations of some embodiments, the determining a training set of text and speech models according to the target data set may include the following steps:

firstly, receiving a corrected data set transmitted by a relevant terminal according to the target data set. Wherein the corrected data set is a data set obtained by correcting the target data set.

And replacing the target data set in the data set with the corrected data set to obtain a replaced data set serving as a training set of the text and voice model.

The above embodiments of the present disclosure have the following beneficial effects: by the training set generation method of some embodiments of the present disclosure, the training set can be accurately and effectively determined by determining at least one datum that the text and the corresponding speech are not aligned. In particular, manual labeling is costly and slow resulting in low efficiency. Based on this, the training set generation method of some embodiments of the present disclosure may first acquire a data set. Wherein the data set includes a text set and a speech set associated with the text set. Then, whether the text which is not aligned with the corresponding voice in the voice set exists in the text set is determined to be used for subsequently determining a training set. And in response to the text which is not aligned with the corresponding voice in the text set, removing at least one of the text and the data which is not aligned with the corresponding voice from the data set to obtain a removed data set as a target data set. And finally, determining a training set of the text and voice models according to the target data set. Alternatively, the data set from which the target data set is removed may be determined as a training set of text and speech models. Therefore, the training set generation method can accurately and effectively determine the training set by determining at least one piece of data of the misaligned text and the misaligned corresponding voice.

With continued reference to fig. 3, a flow 300 of further embodiments of training set generation methods according to the present disclosure is shown. The training set generation method comprises the following steps:

step 301, a data set is acquired.

Step 302, extracting a phoneme sequence corresponding to each text in the text set to obtain a phoneme sequence group.

In some embodiments, an executing entity (e.g., the electronic device 101 shown in fig. 1) may extract a phoneme sequence corresponding to each text in the text set to obtain a phoneme sequence group.

As an example, as shown in fig. 4, the target text 401 may be: 'Mandarin'. The phoneme sequence corresponding to the above target text 401 is "p, u, t, o, ug, h, u, a".

Step 303, inputting the phoneme sequence and the corresponding speech corresponding to each text in the text set to a pre-trained hidden markov model to output information representing whether each text is aligned with the corresponding speech as first information, so as to obtain a first information set.

In some embodiments, the execution body may input a phoneme sequence and a corresponding speech corresponding to each text in the text set to a pre-trained Hidden Markov Model (HMM) to output information representing whether each text is aligned with the corresponding speech as the first information, resulting in the first information set.

In some alternative implementations of some embodiments, the hidden markov model described above is trained according to a Forced alignment Method (MFA).

And 304, generating information representing whether each text in the text set is aligned with the corresponding voice as second information by using a dynamic time warping algorithm according to the phoneme sequence of each text and the corresponding voice to obtain a second information set.

In some embodiments, the executing body may generate, according to the phoneme sequence of each text and the corresponding speech, information representing whether each text in the text set is aligned with the corresponding speech by using a Dynamic Time Warping (DTW) as second information, so as to obtain a second information set.

Step 305, determining whether the text set has a text misaligned with the corresponding voice in the voice set according to the first information set and the second information set.

In some embodiments, the execution subject may determine whether text that is not aligned with corresponding speech in the speech set exists in the text set in various ways according to the first information set and the second information set.

Step 306, in response to the text set having a text misaligned with a corresponding voice, determining at least one text misaligned with the corresponding voice in the text set as a target text set according to the first information set and the second information set.

In some embodiments, in response to the text set having text that is not aligned with the corresponding speech, the execution subject may determine, as the target text set, at least one text that is not aligned with the corresponding speech in the text set according to the first information set and the second information set.

In some optional implementations of some embodiments, the determining, in response to the text set having text that is not aligned with the corresponding speech, at least one text that is not aligned with the corresponding speech in the text set as a target text set according to the first information set and the second information set may include:

the method comprises the steps of responding to the text which is not aligned with corresponding voice in the text set, selecting first information which represents that the text is not aligned with the corresponding voice from the first information set as first target information, and obtaining the first target information set.

And secondly, selecting second information representing the misalignment of the text and the corresponding voice from the second information set as second target information to obtain a second target information set.

And thirdly, determining at least one text which is the same as the text set corresponding to the first target information set and the text set corresponding to the second target information set as the target text set.

Step 307, removing at least one data associated with the target text set from the data set to obtain the target data set.

In some embodiments, the executing entity may remove at least one data associated with the target text set from the data set to obtain the target data set.

And 308, determining a training set of the text and the voice model according to the target data set.

In some embodiments, the specific implementation and technical effects of steps 301 and 304 may refer to steps 201 and 204 in the embodiments corresponding to fig. 2, which are not described herein again.

As can be seen from fig. 3, compared with the description of some embodiments corresponding to fig. 2, the process 300 of the training set generation method in some embodiments corresponding to fig. 3 embodies the steps of determining whether there is text in the text set that is not aligned with the corresponding speech in the speech set, and obtaining a target data set. Therefore, the solutions described in the embodiments can more accurately and efficiently determine whether the text set has text that is not aligned with the corresponding speech in the speech set and subsequently determine the target data set by using the hidden markov model and the dynamic time warping algorithm.

With continuing reference to fig. 5, as an implementation of the above-described method for the above-described figures, the present disclosure provides some embodiments of a training set generation apparatus, which correspond to those of the method embodiments described above for fig. 2, and which may be applied in various electronic devices in particular.

As shown in fig. 5, the training set generation apparatus 500 of some embodiments includes: an acquisition unit 501, a first determination unit 502, a removal unit 503, and a second determination unit 504. The obtaining unit 501 is configured to obtain a data set, where the data set includes a text set and a speech set associated with the text set. A first determining unit 502 configured to determine whether there is a text in the text set that is not aligned with a corresponding voice in the voice set. A removing unit 503 configured to remove at least one of the text and the data that is not aligned with the corresponding voice from the data set in response to the text that is not aligned with the corresponding voice in the text set, and obtain a removed data set as a target data set. A second generating unit 504 configured to determine a training set of text and speech models based on the target data set.

In some optional implementations of some embodiments, the first determining unit 502 of the training set generating apparatus 500 may be further configured to: extracting a phoneme sequence corresponding to each text in the text set to obtain a phoneme sequence group; inputting a phoneme sequence corresponding to each text in the text set and a corresponding voice into a pre-trained hidden Markov model to output information representing whether each text is aligned with the corresponding voice as first information to obtain a first information set; generating information representing whether each text in the text set is aligned with the corresponding voice as second information by utilizing a dynamic time warping algorithm according to the phoneme sequence of each text and the corresponding voice to obtain a second information set; and determining whether the text which is not aligned with the corresponding voice in the voice set exists in the text set according to the first information set and the second information set.

In some optional implementations of some embodiments, the removing unit 503 of the training set generating apparatus 500 may be further configured to: in response to the text set having the text misaligned with the corresponding voice, determining at least one text misaligned with the corresponding voice in the text set as a target text set according to the first information set and the second information set; and removing at least one data associated with the target text set from the data set to obtain the target data set.

In some optional implementations of some embodiments, the removing unit 503 of the training set generating apparatus 500 may be further configured to: responding to the text which is not aligned with the corresponding voice in the text set, and selecting first information which represents that the text is not aligned with the corresponding voice from the first information set as first target information to obtain a first target information set; selecting second information representing misalignment of the text and the corresponding voice from the second information set as second target information to obtain a second target information set; and determining at least one text which is the same as the text set corresponding to the first target information set and the text set corresponding to the second target information set as the target text set.

In some optional implementations of some embodiments, the second determining unit 504 of the training set generating apparatus 500 may be further configured to: receiving a modified data set transmitted by a relevant terminal according to the target data set, wherein the modified data set is obtained by modifying the target data set; and replacing the target data set in the data set with the corrected data set to obtain a replaced data set serving as a training set of the text and voice model.

In some alternative implementations of some embodiments, the hidden markov models described above are trained according to a forced alignment method.

It will be understood that the elements described in the apparatus 500 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 500 and the units included therein, and are not described herein again.

Referring now to fig. 6, shown is a schematic diagram of an electronic device 600 suitable for use in implementing some embodiments of the present disclosure. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network through the communication device 609, or installed from the storage device 608, or installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of some embodiments of the present disclosure.

It should be noted that the computer readable medium described above in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the apparatus; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a data set, wherein the data set comprises a text set and a voice set associated with the text set; determining whether a text which is not aligned with the corresponding voice in the voice set exists in the text set; in response to the text which is not aligned with the corresponding voice in the text set, removing at least one piece of data which is not aligned with the text and the corresponding voice from the data set to obtain a removed data set as a target data set; and determining a training set of the text and voice models according to the target data set.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a first determination unit, a removal unit, and a second determination unit. Where the names of these units do not in some cases constitute a limitation of the unit itself, for example, the second determination unit may also be described as a "unit for determining a training set of text and speech models from the target data set described above".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

16页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:语音合成方法、装置、电子设备及可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!