Training method and device of voice segmentation model, electronic equipment and storage medium

文档序号：193311 发布日期：2021-11-02 浏览：47次中文

阅读说明：本技术 语音切分模型的训练方法、装置、电子设备及存储介质 (Training method and device of voice segmentation model, electronic equipment and storage medium ) 是由张睿卿何中军李芝吴华于 2021-06-30 设计创作，主要内容包括：本申请公开了语音切分模型的训练方法、装置、电子设备及存储介质,涉及计算机技术领域,具体涉及语音技术、深度学习和自然语言处理等人工智能技术领域。具体实施方案为,获取样本语音,并获取待训练的语音切分模型；将样本语音划分为多个样本语音片段；根据语音翻译模型对多个样本语音片段进行翻译,以生成多个样本文本片段；根据多个样本文本片段和预设条件,生成多个样本语音片段的标签值；以及根据多个样本语音片段的标签值和多个样本语音片段对语音切分模型进行训练,以生成训练之后的语音切分模型。由此,能够提高语音切分模型的准确度,且可通过训练的语音切分模型为后续的同声传译提供有意义的语音片段,从而能够提高同声传译的准确率。(The application discloses a training method and device of a voice segmentation model, electronic equipment and a storage medium, and relates to the technical field of computers, in particular to the technical fields of artificial intelligence such as voice technology, deep learning and natural language processing. The specific implementation scheme is that sample voice is obtained, and a voice segmentation model to be trained is obtained; dividing the sample voice into a plurality of sample voice fragments; translating the sample voice fragments according to the voice translation model to generate sample text fragments; generating label values of a plurality of sample voice fragments according to a plurality of sample text fragments and preset conditions; and training the voice segmentation model according to the label values of the plurality of sample voice fragments and the plurality of sample voice fragments to generate the trained voice segmentation model. Therefore, the accuracy of the voice segmentation model can be improved, and meaningful voice fragments can be provided for subsequent simultaneous interpretation through the trained voice segmentation model, so that the accuracy of simultaneous interpretation can be improved.)

1. A training method of a voice segmentation model comprises the following steps:

acquiring sample voice and acquiring a voice segmentation model to be trained;

dividing the sample speech into a plurality of sample speech segments;

translating the sample voice fragments according to a voice translation model to generate sample text fragments;

generating label values of the sample voice fragments according to the sample text fragments and preset conditions; and

and training the voice segmentation model according to the label values of the sample voice fragments and the sample voice fragments to generate the trained voice segmentation model.

2. The method of claim 1, wherein dividing the sample speech into a plurality of sample speech segments comprises:

acquiring a time interval;

dividing the sample speech according to the time interval to generate a plurality of speech segments, wherein the plurality of speech segments comprise the sample speech;

deleting the sample voice in the plurality of voice segments to obtain the plurality of sample voice segments.

3. The method of claim 1, wherein said translating the plurality of sample speech segments according to a speech translation model to generate a plurality of sample text segments comprises:

inputting the plurality of sample speech segments to the speech translation model;

translating the plurality of sample voice fragments through the voice translation model to generate the plurality of sample text fragments, wherein the languages of the plurality of sample text fragments are different from the languages of the plurality of sample voice fragments.

4. The method of claim 1, wherein the generating tag values of the sample speech segments according to the sample text segments and preset conditions comprises:

respectively judging whether each sample text fragment in the plurality of sample text fragments meets a preset condition, and generating the label value according to a corresponding judgment result; wherein the content of the first and second substances,

if the sample text segment meets the preset condition, the label value of the sample voice segment corresponding to the sample text segment is '1';

and if the sample text segment does not meet the preset condition, the label value of the sample voice segment corresponding to the sample text segment is '0'.

5. The method of claim 1, wherein the training the speech segmentation model according to the tag values of the plurality of sample speech segments and the plurality of sample speech segments comprises:

inputting the sample speech segment into the speech segmentation model to generate a predicted tag value;

generating a loss value according to the predicted label value and a label value corresponding to the sample voice segment;

and training the voice segmentation model according to the loss value.

6. A training apparatus for a speech segmentation model, comprising:

the acquisition module is used for acquiring sample voice and acquiring a voice segmentation model to be trained;

a dividing module, configured to divide the sample speech into a plurality of sample speech segments;

the first generation module is used for translating the sample voice fragments according to a voice translation model so as to generate sample text fragments;

the second generation module is used for generating label values of the sample voice fragments according to the sample text fragments and preset conditions; and

and the training module is used for training the voice segmentation model according to the label values of the sample voice segments and the sample voice segments so as to generate the trained voice segmentation model.

7. The apparatus according to claim 6, wherein the obtaining module is specifically configured to:

acquiring a time interval;

dividing the sample speech according to the time interval to generate a plurality of speech segments, wherein the plurality of speech segments comprise the sample speech;

deleting the sample voice in the plurality of voice segments to obtain the plurality of sample voice segments.

8. The apparatus of claim 6, wherein the first generating module is specifically configured to:

inputting the plurality of sample speech segments to the speech translation model;

9. The apparatus of claim 6, wherein the second generating module is specifically configured to:

if the sample text segment meets the preset condition, the label value of the sample voice segment corresponding to the sample text segment is '1';

and if the sample text segment does not meet the preset condition, the label value of the sample voice segment corresponding to the sample text segment is '0'.

10. The apparatus of claim 6, wherein the training module is specifically configured to:

inputting the sample speech segment into the speech segmentation model to generate a predicted tag value;

generating a loss value according to the predicted label value and a label value corresponding to the sample voice segment;

and training the voice segmentation model according to the loss value.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of training a speech segmentation model of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the training method of the speech segmentation model according to any one of claims 1-5.

13. A computer program product comprising a computer program which, when executed by a processor, implements a method of training a speech segmentation model according to any one of claims 1 to 5.

Technical Field

The application relates to the technical field of computers, in particular to the technical field of artificial intelligence such as intelligent search, big data and deep learning, and particularly relates to a voice segmentation model, a voice segmentation device and electronic equipment.

Background

With the more mature simultaneous interpretation technology, the simultaneous interpretation equipment becomes indispensable equipment for synchronous interpretation in international occasions such as international conferences, outings, meeting negotiations, business activities, news media and the like.

Simultaneous interpretation refers to a way of interpreting content continuously in different languages to a listener without interrupting the speaker's speech. In the related art, the mainstream method is to perform segment classification on the current input streaming voice once every T time (usually, T is several hundred milliseconds), and determine whether the current voice is a word boundary (word boundary), and if so, perform translation by using a fixed policy.

Disclosure of Invention

The application provides a training method and device for a voice segmentation model, electronic equipment and a storage medium.

According to an aspect of the present application, there is provided a training method of a speech segmentation model, including:

acquiring sample voice and acquiring a voice segmentation model to be trained;

dividing the sample speech into a plurality of sample speech segments;

translating the sample voice fragments according to a voice translation model to generate sample text fragments;

generating label values of the sample voice fragments according to the sample text fragments and preset conditions; and

and training the voice segmentation model according to the label values of the sample voice fragments and the sample voice fragments to generate the trained voice segmentation model.

According to another aspect of the present application, there is provided a training apparatus for a speech segmentation model, comprising:

the acquisition module is used for acquiring sample voice and acquiring a voice segmentation model to be trained;

a dividing module, configured to divide the sample speech into a plurality of sample speech segments;

the first generation module is used for translating the sample voice fragments according to a voice translation model so as to generate sample text fragments;

the second generation module is used for generating label values of the sample voice fragments according to the sample text fragments and preset conditions; and

According to another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of training a speech segmentation model as described in embodiments of the above-described aspect.

According to another aspect of the present application, there is provided a non-transitory computer readable storage medium storing thereon a computer program for causing a computer to execute the training method of the speech segmentation model according to the embodiment of the above aspect.

According to another aspect of the present application, there is provided a computer program product comprising a computer program, which when executed by a processor, implements a method of training a speech segmentation model as described in an embodiment of the above-mentioned aspect.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic flowchart of a method for training a speech segmentation model according to an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating another method for training a speech segmentation model according to an embodiment of the present application;

FIG. 3 is a diagram of a plurality of sample speech segments according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating comparison of a plurality of sample speech segments and a plurality of sample text segments according to an embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating another method for training a speech segmentation model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a training apparatus for a speech segmentation model according to an embodiment of the present application; and

FIG. 7 is a block diagram of an electronic device for a method of training a speech segmentation model according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The speech segmentation model method, apparatus and electronic device according to the embodiments of the present application are described below with reference to the accompanying drawings.

Artificial intelligence is the subject of research on the use of computers to simulate certain mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.) of humans, both in the hardware and software domain. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology comprises a computer vision technology, a voice recognition technology, a natural language processing technology, deep learning, a big data processing technology, a knowledge map technology and the like.

Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics.

Deep learning is a new research direction in the field of machine learning. Deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. Deep learning is a complex machine learning algorithm, and achieves the effect in speech and image recognition far exceeding the prior related art.

The voice technology is to enable a computer to listen, see, speak and feel, and is a development direction of future human-computer interaction, wherein voice becomes the best viewed human-computer interaction mode in the future, and has more advantages than other interaction modes. The Speech synthesis technology is required for speaking a computer, and the core of the Speech synthesis technology is Text to Speech (Text to Speech).

The training method of the speech segmentation model provided in the embodiment of the present application may be executed by an electronic device, where the electronic device may be a Personal Computer (PC), a tablet Computer, a palmtop Computer, or a mobile phone, and the present application is not limited in any way.

In the embodiment of the application, the electronic device can be provided with a processing component, a storage component and a driving component. Optionally, the driving component and the processing component may be integrated, the storage component may store an operating system, an application program, or other program modules, and the processing component implements the method for training the speech segmentation model provided in the embodiment of the present application by executing the application program stored in the storage component.

Fig. 1 is a schematic flowchart of a training method for a speech segmentation model according to an embodiment of the present application.

The method for training the voice segmentation model in the embodiment of the application can be further executed by the device for training the voice segmentation model provided in the embodiment of the application, and the device can be configured in electronic equipment to divide the acquired sample voice into a plurality of sample voice fragments, translate the plurality of sample voice fragments according to the voice translation model to generate a plurality of sample text fragments, generate the label values of the plurality of sample voice fragments according to the plurality of sample text fragments and preset conditions, and train the voice segmentation model to be trained according to the label values of the plurality of sample voice fragments and the plurality of sample voice fragments, so that the accuracy of the voice segmentation model can be improved.

As a possible situation, the training method of the speech segmentation model in the embodiment of the present application may also be executed at a server side, where the server may be a cloud server, and the training method of the speech segmentation model may be executed at a cloud side.

As shown in fig. 1, the method for training the speech segmentation model may include:

step 101, obtaining sample voice and obtaining a voice segmentation model to be trained. The sample speech may be a plurality of speeches, and the sample speech may be a chinese speech, an english speech, or a german speech, etc., which is not limited herein.

In the embodiment of the present application, there may be multiple ways to obtain the sample voice, wherein the sample voice may be obtained by collecting the voice of the simultaneous interpretation device during the simultaneous interpretation, and the sample voice may also be created manually, for example, according to the requirements of the relevant personnel, the sample voice may be recorded by the relevant recording equipment, and the sample voice may also be obtained by actively collecting the words of some passers-by, which is not limited herein.

It should be noted that the sample speech described in this embodiment may have a complete meaning, such as "weather is good today", we go to a picnic bar "," how nice a day ", etc.

In this embodiment of the application, the to-be-trained speech segmentation model may be pre-stored in a storage space of the electronic device, so as to facilitate access and use, where the storage space is not limited to a physical storage space, for example, a hard disk, and the storage space may also be a storage space (cloud storage space) of a network hard disk connected to the electronic device.

Specifically, after acquiring the sample voice, the electronic device may further acquire (call) the voice segmentation model to be trained from its own storage space.

Step 102, dividing the sample voice into a plurality of sample voice fragments.

In the embodiment of the application, the sample voice can be divided into a plurality of sample voice segments according to a preset sample voice segment extraction algorithm, wherein the sample voice segment extraction algorithm can be calibrated according to actual conditions.

Specifically, after obtaining the sample voice and the voice segmentation model to be trained, the electronic device may divide the sample voice into a plurality of sample voice segments according to a preset sample voice segment extraction algorithm, for example, the sample voice: "how nice a day" is divided into: "how, how nice, and how nice" four sample speech segments.

As one possible scenario, the electronic device may also process (divide) the sample speech by the sample speech segment extraction model to generate a plurality of sample speech segments.

It should be noted that the sample speech segment extraction model described in this embodiment may be trained in advance and pre-stored in the storage space of the electronic device to facilitate the retrieval application.

The training and the generation of the sample voice fragment extraction model can be performed by a related server, the server can be a cloud server or a host of a computer, a communication connection is established between the server and the electronic equipment capable of executing the training method of the voice segmentation model provided by the application embodiment, and the communication connection can be at least one of a wireless network connection and a wired network connection. The server can send the trained sample speech segment extraction model to the electronic device so that the electronic device can call the trained sample speech segment extraction model when needed, and therefore computing stress of the electronic device is greatly reduced.

Specifically, after obtaining the sample voice and the voice segmentation model to be trained, the electronic device may call the sample voice segment extraction model from its own storage space, and input the sample voice to the sample voice segment extraction model, so as to process the sample voice through the sample voice segment extraction model, thereby obtaining a plurality of sample voice segments output by the sample voice segment extraction model.

As another possible scenario, the electronic device may also use a partitioning tool (e.g., a plug-in) to partition the sample speech into a plurality of sample speech segments.

Further, in order to improve the accuracy of sample speech division, in this embodiment of the application, after obtaining the sample speech and the speech segmentation model to be trained, the electronic device may also perform preprocessing on the sample speech to remove a blank portion in the sample speech, noise in the sample speech, and the like.

And 103, translating the plurality of sample voice fragments according to the voice translation model to generate a plurality of sample text fragments.

It should be noted that the translation model described in this embodiment may be trained in advance and pre-stored in the storage space of the electronic device to facilitate the retrieval of the application.

Specifically, after obtaining a plurality of sample voice segments, the electronic device may call up a voice translation model from its own storage space, and may sequentially input the plurality of sample voice segments to the voice translation model, so that the plurality of sample voice segments are translated by the voice translation model to obtain a plurality of sample text segments output by the voice translation model.

And 104, generating label values of the sample voice fragments according to the sample text fragments and preset conditions. The preset condition may be calibrated according to an actual situation, for example, the preset condition may include that the sample text segment is a prefix of a whole sentence speech translation result (i.e., a translation result of the sample speech).

And 105, training the voice segmentation model according to the label values of the plurality of sample voice fragments and the plurality of sample voice fragments to generate the trained voice segmentation model.

Specifically, after obtaining a plurality of sample text segments, the electronic device may generate tag values of a plurality of sample voice segments according to the plurality of sample text segments and preset conditions, then sequentially input the sample voice segments of the plurality of sample voice segments into the voice segmentation model to generate a plurality of predicted tag values, generate a loss value according to the predicted tag values and the tag values corresponding to the currently input sample voice segments, and train the voice segmentation model according to the loss value, thereby optimizing the voice segmentation model.

In the embodiment of the application, sample voice is firstly obtained, a voice segmentation model to be trained is obtained, the sample voice is divided into a plurality of sample voice fragments, then the plurality of sample voice fragments are translated according to the voice translation model to generate a plurality of sample text fragments, label values of the plurality of sample voice fragments are generated according to the plurality of sample text fragments and preset conditions, and finally the voice segmentation model is trained according to the label values of the plurality of sample voice fragments and the plurality of sample voice fragments to generate the trained voice segmentation model. Therefore, the accuracy of the voice segmentation model can be improved, and meaningful voice fragments can be provided for subsequent simultaneous interpretation through the trained voice segmentation model, so that the accuracy of simultaneous interpretation can be improved.

To clearly illustrate the above embodiment, in an embodiment of the present application, as shown in fig. 2, dividing the sample speech into a plurality of sample speech segments may include:

step 201, obtaining a time interval, wherein the time interval may be calibrated according to an actual situation. It should be noted that the time interval described in this embodiment may be set by the relevant person in advance and stored in the electronic device in advance to be called when needed.

Step 202, dividing the sample voice according to the time interval to generate a plurality of voice segments, wherein the plurality of voice segments include the sample voice.

Step 203, deleting the sample voice in the plurality of voice segments to obtain a plurality of sample voice segments.

Specifically, after acquiring the sample voice, the electronic device may call (acquire) a time interval (several hundred milliseconds) from its own storage space, and then the electronic device may divide the sample voice according to the time interval, for example, may take the current voice content every time interval to generate a plurality of voice segments. The electronic device can then delete the sample speech in the plurality of speech segments to obtain a plurality of sample speech segments.

For example, referring to FIG. 3, assume the sample speech is Chinese speech: "how nice a day", the sample voice is divided according to time intervals, and "how nice, and how nice a day" five voice segments can be obtained, wherein the fifth (i.e., last) voice segment can be "how nice a day" of the sample voice. The last speech segment (i.e., sample speech) of the five speech segments may then be deleted, resulting in four sample speech segments "how, how nice, and how nice".

Therefore, the sample voice can be divided into a plurality of sample voice fragments, sufficient sample data are provided for the training of the subsequent voice segmentation model, and the training effect is improved.

Further, in an embodiment of the present application, translating the plurality of sample voice fragments according to the voice translation model to generate a plurality of sample text fragments includes inputting the plurality of sample voice fragments to the voice translation model, and translating the plurality of sample voice fragments through the voice translation model to generate the plurality of sample text fragments, wherein the languages of the plurality of sample text fragments are different languages from the languages of the plurality of sample voice fragments.

Specifically, after obtaining a plurality of sample voice segments, the electronic device may sequentially input the sample voice segments of the plurality of sample voice segments to the voice translation model, and then the voice translation model may sequentially translate the input sample voice segments, thereby obtaining a plurality of sample text segments.

For example, referring to fig. 4, assuming that the sample speech segments are "how, how nice and how nice", respectively, the sample text segments that are correspondingly output by the speech translation model may be "What, What beautiful, What nice, and What a beautiful day".

Therefore, the translation results corresponding to the sample voice fragments can be accurately obtained through the voice translation model.

Further, in an embodiment of the present application, generating the tag values of the plurality of sample voice fragments according to the plurality of sample text fragments and the preset condition may include respectively determining whether each sample text fragment of the plurality of sample text fragments meets the preset condition, and generating the tag value according to a corresponding determination result, where if the sample text fragment meets the preset condition, the tag value of the sample voice fragment corresponding to the sample text fragment is "1"; and if the sample text segment does not meet the preset condition, the label value of the sample voice segment corresponding to the sample text segment is '0'.

Specifically, after obtaining the plurality of sample text segments, the electronic device may respectively determine whether each of the plurality of sample text segments is a prefix of a whole sentence speech translation result (i.e., a translation result of the sample speech), and may generate a tag value according to a corresponding determination result. When the sample text segment is determined to be a prefix of the whole-sentence speech translation result (i.e., the translation result of the sample speech), the tag value of the sample speech segment corresponding to the sample text segment generated by the electronic device may be "1", and when the sample text segment is determined not to be a prefix of the whole-sentence speech translation result (i.e., the translation result of the sample speech), the tag value of the sample speech segment corresponding to the sample text segment generated by the electronic device may be "0". Therefore, the output result of the speech translation model can be converted into probabilities, namely, the positive sample is 1 (the correct probability is 100%), the negative sample is 0 (the correct probability is 0%), and necessary training parameters are provided for the subsequent training of the speech segmentation model.

For example, referring to fig. 4, assuming that the sample text fragments are "What, What beamiful, What nice, and What a beamiful day", the corresponding 1 st and 4 th sample voice fragments are explicit voice fragments, i.e. the corresponding tag value is "1" for the positive sample, and the corresponding tag value is "0" for the 2 nd and 3 rd sample voice fragments.

It should be noted that the preset condition described in this embodiment may include that the sample text segment is a prefix of the whole sentence speech translation result (i.e., the translation result of the sample speech), and may be to determine whether the sample speech segment corresponding to the sample text segment is a speech segment with a definite semantic meaning.

In one embodiment of the present application, as shown in fig. 5, training a speech segmentation model according to the tag values of the plurality of sample speech segments and the plurality of sample speech segments may include:

step 501, a sample speech segment is input into a speech segmentation model to generate a predicted tag value.

Step 502, generating a loss value according to the predicted tag value and the tag value corresponding to the sample voice segment. It should be noted that the loss value in this embodiment can be calculated by a loss value formula.

And 503, training the voice segmentation model according to the loss value.

Specifically, after obtaining the tag values of the sample voice segments, the electronic device may train the voice segmentation model using the sample voice segments and the tag values corresponding to the sample voice segments, wherein in the training process, the sample voice segments may be input into the voice segmentation model to generate predicted tag values, and a loss value may be generated according to the predicted tag values and the tag values corresponding to the sample voice segments using a related loss value formula, and the voice segmentation model is trained according to the loss values, and the above training operation is repeated until the current training based on the sample voice segments is completed. Therefore, the voice segmentation model can be optimized, the accuracy of the voice segmentation model is further improved, and meaningful voice fragments can be provided for subsequent simultaneous interpretation through the trained voice segmentation model, so that the accuracy of simultaneous interpretation can be improved.

Fig. 6 is a schematic structural diagram of a training apparatus for a speech segmentation model according to an embodiment of the present application.

The training device for the voice segmentation model, provided by the embodiment of the application, can be configured in electronic equipment to divide the acquired sample voice into a plurality of sample voice fragments, translate the plurality of sample voice fragments according to the voice translation model to generate a plurality of sample text fragments, generate label values of the plurality of sample voice fragments according to the plurality of sample text fragments and preset conditions, and train the voice segmentation model to be trained according to the label values of the plurality of sample voice fragments and the plurality of sample voice fragments, so that the accuracy of the voice segmentation model can be improved.

As shown in fig. 6, the training apparatus 600 for the speech segmentation model may include: an acquisition module 610, a partitioning module 620, a first generation module 630, a second generation module 640, and a training module 650.

The obtaining module 610 is configured to obtain sample speech and obtain a speech segmentation model to be trained. The sample speech may be a plurality of speeches, and the sample speech may be a chinese speech, an english speech, or a german speech, etc., which is not limited herein.

In this embodiment, there may be multiple ways for the obtaining module 610 to obtain the sample voice, wherein the sample voice may be obtained by collecting the voice of the simultaneous interpretation device during the simultaneous interpretation, and the sample voice may also be created artificially, for example, according to the requirement of the relevant person, the sample voice may be recorded by the relevant recording device, and the utterance of some passers-by may also be actively collected to obtain the sample voice, which is not limited herein.

It should be noted that the sample speech described in this embodiment may have a complete meaning, such as "weather is good today", we go to a picnic bar "," how nice a day ", etc.

Specifically, the obtaining module 610 may further obtain (call) a speech segmentation model to be trained from a storage space of the electronic device after obtaining the sample speech.

The dividing module 620 is used for dividing the sample voice into a plurality of sample voice segments.

In this embodiment of the present application, the dividing module 620 may divide the sample speech into a plurality of sample speech segments according to a preset sample speech segment extraction algorithm, where the sample speech segment extraction algorithm may be calibrated according to an actual situation.

Specifically, after the obtaining module 610 obtains the sample speech and the speech segmentation model to be trained, the dividing module 620 may divide the sample speech into a plurality of sample speech segments according to a preset sample speech segment extraction algorithm, for example, divide the sample speech into: "how nice a day" is divided into: "how, how nice, and how nice" four sample speech segments.

As one possible scenario, the partitioning module 620 may also invoke a sample speech segment extraction model to process (partition) the sample speech to generate a plurality of sample speech segments.

Specifically, after the obtaining module 610 obtains the sample voice and the voice segmentation model to be trained, the dividing module 620 may call a sample voice segment extraction model from a storage space of the electronic device, and input the sample voice to the sample voice segment extraction model, so that the sample voice is processed by the sample voice segment extraction model, and a plurality of sample voice segments output by the sample voice segment extraction model are obtained.

As another possible scenario, the partitioning module 620 may also use a partitioning tool (e.g., a plug-in) to partition the sample speech into a plurality of sample speech segments.

Further, in order to improve the accuracy of sample speech division, in this embodiment of the application, after the obtaining module 610 obtains the sample speech and the speech segmentation model to be trained, the dividing module 620 may also perform preprocessing on the sample speech to remove blank parts in the sample speech, noise in the sample speech, and the like.

The first generation module 630 is configured to translate the plurality of sample speech segments according to a speech translation model to generate a plurality of sample text segments.

Specifically, after the dividing module 620 obtains a plurality of sample voice segments, the first generating module 630 may call up a voice translation model from a storage space of the electronic device, and may sequentially input the plurality of sample voice segments to the voice translation model, so that the plurality of sample voice segments are translated by the voice translation model to obtain a plurality of sample text segments output by the voice translation model.

The second generating module 640 is configured to generate tag values of the sample speech segments according to the sample text segments and preset conditions. The preset condition may be calibrated according to an actual situation, for example, the preset condition may include that the sample text segment is a prefix of a whole sentence speech translation result (i.e., a translation result of the sample speech).

The training module 650 is configured to train the speech segmentation model according to the label values of the plurality of sample speech segments and the plurality of sample speech segments to generate a trained speech segmentation model.

Specifically, after the first generating module 630 obtains a plurality of sample text segments, the second generating module 640 may generate tag values of a plurality of sample voice segments according to the plurality of sample text segments and preset conditions, then the training module 650 sequentially inputs the sample voice segments of the plurality of sample voice segments into the voice segmentation model to generate a plurality of predicted tag values, generates a loss value according to the predicted tag values and the tag values corresponding to the currently input sample voice segments, and trains the voice segmentation model according to the loss value, thereby optimizing the voice segmentation model.

In the embodiment of the application, firstly, a sample voice is obtained through an obtaining module, a voice segmentation model to be trained is obtained, then, the sample voice is divided into a plurality of sample voice fragments through a dividing module, the plurality of sample voice fragments are translated through a first generating module according to the voice translation model to generate a plurality of sample text fragments, label values of the plurality of sample voice fragments are generated through a second generating module according to the plurality of sample text fragments and preset conditions, and finally, the voice segmentation model is trained through a training module according to the label values of the plurality of sample voice fragments and the plurality of sample voice fragments to generate the trained voice segmentation model. Therefore, the accuracy of the voice segmentation model can be improved, and meaningful voice fragments can be provided for subsequent simultaneous interpretation through the trained voice segmentation model, so that the accuracy of simultaneous interpretation can be improved.

In an embodiment of the present application, the obtaining module 610 is specifically configured to obtain a time interval, divide the sample speech according to the time interval, so as to generate a plurality of speech segments, where the plurality of speech segments include the sample speech, and delete the sample speech in the plurality of speech segments, so as to obtain a plurality of sample speech segments.

In an embodiment of the present application, the first generating module 630 is specifically configured to input a plurality of sample speech segments into the speech translation model, and translate the plurality of sample speech segments through the speech translation model to generate a plurality of sample text segments, wherein the languages of the plurality of sample text segments and the languages of the plurality of sample speech segments are different languages.

In an embodiment of the present application, the second generating module 640 is specifically configured to: respectively judging whether each sample text fragment in the plurality of sample text fragments meets a preset condition, and generating a label value according to a corresponding judgment result; if the sample text segment meets the preset condition, the label value of the sample voice segment corresponding to the sample text segment is '1'; and if the sample text segment does not meet the preset condition, the label value of the sample voice segment corresponding to the sample text segment is '0'.

In an embodiment of the present application, the training module 650 is specifically configured to input the sample speech segment into the speech segmentation model to generate a predicted tag value, generate a loss value according to the predicted tag value and the tag value corresponding to the sample speech segment, and train the speech segmentation model according to the loss value.

It should be noted that the foregoing explanation of the embodiment of the training method for the speech segmentation model is also applicable to the training device for the speech segmentation model of the embodiment, and is not repeated herein.

The training device of the voice segmentation model comprises an acquisition module, a to-be-trained voice segmentation model, a division module, a first generation module, a second generation module, a training module and a control module, wherein the acquisition module is used for acquiring sample voice, the to-be-trained voice segmentation model is acquired, the sample voice is divided into a plurality of sample voice fragments through the division module, the first generation module is used for translating the sample voice fragments according to the voice translation model to generate a plurality of sample text fragments, the second generation module is used for generating label values of the sample voice fragments according to the sample text fragments and preset conditions, and the training module is used for training the voice segmentation model according to the label values of the sample voice fragments and the sample voice fragments to generate the trained voice segmentation model. Therefore, the accuracy of the voice segmentation model can be improved, and meaningful voice fragments can be provided for subsequent simultaneous interpretation through the trained voice segmentation model, so that the accuracy of simultaneous interpretation can be improved.

There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 706 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as a training method of a speech segmentation model. For example, in some embodiments, the method of training the speech segmentation model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 706, some or all of which may be loaded and/or installed onto the device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into the RAM703 and executed by the computing unit 701, one or more steps of the training method of the speech segmentation model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform the training method of the speech segmentation model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

18页详细技术资料下载

Training method and device of voice segmentation model, electronic equipment and storage medium

相关技术

网友询问留言