Voice information processing and verification model training method, device, equipment and medium

文档序号:972888 发布日期:2020-11-03 浏览:2次 中文

阅读说明:本技术 语音信息处理及验证模型训练方法、装置、设备及介质 (Voice information processing and verification model training method, device, equipment and medium ) 是由 陈都 李家魁 吕安超 *** 于 2020-07-01 设计创作,主要内容包括:本发明提供了一种语音信息的处理及验证模型的训练方法、装置、设备及介质。基于解码网络获取智能设备采集的语音信息对应的第一语音元素序列,若该第一语音元素序列与预先配置的任一指令语音元素序列匹配,则基于第一语音元素序列,确定第一组合向量,若通过预先训练完成的验证模型,根据第一组合向量,确定该语音信息被正确识别,控制智能设备执行匹配的指令语音元素序列对应的指令词对应的操作。由于在基于解码网络对语音信息进行了识别之后,又通过预先训练完成的验证模型对该语音信息进一步识别,以判断解码是否正确,避免了由于词间干扰、误识别导致的智能设备执行错误的指令词对应的操作,提高对智能设备控制的准确性,也提高了用户体验。(The invention provides a method, a device, equipment and a medium for processing voice information and training a verification model. The method comprises the steps of obtaining a first voice element sequence corresponding to voice information collected by intelligent equipment based on a decoding network, determining a first combination vector based on the first voice element sequence if the first voice element sequence is matched with any one pre-configured instruction voice element sequence, determining that the voice information is correctly recognized according to the first combination vector if a verification model finished through pre-training is passed, and controlling the intelligent equipment to execute operation corresponding to an instruction word corresponding to the matched instruction voice element sequence. After the voice information is recognized based on the decoding network, the voice information is further recognized through the verification model which is trained in advance so as to judge whether the decoding is correct, so that the operation that the intelligent equipment executes wrong instruction words corresponding to the intelligent equipment due to the interference among words and the error recognition is avoided, the accuracy of controlling the intelligent equipment is improved, and the user experience is also improved.)

1. A method for processing voice information, the method comprising:

acquiring a first voice element sequence corresponding to voice information acquired by intelligent equipment based on a decoding network;

if the first voice element sequence is matched with any one pre-configured instruction voice element sequence, determining a first combination vector based on the first voice element sequence;

and if the voice information is correctly recognized according to the first combination vector through a pre-trained verification model, controlling the intelligent equipment to execute the operation of the instruction word corresponding to the matched instruction voice element sequence.

2. The method of claim 1, further comprising:

and if the voice information is determined not to be correctly identified according to the first combination vector through the verification model, refusing to respond to the voice information.

3. The method of claim 1, wherein determining a first combined vector based on the first sequence of speech elements comprises:

determining a first combination vector according to the first voice element sequence and the feature information of the voice information;

wherein the feature information of the voice information includes at least one of: the probability that each voice frame is a mute frame, the probability that content information contained in each voice frame is each voice element, a feature vector corresponding to each voice frame, the number of the voice frames contained in the voice information, and the information of the mute frame contained in the voice frame of the voice information.

4. The method of claim 3, wherein if the feature information of the speech information includes a probability that each speech frame is a silence frame, a probability that content information included in each speech frame is each speech element, and a feature vector corresponding to each speech frame, the determining a first combination vector according to the first speech element sequence and the feature information of the speech information includes:

determining an average feature vector according to the feature vector corresponding to each voice frame;

for each voice frame, determining a probability difference value corresponding to the voice frame according to the probability that the voice frame is a mute frame and the probability that content information contained in the voice frame is each voice element;

and determining the first combination vector according to the first voice element sequence, the average characteristic vector and the probability difference value corresponding to each voice frame.

5. The method of claim 4, wherein the determining the probability difference corresponding to the speech frame according to the probability that the speech frame is a silence frame and the probability that the content information included in the speech frame is each speech element comprises:

if the voice frame is determined to be a mute frame according to the first voice element sequence, determining the probability that the voice frame is the mute frame and a first extreme value in the probability that the contained content information is each voice element, and determining a probability difference value corresponding to the voice frame according to a difference value between any two of the first extreme value and the probability that the voice frame is the mute frame; or

If the target voice element corresponding to the voice frame is determined according to the first voice element sequence, determining a second extreme value in the probability that the voice frame is a silent frame and the probability that the content information contained in the voice frame is each voice element, and determining a probability difference value corresponding to the voice frame according to a difference value between any two of the second extreme value and the probability that the content information contained in the voice frame is the target voice element.

6. A training method for a verification model, the method comprising:

acquiring any voice element sequence sample in a sample set and a corresponding first label, wherein the first label identifies an instruction word corresponding to an instruction voice element sequence matched with the voice element sequence sample, and whether the instruction word is consistent with an instruction word actually contained in the voice sample corresponding to the voice element sequence sample;

determining a second combination vector based on the speech element sequence samples;

and training an original verification model according to the second combination vector and the first label.

7. An apparatus for processing speech information, the apparatus comprising:

the decoding unit is used for acquiring a first voice element sequence corresponding to the voice information acquired by the intelligent equipment based on a decoding network;

a first processing unit, configured to determine a first combination vector based on the first speech element sequence if the first speech element sequence matches any one of pre-configured instruction speech element sequences;

and the second processing unit is used for determining that the voice information is correctly recognized according to the first combination vector if the voice information passes through a pre-trained verification model, and controlling the intelligent equipment to execute the operation of the instruction word corresponding to the matched instruction voice element sequence.

8. A training apparatus for validating a model, the apparatus comprising:

the acquisition module is used for acquiring any voice element sequence sample in a sample set and a corresponding first label, wherein the first label identifies an instruction word corresponding to an instruction voice element sequence matched with the voice element sequence sample, and whether the instruction word is consistent with an instruction word actually contained in the voice sample corresponding to the voice element sequence sample;

a determining module, configured to determine a second combination vector based on the speech element sequence samples;

and the training module is used for training the original verification model according to the second combination vector and the first label.

9. An electronic device, characterized in that the electronic device comprises at least a processor and a memory, the processor being adapted to carry out the steps of the method for processing speech information according to any one of claims 1-5, or the steps of the method for training a verification model according to claim 6, when executing a computer program stored in the memory.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when being executed by a processor, carries out the steps of the method for processing speech information according to any one of claims 1 to 5, or the steps of the method for training a verification model according to claim 6.

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a medium for processing speech information and training a verification model.

Background

With the rapid development of the intelligent interaction technology, in the field of intelligent vehicles and intelligent homes, the intelligent equipment can be controlled to complete the task corresponding to the instruction word by recognizing the collected voice information containing the instruction word, for example, the collected voice information containing the temperature adjustment is recognized, the intelligent air conditioner is controlled to perform temperature adjustment, or the collected voice information containing the wind direction is recognized, and the intelligent air conditioner is controlled to perform wind direction adjustment and the like.

Disclosure of Invention

The invention provides a method, a device, equipment and a medium for processing voice information and training a verification model, which are used for solving the problems of inter-word interference and misrecognition in the existing voice information processing process.

The embodiment of the invention provides a method for processing voice information, which comprises the following steps:

acquiring a first voice element sequence corresponding to voice information acquired by intelligent equipment based on a decoding network;

if the first voice element sequence is matched with any one pre-configured instruction voice element sequence, determining a first combination vector based on the first voice element sequence;

and if the voice information is correctly recognized according to the first combination vector through a pre-trained verification model, controlling the intelligent equipment to execute the operation of the instruction word corresponding to the matched instruction voice element sequence.

The embodiment of the invention also provides a training method of the verification model, which comprises the following steps:

acquiring any voice element sequence sample in a sample set and a corresponding first label, wherein the first label identifies an instruction word corresponding to an instruction voice element sequence matched with the voice element sequence sample, and whether the instruction word is consistent with an instruction word actually contained in the voice sample corresponding to the voice element sequence sample;

determining a second combination vector based on the speech element sequence samples;

and training an original verification model according to the second combination vector and the first label.

The embodiment of the invention also provides a device for processing the voice information, which comprises:

the decoding unit is used for acquiring a first voice element sequence corresponding to the voice information acquired by the intelligent equipment based on a decoding network;

a first processing unit, configured to determine a first combination vector based on the first speech element sequence if the first speech element sequence matches any one of pre-configured instruction speech element sequences;

and the second processing unit is used for determining that the voice information is correctly recognized according to the first combination vector if the voice information passes through a pre-trained verification model, and controlling the intelligent equipment to execute the operation of the instruction word corresponding to the matched instruction voice element sequence.

The embodiment of the invention also provides a training device for the verification model, which comprises:

the acquisition module is used for acquiring any voice element sequence sample in a sample set and a corresponding first label, wherein the first label identifies an instruction word corresponding to an instruction voice element sequence matched with the voice element sequence sample, and whether the instruction word is consistent with an instruction word actually contained in the voice sample corresponding to the voice element sequence sample;

a determining module, configured to determine a second combination vector based on the speech element sequence samples;

and the training module is used for training the original verification model according to the second combination vector and the first label.

An embodiment of the present invention further provides an electronic device, where the electronic device at least includes a processor and a memory, and the processor is configured to implement, when executing a computer program stored in the memory, the steps of the method for processing voice information as described above, or implement the steps of the method for training a verification model as described above.

An embodiment of the present invention further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the method for processing the voice information or implements the steps of the method for training the verification model.

In the process of processing the voice information, after the voice information is recognized based on the decoding network, the voice information is further recognized through the verification model which is trained in advance to judge whether the decoding is correct or not, so that the operation that the intelligent equipment executes wrong instruction words corresponding to the intelligent equipment due to the inter-word interference and the error recognition is avoided, the accuracy of controlling the intelligent equipment is improved, and the user experience is also improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a processing procedure of voice information according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a specific processing flow of voice information according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a device for processing voice information according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a training apparatus for validating a model according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of another electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the attached drawings, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

26页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:语音识别的声学模型训练方法、系统、设备及介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!