Speaker recognition method, speaker recognition device, computer equipment and storage medium

文档序号：36635 发布日期：2021-09-24 浏览：38次中文

阅读说明：本技术 说话人识别方法、装置、计算机设备及存储介质 (Speaker recognition method, speaker recognition device, computer equipment and storage medium ) 是由张之勇王健宗于 2021-06-30 设计创作，主要内容包括：本申请实施例属于人工智能领域,涉及一种说话人识别方法、装置、计算机设备及存储介质,应用于智慧城市领域中,方法包括：获取混合语音以及目标说话人的参考语音；从参考语音中提取参考语音表征；将参考语音表征输入混合提取模型,以根据参考语音表征,从混合语音中获取目标说话人的估计掩膜,估计掩膜中的掩码与混合语音中的语音信号点一一对应；将掩码和语音信号点对应相乘,得到目标说话人的预测语音；计算预测语音和参考语音的概率线性判断得分；当概率线性判断得分处于预设的分值区间时,确定混合语音中包含目标说话人的语音。此外,本申请还涉及区块链技术,参考语音表征可存储于区块链中。本申请提高了说话人识别的准确性。(The embodiment of the application belongs to the field of artificial intelligence, and relates to a speaker identification method, a speaker identification device, computer equipment and a storage medium, which are applied to the field of smart cities, wherein the speaker identification method comprises the following steps: acquiring mixed voice and reference voice of a target speaker; extracting a reference speech representation from a reference speech; inputting the reference voice representation into a mixed extraction model so as to obtain an estimation mask of a target speaker from the mixed voice according to the reference voice representation, wherein the mask in the estimation mask corresponds to voice signal points in the mixed voice one by one; correspondingly multiplying the mask code and the voice signal point to obtain the predicted voice of the target speaker; calculating probability linear judgment scores of the predicted voice and the reference voice; and when the probability linear judgment score is in a preset score interval, determining that the mixed voice contains the voice of the target speaker. Further, the present application relates to blockchain techniques, where reference speech representations may be stored in blockchains. The method and the device improve the accuracy of speaker identification.)

1. A speaker recognition method, comprising the steps of:

acquiring mixed voice and reference voice of a target speaker;

extracting a reference speech characterization from the reference speech by a reference extraction model;

inputting the reference voice representation into a mixed extraction model to indicate the mixed extraction model to acquire an estimation mask of the target speaker from the mixed voice according to the reference voice representation, wherein the mask in the estimation mask is in one-to-one correspondence with voice signal points in the mixed voice;

correspondingly multiplying the mask in the estimation mask and the voice signal point in the mixed voice to obtain the predicted voice of the target speaker;

calculating a probabilistic linear judgment score of the predicted speech and the reference speech;

and when the probability linear judgment score is in a preset score interval, determining that the mixed voice contains the voice of the target speaker.

2. The speaker recognition method according to claim 1, wherein the step of obtaining the mixed speech and the reference speech of the target speaker is preceded by the step of:

acquiring training standard voice, training reference voice and training mixed voice aiming at a target speaker, wherein the training mixed voice is obtained by adding interference voice in the training standard voice;

and training an initial reference extraction model and an initial mixed extraction model according to the training standard voice, the training reference voice and the training mixed voice to obtain a reference extraction model and a mixed extraction model.

3. The speaker recognition method according to claim 2, wherein the step of training an initial reference extraction model and an initial hybrid extraction model according to the training standard speech, the training reference speech, and the training mixed speech to obtain a reference extraction model and a hybrid extraction model comprises:

extracting reference speech representations from the training reference speech through an initial reference extraction model;

inputting the reference speech characterization into an initial hybrid extraction model to instruct the initial hybrid extraction model to extract an estimated mask and a predicted speech of the target speaker from the training hybrid speech according to the reference speech characterization;

computing a joint loss based on the estimation mask, the predicted speech, the training standard speech, and the training mixed speech;

and adjusting the initial reference extraction model and the initial mixed extraction model according to the joint loss until the joint loss meets a training stopping condition to obtain a reference extraction model and a mixed extraction model.

4. The speaker recognition method of claim 3, wherein the step of inputting the reference speech characterization into an initial mixture extraction model to instruct the initial mixture extraction model to extract the estimated mask and predicted speech of the target speaker from the training mixed speech based on the reference speech characterization comprises:

inputting the reference speech characterization into an initial hybrid extraction model to instruct the initial hybrid extraction model to extract a predictive characterization of the target speaker from the training hybrid speech by using the reference speech characterization as prior information;

inputting the predicted representation into a mask calculation layer in the initial mixed extraction model to obtain an estimated mask, wherein the mask in the estimated mask corresponds to the voice signal points in the training mixed voice one by one;

and performing corresponding point multiplication on the mask in the estimation mask and the voice signal point in the training mixed voice to obtain the predicted voice of the target speaker.

5. The speaker recognition method of claim 4, wherein the initial hybrid extraction model comprises a plurality of sequentially connected predictive characterization extraction layers; the step of inputting the reference speech feature into an initial mixture extraction model to instruct the initial mixture extraction model to use the reference speech feature as prior information, and extracting the predictive feature of the target speaker from the training mixed speech includes:

splicing the reference voice characterization and the training mixed voice, and inputting a first layer of prediction characterization extraction layer, wherein the reference voice characterization is priori information, the training mixed voice is source information, and the priori information is used for indicating the prediction characterization extraction layer to extract prediction characterization from the source information;

for the prediction representation extraction layers after the first layer, splicing the reference voice representation and the prediction representation, and inputting the next prediction representation extraction layer for iteration until the last prediction representation extraction layer, wherein the reference voice representation is priori information, the prediction representation is source information, and the priori information is used for indicating the prediction representation extraction layers to extract the prediction representation from the source information;

and determining the prediction representation output by the last layer of prediction representation extraction layer as the prediction representation of the target speaker.

6. The speaker recognition method of claim 3, wherein the step of calculating a joint loss based on the estimation mask, the predicted speech, the training standard speech, and the training mixed speech comprises:

comparing the training standard voice with the training mixed voice to obtain an ideal mask;

calculating a first loss from the estimated mask and the ideal mask;

calculating a second loss from the predicted speech, the training standard speech, and the training mixed speech;

and performing linear operation on the first loss and the second loss to obtain a joint loss.

7. The speaker recognition method of claim 1, wherein the step of calculating the probabilistic linear certainty score of the predicted speech and the reference speech comprises:

respectively extracting X-vector representations of the predicted speech and the reference speech by a probabilistic linear judger;

calculating a log-likelihood ratio of the predicted speech and the reference speech X-vector characterization;

and determining the obtained log-likelihood ratio as the probability linear judgment score of the predicted voice and the reference voice.

8. A speaker recognition apparatus, comprising:

the voice acquisition module is used for acquiring mixed voice and reference voice of a target speaker;

a reference extraction module for extracting a reference speech representation from the reference speech through a reference extraction model;

a mask obtaining module, configured to input the reference speech characterization into a hybrid extraction model to instruct the hybrid extraction model to obtain an estimation mask of the target speaker from the mixed speech according to the reference speech characterization, where masks in the estimation mask correspond to speech signal points in the mixed speech one to one;

a prediction obtaining module, configured to multiply the mask in the estimation mask and the speech signal point in the mixed speech correspondingly to obtain a predicted speech of the target speaker;

a score calculating module for calculating a probabilistic linear judgment score of the predicted speech and the reference speech;

and the determining module is used for determining that the mixed voice contains the voice of the target speaker when the probability linear judgment score is in a preset score interval.

9. A computer device comprising a memory having computer readable instructions stored therein and a processor that when executed performs the steps of the speaker recognition method as claimed in any one of claims 1 to 7.

10. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of the speaker recognition method as claimed in any one of claims 1 to 7.

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a speaker recognition method, device, computer device, and storage medium.

Background

Speaker recognition and speaker verification are important components of speech processing, belong to one of biometric technologies, and aim to verify whether a segment of speech is spoken by a target speaker. Speaker recognition can be widely applied to the safety or entertainment field related to people, for example, voice control of electric appliances is limited to partial family members in an intelligent home to reduce the possibility that children contact dangerous electric appliances.

The cocktail party scene is a scene in which a plurality of speakers exist in a noisy environment. In speaker recognition in a cocktail party scene, a task of speaker recognition for a single sound channel is difficult, noise filtering and target speaker positioning are difficult to achieve due to the fact that single sound channel voice lacks of voice spatial information, and speaker recognition is particularly difficult due to the fact that multiple voices with uncertain number are mixed in the single sound channel voice. Although the conventional speaker recognition technology, such as deep clustering, can realize speaker recognition in a cocktail party scene, it requires a certain premise, for example, the number of speakers needs to be known in advance, or certain requirements are imposed on the type of audio. When the precondition cannot be satisfied, the speaker recognition accuracy is significantly reduced.

Disclosure of Invention

The embodiment of the application aims to provide a speaker identification method, a speaker identification device, computer equipment and a storage medium, so as to solve the problem of low speaker identification accuracy.

In order to solve the above technical problem, an embodiment of the present application provides a speaker recognition method, which adopts the following technical solutions:

acquiring mixed voice and reference voice of a target speaker;

extracting a reference speech characterization from the reference speech by a reference extraction model;

correspondingly multiplying the mask in the estimation mask and the voice signal point in the mixed voice to obtain the predicted voice of the target speaker;

calculating a probabilistic linear judgment score of the predicted speech and the reference speech;

and when the probability linear judgment score is in a preset score interval, determining that the mixed voice contains the voice of the target speaker.

In order to solve the above technical problem, an embodiment of the present application further provides a speaker recognition apparatus, which adopts the following technical solutions:

the voice acquisition module is used for acquiring mixed voice and reference voice of a target speaker;

a reference extraction module for extracting a reference speech representation from the reference speech through a reference extraction model;

a score calculating module for calculating a probabilistic linear judgment score of the predicted speech and the reference speech;

and the determining module is used for determining that the mixed voice contains the voice of the target speaker when the probability linear judgment score is in a preset score interval.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

acquiring mixed voice and reference voice of a target speaker;

extracting a reference speech characterization from the reference speech by a reference extraction model;

correspondingly multiplying the mask in the estimation mask and the voice signal point in the mixed voice to obtain the predicted voice of the target speaker;

calculating a probabilistic linear judgment score of the predicted speech and the reference speech;

and when the probability linear judgment score is in a preset score interval, determining that the mixed voice contains the voice of the target speaker.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

acquiring mixed voice and reference voice of a target speaker;

extracting a reference speech characterization from the reference speech by a reference extraction model;

correspondingly multiplying the mask in the estimation mask and the voice signal point in the mixed voice to obtain the predicted voice of the target speaker;

calculating a probabilistic linear judgment score of the predicted speech and the reference speech;

and when the probability linear judgment score is in a preset score interval, determining that the mixed voice contains the voice of the target speaker.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects: acquiring a mixed voice and a reference voice of a target speaker, and extracting a reference voice representation representing the voice print characteristic of the speaker from the reference voice; the reference voice characterization can be used as prior information, and the feature information of the target speaker can be accurately determined in the mixed voice based on the prior information to obtain an estimation mask; the estimation mask is used for predicting the distribution of the voice of the target speaker in the mixed voice, the mask in the estimation mask is in one-to-one correspondence with the voice signal points in the mixed voice, and the mask in the estimation mask is multiplied by the voice signal points in the mixed voice to accurately obtain the predicted voice of the target speaker; and then calculating the probability linear judgment score of the predicted voice and the reference voice to further determine, and when the score is in a preset credible score interval, determining that the mixed voice contains the voice of the target speaker, thereby further improving the accuracy of speaker recognition.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a speaker recognition method according to the present application;

FIG. 3 is a schematic block diagram of one embodiment of a speaker ID device according to the present application;

FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that the speaker recognition method provided in the embodiments of the present application is generally executed by a server, and accordingly, the speaker recognition apparatus is generally disposed in the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow diagram of one embodiment of a speaker recognition method in accordance with the present application is shown. The speaker identification method comprises the following steps:

in step S201, a mixed voice and a reference voice of a target speaker are obtained.

In this embodiment, the electronic device (e.g., the server shown in fig. 1) on which the speaker recognition method operates may communicate with the terminal through a wired connection or a wireless connection. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

The reference voice can be pre-recorded voice of the target speaker; the mixed speech may include the speech and noise of multiple speakers, possibly including the speech of the targeted speaker; the application needs to recognize whether the mixed speech contains the speech of the target speaker.

Specifically, the mixed voice is usually obtained by the terminal from the real environment through a voice collector, for example, sound can be collected by a television, and the mixed voice is obtained. The mixed speech in the present application may be monaural speech, containing no spatial information. And after the terminal acquires the mixed voice, triggering a speaker recognition instruction, and sending the speaker recognition instruction and the mixed voice to the server. The server acquires the pre-stored reference voice of the target speaker according to the speaker recognition instruction.

In step S202, a reference speech feature is extracted from the reference speech through the reference extraction model.

The reference speech characterization may be feature data extracted from the reference speech, including voiceprint characteristics of the target speaker.

Specifically, speaker recognition can be achieved according to voiceprint characteristics of the speaker, and the server can extract reference voice representations from reference voices through the reference extraction model. The reference extraction model can be a trained neural network or an integral body formed by a preset speech processing algorithm.

Typically, the mixed speech is processed in real-time based on real-time acquisition, and the reference speech is prepared in advance, so that the reference speech characterization can be extracted from the reference speech in advance. And after the server receives the speaker identification instruction, directly acquiring the prepared reference voice representation according to the speaker identification instruction.

It is emphasized that to further ensure privacy and security of the reference speech tokens, the reference speech tokens may also be stored in a node of a blockchain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Step S203, inputting the reference voice representation into the mixed extraction model to indicate the mixed extraction model to acquire an estimation mask of the target speaker from the mixed voice according to the reference voice representation, wherein the mask in the estimation mask corresponds to the voice signal points in the mixed voice one by one.

Wherein the estimation mask is a mask that is an estimate of the distribution of the target speaker's speech in the mixed speech.

Specifically, the server inputs a reference voice characterization into the hybrid extraction model, the reference voice records the voiceprint characteristic of the target speaker, the hybrid extraction model extracts the predictive characterization of the target speaker from the hybrid voice according to the reference voice characterization, and the predictive characterization is characteristic data which is output by the hybrid extraction model and contains the voiceprint characteristic of the target speaker. The mixed extraction model can be a trained neural network or an integral body formed by a preset speech processing algorithm.

The hybrid extraction model may compute an estimated mask from the predictive representation. The estimation mask is composed of masks, each mask is a number, and the value may be 0 or 1.

The mixed speech is composed of discrete speech signal points, each of which corresponds to a mask in the estimation mask. When the mask value is 0, the mixed extraction model judges that the voice signal point does not contain the voice information of the target speaker; when the mask value is 1, the expression mixing extraction model judges that the voice signal point contains the voice information of the target speaker.

When the voice signal points in the mixed voice are the voice signal points in the time domain graph, each voice signal point contains time and amplitude information, and an estimation mask is a vector; when the voice signal points in the mixed voice are the voice signal points in the time-frequency diagram, each voice signal point comprises time, frequency and amplitude information, the voice signal points are in a matrix form, and the estimation mask is also in the matrix form at the moment.

And step S204, correspondingly multiplying the mask in the estimation mask and the voice signal point in the mixed voice to obtain the predicted voice of the target speaker.

Specifically, the mask in the estimation mask is multiplied by the voice signal point in the mixed voice, and when the mask is multiplied by the voice signal amplitude, irrelevant voice in the mixed voice can be filtered out, so that the predicted voice of the target speaker is extracted. The predicted speech is speech from the target speaker as determined by the hybrid extraction model.

In step S205, a probability linear determination score of the predicted speech and the reference speech is calculated.

The probability linear judgment score is used for measuring the similarity of the predicted speech and the reference speech, namely measuring the probability of whether the predicted speech and the reference speech come from the same speaker.

Specifically, to further verify whether the predicted speech and the reference speech are from the target speaker, a probabilistic Linear judgment score, i.e., a plda (probabilistic Linear cognitive analysis) score, of the predicted speech and the reference speech may be calculated. The Probabilistic Linear Discriminant Analysis (PLDA) is a channel compensation algorithm, also called LDA algorithm in probability form (Linear Discriminant Analysis, commonly used in the field of pattern recognition), which provides channel compensation in calculation, and can be used for classification and measurement of correlation of input objects.

Further, the step S205 may include: respectively extracting X-vector representations of the predicted speech and the reference speech by a probability linear judger; calculating the log-likelihood ratio of the predicted speech and the reference speech X-vector characterization; and determining the obtained log-likelihood ratio as the probability linear judgment score of the predicted voice and the reference voice.

Wherein the probabilistic linear judger may be a model implementing Probabilistic Linear Discriminant Analysis (PLDA).

Specifically, the predicted speech and the reference speech are input to a probabilistic linear judger, and an X-vector characterization of the predicted speech and an X-vector characterization of the reference speech are extracted by the probabilistic linear judger. The X-vector characterization is a feature vector applied in the PLDA algorithm.

Log-likelihood ratios may measure how similar the predicted speech and the reference speech are. The probabilistic linear determinator may calculate a log-likelihood ratio of the X-vector characterization of the predicted speech to the X-vector characterization of the reference speech, and use the log-likelihood ratio as a probabilistic linear decision score, i.e., a PLDA score, of the predicted speech and the reference speech.

The higher the probabilistic linear judgment score, the higher the probability that the predicted speech and the reference speech are from the same target speaker. Since the reference speech is from a specific target speaker, the higher the probabilistic linear judgment score, the higher the probability that the predicted speech is from the specific target speaker.

In the embodiment, the X-vector characterization is extracted from the voice to calculate the log-likelihood ratio, and the log-likelihood ratio is used as the probability linear judgment score, so that whether the predicted voice and the reference voice come from the target speaker is further judged, and the accuracy of the identification of the target speaker is ensured.

And step S206, when the probability linear judgment score is in the preset score interval, determining that the mixed voice contains the voice of the target speaker.

Specifically, a score threshold value may be preset, and when the probability linear judgment score is greater than the preset score threshold value, the probability linear judgment score is in a preset score interval, and it may be determined that the predicted speech and the reference speech are from the same target speaker, that is, the mixed speech includes the speech of the target speaker, so that the target speaker is recognized in the mixed speech.

In the embodiment, the reference voice of the mixed voice and the target speaker is obtained, and the reference voice representation for representing the voice print characteristic of the speaker is extracted from the reference voice; the reference voice characterization can be used as prior information, and the feature information of the target speaker can be accurately determined in the mixed voice based on the prior information to obtain an estimation mask; the estimation mask is used for predicting the distribution of the voice of the target speaker in the mixed voice, the mask in the estimation mask is in one-to-one correspondence with the voice signal points in the mixed voice, and the mask in the estimation mask is multiplied by the voice signal points in the mixed voice to accurately obtain the predicted voice of the target speaker; and then calculating the probability linear judgment score of the predicted voice and the reference voice to further determine, and when the score is in a preset credible score interval, determining that the mixed voice contains the voice of the target speaker, thereby further improving the accuracy of speaker recognition.

Further, before step S201, the method may further include:

step S207, acquiring training standard voice, training reference voice and training mixed voice aiming at the target speaker, wherein the training mixed voice is obtained by adding interference voice in the training standard voice.

Wherein, the training standard voice and the training reference voice are the pre-recorded voice of the target speaker; the training mixed voice is obtained by adding interference voice to training standard voice, wherein the interference voice comprises voice and noise of speakers except the target speaker.

In particular, the reference extraction model and the hybrid extraction model in the present application may be constructed based on a neural network. When the reference extraction model and the hybrid extraction model are constructed based on the neural network, the reference extraction model and the hybrid extraction model need to be obtained through model training.

The server acquires training standard voice, training reference voice and training mixed voice aiming at the target speaker so as to carry out model training. The training standard speech and the training reference speech are from the same speaker, and the speech content may be different. The number of training reference voices may be small, but the training standard voices should be kept at a large number.

And S208, training the initial reference extraction model and the initial mixed extraction model according to the training standard voice, the training reference voice and the training mixed voice to obtain a reference extraction model and a mixed extraction model.

Wherein the initial reference extraction model may be a reference extraction model that has not been trained yet; the initial hybrid extraction model may be a hybrid extraction model that has not been trained yet.

Specifically, training of the initial reference extraction model and the initial mixed extraction model can be performed simultaneously, in the training, training reference speech and training standard speech play a role in sample labeling in supervised learning, the training initial reference extraction model extracts a reference speech representation from the training reference speech, and the training initial mixed extraction model extracts predicted speech of a target speaker from the training mixed speech.

In the embodiment, the training standard speech, the training reference speech and the training mixed speech for the target speaker are obtained to perform model training to obtain the reference extraction model and the mixed extraction model, so that speaker recognition according to the models is ensured.

Further, the step S208 may include: extracting reference voice representation from training reference voice through an initial reference extraction model; inputting the reference voice representation into the initial mixed extraction model to instruct the initial mixed extraction model to extract an estimated mask and a predicted voice of the target speaker from the training mixed voice according to the reference voice representation; calculating a joint loss based on the estimation mask, the predicted speech, the training standard speech, and the training mixed speech; and adjusting the initial reference extraction model and the initial mixed extraction model according to the joint loss until the joint loss meets the training stopping condition to obtain the reference extraction model and the mixed extraction model.

Specifically, the training reference speech is input into an initial reference extraction model, and the initial reference extraction model can comprise a plurality of layers of sequentially connected reference characterization extraction layers, wherein each layer of reference characterization extraction layer is composed of a reference characterization extraction network and a nonlinear transformation layer.

For the first layer of reference representation extraction layer, processing the input training reference voice by the reference representation extraction network, and then inputting the processing result into the nonlinear transformation layer for nonlinear transformation to obtain the extraction result of the first layer of reference representation extraction layer; the extraction result of the first layer of reference representation extraction layer is input into a second layer of reference representation extraction layer, the reference representation extraction network of the second layer of reference representation extraction layer processes the extraction result of the previous layer, then the processing result is input into a nonlinear transformation layer to carry out nonlinear transformation to obtain the extraction result of the second layer of reference representation extraction layer, the extraction result of the second layer of reference representation extraction layer is input into the next reference representation extraction layer, the processing process is iterated until the last layer of reference representation extraction layer, the extraction result of the last layer of reference representation extraction layer carries out linear transformation on the linear transformation layer, the output of the linear transformation layer is input into a mean value pooling layer to carry out mean value pooling, and the output of the mean value pooling layer is used as the reference voice representation output by the initial reference extraction model.

The initial mixed extraction model is used for extracting data for recording the voiceprint characteristics of the target speaker from the mixed training voice according to the reference voice characteristics. After the initial reference extraction model processes the mixed training speech, an estimated mask and a predicted speech for the target speaker can be obtained.

And then model loss can be calculated, and the method calculates the joint loss based on the estimated mask, the predicted speech, the training standard speech and the training mixed speech, wherein the joint loss not only contains mask related factors, but also contains speech related factors, and has higher accuracy.

And the server adjusts model parameters of the initial reference extraction model and the initial mixed extraction model by taking the joint loss reduction as a target, and performs iterative training on the initial reference extraction model and the initial mixed extraction model after the parameters are adjusted until the joint loss meets a preset training stopping condition, and then stops training to obtain the reference extraction model and the mixed extraction model. Wherein the training stop condition may be that the joint loss is less than a preset joint loss threshold.

In the embodiment, the initial reference extraction model extracts the reference speech characterization from the training reference speech and inputs the reference speech characterization into the initial hybrid extraction model to instruct the initial hybrid extraction model to extract the estimation mask and the predicted speech for the target speaker; and calculating joint loss based on the estimation mask, the predicted speech, the training standard speech and the training mixed speech, and adjusting the model according to the joint loss so as to obtain a reference extraction model and a mixed extraction model which can be used for speaker recognition.

Further, the step of inputting the reference speech characterization into the initial mixture extraction model to instruct the initial mixture extraction model to extract the estimated mask and the predicted speech of the target speaker from the training mixture speech according to the reference speech characterization may include:

inputting the reference voice characterization into the initial mixed extraction model to indicate the initial mixed extraction model to take the reference voice characterization as prior information, and extracting the predicted characterization of the target speaker from the training mixed voice; inputting the predicted representation into a mask calculation layer in the initial mixed extraction model to obtain an estimated mask, wherein the mask in the estimated mask corresponds to the voice signal points in the training mixed voice one by one; and carrying out corresponding point multiplication on the mask in the estimation mask and the voice signal points in the training mixed voice to obtain the predicted voice of the target speaker.

Specifically, after the reference speech characterization is input into the initial hybrid extraction model, the reference speech characterization is used as prior information to indicate the initial hybrid extraction model to extract the predictive characterization of the target speaker from the training hybrid speech according to the reference speech characterization. If the training mixed speech contains a plurality of speakers and the prediction representation of which speaker needs to be extracted, the reference speech representation extracted from the training reference speech of the speaker can be input into the initial mixed extraction model. That is, the reference speech characterization is different, as are the prediction characterizations extracted from the training mixed speech.

The mask calculation layer is used for predicting and obtaining an estimation mask according to the prediction characterization, the mask in the estimation mask is in one-to-one correspondence with each voice signal point in the training mixed voice, the estimation mask can be a vector or a matrix consisting of 0 and 1, and when the mask value is 1, the initial mixed extraction model is represented to judge that the corresponding voice signal point contains the voice information of the target speaker; and when the estimated mask value is 0, the initial mixed extraction model judges that the corresponding voice signal point does not contain the voice information of the target speaker.

Based on the estimation mask, speech recognition for the target speaker can be transformed from a regression problem to a classification problem. And performing corresponding point multiplication on the mask in the estimation mask and the voice signal point in the training mixed voice, and reducing the amplitude of the voice signal point with the mask value of 0 to 0, so that the voice irrelevant to the target speaker in the training mixed voice can be removed, and the predicted voice aiming at the target speaker can be obtained.

In this embodiment, the predictive representation of the target speaker can be extracted from the training mixed speech according to the reference speech representation, the predictive representation input mask calculation layer can obtain the estimation mask, the estimation mask is the prediction of the model on the speech distribution of the target speaker, and the predictive speech of the target speaker can be accurately and quickly extracted from the training mixed speech after the point multiplication of the mask in the estimation mask and the speech signal point in the training mixed speech.

Further, the initial mixed extraction model comprises a plurality of prediction representation extraction layers which are sequentially connected; the step of inputting the reference speech feature into the initial hybrid extraction model to instruct the initial hybrid extraction model to use the reference speech feature as prior information and extracting the predictive feature of the target speaker from the training hybrid speech may include:

splicing the reference voice characterization and the training mixed voice, and inputting the reference voice characterization into a first layer of prediction characterization extraction layer, wherein the reference voice characterization is prior information, the training mixed voice is source information, and the prior information is used for indicating the prediction characterization extraction layer to extract prediction characterization from the source information;

for the prediction representation extraction layers after the first layer, splicing the reference voice representation and the prediction representation, inputting the next prediction representation extraction layer for iteration until the last prediction representation extraction layer, wherein the reference voice representation is prior information, the prediction representation is source information, and the prior information is used for indicating the prediction representation extraction layers to extract the prediction representation from the source information;

and determining the prediction representation output by the last layer of prediction representation extraction layer as the prediction representation of the target speaker.

Specifically, the initial hybrid extraction model comprises a plurality of layers of prediction characterization extraction layers which are sequentially connected, and each layer of prediction characterization extraction layer is composed of a prediction characterization extraction network and a nonlinear transformation layer.

Reference voice characterization and training mixed voice can be spliced and input into a first layer prediction characterization extraction layer, wherein the reference voice characterization is used as prior information, and the training mixed voice is used as source information. And the predicted representation extraction layer processes the source information according to the prior information, wherein the processing result of the predicted representation extraction network carries out nonlinear transformation on the input nonlinear transformation layer to obtain the predicted representation of the first layer of the predicted representation extraction layer.

And after the prediction representation of the first layer of prediction representation extraction layer is spliced with the reference voice representation, inputting the prediction representation of the second layer of prediction representation extraction layer, wherein the input prediction representation is used as source information, the reference voice representation is used as prior information, a prediction representation extraction network of the second layer of prediction representation extraction layer processes the source information, and then inputting a processing result into a nonlinear transformation layer to perform nonlinear transformation to obtain the prediction representation of the second layer of prediction representation extraction layer.

The prediction representation of the second layer prediction representation extraction layer is input into the next prediction representation extraction layer, the input prediction representation is used as source information for the next prediction representation extraction layer, the reference voice representation is used as prior information, the extraction process of the prediction representation is iterated until the last layer prediction representation extraction layer, and the output of the last layer prediction representation extraction layer is used as the prediction representation of the target speaker.

In one embodiment, the reference characterization extraction network may be a BLSTM network (Bi-directional Long Short-Term Memory artificial neural network) or an FSMN network (feed forward Sequential Memory neural network); the predictive token extraction network may be a BLSTM network or an FSMN network. The FSMN network can model the long-term correlation of signals and has a faster processing speed, but the processing effect of the BLSTM network is relatively better.

When the method is applied, the mixed voice is usually input in real time and processed in real time, so that the prediction characterization extraction network can adopt an FSMN network; in practical application, reference voice is prepared in advance, and reference voice representation has more time to be obtained, so that the reference representation extraction network can adopt a BLSTM network to ensure the extraction effect.

The specific structure of the reference representation extraction network and the predicted representation extraction network can be determined according to the actual application scene. If the requirement on the identification accuracy rate is high, the reference characterization extraction network and the prediction characterization extraction network both use a BLSTM network; the reference token extraction network and the predictive token extraction network may use the FSMN network if there are high requirements on both training time and recognition speed.

The nonlinear transformation layers in the reference characterization extraction network and the predicted characterization extraction network can both adopt a GeLU (Gaussian Error Linear Unit) activation function.

In this embodiment, the initial hybrid extraction model includes a plurality of prediction characterization extraction layers, and each layer performs prediction characterization extraction according to the reference speech characterization, thereby ensuring accuracy of the finally obtained prediction characterization for the target speaker.

Further, the step of calculating the joint loss based on the estimation mask, the predicted speech, the training standard speech and the training mixed speech may include: comparing the training standard voice with the training mixed voice to obtain an ideal mask; calculating a first loss from the estimated mask and the ideal mask; calculating a second loss according to the predicted speech, the training standard speech and the training mixed speech; and performing linear operation on the first loss and the second loss to obtain a combined loss.

Specifically, since the training mixed speech is obtained by adding an interfering speech to the training standard speech, and the predicted speech in the ideal state should be consistent with the training standard speech, the distribution of the speech of the target speaker in the training mixed speech in the ideal state can be obtained by comparing the training standard speech with the training mixed speech, and the comparison result is an ideal mask (IBM).

The Estimated Mask (Estimated Mask) is the distribution of the target speaker's voice predicted by the hybrid extraction model in the training mixed voice, so the Mask Approximation Loss (first Loss) can be calculated according to the Estimated Mask and the ideal Mask, and in one embodiment, the first Loss is calculated as follows:

where M is the estimation mask, M_IBMFor an ideal mask, T represents the average time length (the time length of a speech processing time frame))。

In the foregoing of the present application, the predicted speech in the ideal state should be consistent with the training standard speech, the first loss is the loss of the mask level, and the problem of phase is not considered, so that the loss of the speech phase level, i.e. the second loss, can be calculated according to the predicted speech, the training standard speech and the training mixed speech:

wherein, θ_xAnd theta_yPhase angles of training standard speech and training mixed speech, respectively, | X | and | Y | are amplitudes of the training standard speech and the training mixed speech, respectively, M is an estimation mask, M · | Y | is predicted speech, and T represents an average time length (time length of a speech processing time frame).

Then the first loss and the second loss are combined in a linear operation, and a combined loss is calculated:

wherein 0< λ < 1.

In this embodiment, the first loss of the mask layer is calculated, the second loss of the phase layer is calculated, and the mask and the phase are considered in the combined loss obtained by combining the first loss and the second loss, so that the accuracy of loss calculation is improved, and the accuracy of the model obtained after adjustment according to the loss is ensured.

The method and the device can identify a plurality of target speakers simultaneously. If a plurality of target speakers need to be identified, only reference voice representations of the target speakers need to be prepared in advance, the mixed extraction model can extract a plurality of predicted voices from the mixed voices according to the reference voice representations of the target speakers, and then the probability linear judgment score is calculated. During training, aiming at each target speaker, respectively preparing training standard voice, training reference voice and training mixed voice, and then carrying out model training, so that the reference extraction model and the mixed extraction model obtained by training can realize speaker recognition aiming at each target speaker.

The method and the device can be applied to the field of smart cities, and accordingly construction of the smart cities is promoted. Specifically, can be applied to the identification among the intelligent security protection, can also be applied to the intelligent house in the wisdom community.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a speaker recognition apparatus, which corresponds to the embodiment of the method shown in fig. 2 and can be applied to various electronic devices.

As shown in fig. 3, the speaker recognition apparatus 300 according to the present embodiment includes: a speech acquisition module 301, a reference extraction module 302, a mask acquisition module 303, a prediction acquisition module 304, a score calculation module 305, and a determination module 306, wherein:

the speech acquiring module 301 is configured to acquire the mixed speech and the reference speech of the target speaker.

A reference extraction module 302 for extracting a reference speech characterization from the reference speech through a reference extraction model.

And the mask obtaining module 303 is configured to input the reference speech characterization into the hybrid extraction model to instruct the hybrid extraction model to obtain an estimated mask of the target speaker from the mixed speech according to the reference speech characterization, where the masks in the estimated mask correspond to the speech signal points in the mixed speech one to one.

And the prediction obtaining module 304 is configured to multiply the mask in the estimation mask and the speech signal point in the mixed speech to obtain the predicted speech of the target speaker.

And a score calculating module 305 for calculating a probabilistic linear judgment score of the predicted speech and the reference speech.

And the determining module 306 is configured to determine that the mixed speech includes the speech of the target speaker when the probability linear judgment score is in the preset score interval.

In some optional implementations of the present embodiment, the speaker recognition apparatus 300 may further include: training acquisition module and model training module, wherein:

the training acquisition module is used for acquiring training standard voice, training reference voice and training mixed voice aiming at a target speaker, wherein the training mixed voice is obtained by adding interference voice in the training standard voice.

And the model training module is used for training the initial reference extraction model and the initial mixed extraction model according to the training standard voice, the training reference voice and the training mixed voice to obtain a reference extraction model and a mixed extraction model.

In some optional implementations of this embodiment, the model training module may include: the reference extraction submodule, the mixed extraction submodule, the loss calculation submodule and the model adjustment submodule, wherein:

and the reference extraction submodule is used for extracting the reference voice representation from the training reference voice through the initial reference extraction model.

And the mixed extraction sub-module is used for inputting the reference voice representation into the initial mixed extraction model so as to instruct the initial mixed extraction model to extract the estimation mask and the predicted voice of the target speaker from the training mixed voice according to the reference voice representation.

And the loss calculation submodule is used for calculating the joint loss based on the estimation mask, the predicted speech, the training standard speech and the training mixed speech.

And the model adjusting submodule is used for adjusting the initial reference extraction model and the initial mixed extraction model according to the joint loss until the joint loss meets the training stopping condition to obtain the reference extraction model and the mixed extraction model.

In some optional implementations of this embodiment, the hybrid extraction sub-module may include: a prediction acquisition unit, a prediction input unit, and a dot multiplication unit, wherein:

and the prediction acquisition unit is used for inputting the reference voice characterization into the initial mixed extraction model so as to indicate the initial mixed extraction model to take the reference voice characterization as prior information, and extracting the prediction characterization of the target speaker from the training mixed voice.

And the prediction input unit is used for inputting the prediction representation into a mask calculation layer in the initial mixed extraction model to obtain an estimation mask, and the mask in the estimation mask corresponds to the voice signal points in the training mixed voice one by one.

And the point multiplication unit is used for carrying out corresponding point multiplication on the mask in the estimation mask and the voice signal points in the training mixed voice to obtain the predicted voice of the target speaker.

In some optional implementations of this embodiment, the initial hybrid extraction model includes a plurality of sequentially connected prediction characterization extraction layers; the prediction acquisition unit may include: a first input subunit, a stitching subunit, and a determining subunit, wherein:

the first input subunit is used for splicing the reference voice characterization and the training mixed voice and inputting the first layer of prediction characterization extraction layer, the reference voice characterization is prior information, the training mixed voice is source information, and the prior information is used for indicating the prediction characterization extraction layer to extract the prediction characterization from the source information.

And the splicing subunit is used for splicing the reference voice characterization and the prediction characterization for the prediction characterization extraction layers after the first layer, inputting the next prediction characterization extraction layer for iteration until the last prediction characterization extraction layer, wherein the reference voice characterization is priori information, the prediction characterization is source information, and the priori information is used for indicating the prediction characterization extraction layer to extract the prediction characterization from the source information.

And the determining subunit is used for determining the prediction characterization output by the last layer of prediction characterization extraction layer as the prediction characterization of the target speaker.

In some optional implementations of this embodiment, the loss calculation sub-module may include: the voice comparison unit, the first calculation unit, the second calculation unit and the loss calculation unit, wherein:

and the voice comparison unit is used for comparing the training standard voice with the training mixed voice to obtain an ideal mask.

A first calculation unit for calculating a first loss based on the estimated mask and the ideal mask.

And a second calculating unit for calculating a second loss based on the predicted speech, the training standard speech, and the training mixed speech.

And the loss operation unit is used for carrying out linear operation on the first loss and the second loss to obtain the joint loss.

In some optional implementations of this embodiment, the score calculating module 305 may include: the device comprises a characterization extraction sub-module, a likelihood ratio calculation sub-module and a likelihood ratio determination sub-module, wherein:

and the characterization extraction sub-module is used for respectively extracting the X-vector characterizations of the predicted speech and the reference speech through the probability linear judger.

And the likelihood ratio calculation sub-module is used for calculating the log likelihood ratio of the predicted voice and the reference voice X-vector characterization.

And the likelihood ratio determining submodule is used for determining the obtained log likelihood ratio as the probability linear judgment score of the predicted voice and the reference voice.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only computer device 4 having components 41-43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various application software, such as computer readable instructions of a speaker recognition method. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as executing computer readable instructions of the speaker recognition method.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.

The computer device provided in this embodiment can execute the speaker recognition method described above. The speaker recognition method herein may be the speaker recognition method of the above-described respective embodiments.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the speaker identification method as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

20页详细技术资料下载

Speaker recognition method, speaker recognition device, computer equipment and storage medium

相关技术

网友询问留言