Speaker recognition system and method of use

文档序号：157362 发布日期：2021-10-26 浏览：20次中文

阅读说明：本技术 说话者识别系统及其使用方法 (Speaker recognition system and method of use ) 是由王琼琼冈部浩司越仲孝文于 2020-02-05 设计创作，主要内容包括：一种说话者识别系统,其包括被配置成存储指令的非暂时性计算机可读介质。所述说话者识别系统进一步包括处理器,所述处理器连接至非暂时性计算机可读介质。处理器被配置成执行有关从输入语音数据中的多个帧中的每个帧提取声学特征的指令。处理器被配置成基于提取的声学特征而使用第一神经网络(NN)来执行有关计算多个帧中的每个帧的显著性值的指令,其中,第一NN是使用说话者后验的经训练的NN。处理器被配置成执行有关使用多个帧中的每个帧的显著性值来提取说话者特征的指令。(A speaker recognition system includes a non-transitory computer-readable medium configured to store instructions. The speaker recognition system further includes a processor connected to a non-transitory computer-readable medium. The processor is configured to execute instructions related to extracting acoustic features from each of a plurality of frames in the input speech data. The processor is configured to perform instructions on computing a saliency value for each of a plurality of frames using a first Neural Network (NN) based on the extracted acoustic features, wherein the first NN is a trained NN using a speaker posterior. The processor is configured to execute instructions related to using the saliency value for each of the plurality of frames to extract speaker features.)

1. A speaker recognition system, comprising:

a non-transitory computer-readable medium configured to store instructions; and

a processor connected to the non-transitory computer-readable medium,

wherein the processor is configured to execute instructions for:

extracting acoustic features from each of a plurality of frames in input speech data;

computing a saliency value for each frame of the plurality of frames using a first Neural Network (NN) based on the extracted acoustic features, wherein the first NN is a trained NN using speaker posteriors; and

extracting speaker features using the saliency values for each frame of the plurality of frames.

2. The speaker recognition system of claim 1, wherein the processor is configured to execute instructions for:

extracting the speaker features using a weighted pooling process that is implemented using the saliency values for each of the plurality of frames.

3. The speaker recognition system of claim 1, wherein the processor is configured to execute instructions for:

training the first NN using the speaker posteriors.

4. The speaker recognition system of claim 3, wherein the processor is configured to execute instructions for:

generating the speaker posteriori using training data and speaker identification information.

5. The speaker recognition system of claim 1, wherein the processor is configured to execute instructions for:

calculating the saliency value for each of the plurality of frames based on a gradient of the speaker posteriori on the basis of the extracted acoustic features for each of the plurality of frames.

6. The speaker recognition system of claim 1, wherein the processor is configured to execute instructions for:

calculating the significance value for each frame of the plurality of frames using a first node of the first NN and a second node of the first NN,

wherein a first frame of the plurality of frames output at the first node indicates that the first frame has more useful information than a second frame of the plurality of frames output at the second node.

7. The speaker recognition system of claim 6, wherein the processor is configured to execute instructions for:

calculating the saliency value for each frame of the plurality of frames based on a gradient of the speaker posteriori on the extracted acoustic features for each frame of the plurality of frames output at the first node of the first NN.

8. The speaker recognition system of claim 1, wherein the processor is configured to execute instructions for:

outputting an identity of a speaker of the input speech data based on the extracted speaker characteristics.

9. The speaker recognition system of claim 1, wherein the processor is configured to execute instructions for:

matching a speaker of the input speech data with a stored speaker identification based on the extracted speaker characteristics.

10. The speaker recognition system of claim 1, wherein the processor is configured to execute instructions for:

access to the computer system is granted in response to the extracted speaker characteristic matching an authorized user.

11. A speaker recognition method, comprising:

receiving input voice data;

extracting acoustic features from each of a plurality of frames in the input speech data;

extracting speaker features using the saliency values for each frame of the plurality of frames.

12. The speaker recognition method of claim 11, wherein the extracting of the speaker features comprises:

using a weighted pooling process that is implemented using the saliency value for each of the plurality of frames.

13. The speaker recognition method of claim 11, further comprising:

training the first NN using the speaker posteriors.

14. The speaker recognition method of claim 13, further comprising:

generating the speaker posteriori using training data and speaker identification information.

15. The speaker recognition method of claim 11, wherein the calculation of the saliency value for each frame of the plurality of frames is based on:

a gradient of the speaker posterior on the basis of the extracted acoustic features for each of the plurality of frames.

16. The speaker recognition method of claim 11, wherein the calculating the saliency value for each of the plurality of frames comprises:

receiving information from a first node of the first NN and from a second node of the first NN,

17. The speaker recognition method of claim 16, wherein the calculation of the saliency value for each frame of the plurality of frames is based on:

a gradient of the speaker posterior on the basis of the extracted acoustic features for each of the plurality of frames output at the first node of the first NN.

18. The speaker recognition method of claim 11, further comprising:

outputting an identity of a speaker of the input speech data based on the extracted speaker characteristics.

19. The speaker recognition method of claim 11, further comprising:

matching a speaker of the input speech data with a stored speaker identification based on the extracted speaker characteristics.

20. The speaker recognition method of claim 11, further comprising:

access to the computer system is granted in response to the extracted speaker characteristic matching an authorized user.

Background

In speaker recognition, a system receives a series of raw features, also referred to as acoustic features, having a variable number of frames. A frame is a period of time in which a feature includes data. The raw features are frame-level features, which means that the information is segmented based on time segments. The system contemplates outputting the speaker identity in a speaker identification scheme, or the result of the principal/imposter in a speaker verification scheme. Both the speaker identity and the principal/impostor results of the output are determined at the level of the utterance, which means that the entire information set, which may include many frames, is analyzed. To produce such utterance-level outputs from frame-level inputs, a pooling process of all valid frames is used in some speaker recognition systems. Equal weighted pooling is typically used, meaning that each frame of the original features is given the same importance regardless of the quality of the information in the frame.

Speaker recognition methods include i-vector based methods and DNN-based speaker embedding methods. Both methods use an equally weighted pooling i for this purpose, obtaining an output of speech-level speaker recognition results from the frame-level information.

In the i-vector based approach, from sequences of features y with L frames₁，y₂，·...，·y_LIn the speech, speech-level feature x is extracted according to the following formula

M＝μ+Tx，

Wherein the supervectors M are obtained by concatenating all M_CThe result is that the product of the reaction,

c is an index of the gaussian component in GMM-UBM. All frames are treated equally, only by treating all framesThe manner of summation.

In the DNN-based approach, the average pooling layer is assigned the same importance to each frame.

Disclosure of Invention

At least one embodiment of the present disclosure is directed to a neural network that uses a speaker saliency map such that speaker saliency for each frame is used to weight pooled features from the frame level to the utterance level. Unlike equal weighting pooling in the i-vector and DNN based approach, the speaker saliency map weights different frames of the original features differently. Frames that are more beneficial in speaker recognition, i.e., speaker recognition, will have more weight in the pooling process than other frames.

Drawings

Together with the detailed description, the drawings serve to help explain the principles of the speech recognition system and method of the present invention. The drawings are for purposes of illustrating, and not limiting, the application of the present technology.

Fig. 1 is a block diagram of a configuration of a speaker recognition system according to at least one embodiment.

Fig. 2 is a flow diagram of operations performed by a speaker recognition system in accordance with at least one embodiment.

FIG. 3 is a flow diagram of operations for training a speaker recognition system in accordance with at least one embodiment.

FIG. 4 is a flow diagram of operations for extracting speaker characteristics according to at least one embodiment.

Fig. 5 is a block diagram of a configuration of a speaker recognition system according to at least one embodiment.

FIG. 6 is a flow diagram of operations performed by a speaker recognition system in accordance with at least one embodiment.

FIG. 7 is a flow diagram of operations for training a speaker recognition system in accordance with at least one embodiment.

FIG. 8 is a flow diagram of operations for extracting speaker characteristics according to at least one embodiment.

FIG. 9 is a block diagram of a computing device for implementing a speaker recognition system in accordance with at least one embodiment.

Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures illustrating the integrated circuit architecture may be exaggerated relative to other elements to help improve understanding of the present example embodiments and alternative example embodiments.

Detailed Description

The embodiments will be described below with reference to the drawings. The following detailed description is merely exemplary in nature and is not intended to limit the disclosure or the application and uses of the disclosure. Furthermore, there is no intention to be bound by any theory presented in the preceding background of the invention or the following detailed description of the invention.

Fig. 1 is a block diagram of a configuration of a speaker recognition system 100 according to at least one embodiment. The speaker recognition system 100 includes a training portion 120 configured to receive and process raw features. The speaker recognition system further includes a speaker feature extraction section 130, the speaker feature extraction section 130 being configured to receive input data and output speaker features based on information from the training section 120.

The training section 120 includes an acoustic feature extractor 102_ a configured to extract acoustic features from training data received from the training data storage 101 to determine acoustic information in each of the frames of training data. A speaker discrimination Neural Network (NN) trainer 104 is configured to receive acoustic features from the acoustic feature extractor 102_ a and speaker ID information from a speaker ID storage 103. The speaker discrimination NN trainer 104 outputs speaker discrimination NN parameters for storage in the speaker discrimination NN parameter storage 105.

Any type of neural network may be used for the speaker-discrimination NN trainer 104, such as a time-delayed neural network (TDNN), a Convolutional Neural Network (CNN), an LSTM, or a Gated Recurrence Unit (GRU).

The speaker posterior extractor 106 is configured to extract a target speaker posterior for each speech utterance in the training data storage 101 using the speaker recognition NN parameters stored in the speaker recognition NN parameter storage 105. The speaker posteriori extracted by the speaker posteriori extractor 106 is stored in the speaker posteriori storage means 107. In at least one embodiment, the speaker posteriori extracted by the speaker posteriori extractor 106 is a scalar value in the range from 0 to 1.

The attention NN trainer 108 is configured to receive acoustic features from the acoustic feature extractor 102_ a and corresponding speaker posteriori from the speaker posteriori storage 107. The attention NN trainer 108 is configured to train an attention NN and output attention NN parameters. In at least one embodiment, the attention NN has a single output node. The attention NN parameter storage 109 is configured to store the attention NN parameters generated by the attention NN trainer 108.

Any type of neural network is suitable for the attention NN, for example, a time-lapse neural network (TDNN), a Convolutional Neural Network (CNN), an LSTM, or a Gated Recurrence Unit (GRU). In at least one embodiment, the neural network type used for the attention NN trainer 108 is the same as the neural network type used for the speaker recognition NN trainer 104. In at least one embodiment, the neural network type used for the attention NN trainer 108 is a different type of neural network than used for the speaker recognition NN trainer 104.

The attention NN parameter stored in the attention NN parameter storage 109 is a result of the training section 120 analyzing the training data from the training data storage 101. The attention NN parameters may be used to analyze the input data using the speaker feature extraction 130 to determine the identity of the speaker of the input data and/or to confirm whether the speaker of the input data is an imposter.

The speaker feature extraction section 130 includes an acoustic feature extractor 102_ b configured to extract acoustic features from the input data to identify acoustic features of each of the frames of the input data. The acoustic feature extractor 102_ a and the acoustic feature extractor 102_ b have the same function. In at least one embodiment, the same device is used for both the acoustic feature extractor 102_ a and the acoustic feature extractor 102_ b functions. In at least one embodiment, different devices are used to perform the functions of the acoustic feature extractor 102_ a and the acoustic feature extractor 102_ b.

Acoustic features from the input data are input to a speaker saliency calculator 110. The speaker saliency calculator 110 is configured to calculate the speaker saliency for each frame of the input data using the attention NN parameters stored in the attention NN parameter storage 109. Speaker saliency calculator 110 provides a weighting factor for each frame of input data. The weighting factors are based on the amount of useful information in each frame of the input data. The weighting factor of at least one frame is different from the weighting factor of at least one other frame. In at least one embodiment, each frame of input data has a different weighting factor. In at least one embodiment, at least one frame of input data has the same weighting factor as at least one other frame of input data. Examples of frames with a large amount of useful information include the following: including continuous speech of long duration, using different wording within the frames, or frames with little or no background noise. Examples of frames with a small amount of useful data include the following: speech is chaotic, short in speech duration, multiple speakers speaking simultaneously, or frames with a large amount of background noise. Speaker saliency calculator 110 assigns higher weights to frames with larger amounts of useful information. In at least one embodiment, each frame of input data has the same duration. In at least one embodiment, at least one frame of input data has a different duration than at least one other frame of input data.

Speaker feature extractor 112 utilizes the saliency from saliency calculator 110 during the pooling process to identify speaker features. The speaker feature extractor 112 also receives speaker feature parameters from the speaker feature extractor storage 111 for the pooling process. By including a pooling process within the speaker feature extractor 112, the use of fixed NN parameters is avoided. Thus, speaker feature extractor 112 is able to accommodate a variety of input data with different frames having different amounts of data available. In at least one embodiment, the speaker characteristic is an identity of a speaker of the input data. In at least one embodiment, the speaker characteristic is an authentication of the speaker based on a comparison of the input data to stored speaker characteristic parameters.

Speaker feature extractor 112 is any type of feature extractor capable of performing at least one pooling process. In at least one embodiment, the speaker feature extractor 112 is a depth speaker feature extractor. In at least one embodiment, speaker feature extractor 112 is an i-vector extractor.

The speaker recognition system 100 can provide results with higher accuracy than other methods that do not include the saliency calculator 110. By weighting different frames of data in different ways, frames that include more available data are given a higher importance. Thus, the speaker recognition system 100 can reduce instances of false positives, false negatives, and incorrect identifications of speakers as compared to other systems.

Speaker saliency calculator 110 determines weights to be applied to different frames of the input data. Input speech utterance x ═ x₁,…,x_L) Is input to an attention NN that outputs a scalar score S for each frame of input data based on the amount of data available in the corresponding frame. The gradient of the score with respect to the acoustic features of the frame isWherein x_iIs the acoustic feature vector at frame i (i ═ 1, …, L); l is the total number of frames in the speech utterance, x is a matrix of L feature vectors; w is an attention NN parameter trained by the attention NN trainer 108 and stored in the attention NN parameter storage 109. The saliency of frame i is computed as a gradient vectorP norm of (1), wherein g_ijIs a gradient g_iThe jth element of (1); p is the parameter to be determined. In at least one embodiment, p is positive infinity, and saliency is the largest element in all dimensions of the gradient vector. Using NN parameters W and input acoustic features x_iThe saliency of each of the frames of input data is calculated.

In other speaker feature extractors, where the statistical pooling layer takes a fixed-dimension utterance-level representation from variable-length frame-level feature vectors:wherein h is_iIs a bottleneck feature at the frame level and is the output of the layer preceding the pooling layer. In contrast, speaker feature extractor 112 calculates a weighted averageWherein w_iAs determined by the saliency calculator 110. The result is that speaker feature extractor 112 can increase the importance placed on frames with more information, resulting in faster determination of speaker features with higher accuracy and higher confidence.

Speaker recognition system 100 assigns higher weights to frames that are more important to speaker recognition during saliency calculation. The posteriori of the speech utterance, which is a target speaker or a set of speaker candidates, is used to train the attention NN. Thus, the gradient of the attention NN parameter with respect to the frame represents the contribution of the frame to the target speaker a posteriori, i.e. the importance of the frame to speaker recognition. In the case of weighted pooling, the speaker characteristics are expected to better identify the speaker. Thus, it is expected that speaker recognition will be more accurate and provide a higher confidence in the determined speaker characteristics.

FIG. 2 is a flow diagram of operations performed by a speaker recognition system in accordance with at least one embodiment. In at least one embodiment, the operations of fig. 2 are performed by speaker recognition system 100 (fig. 1). In operation a01, an NN is trained. In operation a02, speaker features are extracted based on training of the NN from operation a 01.

In at least one embodiment, NN training is performed for a single iteration. In at least one embodiment, NN training is performed for a plurality of iterations. In at least one embodiment, the updated data is used for NN training before and additionally after speaker feature extraction.

FIG. 3 is a flow diagram of operations for training a speaker recognition system in accordance with at least one embodiment. In at least one embodiment, the operations of fig. 3 are performed by the training portion 120 of the speaker recognition system 100. In at least one embodiment, the operations of fig. 3 are details of NN training a01 of fig. 2. The following describes the use of the training portion 120 as a non-limiting example of the operation of fig. 3.

In operation B01, the acoustic feature extractor 102_ a reads the voice data stored in the training data storage 101. In at least one embodiment, the speech data is standard speech data, such as NIST 2006 speaker recognition assessment (SRE) or 2008 SRE. In at least one embodiment, the speech data is speech data that is pre-provided by the user based on the speaker feature candidates. In at least one embodiment, the speech data is periodically updated as additional speaker feature candidates are added. In at least one embodiment, the acoustic feature extractor 102_ a receives voice data via wireless communication. In at least one embodiment, the acoustic feature extractor 102_ a receives voice data via a wired connection. In at least one embodiment, the acoustic feature extractor 102_ a receives speech data from a server remote from the training portion 120.

In operation B02, the acoustic feature extractor 102_ a extracts acoustic features from the voice data.

In operation B03, the speaker recognition NN trainer 104 reads the speaker ID stored in the speaker ID storage 103. In at least one embodiment, the speaker ID is periodically updated when new speaker candidates are included. In at least one embodiment, the speaker ID is stored in the same device as the speech data. In at least one embodiment, the speaker ID is stored in a device separate from the device storing the speech data. In at least one embodiment, the speaker recognition NN trainer 104 receives a speaker ID via wireless communication. In at least one embodiment, the speaker recognition NN trainer 104 receives the speaker ID via a wired connection. In at least one embodiment, the speaker recognition NN trainer 104 receives a speaker ID from a server remote from the training portion 120.

In operation B04, the speaker recognition NN trainer 104 trains a speaker recognition NN. The speaker feature discrimination NN trainer 104 trains the speaker discrimination NN by determining parameters of nodes having the speaker discrimination NN based on the read speaker ID and acoustic features extracted from the speech data. In at least one embodiment, the speaker discrimination NN is a TDNN, CNN, LSTM, GRU, or another suitable NN. In at least one embodiment, operation B04 is repeated based on updates to speaker ID storage 103 and/or updates to training data storage 101.

In operation B05, the speaker discrimination NN parameters generated by the speaker discrimination NN trainer 104 are stored in the speaker discrimination NN parameter storage 105. In at least one embodiment, the speaker discrimination NN parameters are stored in the same device as the speaker ID and speech data. In at least one embodiment, the speaker discrimination NN parameters are stored in a device separate from the device storing at least one of the speaker ID or the speech data.

In operation B06, the speaker posterior extractor 106 extracts a speaker posterior of the speech data. The speaker posterior extractor 106 extracts speaker posteriori based on the acoustic features of the speech data extracted from the acoustic feature extractor 102_ a using the speaker discrimination NN based on the parameters stored in the speaker discrimination NN parameter storage 105. In at least one embodiment, the speaker posteriori extracted by the speaker posteriori extractor 106 is a scalar value in the range from 0 to 1.

In operation B07, the speaker posteriors from the speaker posterior extractor 106 are stored in the speaker posterior storage device 107. In at least one embodiment, the speaker posteriori and speaker recognition NN parameters, speaker ID, and speech data are stored in the same device. In at least one embodiment, the speaker posteriori is stored in a device separate from the device storing at least one of the speaker discrimination NN parameters, the speaker ID, or the speech data.

In operation B08, the attention NN trainer 108 trains the attention NN. The attention NN trainer trains the attention NN using the acoustic features extracted by the acoustic feature extractor 102_ a and the stored speaker posteriori from the speaker posteriori storage 107. In at least one embodiment, the attention NN is TDNN, CNN, LSTM, GRU, or another suitable NN. In at least one embodiment, the attention NN is the same type of NN as the speaker-discrimination NN. In at least one embodiment, the attention NN is a different type of NN than the speaker-discrimination NN.

In step B09, the attention NN parameter is stored in the attention NN storage 109. In at least one embodiment, the attention NN parameters are stored in the same device as the speaker posteriori, speaker discrimination NN parameters, speaker ID, and speech data. In at least one embodiment, the attention NN parameters are stored in a device separate from the device storing at least one of the speaker posteriori, the speaker discrimination NN parameters, the speaker ID, or the speech data.

In at least one embodiment, the order of the operations in FIG. 3 is changed. For example, in at least one embodiment, operation B03 occurs before operation B01. In at least one embodiment, at least one of the operations of FIG. 3 is performed concurrently with another operation. For example, in at least one embodiment, operation B02 is performed concurrently with operation B03. In at least one embodiment, at least one operation is performed prior to the operations in FIG. 3. For example, in at least one embodiment, the speech data is stored in training data storage 101 prior to the operations in FIG. 3. In at least one embodiment, at least one operation is performed after the operations in FIG. 3. For example, in at least one embodiment, it is determined whether to update the speech data or the speaker ID information after operation B09.

FIG. 4 is a flow diagram of operations for extracting speaker characteristics according to at least one embodiment. In at least one embodiment, the operations of fig. 4 are performed by the speaker feature extraction portion 130 of the speaker recognition system 100. In at least one embodiment, the operations of fig. 4 are details of the speaker feature extraction a02 of fig. 2. The following describes using the speaker feature extraction section 130 as a non-limiting example of the operation of fig. 4.

In operation C01, the acoustic feature extractor 102_ b reads input voice data from the input data. In at least one embodiment, the input data is received as a live utterance. In at least one embodiment, the input data is stored in a non-transitory recordable medium for analysis. In at least one embodiment, the input data includes more than one utterance.

In operation C02, the acoustic feature extractor 102_ b extracts acoustic features from the input voice data. In at least one embodiment, operation C02 and operation B02 (fig. 3) are performed using the same device. In at least one embodiment, the apparatus used to perform operation C02 is different from the apparatus used to perform operation B02.

In operation C03, the saliency calculator 110 reads the attention NN parameter from the attention NN parameter storage 109. In at least one embodiment, the saliency calculator 110 receives the attention NN parameters via wireless communication. In at least one embodiment, the saliency calculator 110 receives the attention NN parameters via a wired connection. In at least one embodiment, saliency calculator 110 receives attention NN parameters from a server remote from speaker feature extraction 130.

In operation C04, the saliency calculator 110 calculates saliency of each frame of the input speech data. As described above, according to at least one embodiment, saliency calculator 110 assigns a weight to each frame of input speech data. By computing different weights for different frames of input speech data, the operation of fig. 4 can achieve higher accuracy and higher confidence in extracting speaker features than other methods of speaker recognition.

In operation C05, the speaker feature extractor 112 reads speaker feature extractor data stored in the speaker feature extractor storage 111. In at least one embodiment, the speaker feature extractor 112 receives the speaker feature extractor data via wireless communication. In at least one embodiment, the speaker feature extractor data is stored in the same device as the attention NN parameters, speaker posteriori, speaker discrimination NN parameters, speaker ID, and speech data. In at least one embodiment, the speaker feature extractor data is stored in a device separate from the device storing at least one of the attention NN parameter, the speaker posteriori, the speaker discrimination NN parameter, the speaker ID, or the speech data. In at least one embodiment, speaker feature extractor 112 receives speaker feature extractor data via a wired connection. In at least one embodiment, speaker feature extractor 112 receives speaker feature extractor data from a server remote from speaker feature extraction 130.

In operation C06, the speaker feature extractor 112 extracts speaker features using the weights from the saliency calculator 110 and the speaker feature extractor data from the speaker feature extractor storage 111. As described above, the speaker feature extractor 112 extracts speaker features according to at least one embodiment. In at least one embodiment, the speaker characteristic is an identity of a speaker of the input data. In at least one embodiment, the speaker characteristic is an authentication of the speaker based on a comparison of a known speaker ID and a determined identity of the speaker of the input data.

In at least one embodiment, the order of the operations in FIG. 4 is changed. For example, in at least one embodiment, operation C05 occurs before operation C04. In at least one embodiment, at least one of the operations of FIG. 4 is performed concurrently with another operation. For example, in at least one embodiment, operation C03 is performed concurrently with operation C05. In at least one embodiment, at least one operation is performed prior to the operations in FIG. 4. For example, in at least one embodiment, the input data is stored in a non-transitory computer-readable medium prior to the operations in FIG. 4. In at least one embodiment, at least one operation is performed after the operations in FIG. 4. For example, in at least one embodiment, the external device is controlled based on the speaker characteristics determined by the operations in FIG. 4.

In at least one embodiment, speaker recognition system 100 and/or the operations of fig. 2-4 may be used to control an external device (not shown). For example, where the speaker recognition system 100 is used to authenticate a speaker, access to a computer system or physical location is provided to an authenticated user; while an unauthorized user is denied access to the computer system or physical location. In at least one embodiment, speaker recognition system 100 is configured to remotely control an external device via wired or wireless communication. In at least one embodiment, speaker recognition system 100 controls an external device to issue an alert in response to attempted access by an unauthenticated user. By weighting frames differently based on useful information within the frames, the risk of unauthorized access to a computer system or physical location is reduced. In addition, access to authorized users is reduced or prevented from being erroneously blocked by using the weighting scheme of the speaker recognition system 100.

In at least one embodiment, the speaker recognition system 100 and/or the operations of fig. 2-4 may be used to identify a relevant speaker for a user. For example, in a case where the user enjoys speech, the user can identify the speaker using the speaker recognition system 100, thereby enabling the user to learn the speaker more. In at least one embodiment, the speaker recognition system 100 can be used to identify a speaker for the purpose of investigating the speaker. By weighting frames differently based on useful information within the frames, the accuracy of the search function is improved. In addition, the accuracy of the survey is improved by using the weighting scheme of the speaker recognition system 100.

Fig. 5 is a block diagram of a configuration of a speaker recognition system 200 according to at least one embodiment. The speaker recognition system 200 includes a training portion 220 configured to receive and process raw features. The speaker recognition system 200 further includes a speaker feature extraction section 230, the speaker feature extraction section 230 being configured to receive input data and output speaker features based on information from the training section 220. Speaker recognition system 200 is similar to speaker recognition system 100 (fig. 1) and like elements have the same reference numbers. Details of like elements from the speaker recognition system 100 are omitted here for the sake of brevity.

The training section 220 is similar to the training section 120 of the speaker recognition system 100 (fig. 1). In contrast to the training portion 120, the training portion 220 comprises a classifier 215, which classifier 215 is configured to receive the speaker posteriori from the speaker posteriori storage 107. The classifier 215 classifies the speaker posteriori into classes. In at least one embodiment, classifier 215 classifies the speaker posteriori into two categories, such as category 0, which relates to frames with useful data, and category 1, which relates to frames lacking useful data. In at least one embodiment, classifier 215 classifies speaker posteriori into more than two categories based on the amount of useful data in the frame. Classifier 215 classifies the speaker posteriori based on a comparison to at least one predetermined threshold. The number of predetermined thresholds is based on the number of classes into which the classifier 215 classifies the speaker posteriorly.

The attention NN trainer 108 trains the attention NN with the classes from the classifier 215. In at least one embodiment, the attention NN in the speaker recognition system 200 has only two output nodes, corresponding to class 0 and class 1. By comparing the speaker posterior stored in the speaker posterior storage 207 with a predetermined threshold, the training section 220 can train the attention NN more accurately by emphasizing frames having a large amount of useful information. Therefore, the information provided to the speaker feature extraction section 230 is more accurate than other methods.

FIG. 6 is a flow diagram of operations performed by a speaker recognition system in accordance with at least one embodiment. In at least one embodiment, the operations of fig. 6 are performed by speaker recognition system 200 (fig. 5). In operation D01, an NN is trained. In operation D02, speaker features are extracted based on training of the NN from operation D01.

FIG. 7 is a flow diagram of operations for training a speaker recognition system in accordance with at least one embodiment. The operation of fig. 7 is similar to that of fig. 3. In contrast to the operations in FIG. 3, FIG. 7 includes operations for classifying data into class E07 and storing tags for class E08. Operations E01-E06 are similar to operations B01-B06 of fig. 3, and thus, for the sake of brevity, a description of these operations is omitted. The following describes the use of training portion 220 as a non-limiting example of the operation of fig. 7.

In operation E07, the classifier 215 classifies the posteriori into categories. In at least one embodiment, classifier 215 classifies the posteriori into two categories, such as category 0 for posteriori equal to or above a threshold and category 1 for posteriori below a threshold. In at least one embodiment, classifier 215 classifies the posteriori into more than two classes. Classification will be used to distinguish between frames with large amounts of useful information and frames with little or no useful information.

In step E08, classifier 215 stores the category label. In some embodiments, classifier 215 stores the class labels as part of the information in speaker posterior storage 107. In at least one embodiment, the class labels are stored in the same device as the speaker posteriori, the speaker discrimination NN parameters, the speaker ID, and the speech data. In at least one embodiment, the class labels are stored in a device separate from the device storing at least one of speaker posteriori, speaker discrimination NN parameters, speaker ID, or speech data.

In step E09, the attention NN trainer 108 trains the attention NN. The attention NN trainer 108 trains the attention NN using the class labels from the classifier 215 and using the acoustic features extracted by the acoustic feature extractor 102_ a and the stored speaker posteriors from the speaker posteriors storage 107. In at least one embodiment, the attention NN is TDNN, CNN, LSTM, GRU, or another suitable NN. In at least one embodiment, the attention NN is the same type of NN as the speaker-discrimination NN. In at least one embodiment, the attention NN is a different type of NN than the speaker-discrimination NN. By using class labels to train the attention NN, more important are frames with more useful information. Thus, the trained attention NN may be more efficiently used by a saliency calculator (e.g., saliency calculator 110) to increase accuracy and confidence in a speaker recognition system.

In step E10, the attention NN trainer 209 stores the attention NN parameter in the storage 210. In at least one embodiment, the attention NN parameters are stored in the same device as the class labels, speaker posteriori, speaker discrimination NN parameters, speaker ID, and speech data. In at least one embodiment, the attention NN parameters are stored in a device separate from the device storing at least one of the class labels, speaker posteriori, speaker discrimination NN parameters, speaker ID, or speech data.

In at least one embodiment, the order of the operations in FIG. 7 is changed. For example, in at least one embodiment, operation E03 occurs before operation E01. In at least one embodiment, at least one of the operations of FIG. 7 is performed concurrently with another operation. For example, in at least one embodiment, operation E02 is performed concurrently with operation E03. In at least one embodiment, at least one operation is performed prior to the operations in FIG. 7. For example, in at least one embodiment, the speech data is stored in training data storage 101 prior to the operations in FIG. 7. In at least one embodiment, at least one operation is performed after the operations in FIG. 7. For example, in at least one embodiment, it is determined whether to update the speech data or the speaker ID information after operation E10.

FIG. 8 is a flow diagram of operations for extracting speaker characteristics according to at least one embodiment. The operation of fig. 8 is similar to that of fig. 4. Operations F01 through F06 are similar to operations C01 through C06 of fig. 4, and thus, for the sake of brevity, a description of these operations is omitted.

In at least one embodiment, the order of the operations in FIG. 8 is changed. For example, in at least one embodiment, operation F05 occurs before operation F04. In at least one embodiment, at least one of the operations of FIG. 8 is performed concurrently with another operation. For example, in at least one embodiment, operation F03 is performed concurrently with operation F05. In at least one embodiment, at least one operation is performed prior to the operations in FIG. 8. For example, in at least one embodiment, the input data is stored in a non-transitory computer-readable medium prior to the operations in FIG. 8. In at least one embodiment, at least one operation is performed after the operations in FIG. 8. For example, in at least one embodiment, the external device is controlled based on the speaker characteristics determined by the operations in fig. 8.

In at least one embodiment, speaker recognition system 200 and/or the operations of fig. 6-8 may be used to control an external device (not shown). For example, where speaker recognition system 200 is used to authenticate a speaker, access to a computer system or physical location is provided to an authenticated user; while an unauthorized user is denied access to the computer system or physical location. In at least one embodiment, speaker recognition system 200 is configured to remotely control an external device via wired or wireless communication. In at least one embodiment, speaker recognition system 200 controls an external device to issue an alert in response to attempted access by an unauthenticated user. By weighting frames differently based on useful information within the frames, the risk of unauthorized access to a computer system or physical location is reduced. In addition, access to authorized users is reduced or prevented from being erroneously blocked by using the weighting scheme of the speaker recognition system 200.

In at least one embodiment, speaker recognition system 200 and/or the operations of fig. 6-8 may be used to identify a relevant speaker for a user. For example, in a case where the user enjoys speech, the user can identify the speaker using the speaker recognition system 200 to enable the user to know the speaker more. In at least one embodiment, the speaker recognition system 200 can be used to identify a speaker for the purpose of investigating the speaker. By weighting frames differently based on useful information within the frames, the accuracy of the search function is improved. In addition, the accuracy of the survey is improved by using the weighting scheme of the speaker recognition system 200.

FIG. 9 is a block diagram of a computing device for implementing a speaker recognition system in accordance with at least one embodiment. System 900 includes a hardware processor 902 and a non-transitory computer-readable storage medium 904, the non-transitory computer-readable storage medium 904 being encoded with (i.e., storing) parameters 906, i.e., a set of executable instructions for performing the tasks of the speaker recognition system. The computer-readable storage medium 904 is also encoded with instructions 907 regarding interfacing with an external device or other system utilized to implement the speaker recognition system. The processor 902 is electrically coupled to the computer-readable storage medium 904 via a bus 908. The processor 902 is also electrically coupled to an I/O interface 910 through a bus 908. A network interface 912 is also electrically connected to the processor 902 via the bus 908. The network interface 912 connects to the network 914 to enable the processor 902 and the computer-readable storage medium 904 to connect to external elements via the network 914. The processor 902 is configured to execute instructions and use the parameters 906 in the computer-readable storage medium 904 to render the system 900 available for some or all of the operations of the speaker recognition system.

In at least one embodiment, processor 902 is a Central Processing Unit (CPU), a multiprocessor, a distributed processing system, an Application Specific Integrated Circuit (ASIC), and/or a suitable processing unit.

In at least one embodiment, the computer-readable storage medium 904 is an electronic, magnetic, optical, electromagnetic, infrared, and/or semiconductor system (or apparatus or device). The computer-readable storage medium 904 includes, for example, a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a Random Access Memory (RAM), a read-only memory (ROM), a rigid magnetic disk and/or an optical disk. In at least one embodiment using optical disks, the computer-readable storage medium 904 includes a compact disk read only memory (CD-ROM), a compact disk read/write (CD-R/W), and/or a Digital Video Disk (DVD).

In at least one embodiment, storage medium 904 stores parameters 906, which parameters 906 are configured to cause system 900 to operate as a speaker recognition system. In at least one embodiment, the storage medium 904 also stores information needed to perform as a speaker recognition system and information generated during operation, such as training data 916, speaker ID 918, speaker discrimination NN parameters 920, speaker posteriors 922, attention NN parameters 924, input data 926, speaker feature information 928, category information 930, and/or a set of executable instructions to perform the operation of the speaker recognition system.

In at least one embodiment, storage medium 904 stores instructions 907 regarding interfacing with an external device or other system for implementing a speaker recognition system. Instructions 907 enable processor 902 to generate instructions readable by an external device or other system to effectively implement the operation of the speaker recognition system.

The system 900 includes an I/O interface 910. The I/O interface 910 is coupled to external circuitry. In at least one embodiment, the I/O interface 910 includes a keyboard, keypad, mouse, trackball, trackpad, and/or cursor direction keys for communicating information and commands to the processor 902.

The system 900 also includes a network interface 912 coupled to the processor 902. The network interface 912 allows the system 900 to communicate with a network 914 to which one or more other computer systems are connected. Network interface 912 includes wireless network interfaces such as bluetooth, WIFI, WIMAX, GPRS, WCDMA, etc.; or a wired network interface such as ETHERNET, USB, or IEEE-1394. In at least one embodiment, the speaker recognition system is implemented in two or more systems 900 and information such as memory type, memory array layout, I/O voltage, I/O pin location, and charge pump is exchanged between different systems 900 via network 914.

One aspect of the present description relates to a speaker recognition system. The speaker recognition system includes a non-transitory computer-readable medium configured to store instructions. The speaker recognition system also includes a processor coupled to the non-transitory computer-readable medium. The processor is configured to execute instructions related to extracting acoustic features from each of a plurality of frames in input speech data. The processor is configured to execute instructions related to computing a saliency value for each of a plurality of frames using a first Neural Network (NN) based on the extracted acoustic features, wherein the first NN is a trained NN using a speaker posterior. The processor is configured to execute instructions related to using the saliency value for each of the plurality of frames to extract speaker features.

One aspect of the present description relates to a speaker recognition method. A speaker recognition method includes receiving input speech data. The speaker recognition method includes extracting acoustic features from each of a plurality of frames in input speech data. The speaker recognition method includes calculating a saliency value of each of a plurality of frames using a first Neural Network (NN) based on extracted acoustic features, wherein the first NN is a trained NN using speaker posteriors. The speaker recognition method includes extracting a speaker feature using the saliency value of each of the plurality of frames.

The foregoing has outlined features of several embodiments so that those skilled in the art may better understand the aspects of the present disclosure. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions and alterations herein without departing from the spirit and scope of the present disclosure.

The above-described exemplary embodiments may also be described in whole or in part by the following notations and are not limited to the following notations.

(attached note 1)

A speaker recognition system, comprising:

a non-transitory computer-readable medium configured to store instructions; and

a processor connected to a non-transitory computer readable medium, wherein the processor is configured to execute instructions for:

extracting an acoustic feature from each of a plurality of frames in input speech data;

computing a saliency value for each of a plurality of frames using a first Neural Network (NN) based on the extracted acoustic features, wherein the first NN is a trained NN using speaker posteriors; and

speaker characteristics are extracted using the saliency values for each of the plurality of frames.

(attached note 2)