Voice speaker separation method and device

文档序号:1339715 发布日期:2020-07-17 浏览:7次 中文

阅读说明:本技术 一种语音话者分离方法和装置 (Voice speaker separation method and device ) 是由 汪法兵 李健 武卫东 于 2020-02-28 设计创作,主要内容包括:本发明提供了一种语音话者分离方法和装置,涉及语音识别技术领域。本发明实施例中,在对语音片段聚类之前,通过预先设置的预设噪音过滤参数对语音片段进行过滤,由于瞬态噪声与话者语音之间有着明显的不同,因此,可以通过合适的预设噪音过滤参数将大部分瞬态噪声过滤,保证第一语音集合中大部分为不同话者的语音片段,从而提升了后续第一语音特征提取以及语音片段聚类的准确性,正确分离不同话者语音,提升了语音话者分离技术的鲁棒性。(The invention provides a voice speaker separation method and device, and relates to the technical field of voice recognition. In the embodiment of the invention, the voice segments are filtered by the preset noise filtering parameters before the voice segments are clustered, and because the transient noise is obviously different from the voices of speakers, most of the transient noise can be filtered by the proper preset noise filtering parameters, and most of the first voice set is the voice segments of different speakers, so that the accuracy of subsequent first voice feature extraction and voice segment clustering is improved, the voices of different speakers are correctly separated, and the robustness of a voice speaker separation technology is improved.)

1. A voice speaker separation method, comprising:

acquiring audio data to be processed;

carrying out segmentation processing on the audio data according to a mute period to obtain at least one voice segment;

classifying the voice fragments which accord with the preset noise filtering parameters into a first voice set;

extracting first voice features of voice segments in the first voice set;

clustering the voice fragments in the first voice set according to the first voice characteristics to obtain a clustering result;

and separating the voice segments of different speakers in the first voice set according to the clustering result.

2. The method of claim 1, wherein after obtaining the clustering result, the method further comprises:

classifying the voice fragments which do not accord with the preset noise filtering parameters into a second voice set;

extracting second voice features of voice segments in the second voice set;

and separating voice segments of different speakers in the second voice set according to the second voice characteristic and the clustering result.

3. The method of claim 2, wherein separating speech segments of different speakers in the second speech set according to the second speech characteristics and the clustering result comprises:

respectively calculating a class vector corresponding to each class according to each class in the clustering result;

and separating the voice segments of different speakers in the second voice set according to the second voice feature and the class vector.

4. The method of claim 3, wherein separating the speech segments of the different speakers in the second speech set according to the second speech features and the class vectors comprises:

respectively calculating the matching degree of the second voice features and the class vectors;

determining a first corresponding relation between the second voice feature and the class vector according to the matching degree;

determining a second corresponding relation between the voice segments in the second voice set and the clustering result according to the first corresponding relation;

and separating the voice segments of different speakers in the second voice set according to the second corresponding relation.

5. The method according to claim 4, wherein said determining a first corresponding relationship between the second speech feature and the class vector according to the matching degree comprises:

and for each second voice feature, determining that the class vector with the highest matching degree has the first corresponding relation with the second voice feature.

6. An apparatus for voice speaker separation, the apparatus comprising:

the data acquisition module is used for acquiring audio data to be processed;

the data segmentation module is used for carrying out segmentation processing on the audio data according to a mute period to obtain at least one voice segment;

the parameter filtering module is used for classifying the voice fragments which accord with the preset noise filtering parameters into a first voice set;

the feature extraction module is used for extracting first voice features of voice segments in the first voice set;

the data clustering module is used for clustering the voice fragments in the first voice set according to the first voice characteristics to obtain a clustering result;

and the voice separation module is used for separating the voice segments of different speakers in the first voice set according to the clustering result.

7. The apparatus of claim 6,

the parameter filtering module is further used for classifying the voice fragments which do not accord with the preset noise filtering parameters into a second voice set;

the feature extraction module is further configured to extract a second speech feature of the speech segment in the second speech set;

and the voice separation module is further used for separating voice segments of different speakers in the second voice set according to the second voice characteristic and the clustering result.

8. The apparatus of claim 7, wherein the data clustering module comprises:

the vector calculation submodule is used for respectively calculating a class vector corresponding to each class according to each class in the clustering result;

and the class designation submodule is used for separating the voice segments of different speakers in the second voice set according to the second voice characteristic and the class vector.

9. The apparatus of claim 8, wherein the category designation submodule comprises:

the matching degree calculation unit is used for calculating the matching degree of the second voice feature and the class vector respectively;

a corresponding relation determining unit, configured to determine a first corresponding relation between the second speech feature and the class vector according to the matching degree;

the corresponding relation determining unit is further configured to determine a second corresponding relation between the voice segments in the second voice set and the clustering result according to the first corresponding relation;

and the speaker specifying unit is used for separating the voice segments of different speakers in the second voice set according to the second corresponding relation.

10. The apparatus according to claim 9, wherein the correspondence determining unit is specifically configured to determine, for each of the second speech features, that the class vector with the highest degree of matching has the first correspondence with the second speech feature.

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice speaker separation method and device.

Background

In a call, voice recognition, voiceprint recognition, and other scenarios, it is usually necessary to distinguish the identities of speakers corresponding to different voice inputs, or to receive only the voice of a specific speaker among the input voices of multiple speakers. Therefore, in practical applications, when multiple voices are inputted, it is necessary to separate voices of different speakers by a speaker separation technique.

When the signal-to-noise ratio of the input audio is high, the voices of different speakers can be separated by segmenting the specific length of the voice, extracting the characteristics section by section and clustering the characteristics. However, when the input audio signal-to-noise ratio is low and background interference noise such as keyboard tapping sound, door opening and closing sound and wind sound is more, the result of voice feature extraction is affected, so that the accuracy of voice clustering is seriously interfered, and the robustness of speaker separation is reduced.

Disclosure of Invention

In view of the above, the present invention has been made to provide a voice speaker separation method and apparatus that overcomes or at least partially solves the above-mentioned problems.

According to a first aspect of the present invention, there is provided a voice speaker separating method, the method comprising:

acquiring audio data to be processed;

carrying out segmentation processing on the audio data according to a mute period to obtain at least one voice segment;

classifying the voice fragments which accord with the preset noise filtering parameters into a first voice set;

extracting first voice features of voice segments in the first voice set;

clustering the voice fragments in the first voice set according to the first voice characteristics to obtain a clustering result;

and separating the voice segments of different speakers in the first voice set according to the clustering result.

According to a second aspect of the present invention, there is provided a voice speaker separating apparatus, comprising:

the data acquisition module is used for acquiring audio data to be processed;

the data segmentation module is used for carrying out segmentation processing on the audio data according to a mute period to obtain at least one voice segment;

the parameter filtering module is used for classifying the voice fragments which accord with the preset noise filtering parameters into a first voice set;

the feature extraction module is used for extracting first voice features of voice segments in the first voice set;

the data clustering module is used for clustering the voice fragments in the first voice set according to the first voice characteristics to obtain a clustering result;

and the voice separation module is used for separating the voice segments of different speakers in the first voice set according to the clustering result.

In the embodiment of the invention, the voice segments are filtered by the preset noise filtering parameters before the voice segments are clustered, and because the transient noise is obviously different from the voices of speakers, most of the transient noise can be filtered by the proper preset noise filtering parameters, and most of the first voice set is the voice segments of different speakers, so that the accuracy of subsequent first voice feature extraction and voice segment clustering is improved, the voices of different speakers are correctly separated, and the robustness of a voice speaker separation technology is improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart illustrating steps of a method for separating voice speakers according to an embodiment of the present invention;

fig. 2 is a flow chart illustrating steps of another method for separating voice speakers according to an embodiment of the present invention;

fig. 3 is a block diagram of a voice speaker separation apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 is a flowchart illustrating steps of a method for separating voice speakers according to an embodiment of the present invention, where as shown in fig. 1, the method may include:

step 101, audio data to be processed is obtained.

The embodiment of the invention is applied to the scene that different speakers need to be separated to input voice, wherein the speakers can be speakers, speakers and the like in the scene of a multi-person teleconference and a video conference, and can also be objects identified by a single voiceprint in a multi-person voice environment. Therefore, the acquired audio data to be processed at least includes audio data of two speakers, and besides, due to the influences of the hardware condition of the audio acquisition device, the acquisition condition of the acquisition environment, and the like, the audio data to be processed may also include noise such as environmental noise and the noise of the acquisition device, and may influence the speaker separation process on the audio data.

And 102, carrying out segmentation processing on the audio data according to the mute period to obtain at least one voice segment.

In the embodiment of the invention, the audio data can be segmented firstly, and the audio data is divided into at least one voice segment, so that each voice segment only comprises voice of a speaker or only comprises noise as far as possible, thereby improving the accuracy of subsequent clustering. Alternatively, since there may be a pause in the input of speech when switching between different speakers, so that there is a recognizable silent period in the audio data, the audio data may be segmented by the silent period, so that a speech segment including only one speaker is obtained. In a specific application, a long silence period can be identified and eliminated in audio data through a silence activity detection vad (voice activity detection) or a technique also called silence end detection and silence boundary detection, so as to obtain different voice segments. In order to improve the accuracy of the audio data segmentation, a minimum analysis window can be adopted, namely, the smallest possible time segment is intercepted in the audio data for the identification and elimination of the mute period, so as to obtain a more accurate voice segment.

And 103, classifying the voice segments which accord with the preset noise filtering parameters into a first voice set.

In the embodiment of the present invention, a preset noise filtering parameter may be preset, where the preset noise filtering parameter is used to filter transient noise in a speech segment, where the transient noise is common noise data in audio data, and has the characteristics of short duration and wide-band spectrum, and may have a serious influence on processing of speech data. Optionally, based on the characteristic that the transient noise duration is short, different transient noise time lengths may be statistically analyzed, so as to set a suitable time value as the preset noise filtering parameter, and determine a voice segment with a time length greater than or equal to the time value as meeting the preset noise filtering parameter, and classify the voice segment into a first voice set, so that the first voice set only includes the voice segment after the transient noise is filtered. The preset noise filtering parameter only needs to be able to distinguish the voice segment of the voice of the speaker from the voice segment of the transient noise, and the broadband spectrum of the transient noise may also be statistically analyzed by using the broadband characteristic of the transient noise to set the preset noise filtering parameter.

And 104, extracting first voice characteristics of the voice segments in the first voice set.

In the embodiment of the present invention, a first speech feature of each speech segment in the first speech set may be extracted, where the first speech feature is information extracted from the speech segment and capable of representing a speech feature. In practical application, the first speech feature can be extracted through the existing speaker separation technology such as mel frequency cepstrum, wherein the mel frequency cepstrum is linear transformation of a logarithmic energy spectrum of a nonlinear mel scale based on sound frequency, the frequency band division of the mel frequency cepstrum is equally divided on the mel scale, the mel frequency cepstrum can be more similar to the human auditory system than the linearly spaced frequency bands used in a normal logarithmic cepstrum, and the mel frequency cepstrum coefficient of a speech segment in a first speech set obtained through the mel frequency cepstrum is a feature vector corresponding to the speech segment, namely the first speech feature.

Optionally, the first speech feature may further include a short-time average energy, a short-time average amplitude, a short-time average zero-crossing rate, a formant, a glottal wave, a speech rate, and the like of the speech segment, as long as the features of different speech segments can be represented, and the method for extracting the first speech feature and the specific type of the first speech feature are not particularly limited in the embodiment of the present invention.

In the embodiment of the present invention, optionally, the extracted feature Vector may be further extracted by a machine learning algorithm, so as to obtain a corresponding abstract feature Vector as a first speech feature, such as an Identity-Vector (i-Vector) speech feature further extracted from a mean value hyper-Vector obtained based on a GMM (Gaussian mixture model), an xvector (x-Vector) speech feature extracted by a TDNN (Time Delay Neural Network), a dvector (d-Vector) speech feature extracted by a DNN (Deep Neural Network), and so on.

And 105, clustering the voice segments in the first voice set according to the first voice characteristics to obtain a clustering result.

In the embodiment of the present invention, the speech segments in the first speech set are clustered according to the first speech feature, and optionally, distances of feature vectors corresponding to different speech segments may be calculated, for example, distances obtained by scoring the feature vectors by P L DA (Probabilistic linear Discriminant Analysis ), or cosine distances between the feature vectors, and the speech segments corresponding to the first speech feature whose distance is smaller than a preset clustering distance are grouped into one category, so as to obtain at least two categories corresponding to different speech segments as clustering results.

And 106, separating the voice segments of different speakers in the first voice set according to the clustering result.

In the embodiment of the invention, after the clustering result is determined, different categories and the voice segments under each category can be obtained, wherein, because the clustering is carried out according to the first voice characteristics of different voice segments, each obtained category only comprises the voice segments with highly similar or similar first voice characteristics, and the accuracy of the obtained clustering result can be considered to be higher under the conditions of filtering transient noise and eliminating interference. A class includes only the speech segments of a speaker, in which case the speech segments under different classes are the speech segments of separate speakers.

In addition, in a scene such as a telephone conference, a video conference and the like, optionally, a speech record can be formed when each speaker speaks, when the number of speakers recorded with speech does not correspond to the number of categories in the clustering result, the clustering result can be considered to be inaccurate, and feature extraction, clustering and the like are performed on the voice segments again; in a voiceprint recognition scene, when the first voice features in different types of voice segments are not matched with the historical voice features of the recognition object, the voice data of the speaker which is not collected can be considered, or the voice data of the speaker is wrongly distributed, and at the moment, the speaker can be prompted to input the voice data again, or the voice segments can be subjected to feature extraction and clustering again.

In summary, in the embodiment of the present invention, before clustering speech segments, the speech segments are filtered by preset noise filtering parameters, and since transient noise is obviously different from a speaker's speech, most of the transient noise can be filtered by the proper preset noise filtering parameters, and it is ensured that most of the first speech set is speech segments of different speakers, so as to improve accuracy of subsequent first speech feature extraction and speech segment clustering, correctly separate voices of different speakers, and improve robustness of a voice-speaker separation technique.

Fig. 2 is a flowchart illustrating steps of another method for separating voice speakers according to an embodiment of the present invention, where as shown in fig. 2, the method may include:

step 201, obtaining audio data to be processed;

step 202, performing segmentation processing on the audio data according to a silent period to obtain at least one voice segment;

step 203, classifying the voice segments which accord with the preset noise filtering parameters into a first voice set;

step 204, extracting a first voice feature of the voice segments in the first voice set;

step 205, clustering the voice segments in the first voice set according to the first voice feature to obtain a clustering result;

and step 206, separating the voice segments of different speakers in the first voice set according to the clustering result.

In the embodiment of the present invention, the descriptions in step 201 to step 206 refer to the descriptions in step 101 to step 106, which are not repeated herein.

Optionally, after step 205, the method may further include:

and step 2051, classifying the voice segments which do not accord with the preset noise filtering parameters into a second voice set.

In the embodiment of the present invention, in addition to classifying the speech segments that meet the preset noise filtering parameter as the first speech set, the speech segments that do not meet the preset noise filtering parameter, for example, the speech segments whose time length is less than the time value, may be classified as the second speech set, so that the second speech set includes speech segments of most transient noises. However, there may be a case where the misjudgment occurs, for example, the second speech set may include a speech segment of the speaker with a short time length in addition to the speech segment of the transient noise, or the speech segment may include both the transient noise and the speech of the speaker due to a segmentation error.

And step 2052, extracting second voice features of the voice segments in the second voice set.

In this embodiment of the present invention, a second speech feature may be extracted from each speech segment in the second speech set, where a process of extracting the second speech feature is similar to the process of extracting the first speech feature of the speech segment in the first speech set in step 104, and details are not repeated here. Optionally, after the voice segments are obtained by segmenting the audio data, voice features of all the voice segments are extracted, the voice segments are divided into a first voice set and a second voice set according to preset noise filtering parameters, and meanwhile, the voice features of the voice segments are also divided into the first voice feature set corresponding to the first voice set and the second voice feature set corresponding to the second voice set, so that processing steps of the audio data are simplified, and the efficiency of separating voice speakers is improved.

And step 2053, separating the voice segments of different speakers in the second voice set according to the second voice feature and the clustering result.

In the embodiment of the present invention, the voice segments of different speakers in the second voice set may be separated according to the clustering result of the first voice set and the second voice feature of the voice segment in the second voice set, and optionally, different categories in the clustering result may be matched with the second voice feature, so as to determine categories to which different voice segments in the second voice set may belong, so as to separate the voice segments of different speakers in the second voice set.

Optionally, step 2053 includes:

and step S11, respectively calculating a class vector corresponding to each class according to each class in the clustering result.

In the embodiment of the present invention, when determining whether the category in the clustering result matches the second speech feature corresponding to the second speech set, a class vector corresponding to each category may be calculated first, where the class vector is a feature vector corresponding to a category, and may be calculated by the first speech features corresponding to all speech segments in each category, and optionally, may be a feature vector mean or a weighted average, etc., of the first speech features corresponding to all speech segments in each category.

And step S12, separating the voice segments of different speakers in the second voice set according to the second voice feature and the class vector.

In the embodiment of the present invention, the second speech feature may be matched with the class vector, so as to determine that the second speech feature matches the class included in the clustering result corresponding to the first speech set, so as to separate the speech segments of different speakers in the second speech set.

Optionally, step S12 includes:

and step S21, respectively calculating the matching degree of the second voice feature and the class vector.

In this embodiment of the present invention, matching degrees of all the second speech features and the class vectors may be respectively calculated, and optionally, the matching degree may be a distance between the feature vector of the second speech feature and the class vector, where the smaller the distance, the higher the matching degree, the larger the distance, the lower the matching degree, and the distance between the feature vector and the class vector may be a P L DA distance or a cosine distance, etc.

And step S22, determining a first corresponding relation between the second voice feature and the class vector according to the matching degree.

In the embodiment of the present invention, the first corresponding relationship between the second speech features and the class vectors may be determined according to the matching degrees between the different second speech features and the class vectors, or alternatively, for each second speech feature, the class vectors may be sorted according to the matching degrees, and the first corresponding relationship between the second speech features and the class vectors is determined according to the result of the sorting, or the second speech feature with the highest matching degree is determined to have the first corresponding relationship for each class vector, and so on.

Optionally, step S22 specifically includes, for each second speech feature, determining that the class vector with the highest matching degree has the first corresponding relationship with the second speech feature.

In the embodiment of the present invention, a first corresponding relationship between the second speech features and the class vectors is determined according to the matching degree, and optionally, a first corresponding relationship between each second speech feature and the class vector with the highest matching degree is determined, so as to associate all the second speech features and the class vectors, thereby comprehensively considering all the speech segments; or, a matching degree threshold may be set, when the highest matching degree between the second speech feature and all class vectors cannot reach the matching degree threshold, the second speech feature is considered to correspond to the speech feature of the transient noise, the second speech feature is not associated with the class vectors, and when the highest matching degree reaches the matching degree threshold, the second speech feature is associated with the class vectors, so as to avoid that the second speech feature corresponding to the transient noise is associated with the class vectors, which causes a speech segment separation error.

And step S23, determining a second corresponding relation between the voice segments in the second voice set and the clustering result according to the first corresponding relation.

In the embodiment of the present invention, the second speech feature corresponds to each speech segment in the second speech set, and the class vector corresponds to each class in the first speech set clustering result, so that according to the first corresponding relationship between the second speech feature and the class vector, the second corresponding relationship between each speech segment in the second speech set and each class in the clustering result can be obtained correspondingly.

And step S24, separating the voice segments of different speakers in the second voice set according to the second corresponding relation.

In the embodiment of the present invention, according to the second corresponding relationship between each voice segment in the second voice set and each category in the clustering result, each voice segment in the second voice set can be classified into the corresponding category in the clustering result, so as to separate the voice segments of different speakers in the second voice set, and the voice segments of the same speaker are included in the same category. And according to the clustering result of the first voice set, reclassifying the voice segments in the second voice set, so that the voice segments in the second voice set which do not participate in clustering can be classified into the categories, and the accuracy of voice separation of the speaker is improved.

In the embodiment of the invention, the voices of different speakers in the audio data can be separated only by the voice speaker separation technology, and optionally, the corresponding relation between different speakers and different categories in the clustering result can be determined, so that the speaker identity of the voice segment under each category is marked. Optionally, historical voice features of different speakers may be obtained, and the first voice features of the voice segments in different categories may be matched according to the historical voice features, so as to determine correspondence between the different speakers and the different categories, where the historical voice features may be voice features extracted from historical voice data of the speakers.

Optionally, in a scene such as a teleconference, a video conference, and the like, identity information of different speakers can be determined and obtained from a conference list, and historical voice features of the different speakers can be obtained according to the identity information; in a scene of single-person voiceprint recognition, the historical speech features of a recognition object can be directly obtained for matching. Or, in a scene such as a teleconference, a video conference, or the like, the category of the voice segment acquired when different speakers speak separately is determined, so as to determine the corresponding relationship between different speakers and different categories, and according to the difference of application scenes, a person skilled in the art can determine the corresponding relationship between the speakers and the categories by using different methods.

In the embodiment of the present invention, optionally, different categories in the clustering result may be identified according to the identity information of different speakers and the corresponding relationship between different speakers and different categories, so as to determine the identity information of the speakers to which the voice segments belong under different categories, which is convenient for managing the voice segments in different categories, and when subsequently acquiring the audio data to be processed, speaker separation may be performed on the reacquired audio data to be processed according to the identified categories, thereby improving the efficiency of voice speaker separation.

Specific examples of embodiments of the present invention are set forth below to explain in detail the implementation of embodiments of the present invention, as follows:

acquiring audio data x to be processed;

adopting VAD technology and minimum analysis window to process x in segments to obtain S1,S2,…STA total of T fragments;

using existing speaker separation techniques to pair S1,S2,…STRespectively extracting the voice characteristics to obtain corresponding characteristic sequences F1,F2,…ST

Classifying the voice segments meeting the preset noise filtering parameter thr into a first voice set segmeters 1, and correspondingly sorting a set Feat1 of first voice characteristics;

classifying the voice segments which do not accord with the preset noise filtering parameter thr into second voice set segments2, and correspondingly sorting a set Feat2 of second voice features;

clustering the voice fragments in the first voice set according to the first voice feature in the Feat1 to obtain clustering results which are K categories;

determining voice fragments comprising K speakers in the first voice set according to the K categories;

respectively calculating a class vector C corresponding to each class according to the first voice feature corresponding to each class in the K classesi,i=1,2,...K;

For each second speech feature F in Feat2jCalculation and each class CiDetermining C with the smallest distanceiWith the second speech feature FjHas a first corresponding relationship;

determining that the corresponding category has a second corresponding relation with the voice segment corresponding to the second voice feature;

determining each of the speech segments S in the second speech collection segments2jAre classified into corresponding ones of K classes to separate the speech segments of different speakers.

In summary, in the embodiment of the present invention, before clustering speech segments, the speech segments are filtered through preset noise filtering parameters, and since transient noise is obviously different from a speaker's speech, most of the transient noise can be filtered through appropriate preset noise filtering parameters, so that most of the speech segments in a first speech set are different from the speaker's speech, and therefore, the accuracy of subsequent first speech feature extraction and speech segment clustering is improved; and the voices of the speakers including transient noise and possibly wrongly classified are classified through the clustering result of the first voice set, so that missing and missing are detected, and the robustness of the voice speaker separation technology is further improved.

Fig. 3 is a block diagram of a voice speaker separation apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus 300 may include:

a data obtaining module 301, configured to obtain audio data to be processed;

a data segmenting module 302, configured to perform segmentation processing on the audio data according to a silence period to obtain at least one voice segment;

the parameter filtering module 303 is configured to classify the voice segments meeting the preset noise filtering parameter into a first voice set;

a feature extraction module 304, configured to extract a first speech feature of a speech segment in the first speech set;

a data clustering module 305, configured to cluster the voice segments in the first voice set according to the first voice feature to obtain a clustering result;

and a voice separation module 306, configured to separate voice segments of different speakers in the first voice set according to the clustering result.

Optionally, the parameter filtering module 303 is further configured to classify the voice segments that do not meet the preset noise filtering parameter into a second voice set;

the feature extraction module 304 is further configured to extract a second speech feature of the speech segment in the second speech set;

the voice separation module 306 is further configured to separate voice segments of different speakers in the second voice set according to the second voice feature and the clustering result.

Optionally, the data clustering module 305 includes:

the vector calculation submodule is used for respectively calculating a class vector corresponding to each class according to each class in the clustering result;

and the class designation submodule is used for separating the voice segments of different speakers in the second voice set according to the second voice characteristic and the class vector.

Optionally, the category specification sub-module includes:

the matching degree calculation unit is used for calculating the matching degree of the second voice feature and the class vector respectively;

a corresponding relation determining unit, configured to determine a first corresponding relation between the second speech feature and the class vector according to the matching degree;

the corresponding relation determining unit is further configured to determine a second corresponding relation between the voice segments in the second voice set and the clustering result according to the first corresponding relation;

and the speaker specifying unit is used for separating the voice segments of different speakers in the second voice set according to the second corresponding relation.

Optionally, the correspondence determining unit is specifically configured to determine, for each second speech feature, that the class vector with the highest matching degree and the second speech feature have the first correspondence.

In summary, in the embodiment of the present invention, before clustering speech segments, the speech segments are filtered through preset noise filtering parameters, and since transient noise is obviously different from a speaker's speech, most of the transient noise can be filtered through appropriate preset noise filtering parameters, so that most of the speech segments in a first speech set are different from the speaker's speech, and therefore, the accuracy of subsequent first speech feature extraction and speech segment clustering is improved; and the voices of the speakers including transient noise and possibly wrongly classified are classified through the clustering result of the first voice set, so that missing and missing are detected, and the robustness of the voice speaker separation technology is further improved.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As is readily imaginable to the person skilled in the art: any combination of the above embodiments is possible, and thus any combination between the above embodiments is an embodiment of the present invention, but the present disclosure is not necessarily detailed herein for reasons of space.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种语音信号分离方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!