Voice extraction method, device, equipment and medium for target speaker

文档序号:662569 发布日期:2021-04-27 浏览:10次 中文

阅读说明:本技术 针对目标说话人的语音提取方法、装置、设备及介质 (Voice extraction method, device, equipment and medium for target speaker ) 是由 张舒婷 赖众程 杨念慈 何利斌 李会璟 王小红 刘彦国 于 2020-12-23 设计创作,主要内容包括:本申请涉及人工智能技术领域,揭示了一种针对目标说话人的语音提取方法、装置、设备及介质,其中方法包括:采用预设分段方法根据第一方向的第一待处理的语音数据确定多个第一待提取语音数据段;根据多个第一待提取语音数据段对第二方向的第二待处理的语音数据进行分段提取得到多个第二待提取语音数据段;对多个第一待提取语音数据段和多个第二待提取语音数据段进行相同时间的数据提取得到多个待提取语音数据段对;分别将每个待提取语音数据段对输入单一说话人语音提取模型进行语音提取得到多个目标说话人语音数据段,然后按时间顺序进行拼接得到目标说话人的目标语音数据。从而降低了业务素质评估的成本,提高了业务素质评估的全面性。(The application relates to the technical field of artificial intelligence, and discloses a method, a device, equipment and a medium for extracting voice of a target speaker, wherein the method comprises the following steps: determining a plurality of first voice data segments to be extracted according to first voice data to be processed in a first direction by adopting a preset segmentation method; carrying out segmentation extraction on second to-be-processed voice data in a second direction according to the first to-be-extracted voice data sections to obtain a plurality of second to-be-extracted voice data sections; carrying out data extraction on the plurality of first voice data segments to be extracted and the plurality of second voice data segments to be extracted at the same time to obtain a plurality of voice data segment pairs to be extracted; and respectively carrying out voice extraction on each voice data segment to be extracted by inputting a single speaker voice extraction model to obtain a plurality of target speaker voice data segments, and then splicing the target speaker voice data segments according to the time sequence to obtain the target voice data of the target speaker. Therefore, the cost of business quality evaluation is reduced, and the comprehensiveness of the business quality evaluation is improved.)

1. A method for extracting speech for a target speaker, the method comprising:

acquiring first voice data to be processed and second voice data to be processed of a target speaker in the same time period, wherein the first voice data to be processed is voice data obtained according to a voice signal in a first direction, and the second voice data to be processed is voice data obtained according to a voice signal in a second direction;

performing segmentation processing on the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted;

carrying out segmentation extraction on the second voice data to be processed according to the first voice data segments to be extracted to obtain a plurality of second voice data segments to be extracted;

performing data extraction on the plurality of first voice data segments to be extracted and the plurality of second voice data segments to be extracted at the same time to obtain a plurality of voice data segment pairs to be extracted;

and respectively carrying out voice extraction on each voice data segment to be extracted to an input single speaker voice extraction model to obtain a plurality of target speaker voice data segments, wherein the single speaker voice extraction model comprises the following steps: the single speaker voice extraction model is a model obtained based on TasNet network training;

and splicing the target speaker voice data segments according to the time sequence to obtain the target voice data of the target speaker.

2. The method as claimed in claim 1, wherein the step of obtaining the first to-be-processed speech data and the second to-be-processed speech data of the target speaker in the same time period comprises:

acquiring the voice signal of the target speaker in the first direction and the voice signal of the target speaker in the second direction within the same time period;

carrying out segmentation processing on the voice signal in the first direction by adopting a first preset time length to obtain a plurality of segmented voice signal segments in the first direction;

inputting each segmented first-direction voice signal segment into a digital filter to obtain a plurality of filtered first-direction voice signal segments;

performing discrete Fourier transform on each filtered first-direction voice signal segment to obtain a plurality of transformed first-direction voice signal segments;

performing inverse discrete Fourier transform on the plurality of transformed first-direction voice signal segments to obtain noise-reduced first-direction voice data;

carrying out segmentation processing on the voice signal in the second direction by adopting the first preset time length to obtain a plurality of segmented voice signal segments in the second direction;

inputting each segmented second direction voice signal segment into a digital filter to obtain a plurality of filtered second direction voice signal segments;

performing discrete Fourier transform on each filtered second-direction voice signal segment to obtain a plurality of transformed second-direction voice signal segments;

performing inverse discrete Fourier transform on the plurality of transformed second direction voice signal segments to obtain second direction voice data subjected to noise reduction;

pre-emphasis processing is carried out on the first direction voice data after noise reduction, and the first voice data to be processed is obtained;

and pre-emphasis processing is carried out on the second direction voice data after noise reduction, so as to obtain the second voice data to be processed.

3. The method as claimed in claim 1, wherein the step of segmenting the first to-be-processed speech data by using a predetermined segmentation method to obtain a plurality of first to-be-extracted speech data segments comprises:

performing framing processing on the first voice data to be processed by adopting a second preset time length to obtain a plurality of first voice data frames to be processed;

respectively carrying out voice energy calculation on each first voice data frame to be processed to obtain first voice energy corresponding to each of the plurality of first voice data frames to be processed;

extracting the first voice energy from the first voice energy corresponding to the first voice data frames to be processed according to a preset quantity to obtain a plurality of first beginning voice energies;

carrying out mean value calculation on the first beginning voice energies to obtain first background voice energies corresponding to the first voice data frames to be processed;

respectively carrying out subtraction calculation on the first voice energy corresponding to each first voice data frame to be processed and the first background voice energy to obtain first voice energy difference values corresponding to the plurality of first voice data frames to be processed;

comparing the first voice energy difference value corresponding to each first voice data frame to be processed with a voice energy threshold value;

when the first voice energy difference value corresponding to the first voice data frame to be processed is larger than the voice energy threshold value, determining that the mute category of the first voice data frame to be processed corresponding to the first voice energy difference value is a non-mute frame;

when a first voice energy difference value corresponding to the first voice data frame to be processed is smaller than or equal to the voice energy threshold value, determining that the mute category of the first voice data frame to be processed corresponding to the first voice energy difference value is a mute frame;

and performing mute frame deletion processing on the plurality of first voice data frames to be processed by adopting a mute frame quantity threshold and the mute category to obtain a plurality of first voice data sections to be extracted.

4. The method as claimed in claim 3, wherein the step of performing silence frame deletion processing on the first frames of speech data to be processed by using a threshold of number of silence frames and the silence category to obtain the first segments of speech data to be extracted includes:

calculating the number of the mute frames of the first voice data frames to be processed which are continuous according to time to obtain the number of a plurality of first continuous mute frames;

comparing the number of each first continuous mute frame with the threshold of the number of the mute frames respectively;

and when the number of the first continuous mute frames is greater than the mute frame number threshold, deleting the first to-be-processed voice data frames corresponding to all the first continuous mute frame numbers greater than the mute frame number threshold from the plurality of first to-be-processed voice data frames to obtain the plurality of first to-be-extracted voice data segments.

5. The method as claimed in claim 1, wherein the step of extracting the second voice data to be processed in segments according to the plurality of first voice data segments to be extracted to obtain a plurality of second voice data segments to be extracted includes:

respectively extracting the start time and the end time of each first voice data segment to be extracted to obtain the first start time and the first end time corresponding to the plurality of first voice data segments to be extracted;

and performing segmentation extraction on the second voice data to be processed by respectively adopting the first start time and the first end time corresponding to each first voice data segment to be extracted to obtain a plurality of second voice data segments to be extracted.

6. The method as claimed in claim 1, wherein the step of performing speech extraction on each of the speech data segments to be extracted by the speech extraction model of the input speaker to obtain a plurality of speech data segments of the target speaker comprises:

inputting the first voice data segment to be extracted of the voice data segment pair to be extracted into the first coding transformation module of the single speaker voice extraction model for coding transformation to obtain a first coding transformation result;

inputting the second voice data segment to be extracted of the voice data segment pair to be extracted into the second coding transformation module of the single speaker voice extraction model for coding transformation to obtain a second coding transformation result;

calling the speaker separation learning module of the single speaker voice extraction model to perform speaker separation learning on the first coding transformation result and the second coding transformation result to obtain a target mask matrix;

calling the decoding transformation module of the single speaker voice extraction model to perform decoding transformation on the target mask matrix to obtain the target speaker voice data segment corresponding to the voice data segment to be extracted;

and repeatedly executing the step of inputting the first to-be-extracted voice data segment of the to-be-extracted voice data segment pair into the first coding transformation module of the single speaker voice extraction model for coding transformation to obtain a first coding transformation result until all the to-be-extracted voice data segments are respectively corresponding to the target speaker voice data segment.

7. The method as claimed in claim 1, wherein before the step of performing speech extraction on each of the speech data segments to be extracted by the speech extraction model of the input single speaker to obtain a plurality of speech data segments of the target speaker, the method comprises:

obtaining a plurality of training samples, the training samples comprising: voice sample data in a first direction, voice sample data in a second direction and voice calibration data;

inputting the voice sample data of the training sample in the first direction into a first to-be-trained code conversion module of a to-be-trained voice extraction model and inputting the voice sample data of the training sample in the second direction into a second to-be-trained code conversion module of the to-be-trained voice extraction model, and acquiring single speaker training data output by the to-be-trained voice extraction model, wherein the to-be-trained voice extraction model is a mode obtained based on the TasNet network transformation;

calculating the voice calibration data and the single speaker training data input loss function to obtain a loss value of the voice extraction model to be trained, updating parameters of the voice extraction model to be trained according to the loss value, and using the updated voice extraction model to be trained to calculate the single speaker training data next time;

and repeatedly executing the steps of the method until the loss value reaches a first convergence condition or the iteration times reaches a second convergence condition, and determining the voice extraction model to be trained, of which the loss value reaches the first convergence condition or the iteration times reaches the second convergence condition, as the single speaker voice extraction model.

8. An apparatus for speech extraction for a target speaker, the apparatus comprising:

the voice data acquisition module is used for acquiring first voice data to be processed and second voice data to be processed of a target speaker in the same time period, wherein the first voice data to be processed is voice data obtained according to a voice signal in a first direction, and the second voice data to be processed is voice data obtained according to a voice signal in a second direction;

the first segmentation processing module is used for performing segmentation processing on the first to-be-processed voice data by adopting a preset segmentation method to obtain a plurality of first to-be-extracted voice data segments;

the second segmentation extraction module is used for performing segmentation extraction on the second voice data to be processed according to the plurality of first voice data segments to be extracted to obtain a plurality of second voice data segments to be extracted;

the to-be-extracted voice data segment pair determining module is used for performing data extraction on the plurality of first to-be-extracted voice data segments and the plurality of second to-be-extracted voice data segments at the same time to obtain a plurality of to-be-extracted voice data segment pairs;

a target speaker voice data segment determining module, configured to perform voice extraction on an input single speaker voice extraction model by using each to-be-extracted voice data segment, respectively, so as to obtain multiple target speaker voice data segments, where the single speaker voice extraction model includes: the single speaker voice extraction model is a model obtained based on TasNet network training;

and the target voice data determining module is used for splicing the plurality of target speaker voice data segments according to the time sequence to obtain the target voice data of the target speaker.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for extracting speech of a target speaker.

Background

At present, the service personnel have different business qualities, and the problems of non-standard service techniques and unfriendly attitudes exist. In order to improve the service quality of service personnel, the business quality is evaluated in a manual spot check and dark visit spot check mode, a large amount of manpower and financial resources are consumed, and the cost is high; moreover, the spot check and spot check can only reflect the service condition of part of the time, so that the obtained business quality assessment has one-sidedness.

Disclosure of Invention

The application mainly aims to provide a method, a device, equipment and a medium for extracting voice of a target speaker, and aims to solve the technical problems that in the prior art, the service industry carries out business quality assessment in a manual spot check and dark visit spot check mode, so that the cost is high, and the obtained business quality assessment is one-sided.

In order to achieve the above object, the present application provides a method for extracting a speech of a target speaker, the method comprising:

acquiring first voice data to be processed and second voice data to be processed of a target speaker in the same time period, wherein the first voice data to be processed is voice data obtained according to a voice signal in a first direction, and the second voice data to be processed is voice data obtained according to a voice signal in a second direction;

performing segmentation processing on the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted;

carrying out segmentation extraction on the second voice data to be processed according to the first voice data segments to be extracted to obtain a plurality of second voice data segments to be extracted;

performing data extraction on the plurality of first voice data segments to be extracted and the plurality of second voice data segments to be extracted at the same time to obtain a plurality of voice data segment pairs to be extracted;

and respectively carrying out voice extraction on each voice data segment to be extracted to an input single speaker voice extraction model to obtain a plurality of target speaker voice data segments, wherein the single speaker voice extraction model comprises the following steps: the single speaker voice extraction model is a model obtained based on TasNet network training;

and splicing the target speaker voice data segments according to the time sequence to obtain the target voice data of the target speaker.

Further, the step of obtaining the first to-be-processed speech data and the second to-be-processed speech data of the target speaker in the same time period includes:

acquiring the voice signal of the target speaker in the first direction and the voice signal of the target speaker in the second direction within the same time period;

carrying out segmentation processing on the voice signal in the first direction by adopting a first preset time length to obtain a plurality of segmented voice signal segments in the first direction;

inputting each segmented first-direction voice signal segment into a digital filter to obtain a plurality of filtered first-direction voice signal segments;

performing discrete Fourier transform on each filtered first-direction voice signal segment to obtain a plurality of transformed first-direction voice signal segments;

performing inverse discrete Fourier transform on the plurality of transformed first-direction voice signal segments to obtain noise-reduced first-direction voice data;

carrying out segmentation processing on the voice signal in the second direction by adopting the first preset time length to obtain a plurality of segmented voice signal segments in the second direction;

inputting each segmented second direction voice signal segment into a digital filter to obtain a plurality of filtered second direction voice signal segments;

performing discrete Fourier transform on each filtered second-direction voice signal segment to obtain a plurality of transformed second-direction voice signal segments;

performing inverse discrete Fourier transform on the plurality of transformed second direction voice signal segments to obtain second direction voice data subjected to noise reduction;

pre-emphasis processing is carried out on the first direction voice data after noise reduction, and the first voice data to be processed is obtained;

and pre-emphasis processing is carried out on the second direction voice data after noise reduction, so as to obtain the second voice data to be processed.

Further, the step of performing segmentation processing on the first to-be-processed speech data by using a preset segmentation method to obtain a plurality of first to-be-extracted speech data segments includes:

performing framing processing on the first voice data to be processed by adopting a second preset time length to obtain a plurality of first voice data frames to be processed;

respectively carrying out voice energy calculation on each first voice data frame to be processed to obtain first voice energy corresponding to each of the plurality of first voice data frames to be processed;

extracting the first voice energy from the first voice energy corresponding to the first voice data frames to be processed according to a preset quantity to obtain a plurality of first beginning voice energies;

carrying out mean value calculation on the first beginning voice energies to obtain first background voice energies corresponding to the first voice data frames to be processed;

respectively carrying out subtraction calculation on the first voice energy corresponding to each first voice data frame to be processed and the first background voice energy to obtain first voice energy difference values corresponding to the plurality of first voice data frames to be processed;

comparing the first voice energy difference value corresponding to each first voice data frame to be processed with a voice energy threshold value;

when the first voice energy difference value corresponding to the first voice data frame to be processed is larger than the voice energy threshold value, determining that the mute category of the first voice data frame to be processed corresponding to the first voice energy difference value is a non-mute frame;

when a first voice energy difference value corresponding to the first voice data frame to be processed is smaller than or equal to the voice energy threshold value, determining that the mute category of the first voice data frame to be processed corresponding to the first voice energy difference value is a mute frame;

and performing mute frame deletion processing on the plurality of first voice data frames to be processed by adopting a mute frame quantity threshold and the mute category to obtain a plurality of first voice data sections to be extracted.

Further, the step of performing mute frame deletion processing on the first to-be-processed voice data frames by using the mute frame number threshold and the mute category to obtain the first to-be-extracted voice data segments includes:

calculating the number of the mute frames of the first voice data frames to be processed which are continuous according to time to obtain the number of a plurality of first continuous mute frames;

comparing the number of each first continuous mute frame with the threshold of the number of the mute frames respectively;

and when the number of the first continuous mute frames is greater than the mute frame number threshold, deleting the first to-be-processed voice data frames corresponding to all the first continuous mute frame numbers greater than the mute frame number threshold from the plurality of first to-be-processed voice data frames to obtain the plurality of first to-be-extracted voice data segments.

Further, the step of extracting the second to-be-processed speech data in a segmented manner according to the plurality of first to-be-extracted speech data segments to obtain a plurality of second to-be-extracted speech data segments includes:

respectively extracting the start time and the end time of each first voice data segment to be extracted to obtain the first start time and the first end time corresponding to the plurality of first voice data segments to be extracted;

and performing segmentation extraction on the second voice data to be processed by respectively adopting the first start time and the first end time corresponding to each first voice data segment to be extracted to obtain a plurality of second voice data segments to be extracted.

Further, the step of performing speech extraction on each to-be-extracted speech data segment for the input single speaker speech extraction model to obtain a plurality of target speaker speech data segments includes:

inputting the first voice data segment to be extracted of the voice data segment pair to be extracted into the first coding transformation module of the single speaker voice extraction model for coding transformation to obtain a first coding transformation result;

inputting the second voice data segment to be extracted of the voice data segment pair to be extracted into the second coding transformation module of the single speaker voice extraction model for coding transformation to obtain a second coding transformation result;

calling the speaker separation learning module of the single speaker voice extraction model to perform speaker separation learning on the first coding transformation result and the second coding transformation result to obtain a target mask matrix;

calling the decoding transformation module of the single speaker voice extraction model to perform decoding transformation on the target mask matrix to obtain the target speaker voice data segment corresponding to the voice data segment to be extracted;

and repeatedly executing the step of inputting the first to-be-extracted voice data segment of the to-be-extracted voice data segment pair into the first coding transformation module of the single speaker voice extraction model for coding transformation to obtain a first coding transformation result until all the to-be-extracted voice data segments are respectively corresponding to the target speaker voice data segment.

Further, before the step of performing speech extraction on each speech data segment to be extracted for the input single speaker speech extraction model to obtain a plurality of target speaker speech data segments, the method includes:

obtaining a plurality of training samples, the training samples comprising: voice sample data in a first direction, voice sample data in a second direction and voice calibration data;

inputting the voice sample data of the training sample in the first direction into a first to-be-trained code conversion module of a to-be-trained voice extraction model and inputting the voice sample data of the training sample in the second direction into a second to-be-trained code conversion module of the to-be-trained voice extraction model, and acquiring single speaker training data output by the to-be-trained voice extraction model, wherein the to-be-trained voice extraction model is a mode obtained based on the TasNet network transformation;

calculating the voice calibration data and the single speaker training data input loss function to obtain a loss value of the voice extraction model to be trained, updating parameters of the voice extraction model to be trained according to the loss value, and using the updated voice extraction model to be trained to calculate the single speaker training data next time;

and repeatedly executing the steps of the method until the loss value reaches a first convergence condition or the iteration times reaches a second convergence condition, and determining the voice extraction model to be trained, of which the loss value reaches the first convergence condition or the iteration times reaches the second convergence condition, as the single speaker voice extraction model.

The present application further proposes a speech extraction device for a target speaker, the device comprising:

the voice data acquisition module is used for acquiring first voice data to be processed and second voice data to be processed of a target speaker in the same time period, wherein the first voice data to be processed is voice data obtained according to a voice signal in a first direction, and the second voice data to be processed is voice data obtained according to a voice signal in a second direction;

the first segmentation processing module is used for performing segmentation processing on the first to-be-processed voice data by adopting a preset segmentation method to obtain a plurality of first to-be-extracted voice data segments;

the second segmentation extraction module is used for performing segmentation extraction on the second voice data to be processed according to the plurality of first voice data segments to be extracted to obtain a plurality of second voice data segments to be extracted;

the to-be-extracted voice data segment pair determining module is used for performing data extraction on the plurality of first to-be-extracted voice data segments and the plurality of second to-be-extracted voice data segments at the same time to obtain a plurality of to-be-extracted voice data segment pairs;

a target speaker voice data segment determining module, configured to perform voice extraction on an input single speaker voice extraction model by using each to-be-extracted voice data segment, respectively, so as to obtain multiple target speaker voice data segments, where the single speaker voice extraction model includes: the single speaker voice extraction model is a model obtained based on TasNet network training;

and the target voice data determining module is used for splicing the plurality of target speaker voice data segments according to the time sequence to obtain the target voice data of the target speaker.

The present application further proposes a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of the method of any one of the above when executing the computer program.

The present application also proposes a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any of the above.

The method, the device, the equipment and the medium for extracting the voice of the target speaker perform segmented processing on first to-be-processed voice data in a first direction and second to-be-processed voice data in a second direction in the same time period of the target speaker and perform data extraction in the same time period to obtain a plurality of second to-be-extracted voice data sections, then input the second to-be-extracted voice data sections into a single speaker voice extraction model for voice extraction to obtain a plurality of target speaker voice data sections, wherein the single speaker voice extraction model is a model obtained based on TasNet network training, and finally splice the plurality of target speaker voice data sections according to time sequence to obtain the target voice data of the target speaker, thereby realizing the rapid, accurate and automatic extraction of the speaking voice of the target speaker and reducing the cost of business quality evaluation, the comprehensiveness of the business quality evaluation is improved through the complete voice data of the target speaker, and the privacy and the safety of the voice data of other speakers are protected.

Drawings

FIG. 1 is a flowchart illustrating a method for extracting speech of a target speaker according to an embodiment of the present application;

FIG. 2 is a block diagram of a voice extraction apparatus for a target speaker according to an embodiment of the present application;

fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In order to solve the technical problems that in the prior art, the service industry carries out business quality assessment in a mode of manual spot check and blind visit spot check, so that the cost is high, and the obtained business quality assessment is one-sided, the application provides a voice extraction method for a target speaker, and the method is applied to the technical field of artificial intelligence. The voice extraction method aiming at the target speaker comprises the steps of conducting segmentation processing on voices of the target speaker in different directions, then inputting a single speaker voice extraction model obtained based on TasNet network training to conduct voice extraction, splicing the extracted voices according to the time sequence to obtain voice data only containing the voice of the target speaker, and therefore the fact that the voice of the target speaker is extracted quickly, accurately and automatically is achieved, the cost of business quality evaluation is reduced, the comprehensiveness of the business quality evaluation is improved through the complete voice data of the target speaker, and the privacy safety of the voice data of other speakers is protected.

Referring to fig. 1, an embodiment of the present application provides a method for extracting a speech of a target speaker, where the method includes:

s1: acquiring first voice data to be processed and second voice data to be processed of a target speaker in the same time period, wherein the first voice data to be processed is voice data obtained according to a voice signal in a first direction, and the second voice data to be processed is voice data obtained according to a voice signal in a second direction;

s2: performing segmentation processing on the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted;

s3: carrying out segmentation extraction on the second voice data to be processed according to the first voice data segments to be extracted to obtain a plurality of second voice data segments to be extracted;

s4: performing data extraction on the plurality of first voice data segments to be extracted and the plurality of second voice data segments to be extracted at the same time to obtain a plurality of voice data segment pairs to be extracted;

s5: and respectively carrying out voice extraction on each voice data segment to be extracted to an input single speaker voice extraction model to obtain a plurality of target speaker voice data segments, wherein the single speaker voice extraction model comprises the following steps: the single speaker voice extraction model is a model obtained based on TasNet network training;

s6: and splicing the target speaker voice data segments according to the time sequence to obtain the target voice data of the target speaker.

The embodiment obtains a plurality of second voice data segments to be extracted by carrying out segmentation processing and data extraction at the same time on first voice data to be processed in a first direction and second voice data to be processed in a second direction of a target speaker in the same time period, then inputs the second voice data segments to be extracted into a single speaker voice extraction model to carry out voice extraction to obtain a plurality of target speaker voice data segments, the single speaker voice extraction model is a model obtained based on TasNet network training, and finally splices the plurality of target speaker voice data segments according to time sequence to obtain target voice data of the target speaker, thereby realizing fast, accurate and automatic extraction of the voice of the target speaker, reducing the cost of business quality evaluation, improving the comprehensiveness of the business quality evaluation through the complete voice data of the target speaker, the method is beneficial to protecting the privacy and the safety of the voice data of other speakers.

For S1, the first to-be-processed speech data and the second to-be-processed speech data of the target speaker in the same time period may be obtained from the database, or the first to-be-processed speech data and the second to-be-processed speech data of the target speaker in the same time period recorded by the recording apparatus may be directly obtained.

The first voice data to be processed is voice data obtained according to a voice signal in a first direction, the second voice data to be processed is voice data obtained according to a voice signal in a second direction, and the first direction and the second direction are different directions.

The first voice data to be processed and the second voice data to be processed are in the same time period of the target speaker, that is, the voice signal in the first direction and the voice signal in the second direction are simultaneously recorded voice signals, and the recording time length of the voice signal in the first direction is the same as the recording time length of the voice signal in the second direction.

For S2, the first speech data to be processed is divided into a plurality of pieces of small speech data by a preset segmentation method, and each piece of small speech data is taken as a first speech data piece to be extracted.

For S3, the second to-be-processed speech data is divided into a plurality of pieces of small speech data using the same start time and end time as the plurality of pieces of first to-be-extracted speech data, and each piece of small speech data is taken as one piece of second to-be-extracted speech data.

It is to be understood that, the second voice data to be processed may be segmented first, and then the first voice data to be processed may be segmented and extracted by using the segmentation processing result, which is not limited in this respect.

It is understood that, the step S3 can also directly adopt the method of the step S2 to perform the segmentation process, and is not limited herein.

For S4, the pieces of speech data to be extracted with the same start time and end time in the plurality of pieces of first speech data to be extracted and the plurality of pieces of second speech data to be extracted are grouped into a pair of pieces of speech data to be extracted. That is, each of the pair of speech data segments to be extracted includes a first speech data segment to be extracted and a second speech data segment to be extracted. The starting time of the first voice data segment to be extracted in the same voice data segment pair to be extracted is the same as the starting time of the second voice data segment to be extracted, and the ending time of the first voice data segment to be extracted in the same voice data segment pair to be extracted is the same as the ending time of the second voice data segment to be extracted.

For example, in the to-be-extracted speech data segment pair D1, the start time of the first to-be-extracted speech data segment is 1 hour 0 minute 0 second and the end time is 2 hours 0 minute 0 second, and the start time of the second to-be-extracted speech data segment is 1 hour 0 minute 0 second and the end time is 2 hours 0 minute 0 second, which is not limited in this example.

For step S5, the first to-be-extracted speech data segment and the second to-be-extracted speech data segment corresponding to each to-be-extracted speech data segment are simultaneously input into the single speaker speech extraction model for speech extraction, so as to obtain multiple target speaker speech data segments output by the single speaker speech extraction model. That is, only the voice data of the target speaker is contained in the voice data segment of the target speaker, and each voice data segment to be extracted obtains a voice data segment of the target speaker by extracting the voice passing through the single speaker voice extraction model.

The first coding conversion module and the second coding conversion module output data to the speaker separation learning module, and the speaker separation learning module outputs data to the decoding conversion module.

It is understood that the first transcoding module and the second transcoding module of the single speaker speech extraction model have the same structure, and the parameters of the first transcoding module and the second transcoding module of the single speaker speech extraction model have the same parameters.

The first coding conversion module is a module obtained by training an Encoder module based on a TasNet (single-channel real-time voice Separation) network, the second coding conversion module is a module obtained by training an Encoder module based on the TasNet network, the speaker Separation learning module is a module obtained by training a Separation module based on the TasNet network, and the decoding conversion module is a module obtained by training a Decoder module based on the TasNet network.

The first coding transformation module is used for coding transformation. The second coding transformation module is used for coding transformation. The speaker separation learning module is used for speaker separation learning. The decoding transformation module is used for performing decoding transformation.

And S6, splicing each target speaker voice data segment in the target speaker voice data segments according to the time sequence, and taking the spliced voice data as the target voice data of the target speaker. The target voice data of the target speaker only contains the voice data of the target speaker and is complete voice data of the target speaker in the same time period in the step S1, so that the comprehensiveness of the business quality evaluation is improved, the voice data of other speakers are removed, and the privacy and the safety of other speakers are protected.

In one embodiment, the step of obtaining the first to-be-processed speech data and the second to-be-processed speech data of the target speaker in the same time period includes:

s101: acquiring the voice signal of the target speaker in the first direction and the voice signal of the target speaker in the second direction within the same time period;

s102: carrying out segmentation processing on the voice signal in the first direction by adopting a first preset time length to obtain a plurality of segmented voice signal segments in the first direction;

s103: inputting each segmented first-direction voice signal segment into a digital filter to obtain a plurality of filtered first-direction voice signal segments;

s104: performing discrete Fourier transform on each filtered first-direction voice signal segment to obtain a plurality of transformed first-direction voice signal segments;

s105: performing inverse discrete Fourier transform on the plurality of transformed first-direction voice signal segments to obtain noise-reduced first-direction voice data;

s106: carrying out segmentation processing on the voice signal in the second direction by adopting the first preset time length to obtain a plurality of segmented voice signal segments in the second direction;

s107: inputting each segmented second direction voice signal segment into a digital filter to obtain a plurality of filtered second direction voice signal segments;

s108: performing discrete Fourier transform on each filtered second-direction voice signal segment to obtain a plurality of transformed second-direction voice signal segments;

s109: performing inverse discrete Fourier transform on the plurality of transformed second direction voice signal segments to obtain second direction voice data subjected to noise reduction;

s110: pre-emphasis processing is carried out on the first direction voice data after noise reduction, and the first voice data to be processed is obtained;

s111: and pre-emphasis processing is carried out on the second direction voice data after noise reduction, so as to obtain the second voice data to be processed.

The embodiment realizes the segmented filtering, the discrete Fourier transform, the inverse discrete Fourier transform and the pre-emphasis processing of the voice signals, thereby improving the voice data quality of the obtained first voice data to be processed and the second voice data to be processed and improving the accuracy of the determined target voice data of the target speaker.

For S101, the voice signal in the first direction is a voice signal recorded by the recording device in the first direction to the target speaker. The voice signal of the second direction is the voice signal recorded by the recording device of the second direction to the target speaker.

The sound recording device in the first direction and the sound recording device in the second direction can be arranged independently, and can also be integrated on the same electronic device. For example, the recording device in the first direction and the recording device in the second direction are integrated into the smart chest card, the recording device in the first direction faces the mouth of the target speaker, and the recording device in the second direction faces the front of the target speaker, which is not limited in this embodiment.

For step S102, a first preset duration is adopted to divide the voice signal in the first direction into multiple small voice signals, and each small voice signal is taken as a segmented voice signal segment in the first direction.

Optionally, the first preset time duration is 20 ms.

For S103, the digital filter may be a filter capable of removing additive noise from the prior art, which is not described herein.

For step S104, performing discrete fourier transform on the filtered first-direction speech signal segment, and taking the filtered first-direction speech signal segment after the discrete fourier transform as the transformed first-direction speech signal segment.

And S105, sequencing the plurality of transformed first-direction voice signal segments according to the time sequence, performing inverse discrete Fourier transform on the sequenced plurality of transformed first-direction voice signal segments, and performing inverse discrete Fourier transform to obtain noise-reduced first-direction voice data. The noise-reduced first-direction voice data is clean voice data.

For step S106, the speech signal in the second direction is divided into multiple small speech signals by using a first preset time duration, and each small speech signal is used as a segmented speech signal segment in the second direction.

For S107, the digital filter may be a filter capable of removing additive noise from the prior art, which is not described herein.

For step S108, performing discrete fourier transform on the filtered second direction speech signal segment, and taking the filtered second direction speech signal segment after the discrete fourier transform as the transformed second direction speech signal segment.

And S109, sequencing the plurality of transformed second direction voice signal segments according to the time sequence, performing inverse discrete Fourier transform on the sequenced plurality of transformed second direction voice signal segments, and performing inverse discrete Fourier transform to obtain noise-reduced second direction voice data. The noise-reduced second direction voice data is pure voice data.

For S110, pre-emphasis is a signal processing method for compensating the high frequency component of the input signal to compensate the excessive attenuation of the high frequency component during transmission.

And pre-emphasis processing is carried out on the voice data subjected to noise reduction in the first direction by adopting a first-order FIR high-pass digital filter to obtain the first voice data to be processed.

Optionally, the pre-emphasis coefficient of the first order FIR high-pass digital filter is 0.9< α < 1.0.

Optionally, the pre-emphasis coefficient of the first order FIR high-pass digital filter is 0.97.

For step S111, a first-order FIR high-pass digital filter is adopted to perform pre-emphasis processing on the voice data subjected to noise reduction in the second direction, so as to obtain the second voice data to be processed.

Optionally, the pre-emphasis coefficient of the first order FIR high-pass digital filter is 0.9< α < 1.0.

Optionally, the pre-emphasis coefficient of the first order FIR high-pass digital filter is 0.97.

In an embodiment, the step of performing segmentation processing on the first to-be-processed speech data by using a preset segmentation method to obtain a plurality of first to-be-extracted speech data segments includes:

s21: performing framing processing on the first voice data to be processed by adopting a second preset time length to obtain a plurality of first voice data frames to be processed;

s22: respectively carrying out voice energy calculation on each first voice data frame to be processed to obtain first voice energy corresponding to each of the plurality of first voice data frames to be processed;

s23: extracting the first voice energy from the first voice energy corresponding to the first voice data frames to be processed according to a preset quantity to obtain a plurality of first beginning voice energies;

s24: carrying out mean value calculation on the first beginning voice energies to obtain first background voice energies corresponding to the first voice data frames to be processed;

s25: respectively carrying out subtraction calculation on the first voice energy corresponding to each first voice data frame to be processed and the first background voice energy to obtain first voice energy difference values corresponding to the plurality of first voice data frames to be processed;

s26: comparing the first voice energy difference value corresponding to each first voice data frame to be processed with a voice energy threshold value;

s27: when the first voice energy difference value corresponding to the first voice data frame to be processed is larger than the voice energy threshold value, determining that the mute category of the first voice data frame to be processed corresponding to the first voice energy difference value is a non-mute frame;

s28: when a first voice energy difference value corresponding to the first voice data frame to be processed is smaller than or equal to the voice energy threshold value, determining that the mute category of the first voice data frame to be processed corresponding to the first voice energy difference value is a mute frame;

s29: and performing mute frame deletion processing on the plurality of first voice data frames to be processed by adopting a mute frame quantity threshold and the mute category to obtain a plurality of first voice data sections to be extracted.

According to the embodiment, the frame division is firstly carried out, then the mute category of each frame is determined according to the voice energy of each frame, and finally the mute frame is deleted according to the mute category to obtain a plurality of first voice data sections to be extracted, so that the number of the voice data sections of the input single speaker voice extraction model is reduced, the voice extraction efficiency is improved, the mute duration in the finally obtained target voice data of the target speaker is reduced, and the efficiency of performing business quality evaluation on the target voice data based on the target speaker is improved.

For S21, the first to-be-processed speech data is divided into multiple frames of speech data by using a second preset time duration, and each frame of speech data is taken as a frame of the first to-be-processed speech data. By dividing the voice data frame, the error of deleting the mute frame subsequently is reduced, and the accuracy of the target voice data of the target speaker is further improved.

Optionally, the second preset time duration is 30 ms.

For S22, performing speech energy calculation on the first to-be-processed speech data frame, and taking the calculated speech energy as the first speech energy corresponding to the first to-be-processed speech data frame.

First speech energy calculation equation EnComprises the following steps:

where x (m) is the first frame of speech data to be processed, w (m) is the window function (a rectangular function corresponding to the first frame of speech data to be processed), where the window is a square window, i.e. the speech energy is equal to the sum of the squares of all speech data in each frame.

For S23, extracting a preset number of first to-be-processed voice data frames from the beginning of the plurality of first to-be-processed voice data frames, and using the extracted preset number of first to-be-processed voice data frames as a plurality of background to-be-processed voice data frames; and taking the first speech energy corresponding to each background speech data frame to be processed as a first beginning speech energy.

Optionally, the preset number is 10.

For S24, performing a mean value calculation of the speech energies for the first beginning speech energies, and taking the mean value of the speech energies obtained by the calculation as the first background speech energies corresponding to the first to-be-processed speech data frames.

For S25, the first background speech energy is subtracted from the first speech energy corresponding to the first to-be-processed speech data frame to obtain a speech energy difference value, and the obtained speech energy difference value is used as the first speech energy difference value corresponding to the first to-be-processed speech data frame.

For S26, obtaining a speech energy threshold; and comparing the first voice energy difference value corresponding to each first voice data frame to be processed with a voice energy threshold value.

For S27, when the first speech energy difference corresponding to the first to-be-processed speech data frame is greater than the speech energy threshold, it means that the difference between the first to-be-processed speech data frame corresponding to the first speech energy difference and the background speech corresponding to the first background speech energy is greater, and at this time, the target speaker and/or other speakers are speaking, so that it may be determined that the silence category of the first to-be-processed speech data frame corresponding to the first speech energy difference is a non-silence frame.

For S28, when there is a first speech energy difference value corresponding to the first to-be-processed speech data frame that is less than or equal to the speech energy threshold, it means that the difference between the first to-be-processed speech data frame corresponding to the first speech energy difference value and the background speech corresponding to the first background speech energy is not great, and at this time, the target speaker and the other speaker do not speak, so that the silence category of the first to-be-processed speech data frame corresponding to the first speech energy difference value can be determined to be a silence frame.

For S29, deleting the first to-be-processed speech data frames of which consecutive silence categories are silence frames that satisfy the threshold of the number of silence frames, and determining the first to-be-extracted speech data segments according to the deleted first to-be-processed speech data frames. That is, the total duration of the voice data of the first to-be-extracted voice data segments is less than or equal to the total duration of the voice data of the first to-be-processed voice data frames.

Optionally, a mute frame quantity threshold and the mute category are adopted to perform mute frame deletion processing on the plurality of first to-be-processed voice data frames, so as to obtain a plurality of to-be-combined first to-be-processed voice data frames; and combining the adjacent voice data frames of the first to-be-processed voice data frames to obtain a plurality of first to-be-extracted voice data sections. Therefore, the number of voice data segments of the voice extraction model of the single speaker is further reduced, and the voice extraction efficiency is improved. For example, after deleting the voice data frame 3 and the voice data frame 4 of the plurality of first to-be-processed voice data frames sequenced in time sequence, the obtained plurality of first to-be-processed voice data frames to be combined are the voice data frame 1, the voice data frame 2, the voice data frame 5, the voice data frame 6 and the voice data frame 7, at this time, the plurality of first to-be-processed voice data frames to be combined are combined into the voice data frame 1, the voice data frame 2, the voice data frame 5, the voice data frame 6 and the voice data frame 7 in time sequence, that is, the adjacent voice data frame 1 and the voice data frame 2 are combined, and the adjacent voice data frame 5, the voice data frame 6 and the voice data frame 7 are combined to obtain two first to-be-extracted voice data segments, wherein the first to-be-extracted voice data segment includes the voice data frame 1, the voice data frame 2, and the first to-be-extracted voice data segment includes the voice data frame 1, the voice data frame 6, the voice data, The frame 2 of voice data and the second first segment of voice data to be extracted include a frame 5 of voice data, a frame 6 of voice data and a frame 7 of voice data, which is not limited in this embodiment.

In an embodiment, the step of performing silence frame deletion processing on the plurality of first to-be-processed speech data frames by using a silence frame number threshold and the silence category to obtain the plurality of first to-be-extracted speech data segments includes:

s291: calculating the number of the mute frames of the first voice data frames to be processed which are continuous according to time to obtain the number of a plurality of first continuous mute frames;

s292: comparing the number of each first continuous mute frame with the threshold of the number of the mute frames respectively;

s293: and when the number of the first continuous mute frames is greater than the mute frame number threshold, deleting the first to-be-processed voice data frames corresponding to all the first continuous mute frame numbers greater than the mute frame number threshold from the plurality of first to-be-processed voice data frames to obtain the plurality of first to-be-extracted voice data segments.

According to the embodiment, the plurality of first voice data segments to be extracted are obtained by deleting the mute frames according to the mute categories, the number of voice data segments for inputting a single speaker voice extraction model is reduced, the voice extraction efficiency is improved, the mute duration in the finally obtained target voice data of the target speaker is reduced, and the efficiency of performing business quality evaluation on the target voice data of the target speaker is improved.

For S291, sorting the first to-be-processed voice data frames in chronological order; and calculating the number of continuous mute frames of the sequenced first voice data frames to be processed to obtain the number of a plurality of first continuous mute frames.

For example, the voice data frame 1, the voice data frame 2, the voice data frame 3, the voice data frame 4, the voice data frame 5, the voice data frame 6, and the voice data frame 7 of the first to-be-processed voice data frames are sequenced in time order, where the mute category of the voice data frame 3, the voice data frame 4, and the voice data frame 6 is a mute frame, and then a first continuous mute frame number is obtained as 2 (that is, the voice data frame 1 and the voice data frame 2), a second first continuous mute frame number is obtained as 1 (that is, the voice data frame 5), and a third first continuous mute frame number is obtained as 1 (that is, the voice data frame 7), which is not limited in this example.

For S292, each of the first consecutive number of silence frames is individually compared to a threshold number of silence frames.

For S293, when the first continuous mute frame number is greater than the mute frame number threshold, it means that the number of the first to-be-processed speech data frames corresponding to the first continuous mute frame number reaches a deletion condition, deleting the first to-be-processed speech data frames corresponding to all the first continuous mute frame numbers that are greater than the mute frame number threshold from the plurality of first to-be-processed speech data frames, and obtaining the plurality of first to-be-extracted speech data segments according to the deleted plurality of first to-be-processed speech data frames.

When the number of the first continuous mute frames is less than or equal to the mute frame number threshold, no processing is needed, so that the transitional deletion is avoided from changing the speech speed of the target speaker in the target speech data.

In an embodiment, the step of extracting, by segments, the second to-be-processed speech data according to the first to-be-extracted speech data segments to obtain a second to-be-extracted speech data segments includes:

s31: respectively extracting the start time and the end time of each first voice data segment to be extracted to obtain the first start time and the first end time corresponding to the plurality of first voice data segments to be extracted;

s32: and performing segmentation extraction on the second voice data to be processed by respectively adopting the first start time and the first end time corresponding to each first voice data segment to be extracted to obtain a plurality of second voice data segments to be extracted.

The embodiment realizes the segmentation extraction of the second voice data to be processed according to the plurality of first voice data segments to be extracted, thereby providing a data base for extracting the voice data segments to be extracted.

For S31, extracting a first speech data segment to be extracted from the plurality of first speech data segments to be extracted as a target first speech data segment to be extracted; acquiring the starting time of a target first voice data segment to be extracted as the first starting time corresponding to the target first voice data segment to be extracted, and acquiring the ending time of the target first voice data segment to be extracted as the first ending time corresponding to the target first voice data segment to be extracted; and repeatedly executing the step of extracting one first voice data segment to be extracted from the plurality of first voice data segments to be extracted as the target first voice data segment to be extracted until determining the first start time and the first end time corresponding to the plurality of first voice data segments to be extracted respectively.

For S32, extracting a first speech data segment to be extracted from the plurality of first speech data segments to be extracted as a target first speech data segment to be extracted; performing segmentation extraction from the second voice data to be processed according to a first start time and a first end time corresponding to the target first voice data segment to be extracted, and taking the voice data obtained by the segmentation extraction as a second voice data segment to be extracted corresponding to the target first voice data segment to be extracted; and repeatedly executing the step of extracting one first voice data segment to be extracted from the plurality of first voice data segments to be extracted as a target first voice data segment to be extracted until determining second voice data segments to be extracted corresponding to the plurality of first voice data segments to be extracted.

In an embodiment, the step of performing speech extraction on the input single speaker speech extraction model by using each of the speech data segments to be extracted to obtain a plurality of target speaker speech data segments includes:

s51: inputting the first voice data segment to be extracted of the voice data segment pair to be extracted into the first coding transformation module of the single speaker voice extraction model for coding transformation to obtain a first coding transformation result;

s52: inputting the second voice data segment to be extracted of the voice data segment pair to be extracted into the second coding transformation module of the single speaker voice extraction model for coding transformation to obtain a second coding transformation result;

s53: calling the speaker separation learning module of the single speaker voice extraction model to perform speaker separation learning on the first coding transformation result and the second coding transformation result to obtain a target mask matrix;

s54: calling the decoding transformation module of the single speaker voice extraction model to perform decoding transformation on the target mask matrix to obtain the target speaker voice data segment corresponding to the voice data segment to be extracted;

s55: and repeatedly executing the step of inputting the first to-be-extracted voice data segment of the to-be-extracted voice data segment pair into the first coding transformation module of the single speaker voice extraction model for coding transformation to obtain a first coding transformation result until all the to-be-extracted voice data segments are respectively corresponding to the target speaker voice data segment.

The embodiment realizes that the first voice data segment to be extracted and the second voice data segment to be extracted of the voice data segment pair to be extracted are simultaneously input into the single speaker voice extraction model for voice extraction, thereby realizing the rapid, accurate and automatic extraction of the speaking voice of the target speaker.

For S51, the first to-be-extracted speech data segment of the to-be-extracted speech data segment pair is input into the first transcoding module of the single speaker speech extraction model for transcoding, so as to obtain a first transcoding result, that is, the first transcoding module for training the single speaker speech extraction model uses a speech signal in a first direction.

For S52, the second to-be-extracted speech data segment of the to-be-extracted speech data segment pair is input into the second transcoding module of the single speaker speech extraction model for transcoding, so as to obtain a second transcoding result, that is, the second transcoding module for training the single speaker speech extraction model uses a speech signal in a second direction.

For S53, the target mask matrix refers to the mask matrix of the target speaker.

And S54, calling the decoding transformation module of the single speaker voice extraction model to perform decoding transformation on the target mask matrix to realize reduction, and obtaining the target speaker voice data segment corresponding to the voice data segment to be extracted.

For S55, the steps S51 to S55 are repeatedly executed until all the to-be-extracted speech data segments are determined to respectively correspond to the target speaker speech data segments.

In an embodiment, before the step of performing speech extraction on the input single speaker speech extraction model by using each of the speech data segments to be extracted to obtain a plurality of target speaker speech data segments, the method includes:

s051: obtaining a plurality of training samples, the training samples comprising: voice sample data in a first direction, voice sample data in a second direction and voice calibration data;

s052: inputting the voice sample data of the training sample in the first direction into a first to-be-trained code conversion module of a to-be-trained voice extraction model and inputting the voice sample data of the training sample in the second direction into a second to-be-trained code conversion module of the to-be-trained voice extraction model, and acquiring single speaker training data output by the to-be-trained voice extraction model, wherein the to-be-trained voice extraction model is a mode obtained based on the TasNet network transformation;

s053: calculating the voice calibration data and the single speaker training data input loss function to obtain a loss value of the voice extraction model to be trained, updating parameters of the voice extraction model to be trained according to the loss value, and using the updated voice extraction model to be trained to calculate the single speaker training data next time;

s054: and repeatedly executing the steps of the method until the loss value reaches a first convergence condition or the iteration times reaches a second convergence condition, and determining the voice extraction model to be trained, of which the loss value reaches the first convergence condition or the iteration times reaches the second convergence condition, as the single speaker voice extraction model.

The embodiment realizes the single speaker voice extraction model based on the TasNet network training, and provides a foundation for the subsequent voice data separation of the single speaker.

For S051, each training sample includes a voice sample data in the first direction, a voice sample data in the second direction, and a voice calibration data.

The voice calibration data is voice data of a single speaker calibrated for voice sample data in a first direction and voice sample data in a second direction.

The voice sample data in the first direction, the voice sample data in the second direction and the voice calibration data are all voice data in a time domain.

For S052, the speech extraction model to be trained comprises: the system comprises a first to-be-trained encoding transformation module, a second to-be-trained encoding transformation module, a to-be-trained speaker separation learning module and a to-be-trained decoding transformation module; the first to-be-trained code conversion module and the second to-be-trained code conversion module are connected with the to-be-trained speaker separation learning module, and the to-be-trained speaker separation learning module is connected with the to-be-trained decoding conversion module.

Optionally, the first transcoding module to be trained and the second transcoding module to be trained both use the Encoder module of the TasNet network, the speaker Separation learning module to be trained uses the Separation module of the TasNet network, and the decoding transformation module to be trained uses the Decoder module of the TasNet network.

The Encoder module of the TasNet network comprises: convolution kernel is convolution layer, regularization layer, full connection layer of 1 × 1.

For S053, the loss function SI-SNR is:

wherein the content of the first and second substances,representing the single speaker training data, s representing the voice calibration data,is to make a vectorSum vectorPerforming point multiplication, wherein log () refers to a logarithm function, | | S | | | refers to a second norm of the voice calibration data, | | SstargetIs S | |stargetSecond norm, | | enoiseIs e | |noiseThe second norm.

And for S054, repeating the steps from S052 to S054 until the loss value reaches a first convergence condition or the iteration times reaches a second convergence condition, and determining the speech extraction model to be trained, of which the loss value reaches the first convergence condition or the iteration times reaches the second convergence condition, as the single speaker speech extraction model.

The first convergence condition means that the magnitudes of loss values calculated two adjacent times satisfy a lipschitz condition (lipschitz continuity condition).

The iteration number reaching the second convergence condition means that the number of times that the speech extraction model to be trained is used for calculating the training data of the single speaker, that is, the iteration number is increased by 1 after calculation.

Referring to fig. 2, the present application also proposes a speech extraction apparatus for a target speaker, the apparatus comprising:

a voice data obtaining module 100, configured to obtain first to-be-processed voice data and second to-be-processed voice data of a target speaker in the same time period, where the first to-be-processed voice data is voice data obtained according to a voice signal in a first direction, and the second to-be-processed voice data is voice data obtained according to a voice signal in a second direction;

a first segmentation processing module 200, configured to perform segmentation processing on the first to-be-processed speech data by using a preset segmentation method to obtain a plurality of first to-be-extracted speech data segments;

a second segmentation extraction module 300, configured to perform segmentation extraction on the second to-be-processed voice data according to the first to-be-extracted voice data segments to obtain a second to-be-extracted voice data segments;

a to-be-extracted voice data segment pair determining module 400, configured to perform data extraction on the plurality of first to-be-extracted voice data segments and the plurality of second to-be-extracted voice data segments at the same time to obtain a plurality of to-be-extracted voice data segment pairs;

a target speaker voice data segment determining module 500, configured to perform voice extraction on each to-be-extracted voice data segment with respect to an input single speaker voice extraction model to obtain multiple target speaker voice data segments, where the single speaker voice extraction model includes: the single speaker voice extraction model is a model obtained based on TasNet network training;

and the target voice data determining module 600 is configured to splice the multiple target speaker voice data segments according to a time sequence to obtain the target voice data of the target speaker.

The embodiment obtains a plurality of second voice data segments to be extracted by carrying out segmentation processing and data extraction at the same time on first voice data to be processed in a first direction and second voice data to be processed in a second direction of a target speaker in the same time period, then inputs the second voice data segments to be extracted into a single speaker voice extraction model to carry out voice extraction to obtain a plurality of target speaker voice data segments, the single speaker voice extraction model is a model obtained based on TasNet network training, and finally splices the plurality of target speaker voice data segments according to time sequence to obtain target voice data of the target speaker, thereby realizing fast, accurate and automatic extraction of the voice of the target speaker, reducing the cost of business quality evaluation, improving the comprehensiveness of the business quality evaluation through the complete voice data of the target speaker, the method is beneficial to protecting the privacy and the safety of the voice data of other speakers.

Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing data such as a voice extraction method for a target speaker. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of speech extraction for a targeted speaker. The voice extraction method aiming at the target speaker comprises the following steps: acquiring first voice data to be processed and second voice data to be processed of a target speaker in the same time period, wherein the first voice data to be processed is voice data obtained according to a voice signal in a first direction, and the second voice data to be processed is voice data obtained according to a voice signal in a second direction; performing segmentation processing on the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted; carrying out segmentation extraction on the second voice data to be processed according to the first voice data segments to be extracted to obtain a plurality of second voice data segments to be extracted; performing data extraction on the plurality of first voice data segments to be extracted and the plurality of second voice data segments to be extracted at the same time to obtain a plurality of voice data segment pairs to be extracted; and respectively carrying out voice extraction on each voice data segment to be extracted to an input single speaker voice extraction model to obtain a plurality of target speaker voice data segments, wherein the single speaker voice extraction model comprises the following steps: the single speaker voice extraction model is a model obtained based on TasNet network training; and splicing the target speaker voice data segments according to the time sequence to obtain the target voice data of the target speaker.

The embodiment obtains a plurality of second voice data segments to be extracted by carrying out segmentation processing and data extraction at the same time on first voice data to be processed in a first direction and second voice data to be processed in a second direction of a target speaker in the same time period, then inputs the second voice data segments to be extracted into a single speaker voice extraction model to carry out voice extraction to obtain a plurality of target speaker voice data segments, the single speaker voice extraction model is a model obtained based on TasNet network training, and finally splices the plurality of target speaker voice data segments according to time sequence to obtain target voice data of the target speaker, thereby realizing fast, accurate and automatic extraction of the voice of the target speaker, reducing the cost of business quality evaluation, improving the comprehensiveness of the business quality evaluation through the complete voice data of the target speaker, the method is beneficial to protecting the privacy and the safety of the voice data of other speakers.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing a method for extracting speech for a target speaker, including the steps of: acquiring first voice data to be processed and second voice data to be processed of a target speaker in the same time period, wherein the first voice data to be processed is voice data obtained according to a voice signal in a first direction, and the second voice data to be processed is voice data obtained according to a voice signal in a second direction; performing segmentation processing on the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted; carrying out segmentation extraction on the second voice data to be processed according to the first voice data segments to be extracted to obtain a plurality of second voice data segments to be extracted; performing data extraction on the plurality of first voice data segments to be extracted and the plurality of second voice data segments to be extracted at the same time to obtain a plurality of voice data segment pairs to be extracted; and respectively carrying out voice extraction on each voice data segment to be extracted to an input single speaker voice extraction model to obtain a plurality of target speaker voice data segments, wherein the single speaker voice extraction model comprises the following steps: the single speaker voice extraction model is a model obtained based on TasNet network training; and splicing the target speaker voice data segments according to the time sequence to obtain the target voice data of the target speaker.

The executed voice extraction method aiming at the target speaker obtains a plurality of second voice data segments to be extracted by carrying out segmentation processing and data extraction at the same time on the first voice data to be processed in the first direction and the second voice data to be processed in the second direction of the target speaker in the same time period, then inputs the second voice data segments to be extracted into a single speaker voice extraction model to carry out voice extraction to obtain a plurality of target speaker voice data segments, the single speaker voice extraction model is a model obtained based on TasNet network training, and finally splices the plurality of target speaker voice data segments according to time sequence to obtain the target voice data of the target speaker, thereby realizing the rapid, accurate and automatic extraction of the speaker voice of the target speaker, reducing the cost of business quality evaluation, improving the comprehensiveness of the business quality evaluation through the complete voice data of the target speaker, the method is beneficial to protecting the privacy and the safety of the voice data of other speakers.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

24页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:静音语音检测方法、装置、终端设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!