Voice extraction method, device, equipment and medium for target speaker

文档序号：662569 发布日期：2021-04-27 浏览：10次中文

阅读说明：本技术 针对目标说话人的语音提取方法、装置、设备及介质 (Voice extraction method, device, equipment and medium for target speaker ) 是由张舒婷赖众程杨念慈何利斌李会璟王小红刘彦国于 2020-12-23 设计创作，主要内容包括：本申请涉及人工智能技术领域,揭示了一种针对目标说话人的语音提取方法、装置、设备及介质,其中方法包括：采用预设分段方法根据第一方向的第一待处理的语音数据确定多个第一待提取语音数据段；根据多个第一待提取语音数据段对第二方向的第二待处理的语音数据进行分段提取得到多个第二待提取语音数据段；对多个第一待提取语音数据段和多个第二待提取语音数据段进行相同时间的数据提取得到多个待提取语音数据段对；分别将每个待提取语音数据段对输入单一说话人语音提取模型进行语音提取得到多个目标说话人语音数据段,然后按时间顺序进行拼接得到目标说话人的目标语音数据。从而降低了业务素质评估的成本,提高了业务素质评估的全面性。(The application relates to the technical field of artificial intelligence, and discloses a method, a device, equipment and a medium for extracting voice of a target speaker, wherein the method comprises the following steps: determining a plurality of first voice data segments to be extracted according to first voice data to be processed in a first direction by adopting a preset segmentation method; carrying out segmentation extraction on second to-be-processed voice data in a second direction according to the first to-be-extracted voice data sections to obtain a plurality of second to-be-extracted voice data sections; carrying out data extraction on the plurality of first voice data segments to be extracted and the plurality of second voice data segments to be extracted at the same time to obtain a plurality of voice data segment pairs to be extracted; and respectively carrying out voice extraction on each voice data segment to be extracted by inputting a single speaker voice extraction model to obtain a plurality of target speaker voice data segments, and then splicing the target speaker voice data segments according to the time sequence to obtain the target voice data of the target speaker. Therefore, the cost of business quality evaluation is reduced, and the comprehensiveness of the business quality evaluation is improved.)

1. A method for extracting speech for a target speaker, the method comprising:

acquiring first voice data to be processed and second voice data to be processed of a target speaker in the same time period, wherein the first voice data to be processed is voice data obtained according to a voice signal in a first direction, and the second voice data to be processed is voice data obtained according to a voice signal in a second direction;

performing segmentation processing on the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted;

carrying out segmentation extraction on the second voice data to be processed according to the first voice data segments to be extracted to obtain a plurality of second voice data segments to be extracted;

performing data extraction on the plurality of first voice data segments to be extracted and the plurality of second voice data segments to be extracted at the same time to obtain a plurality of voice data segment pairs to be extracted;

and respectively carrying out voice extraction on each voice data segment to be extracted to an input single speaker voice extraction model to obtain a plurality of target speaker voice data segments, wherein the single speaker voice extraction model comprises the following steps: the single speaker voice extraction model is a model obtained based on TasNet network training;

and splicing the target speaker voice data segments according to the time sequence to obtain the target voice data of the target speaker.

2. The method as claimed in claim 1, wherein the step of obtaining the first to-be-processed speech data and the second to-be-processed speech data of the target speaker in the same time period comprises:

acquiring the voice signal of the target speaker in the first direction and the voice signal of the target speaker in the second direction within the same time period;

carrying out segmentation processing on the voice signal in the first direction by adopting a first preset time length to obtain a plurality of segmented voice signal segments in the first direction;

inputting each segmented first-direction voice signal segment into a digital filter to obtain a plurality of filtered first-direction voice signal segments;

performing discrete Fourier transform on each filtered first-direction voice signal segment to obtain a plurality of transformed first-direction voice signal segments;

performing inverse discrete Fourier transform on the plurality of transformed first-direction voice signal segments to obtain noise-reduced first-direction voice data;

carrying out segmentation processing on the voice signal in the second direction by adopting the first preset time length to obtain a plurality of segmented voice signal segments in the second direction;

inputting each segmented second direction voice signal segment into a digital filter to obtain a plurality of filtered second direction voice signal segments;

performing discrete Fourier transform on each filtered second-direction voice signal segment to obtain a plurality of transformed second-direction voice signal segments;

performing inverse discrete Fourier transform on the plurality of transformed second direction voice signal segments to obtain second direction voice data subjected to noise reduction;

pre-emphasis processing is carried out on the first direction voice data after noise reduction, and the first voice data to be processed is obtained;

and pre-emphasis processing is carried out on the second direction voice data after noise reduction, so as to obtain the second voice data to be processed.

3. The method as claimed in claim 1, wherein the step of segmenting the first to-be-processed speech data by using a predetermined segmentation method to obtain a plurality of first to-be-extracted speech data segments comprises:

performing framing processing on the first voice data to be processed by adopting a second preset time length to obtain a plurality of first voice data frames to be processed;

respectively carrying out voice energy calculation on each first voice data frame to be processed to obtain first voice energy corresponding to each of the plurality of first voice data frames to be processed;

extracting the first voice energy from the first voice energy corresponding to the first voice data frames to be processed according to a preset quantity to obtain a plurality of first beginning voice energies;

carrying out mean value calculation on the first beginning voice energies to obtain first background voice energies corresponding to the first voice data frames to be processed;

respectively carrying out subtraction calculation on the first voice energy corresponding to each first voice data frame to be processed and the first background voice energy to obtain first voice energy difference values corresponding to the plurality of first voice data frames to be processed;

comparing the first voice energy difference value corresponding to each first voice data frame to be processed with a voice energy threshold value;

when the first voice energy difference value corresponding to the first voice data frame to be processed is larger than the voice energy threshold value, determining that the mute category of the first voice data frame to be processed corresponding to the first voice energy difference value is a non-mute frame;

when a first voice energy difference value corresponding to the first voice data frame to be processed is smaller than or equal to the voice energy threshold value, determining that the mute category of the first voice data frame to be processed corresponding to the first voice energy difference value is a mute frame;

and performing mute frame deletion processing on the plurality of first voice data frames to be processed by adopting a mute frame quantity threshold and the mute category to obtain a plurality of first voice data sections to be extracted.

4. The method as claimed in claim 3, wherein the step of performing silence frame deletion processing on the first frames of speech data to be processed by using a threshold of number of silence frames and the silence category to obtain the first segments of speech data to be extracted includes:

calculating the number of the mute frames of the first voice data frames to be processed which are continuous according to time to obtain the number of a plurality of first continuous mute frames;

comparing the number of each first continuous mute frame with the threshold of the number of the mute frames respectively;

and when the number of the first continuous mute frames is greater than the mute frame number threshold, deleting the first to-be-processed voice data frames corresponding to all the first continuous mute frame numbers greater than the mute frame number threshold from the plurality of first to-be-processed voice data frames to obtain the plurality of first to-be-extracted voice data segments.

5. The method as claimed in claim 1, wherein the step of extracting the second voice data to be processed in segments according to the plurality of first voice data segments to be extracted to obtain a plurality of second voice data segments to be extracted includes:

respectively extracting the start time and the end time of each first voice data segment to be extracted to obtain the first start time and the first end time corresponding to the plurality of first voice data segments to be extracted;

and performing segmentation extraction on the second voice data to be processed by respectively adopting the first start time and the first end time corresponding to each first voice data segment to be extracted to obtain a plurality of second voice data segments to be extracted.

6. The method as claimed in claim 1, wherein the step of performing speech extraction on each of the speech data segments to be extracted by the speech extraction model of the input speaker to obtain a plurality of speech data segments of the target speaker comprises:

inputting the first voice data segment to be extracted of the voice data segment pair to be extracted into the first coding transformation module of the single speaker voice extraction model for coding transformation to obtain a first coding transformation result;

inputting the second voice data segment to be extracted of the voice data segment pair to be extracted into the second coding transformation module of the single speaker voice extraction model for coding transformation to obtain a second coding transformation result;

calling the speaker separation learning module of the single speaker voice extraction model to perform speaker separation learning on the first coding transformation result and the second coding transformation result to obtain a target mask matrix;

calling the decoding transformation module of the single speaker voice extraction model to perform decoding transformation on the target mask matrix to obtain the target speaker voice data segment corresponding to the voice data segment to be extracted;

and repeatedly executing the step of inputting the first to-be-extracted voice data segment of the to-be-extracted voice data segment pair into the first coding transformation module of the single speaker voice extraction model for coding transformation to obtain a first coding transformation result until all the to-be-extracted voice data segments are respectively corresponding to the target speaker voice data segment.

7. The method as claimed in claim 1, wherein before the step of performing speech extraction on each of the speech data segments to be extracted by the speech extraction model of the input single speaker to obtain a plurality of speech data segments of the target speaker, the method comprises:

obtaining a plurality of training samples, the training samples comprising: voice sample data in a first direction, voice sample data in a second direction and voice calibration data;

inputting the voice sample data of the training sample in the first direction into a first to-be-trained code conversion module of a to-be-trained voice extraction model and inputting the voice sample data of the training sample in the second direction into a second to-be-trained code conversion module of the to-be-trained voice extraction model, and acquiring single speaker training data output by the to-be-trained voice extraction model, wherein the to-be-trained voice extraction model is a mode obtained based on the TasNet network transformation;

calculating the voice calibration data and the single speaker training data input loss function to obtain a loss value of the voice extraction model to be trained, updating parameters of the voice extraction model to be trained according to the loss value, and using the updated voice extraction model to be trained to calculate the single speaker training data next time;

and repeatedly executing the steps of the method until the loss value reaches a first convergence condition or the iteration times reaches a second convergence condition, and determining the voice extraction model to be trained, of which the loss value reaches the first convergence condition or the iteration times reaches the second convergence condition, as the single speaker voice extraction model.

8. An apparatus for speech extraction for a target speaker, the apparatus comprising:

the voice data acquisition module is used for acquiring first voice data to be processed and second voice data to be processed of a target speaker in the same time period, wherein the first voice data to be processed is voice data obtained according to a voice signal in a first direction, and the second voice data to be processed is voice data obtained according to a voice signal in a second direction;

the first segmentation processing module is used for performing segmentation processing on the first to-be-processed voice data by adopting a preset segmentation method to obtain a plurality of first to-be-extracted voice data segments;

the second segmentation extraction module is used for performing segmentation extraction on the second voice data to be processed according to the plurality of first voice data segments to be extracted to obtain a plurality of second voice data segments to be extracted;

the to-be-extracted voice data segment pair determining module is used for performing data extraction on the plurality of first to-be-extracted voice data segments and the plurality of second to-be-extracted voice data segments at the same time to obtain a plurality of to-be-extracted voice data segment pairs;

a target speaker voice data segment determining module, configured to perform voice extraction on an input single speaker voice extraction model by using each to-be-extracted voice data segment, respectively, so as to obtain multiple target speaker voice data segments, where the single speaker voice extraction model includes: the single speaker voice extraction model is a model obtained based on TasNet network training;

and the target voice data determining module is used for splicing the plurality of target speaker voice data segments according to the time sequence to obtain the target voice data of the target speaker.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for extracting speech of a target speaker.

Background

At present, the service personnel have different business qualities, and the problems of non-standard service techniques and unfriendly attitudes exist. In order to improve the service quality of service personnel, the business quality is evaluated in a manual spot check and dark visit spot check mode, a large amount of manpower and financial resources are consumed, and the cost is high; moreover, the spot check and spot check can only reflect the service condition of part of the time, so that the obtained business quality assessment has one-sidedness.

Disclosure of Invention

The application mainly aims to provide a method, a device, equipment and a medium for extracting voice of a target speaker, and aims to solve the technical problems that in the prior art, the service industry carries out business quality assessment in a manual spot check and dark visit spot check mode, so that the cost is high, and the obtained business quality assessment is one-sided.

In order to achieve the above object, the present application provides a method for extracting a speech of a target speaker, the method comprising:

performing segmentation processing on the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted;

and splicing the target speaker voice data segments according to the time sequence to obtain the target voice data of the target speaker.

Further, the step of obtaining the first to-be-processed speech data and the second to-be-processed speech data of the target speaker in the same time period includes: