Voice processing method, device and equipment

文档序号：193323 发布日期：2021-11-02 浏览：38次中文

阅读说明：本技术 一种语音处理方法、装置及设备 (Voice processing method, device and equipment ) 是由陈小强蒲胤华于 2021-07-28 设计创作，主要内容包括：本申请实施例提供一种语音处理方法、装置及设备,应用于语音系统,所述语音系统中包括麦克风和扬声器,该方法包括：获取所述麦克风在预设时段内采集的第一语音信号,所述第一语音信号包括用户语音信号和所述扬声器在所述预设时段内播放的语音信号；在缓存中获取所述预设时段内的第二语音信号；根据所述第一语音信号和所述第二语音信号,确定所述扬声器对所述缓存中的语音信号进行播放的时延；根据所述时延对所述第二语音信号进行校准处理,得到第三语音信号；根据所述第三语音信号对所述第一语音信号进行处理,以在所述第一语音信号中提取所述用户语音信号。提高了语音处理的准确性。(The embodiment of the application provides a voice processing method, a voice processing device and voice processing equipment, which are applied to a voice system, wherein the voice system comprises a microphone and a loudspeaker, and the method comprises the following steps: acquiring a first voice signal acquired by the microphone in a preset time period, wherein the first voice signal comprises a user voice signal and a voice signal played by the loudspeaker in the preset time period; acquiring a second voice signal in the preset time period in a cache; determining the time delay of the loudspeaker for playing the voice signals in the cache according to the first voice signal and the second voice signal; calibrating the second voice signal according to the time delay to obtain a third voice signal; and processing the first voice signal according to the third voice signal so as to extract the user voice signal from the first voice signal. The accuracy of speech processing is improved.)

1. A speech processing method is applied to a speech system, wherein the speech system comprises a microphone and a loudspeaker, and the method comprises the following steps:

acquiring a first voice signal acquired by the microphone in a preset time period, wherein the first voice signal comprises a user voice signal and a voice signal played by the loudspeaker in the preset time period;

acquiring a second voice signal in the preset time period in a cache;

determining the time delay of the loudspeaker for playing the voice signals in the cache according to the first voice signal and the second voice signal;

calibrating the second voice signal according to the time delay to obtain a third voice signal;

and processing the first voice signal according to the third voice signal so as to extract the user voice signal from the first voice signal.

2. The speech processing method according to claim 1, wherein calibrating the second speech signal according to the time delay to obtain a third speech signal comprises:

acquiring sampling parameters of the microphone, wherein the sampling parameters comprise an audio sampling rate, a sampling bit number and a channel number;

determining a preset signal insertion quantity N according to the time delay and the sampling parameter, wherein N is an integer greater than 1;

and adding N preset signals before the second voice signal to obtain the third voice signal.

3. The speech processing method of claim 2, wherein determining a preset number of signal insertions, N, based on the time delay and the sampling parameter comprises:

according to the time delay and the sampling parameter, determining the insertion number N of the preset signals by the following formula I:

wherein d is the time delay, R is the sampling rate, B is the number of sampling bits, and C is the number of channels.

4. The speech processing method according to claim 2 or 3, wherein adding N preset signals before the second speech signal to obtain the third speech signal comprises:

determining a starting storage position corresponding to the second voice signal in a cache;

and adding the N preset signals before the initial storage position in the cache to obtain the third voice signal, wherein the initial storage position of the third voice signal in the cache is the storage position of the first preset signal in the N preset signals.

5. The method of any one of claims 1-4, wherein determining the time delay for the speaker to play the voice signal in the buffer according to the first voice signal and the second voice signal comprises:

determining a first voice characteristic corresponding to the first voice signal;

determining a second voice characteristic corresponding to the second voice signal;

and matching the first voice characteristic and the second voice characteristic to obtain the time delay.

6. The speech processing method according to claim 5, wherein the matching the first speech feature and the second speech feature to obtain the time delay comprises:

determining a first position in the first voice features, wherein the matching degree of the voice features after the first position in the first voice features and the second voice features is larger than or equal to a first threshold value;

and determining the voice playing time length between the initial position of the first voice characteristic and the first position as the time delay.

7. The speech processing method according to any one of claims 1-6, wherein the speech system is a vehicle-mounted speech system, the method further comprising:

determining a control instruction according to the user voice signal;

and controlling the corresponding vehicle-mounted equipment according to the control instruction.

8. A speech processing device is characterized by comprising a first acquisition module, a second acquisition module, a determination module, a calibration module and a processing module, wherein,

the first acquisition module is used for acquiring a first voice signal acquired by the microphone in a preset time period, wherein the first voice signal comprises a user voice signal and a voice signal played by the loudspeaker in the preset time period;

the second obtaining module is used for obtaining a second voice signal in the preset time period in a cache;

the determining module is configured to determine, according to the first voice signal and the second voice signal, a time delay for the speaker to play the voice signal in the buffer;

the calibration module is used for calibrating the second voice signal according to the time delay to obtain a third voice signal;

the processing module is configured to process the first voice signal according to the third voice signal, so as to extract the user voice signal from the first voice signal.

9. A speech processing device, comprising: a processor, a memory,

the memory stores computer-executable instructions;

the processor executes the computer-executable instructions stored by the memory, causing the processor to perform the speech processing method of any of claims 1-7.

10. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, implement the speech processing method of any one of claims 1 to 7.

11. A computer program product, characterized in that it comprises a computer program which, when being executed by a processor, carries out the speech processing method of any one of claims 1 to 7.

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a method, an apparatus, and a device for processing speech.

Background

The user can control the vehicle-mounted equipment in the vehicle through the voice instruction, and when the user sends the voice instruction, the multimedia equipment in the vehicle can play multimedia voice signals such as music and broadcast, so that the voice instruction sent by the user cannot be accurately identified.

The multimedia device usually stores the multimedia voice signal in a buffer memory first, and then plays the multimedia voice signal in the buffer memory. In the related art, after a microphone in a vehicle acquires a speech signal to be recognized (including a speech instruction and a multimedia speech signal), the speech signal to be recognized may be processed through the multimedia signal in a buffer memory to obtain a speech instruction issued by a user. However, there is usually a certain time delay between the multimedia signal in the speech signal to be recognized acquired by the microphone and the multimedia speech signal in the buffer memory, so that the speech instruction cannot be accurately extracted from the speech signal to be recognized, and the accuracy of speech processing is poor.

Disclosure of Invention

The application relates to a voice processing method, a voice processing device and voice processing equipment, which reduce the time delay of a reference signal and a collected voice signal and improve the accuracy of voice processing.

In a first aspect, an embodiment of the present application provides a speech processing method, which is applied to a speech system, where the speech system includes a microphone and a speaker, and the method includes:

acquiring a second voice signal in the preset time period in a cache;

determining the time delay of the loudspeaker for playing the voice signals in the cache according to the first voice signal and the second voice signal;

calibrating the second voice signal according to the time delay to obtain a third voice signal;

and processing the first voice signal according to the third voice signal so as to extract the user voice signal from the first voice signal.

In a possible implementation manner, performing calibration processing on the second voice signal according to the time delay to obtain a third voice signal includes:

acquiring sampling parameters of the microphone, wherein the sampling parameters comprise an audio sampling rate, a sampling bit number and a channel number;

determining a preset signal insertion quantity N according to the time delay and the sampling parameter, wherein N is an integer greater than 1;

and adding N preset signals before the second voice signal to obtain the third voice signal.

In a possible implementation, determining the preset signal insertion number N according to the time delay and the sampling parameter includes:

according to the time delay and the sampling parameter, determining the insertion number N of the preset signals by the following formula I:

wherein d is the time delay, R is the sampling rate, B is the number of sampling bits, and C is the number of channels.

In a possible implementation manner, adding N preset signals before the second speech signal to obtain the third speech signal includes:

determining a starting storage position corresponding to the second voice signal in a cache;

In a possible implementation manner, determining, according to the first voice signal and the second voice signal, a time delay of the speaker for playing the voice signal in the buffer includes:

determining a first voice characteristic corresponding to the first voice signal;

determining a second voice characteristic corresponding to the second voice signal;

and matching the first voice characteristic and the second voice characteristic to obtain the time delay.

In a possible implementation manner, the matching the first speech feature and the second speech feature to obtain the time delay includes:

and determining the voice playing time length between the initial position of the first voice characteristic and the first position as the time delay.

In one possible embodiment, the speech system is a car-mounted speech system, and the method further includes:

determining a control instruction according to the user voice signal;

and controlling the corresponding vehicle-mounted equipment according to the control instruction.

In a second aspect, an embodiment of the present application provides a speech processing apparatus, including a first obtaining module, a second obtaining module, a determining module, a calibrating module, and a processing module, wherein,

the second obtaining module is used for obtaining a second voice signal in the preset time period in a cache;

the determining module is configured to determine, according to the first voice signal and the second voice signal, a time delay for the speaker to play the voice signal in the buffer;

the calibration module is used for calibrating the second voice signal according to the time delay to obtain a third voice signal;

the processing module is configured to process the first voice signal according to the third voice signal, so as to extract the user voice signal from the first voice signal.

In a possible implementation, the calibration module is specifically configured to: