Voice processing method, device and equipment

文档序号:193323 发布日期:2021-11-02 浏览:38次 中文

阅读说明:本技术 一种语音处理方法、装置及设备 (Voice processing method, device and equipment ) 是由 陈小强 蒲胤华 于 2021-07-28 设计创作,主要内容包括:本申请实施例提供一种语音处理方法、装置及设备,应用于语音系统,所述语音系统中包括麦克风和扬声器,该方法包括:获取所述麦克风在预设时段内采集的第一语音信号,所述第一语音信号包括用户语音信号和所述扬声器在所述预设时段内播放的语音信号;在缓存中获取所述预设时段内的第二语音信号;根据所述第一语音信号和所述第二语音信号,确定所述扬声器对所述缓存中的语音信号进行播放的时延;根据所述时延对所述第二语音信号进行校准处理,得到第三语音信号;根据所述第三语音信号对所述第一语音信号进行处理,以在所述第一语音信号中提取所述用户语音信号。提高了语音处理的准确性。(The embodiment of the application provides a voice processing method, a voice processing device and voice processing equipment, which are applied to a voice system, wherein the voice system comprises a microphone and a loudspeaker, and the method comprises the following steps: acquiring a first voice signal acquired by the microphone in a preset time period, wherein the first voice signal comprises a user voice signal and a voice signal played by the loudspeaker in the preset time period; acquiring a second voice signal in the preset time period in a cache; determining the time delay of the loudspeaker for playing the voice signals in the cache according to the first voice signal and the second voice signal; calibrating the second voice signal according to the time delay to obtain a third voice signal; and processing the first voice signal according to the third voice signal so as to extract the user voice signal from the first voice signal. The accuracy of speech processing is improved.)

1. A speech processing method is applied to a speech system, wherein the speech system comprises a microphone and a loudspeaker, and the method comprises the following steps:

acquiring a first voice signal acquired by the microphone in a preset time period, wherein the first voice signal comprises a user voice signal and a voice signal played by the loudspeaker in the preset time period;

acquiring a second voice signal in the preset time period in a cache;

determining the time delay of the loudspeaker for playing the voice signals in the cache according to the first voice signal and the second voice signal;

calibrating the second voice signal according to the time delay to obtain a third voice signal;

and processing the first voice signal according to the third voice signal so as to extract the user voice signal from the first voice signal.

2. The speech processing method according to claim 1, wherein calibrating the second speech signal according to the time delay to obtain a third speech signal comprises:

acquiring sampling parameters of the microphone, wherein the sampling parameters comprise an audio sampling rate, a sampling bit number and a channel number;

determining a preset signal insertion quantity N according to the time delay and the sampling parameter, wherein N is an integer greater than 1;

and adding N preset signals before the second voice signal to obtain the third voice signal.

3. The speech processing method of claim 2, wherein determining a preset number of signal insertions, N, based on the time delay and the sampling parameter comprises:

according to the time delay and the sampling parameter, determining the insertion number N of the preset signals by the following formula I:

wherein d is the time delay, R is the sampling rate, B is the number of sampling bits, and C is the number of channels.

4. The speech processing method according to claim 2 or 3, wherein adding N preset signals before the second speech signal to obtain the third speech signal comprises:

determining a starting storage position corresponding to the second voice signal in a cache;

and adding the N preset signals before the initial storage position in the cache to obtain the third voice signal, wherein the initial storage position of the third voice signal in the cache is the storage position of the first preset signal in the N preset signals.

5. The method of any one of claims 1-4, wherein determining the time delay for the speaker to play the voice signal in the buffer according to the first voice signal and the second voice signal comprises:

determining a first voice characteristic corresponding to the first voice signal;

determining a second voice characteristic corresponding to the second voice signal;

and matching the first voice characteristic and the second voice characteristic to obtain the time delay.

6. The speech processing method according to claim 5, wherein the matching the first speech feature and the second speech feature to obtain the time delay comprises:

determining a first position in the first voice features, wherein the matching degree of the voice features after the first position in the first voice features and the second voice features is larger than or equal to a first threshold value;

and determining the voice playing time length between the initial position of the first voice characteristic and the first position as the time delay.

7. The speech processing method according to any one of claims 1-6, wherein the speech system is a vehicle-mounted speech system, the method further comprising:

determining a control instruction according to the user voice signal;

and controlling the corresponding vehicle-mounted equipment according to the control instruction.

8. A speech processing device is characterized by comprising a first acquisition module, a second acquisition module, a determination module, a calibration module and a processing module, wherein,

the first acquisition module is used for acquiring a first voice signal acquired by the microphone in a preset time period, wherein the first voice signal comprises a user voice signal and a voice signal played by the loudspeaker in the preset time period;

the second obtaining module is used for obtaining a second voice signal in the preset time period in a cache;

the determining module is configured to determine, according to the first voice signal and the second voice signal, a time delay for the speaker to play the voice signal in the buffer;

the calibration module is used for calibrating the second voice signal according to the time delay to obtain a third voice signal;

the processing module is configured to process the first voice signal according to the third voice signal, so as to extract the user voice signal from the first voice signal.

9. A speech processing device, comprising: a processor, a memory,

the memory stores computer-executable instructions;

the processor executes the computer-executable instructions stored by the memory, causing the processor to perform the speech processing method of any of claims 1-7.

10. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, implement the speech processing method of any one of claims 1 to 7.

11. A computer program product, characterized in that it comprises a computer program which, when being executed by a processor, carries out the speech processing method of any one of claims 1 to 7.

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a method, an apparatus, and a device for processing speech.

Background

The user can control the vehicle-mounted equipment in the vehicle through the voice instruction, and when the user sends the voice instruction, the multimedia equipment in the vehicle can play multimedia voice signals such as music and broadcast, so that the voice instruction sent by the user cannot be accurately identified.

The multimedia device usually stores the multimedia voice signal in a buffer memory first, and then plays the multimedia voice signal in the buffer memory. In the related art, after a microphone in a vehicle acquires a speech signal to be recognized (including a speech instruction and a multimedia speech signal), the speech signal to be recognized may be processed through the multimedia signal in a buffer memory to obtain a speech instruction issued by a user. However, there is usually a certain time delay between the multimedia signal in the speech signal to be recognized acquired by the microphone and the multimedia speech signal in the buffer memory, so that the speech instruction cannot be accurately extracted from the speech signal to be recognized, and the accuracy of speech processing is poor.

Disclosure of Invention

The application relates to a voice processing method, a voice processing device and voice processing equipment, which reduce the time delay of a reference signal and a collected voice signal and improve the accuracy of voice processing.

In a first aspect, an embodiment of the present application provides a speech processing method, which is applied to a speech system, where the speech system includes a microphone and a speaker, and the method includes:

acquiring a first voice signal acquired by the microphone in a preset time period, wherein the first voice signal comprises a user voice signal and a voice signal played by the loudspeaker in the preset time period;

acquiring a second voice signal in the preset time period in a cache;

determining the time delay of the loudspeaker for playing the voice signals in the cache according to the first voice signal and the second voice signal;

calibrating the second voice signal according to the time delay to obtain a third voice signal;

and processing the first voice signal according to the third voice signal so as to extract the user voice signal from the first voice signal.

In a possible implementation manner, performing calibration processing on the second voice signal according to the time delay to obtain a third voice signal includes:

acquiring sampling parameters of the microphone, wherein the sampling parameters comprise an audio sampling rate, a sampling bit number and a channel number;

determining a preset signal insertion quantity N according to the time delay and the sampling parameter, wherein N is an integer greater than 1;

and adding N preset signals before the second voice signal to obtain the third voice signal.

In a possible implementation, determining the preset signal insertion number N according to the time delay and the sampling parameter includes:

according to the time delay and the sampling parameter, determining the insertion number N of the preset signals by the following formula I:

wherein d is the time delay, R is the sampling rate, B is the number of sampling bits, and C is the number of channels.

In a possible implementation manner, adding N preset signals before the second speech signal to obtain the third speech signal includes:

determining a starting storage position corresponding to the second voice signal in a cache;

and adding the N preset signals before the initial storage position in the cache to obtain the third voice signal, wherein the initial storage position of the third voice signal in the cache is the storage position of the first preset signal in the N preset signals.

In a possible implementation manner, determining, according to the first voice signal and the second voice signal, a time delay of the speaker for playing the voice signal in the buffer includes:

determining a first voice characteristic corresponding to the first voice signal;

determining a second voice characteristic corresponding to the second voice signal;

and matching the first voice characteristic and the second voice characteristic to obtain the time delay.

In a possible implementation manner, the matching the first speech feature and the second speech feature to obtain the time delay includes:

determining a first position in the first voice features, wherein the matching degree of the voice features after the first position in the first voice features and the second voice features is larger than or equal to a first threshold value;

and determining the voice playing time length between the initial position of the first voice characteristic and the first position as the time delay.

In one possible embodiment, the speech system is a car-mounted speech system, and the method further includes:

determining a control instruction according to the user voice signal;

and controlling the corresponding vehicle-mounted equipment according to the control instruction.

In a second aspect, an embodiment of the present application provides a speech processing apparatus, including a first obtaining module, a second obtaining module, a determining module, a calibrating module, and a processing module, wherein,

the first acquisition module is used for acquiring a first voice signal acquired by the microphone in a preset time period, wherein the first voice signal comprises a user voice signal and a voice signal played by the loudspeaker in the preset time period;

the second obtaining module is used for obtaining a second voice signal in the preset time period in a cache;

the determining module is configured to determine, according to the first voice signal and the second voice signal, a time delay for the speaker to play the voice signal in the buffer;

the calibration module is used for calibrating the second voice signal according to the time delay to obtain a third voice signal;

the processing module is configured to process the first voice signal according to the third voice signal, so as to extract the user voice signal from the first voice signal.

In a possible implementation, the calibration module is specifically configured to:

acquiring sampling parameters of the microphone, wherein the sampling parameters comprise an audio sampling rate, a sampling bit number and a channel number;

determining a preset signal insertion quantity N according to the time delay and the sampling parameter, wherein N is an integer greater than 1;

and adding N preset signals before the second voice signal to obtain the third voice signal.

In a possible implementation, the calibration module is specifically configured to:

according to the time delay and the sampling parameter, determining the insertion number N of the preset signals by the following formula I:

wherein d is the time delay, R is the sampling rate, B is the number of sampling bits, and C is the number of channels.

In a possible implementation, the calibration module is specifically configured to:

determining a starting storage position corresponding to the second voice signal in a cache;

and adding the N preset signals before the initial storage position in the cache to obtain the third voice signal, wherein the initial storage position of the third voice signal in the cache is the storage position of the first preset signal in the N preset signals.

In a possible implementation, the determining module is specifically configured to:

determining a first voice characteristic corresponding to the first voice signal;

determining a second voice characteristic corresponding to the second voice signal;

and matching the first voice characteristic and the second voice characteristic to obtain the time delay. In a possible implementation, the determining module is specifically configured to:

determining a first position in the first voice features, wherein the matching degree of the voice features after the first position in the first voice features and the second voice features is larger than or equal to a first threshold value;

and determining the voice playing time length between the initial position of the first voice characteristic and the first position as the time delay.

In one possible implementation, the speech processing apparatus further comprises a control module,

the control module is used for determining a control instruction according to the user voice signal; and controlling the corresponding vehicle-mounted equipment according to the control instruction.

In a third aspect, an embodiment of the present application provides a speech processing apparatus, including: a processor, a memory,

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored by the memory, causing the processor to perform the speech processing method of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the computer-readable storage medium is configured to implement the speech processing method according to the first aspect.

In a fifth aspect, the present application provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the speech processing method according to the first aspect.

According to the voice processing method, the device and the equipment provided by the embodiment of the application, a first voice signal collected by a microphone and a second voice signal (reference signal) in a cache are obtained firstly; determining the time delay of the loudspeaker for playing the voice signal in the buffer memory according to the first voice signal and the second voice signal (reference signal); calibrating the second voice signal (reference signal) according to the time delay to obtain a third voice signal; and carrying out EC/NR processing on the first voice signal by utilizing the third voice signal to obtain a pure user voice signal. Because the time delay between the third voice signal and the first voice signal is small, the user voice signal can be accurately extracted from the first voice signal according to the third voice signal, and the accuracy of voice processing is further improved.

Drawings

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a speech processing method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of another speech processing method according to an embodiment of the present application;

fig. 4 is a schematic flow chart of a matching process provided in the embodiment of the present application;

FIG. 5 is a schematic diagram of adding N default signals before the start memory location in the buffer;

fig. 6 is a schematic diagram of a speech processing method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of another speech processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a speech processing device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For ease of understanding, an application scenario to which the embodiment of the present application is applied is described below with reference to fig. 1.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application. Referring to fig. 1, a vehicle may be provided with a voice system. The voice system can comprise a microphone, a voice processing module, a voice recognition module and a multimedia player, and the multimedia player can comprise a cache module and a loudspeaker. The audio signal is buffered in a buffer module of the multimedia player, and when the multimedia player is started, the loudspeaker plays the audio signal in the buffer module.

One or more vehicle-mounted devices can be arranged in the vehicle, and a user can control the vehicle-mounted devices by sending out voice commands. When a user utters voice, the voice signal collected by the microphone in the intelligent vehicle-mounted device may include a voice instruction uttered by the user and an audio signal (background noise) played by the speaker in the multimedia player. The voice processing module carries out EC/NR processing on the collected voice signals to obtain a pure voice command sent by the user, and then transmits the pure voice command sent by the user to the voice recognition module for recognition.

In the related art, the speech processing module may perform EC/NR processing on the acquired speech signal by using a reference signal, where the reference signal is an audio signal buffered in a buffer module in the multimedia player. Time delay is usually existed between the reference signal and the background noise in the collected voice signal, and the reference signal cannot perform good EC/NR processing on the collected voice signal, so that the voice instruction cannot be accurately extracted from the voice signal to be recognized, and further the accuracy of voice processing is poor.

In order to obtain a pure voice instruction sent by a user, an embodiment of the present application provides a voice processing method, which first determines a time delay between a reference signal and a voice signal collected by a microphone, calibrates the reference signal according to the time delay, and performs EC/NR processing on the collected voice signal by using the calibrated reference signal. By calibrating the reference signal, the time delay between the calibrated reference signal and the collected voice signal can be reduced, so that the voice command can be accurately extracted from the voice signal to be recognized according to the calibrated reference signal, and the accuracy of voice processing is improved.

The technical means shown in the present application will be described in detail below with reference to specific examples. It should be noted that the following embodiments may exist independently or may be combined with each other, and the description of the same or displayed contents is not repeated in different embodiments.

Fig. 2 is a flowchart illustrating a speech processing method according to an embodiment of the present application. Referring to fig. 2, the method may include:

s201, acquiring a first voice signal acquired by a microphone in a preset time period.

The execution main body of the embodiment of the application can be a vehicle, and can also be a voice processing device arranged in the vehicle, and the voice processing device can be realized through software, and also can be realized through the combination of software and hardware.

The first voice signal comprises a user voice signal and a voice signal played by a loudspeaker in a preset time period; the voice signal of the user may be a voice instruction sent by the user, and the voice signal played by the speaker may be a voice signal obtained by playing the audio signal in the buffer memory by the speaker.

The preset period may be a period from the start of the emission of the user voice signal to the end of the user voice signal, and for example, the preset period may be 20 seconds, 1 minute, or the like.

S202, acquiring a second voice signal in a preset time period in the buffer.

The buffer may be a circular buffer, and the buffer may store an audio signal to be played, for example, the audio signal may be a multimedia audio signal.

The second voice signal is an audio signal pre-stored in the buffer. The second speech signal may be a multimedia audio signal, e.g. the second speech signal may be music, radio, etc. The second speech signal may be used as a reference signal for processing the first speech signal.

And S203, determining the time delay of the loudspeaker for playing the voice signal in the buffer memory according to the first voice signal and the second voice signal.

The voice signal has a certain time delay from the buffer to the loudspeaker for playing, and the time delay can be 200 milliseconds, 2 seconds and the like.

The time delay of the speaker for playing the voice signal in the buffer memory can be determined by the following method: determining a first voice characteristic corresponding to the first voice signal; determining a second voice characteristic corresponding to the second voice signal; and matching the first voice characteristic and the second voice characteristic to obtain the time delay.

For example, the duration of the first speech feature is T1 seconds, the duration of the second speech feature is T2 seconds, and assuming that the starting time of the first speech feature is 0 seconds, the speech feature of the first speech feature after 1.5s matches the second speech feature, and the starting time of the first speech feature is 0 seconds, the delay is determined to be 1.5 s.

And S204, calibrating the second voice signal according to the time delay to obtain a third voice signal.

A preset signal with a certain duration may be added before the second voice signal according to the time delay to obtain a third voice signal. For example, assuming that the time delay is t, a preset signal with a time duration of t may be added before the second voice signal to obtain a third voice signal.

The time delay between the third voice signal and the first voice signal is smaller than the time delay between the second voice signal and the first voice signal.

And S205, processing the first voice signal according to the third voice signal so as to extract the user voice signal from the first voice signal.

The first speech signal may be EC/NR processed from the third speech signal.

In the embodiment shown in fig. 2, a first voice signal collected by a microphone and a second voice signal (reference signal) in a buffer are obtained first; determining the time delay of the loudspeaker for playing the voice signal in the buffer memory according to the first voice signal and the second voice signal (reference signal); calibrating the second voice signal (reference signal) according to the time delay to obtain a third voice signal; and carrying out EC/NR processing on the first voice signal by utilizing the third voice signal to obtain a pure user voice signal. Because the time delay between the third voice signal and the first voice signal is small, the user voice signal can be accurately extracted from the first voice signal according to the third voice signal, and the accuracy of voice processing is further improved.

Based on any of the above embodiments, the following describes the speech processing method in detail with reference to the embodiment shown in fig. 3.

Fig. 3 is a flowchart illustrating another speech processing method according to an embodiment of the present application. Referring to fig. 3, the method may include:

s301, acquiring a first voice signal acquired by a microphone in a preset time period.

It should be noted that the execution process of S301 may refer to the execution process of S201, and is not described herein again. S302, obtaining a second voice signal in a preset time period in the cache. It should be noted that the execution process of S302 may refer to the execution process of S202, and is not described herein again.

S303, determining a first voice characteristic corresponding to the first voice signal.

The first speech feature may be a sequence of time domain frames, a sequence of short-term zero-crossing rates, a sequence of spectral centroids, and/or a sequence of mel-frequency cepstral coefficients of the first speech signal, etc.

S304, determining a second voice characteristic corresponding to the second voice signal.

The second speech feature may be a sequence of time domain frames, a sequence of short-term zero-crossing rates, a sequence of spectral centroids, and/or a sequence of mel-frequency cepstral coefficients of the second speech signal, etc.

S305, matching the first voice characteristic and the second voice characteristic to obtain time delay.

The matching process can be performed in the following manner: determining a first position in the first voice characteristics, wherein the matching degree of the voice characteristics after the first position in the first voice characteristics and the second voice characteristics is larger than or equal to a first threshold value; and determining the voice playing time length between the initial position and the first position of the first voice characteristic as time delay.

The duration of the first speech feature is the same as the duration of the first speech signal.

Optionally, the second speech feature and the first speech feature may be subjected to matching processing, if the similarity between the second speech feature and the first speech feature is smaller than a first threshold, the starting position of the first speech feature is pushed backwards by P1, and the matching degree between the second speech feature and the first speech feature after the starting position is pushed is determined, if the matching degree is greater than or equal to the first threshold, P1 is determined as the first position, otherwise, the starting position of the first speech feature is continuously pushed backwards by P2, and the above process is repeated until the first position is determined.

The first threshold may be 98%, 100%, etc.

For ease of understanding, the matching process is described in detail below with reference to fig. 4.

Fig. 4 is a schematic flow chart of the matching process according to the embodiment of the present application. Referring to fig. 4, a first speech feature 401 and a second speech feature 402 may be determined, where the duration of the first speech feature is the same as the duration of the first speech signal, and the duration of the second speech feature is the same as the duration of the second speech signal.

A first position may be determined on the first speech feature 401, where a matching degree of the speech feature after the first position in the first speech feature with the second speech feature is greater than a preset threshold. Assuming that the corresponding time at the first position is 1 second and the starting time of the first speech feature is 0 second, the time delay is determined to be 1 second.

And S306, acquiring sampling parameters of the microphone.

The sampling parameters include the audio sampling rate, the number of sample bits, and the number of channels.

The audio sampling rate may be the number of times the recording device samples the audio signal in a unit time, and may be, for example, 24000Hz, 48000Hz, or the like. The sampling bit number may be the resolution of sound processed by the sound card, and may be 16bit, 24bit, 32bit, or the like, for example. The number of channels may be the number of channels of sound, for example, may be 1 channel, 2 channels, and so on.

And S307, determining the preset signal insertion quantity N according to the time delay and the sampling parameters.

The preset signal may be zero data; n is an integer greater than 1.

The preset signal insertion number N can be determined by the formula one:

wherein d is time delay, R is sampling rate, B is sampling digit, and C is channel number.

And S308, determining a starting storage position corresponding to the second voice signal in the buffer.

The buffer includes storage time of each voice signal, and an initial storage location corresponding to the second voice signal can be determined in the buffer according to the buffer time of the second voice signal.

S309, adding N preset signals before the initial storage position in the cache to obtain a third voice signal.

The initial storage position of the third voice signal in the buffer memory is the storage position of the first preset signal in the N preset signals.

For ease of understanding, the addition of N preset signals before the starting memory location in the buffer will be described in detail below with reference to fig. 5.

FIG. 5 is a diagram illustrating the addition of N default signals prior to the beginning memory location in the buffer. Referring to fig. 5, an image 501 and an image 502 are included. The signal in the middle of the start position 1 to the end position in the image 501 is the second speech signal. N pieces of zero data (preset signals) are inserted before the start position 1 of the second voice signal to obtain a third voice signal, which is a signal from the start position 2 to the end position in the image 502.

And S310, processing the first voice signal according to the third voice signal so as to extract the user voice signal from the first voice signal.

It should be noted that the execution process of S310 may refer to the execution process of S205, and is not described herein again.

In the embodiment shown in fig. 3, the first voice signal collected by the microphone and the second voice signal (reference signal) in the buffer are obtained first, then the voice characteristics of the first voice signal and the second voice signal (reference signal) are obtained, and the time delay of the speaker for playing the voice signal in the buffer is determined according to the voice characteristics of the two signals. And determining the number N of preset signals according to the time delay and sampling parameters of the microphone, and calibrating the second voice signal (reference signal) through the preset signals to obtain a third voice signal. And carrying out EC/NR processing on the first voice signal by utilizing the third voice signal to obtain a pure user voice signal. Because the time delay between the third voice signal and the first voice signal is small, the user voice signal can be accurately extracted from the first voice signal according to the third voice signal, and the accuracy of voice processing is further improved. In addition, after the first calibration processing is performed on the second voice signal, only the micro calibration needs to be performed at regular time (for example, 3 minutes or 5 minutes), and the calibration is not required before each voice recognition, so that the time for the calibration processing is saved.

On the basis of any of the above embodiments, the following describes the speech processing method in detail by using a specific example shown in fig. 6. Fig. 6 is a schematic diagram of a speech processing method according to an embodiment of the present application. Referring to fig. 6, the method includes steps 1, 2 and 3.

Step 1, acquiring a first voice signal 601 acquired by a microphone in a preset time period, and acquiring a second voice signal 602 in a buffer memory in the preset time period, wherein a first position determined in the first voice signal is assumed to be shown as a position a. And determining that the voice playing time of the position A in the first voice signal 601 is 0.8s, the voice playing time of the initial position of the first voice signal is 0s, and subtracting the two voice playing times to obtain the time delay, wherein the time delay is 0.8 s.

Step 2, acquiring sampling parameters of the microphone, determining the insertion quantity N of zero data (preset signals) according to the time delay and the sampling parameters, and determining the insertion quantity N of the zero data (preset signals) through a formula I; the second speech signal is calibrated by inserting N zero data (preset signals) at the start position of the second speech signal, resulting in a third speech signal 603.

Wherein d is time delay, R is sampling rate, B is sampling digit, and C is channel number.

And step 3, carrying out EC/NR processing on the first voice signal by using the third voice signal to obtain a pure user voice signal 604.

In the embodiment shown in fig. 6, the time delay and the sampling parameter are used to calibrate the second voice signal, so that the time delay between the third voice signal and the first voice signal is relatively small, and therefore, the third voice signal can perform relatively accurate EC/NR processing on the first voice signal to obtain a pure voice instruction sent by the user, and further, the accuracy of the voice processing is improved. In addition, when the calibration processing is performed on the second speech signal, the signal itself is not subjected to complicated calculation, and therefore, the time for the calibration processing is short.

Fig. 7 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application. Please refer to fig. 7. The speech processing device 7 may comprise a first acquisition module 11, a second acquisition module 12, a determination module 13, a calibration module 14 and a processing module 15, wherein,

the first acquiring module 11 is configured to acquire a first voice signal acquired by a microphone within a preset time period, where the first voice signal includes a user voice signal and a voice signal played by a speaker within the preset time period;

the second obtaining module 12 is configured to obtain a second voice signal in a preset time period in the cache;

the determining module 13 is configured to determine, according to the first voice signal and the second voice signal, a time delay for the speaker to play the voice signal in the buffer;

the calibration module 14 is configured to perform calibration processing on the second voice signal according to the time delay to obtain a third voice signal;

the processing module 15 is configured to process the first voice signal according to the third voice signal to extract the user voice signal from the first voice signal.

In a possible embodiment, the calibration module 14 is specifically configured to:

acquiring sampling parameters of a microphone, wherein the sampling parameters comprise an audio sampling rate, a sampling bit number and a channel number;

determining the insertion number N of preset signals according to the time delay and sampling parameters, wherein N is an integer greater than 1;

and adding N preset signals before the second voice signal to obtain a third voice signal.

In a possible embodiment, the calibration module 14 is specifically configured to:

according to the time delay and the sampling parameters, determining the insertion number N of the preset signals by the following formula I:

wherein d is time delay, R is sampling rate, B is sampling digit, and C is channel number.

In a possible embodiment, the calibration module 14 is specifically configured to:

determining a starting storage position corresponding to the second voice signal in the cache;

and adding N preset signals before the initial storage position in the cache to obtain a third voice signal, wherein the initial storage position of the third voice signal in the cache is the storage position of the first preset signal in the N preset signals.

In a possible implementation, the determining module 13 is specifically configured to:

determining a first voice characteristic corresponding to the first voice signal;

determining a second voice characteristic corresponding to the second voice signal;

and matching the first voice characteristic and the second voice characteristic to obtain the time delay.

In a possible implementation, the determining module 13 is specifically configured to:

determining a first position in the first voice characteristics, wherein the matching degree of the voice characteristics after the first position in the first voice characteristics and the second voice characteristics is larger than or equal to a first threshold value;

and determining the voice playing time length between the initial position and the first position of the first voice characteristic as time delay.

Fig. 8 is a schematic structural diagram of another speech processing apparatus according to an embodiment of the present application. On the basis of fig. 7, please refer to fig. 8, the speech processing apparatus 10 further comprises a control module 16, wherein,

the control module 16 is used for determining a control instruction according to the user voice signal; and controlling the corresponding vehicle-mounted equipment according to the control instruction.

The speech processing apparatus 10 provided in the embodiment of the present application can execute the technical solutions shown in the above method embodiments, and the implementation principles and the beneficial effects thereof are similar, and are not described again here.

Fig. 9 is a schematic structural diagram of a speech processing device according to an embodiment of the present application. Referring to fig. 9, the voice processing apparatus 20 may include: memory 21, processor 22. Illustratively, the memory 21, the processor 22, and the various parts are interconnected by a bus 23.

Memory 21 is used to store program instructions;

the processor 22 is configured to execute the program instructions stored in the memory to cause the speech processing apparatus 20 to execute the speech processing method described above.

The speech processing device shown in the embodiment of fig. 9 may execute the technical solutions shown in the above method embodiments, and the implementation principles and beneficial effects thereof are similar and will not be described herein again.

The embodiment of the present application provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the computer-readable storage medium is used for implementing the above-mentioned voice processing method.

Embodiments of the present application may also provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the method for processing speech can be implemented.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the disclosure. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the application. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

19页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:数据处理方法、装置、电子设备和计算机存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!