Virtual sound insulation communication method with audio recognition function

文档序号：193376 发布日期：2021-11-02 浏览：28次中文

阅读说明：本技术 一种具有音频识别功能的虚拟隔音通信方法 (Virtual sound insulation communication method with audio recognition function ) 是由朱志辉于 2021-08-10 设计创作，主要内容包括：本发明公开一种具有音频识别功能的虚拟隔音通信方法。所述方法包括音频信号采集步骤、接近信号检测步骤、采集指令调节步骤、虚拟音频隔音步骤、输出信号质量评价步骤以及音频识别模型更新步骤。所述音频信号采集步骤用于采集输入音频、所述接近信号检测步骤用于检测接近信号、所述采集指令调节步骤用于基于所述检测到的接近信号调节所述音频信号采集指令、所述虚拟音频隔音步骤用于基于音频识别模型进行隔音处理后输出信号、若所述输出信号质量评价步骤的结论为否,则执行所述音频识别模型更新步骤。本发明的技术方案能够自适应的进行不同场景下的隔音消噪。(The invention discloses a virtual sound insulation communication method with an audio recognition function. The method comprises an audio signal acquisition step, a proximity signal detection step, an acquisition instruction adjustment step, a virtual audio sound insulation step, an output signal quality evaluation step and an audio identification model updating step. The audio signal acquisition step is used for acquiring input audio, the approach signal detection step is used for detecting an approach signal, the acquisition instruction adjustment step is used for adjusting the audio signal acquisition instruction based on the detected approach signal, the virtual audio sound insulation step is used for outputting a signal after sound insulation processing is carried out based on an audio recognition model, and if the conclusion of the output signal quality evaluation step is negative, the audio recognition model updating step is executed. The technical scheme of the invention can adaptively perform sound insulation and noise elimination in different scenes.)

1. A method of virtual acoustic communication, the method comprising the steps of:

an audio signal acquisition step, wherein the audio signal acquisition step sends an audio acquisition instruction based on a state monitoring instruction and is used for acquiring an externally input audio signal;

a proximity signal detection step of detecting whether or not a proximity signal exists after the audio signal acquisition step;

if the approach signal is detected, executing an acquisition instruction adjusting step;

the acquisition instruction adjusting step adjusts the acquisition mode of the audio acquisition instruction in the audio signal acquisition step for the externally input audio signal based on the detected approach signal;

a virtual audio sound insulation step, wherein the virtual audio sound insulation step is used for outputting signals after sound insulation processing is carried out on the audio signals which are acquired in the audio signal acquisition step and input from the outside based on an audio recognition model;

an output signal quality evaluation step of evaluating quality of an output signal of the virtual audio sound insulation step;

if the quality evaluation does not meet the preset condition, executing an audio recognition model updating step;

the audio recognition model updating step is for updating the audio recognition model used in the virtual audio soundproofing step.

2. The method of claim 1, wherein:

the state monitoring instruction is issued based on the detected interaction state.

3. The method of claim 1, wherein:

the audio signal acquisition step further comprises a preprocessing step;

and the preprocessing step is used for preprocessing the collected external input audio signals, wherein the preprocessing comprises signal pre-emphasis, framing and end point monitoring.

4. The method of claim 3, wherein:

said framing operation comprises performing with a sliding time window function related to a transfer function used by said signal pre-emphasis step.

5. The method of claim 4, wherein:

the pre-emphasis step uses a transfer function of:wherein, in the step (A),(ii) a z is a transfer function variable.

6. The method of claim 5, wherein:

the sliding time window function expression adopted by the framing operation is as follows:

(ii) a N is the length of each frame.

7. The method of claim 4 or 5, wherein:

in the framing operation, the sequence length D of the overlapping adjacent frame sequences satisfies the following condition:

，

where N is the length of each frame.

8. A virtual soundproof communication method applied to a mobile terminal including a plurality of proximity sensors and a plurality of sound pickup arrays, the method comprising the steps of:

s901: judging whether the mobile terminal is in an interactive state, wherein the interactive state comprises the opening of a voice call or the opening of a video call;

if yes, go to step S902;

s902: turning on the plurality of pickup arrays;

s903: determining whether at least one of the plurality of proximity sensors detects a proximity signal,

if yes, go to step S904; otherwise, go to step S905;

s904: adjusting the state of the pickup array according to the type of the detected approach signal, and entering step S605;

s905: collecting audio input signals through the pickup array, and outputting the audio input signals after performing sound insulation treatment on the audio input signals by using an audio AI (artificial intelligence) processor built in the mobile terminal;

s906: performing quality evaluation on the output signal, and judging whether the output signal meets a preset standard;

if yes, returning to the step S903;

if not, a feedback signal is sent to the audio AI processor, so that the audio AI processor returns to step S903 after updating the audio AI identification model.

9. The method of claim 8, wherein:

after the audio input signal is collected in step S905, the audio input signal is input to the audio AI processor after being preprocessed; the pretreatment comprises the following steps:

s9051: pre-emphasis processing of the audio input signal by a high-pass filter having a transfer function of:wherein, in the step (A),(ii) a z is a transfer function variable;

s9052: framing the pre-emphasized audio input signal with a sliding time window associated with the high-pass filter;

s9053: denoising the sequence after framing by adopting a spectral subtraction method;

s9054: and carrying out end point detection on the denoised sequence to obtain the audio frequency domain characteristics between every two adjacent end points.

10. A computer-readable storage medium having stored thereon computer-executable program instructions, the executable instructions being executed by a processor and a memory for implementing the method of any one of claims 1-7.

Technical Field

The invention belongs to the field of intelligent communication, and particularly relates to a virtual sound insulation communication method with an audio recognition function.

Background

As is well known, voice is the most natural and convenient way for human beings to communicate, and is one of the most direct interaction modes in human-computer interaction, and is generally considered as the leading role of the next generation of human-computer interaction revolution. With the popularization of embedded mobile devices, represented by smartphones, tablet computers, and the like, and the gradual maturation of voice core technologies and application environments, voice interaction is being accepted and used by more and more users worldwide.

The inventors have realized that the speech recognition effect required by the user is different in different interactive scenarios. For example, end-to-end voice call in a private state is influenced by noise slightly or even negligibly; in a multi-party call, such as a live video call, a hands-free/play-out call, and the like, external noise is the most important interference factor, and different call modes and sound pickup modes are required.

In contrast, the chinese patent application with application number CN201910607790.2 proposes a virtual sound insulation communication method, device, system, electronic device, and storage medium. The virtual sound insulation communication method based on optical communication comprises the following steps: determining a public area and a private area separated by light based on optical communication; collecting first voice data; separating sound source voice data of one or more sound sources from the first voice data; determining the sound source position of the sound source voice data according to the sound source voice data; filtering sound source voice data with a sound source position located in the private area from the first voice data; carrying out voice communication by using the filtered first voice data so as to realize intelligent sound insulation in audio call and/or video call; application publication No. CN107148782A, which discloses an audio system with configurable zones that can be configured to output beams of audio representing channels for one or more pieces of sound program content into independent zones based on the positioning of users, audio sources, and/or speaker arrays.

However, the above prior art still fails to address the problem of automated switching and recognition of talk mode and pickup mode in different interactive scenarios.

Disclosure of Invention

In order to solve the technical problem, the invention provides a virtual sound insulation communication method with an audio identification function. The method comprises an audio signal acquisition step, a proximity signal detection step, an acquisition instruction adjustment step, a virtual audio sound insulation step, an output signal quality evaluation step and an audio identification model updating step. The audio signal acquisition step is used for acquiring input audio, the approach signal detection step is used for detecting an approach signal, the acquisition instruction adjustment step is used for adjusting the audio signal acquisition instruction based on the detected approach signal, the virtual audio sound insulation step is used for outputting a signal after sound insulation processing is carried out based on an audio recognition model, and if the conclusion of the output signal quality evaluation step is negative, the audio recognition model updating step is executed. The technical scheme of the invention can adaptively perform sound insulation and noise elimination in different scenes.

The technical scheme of the invention can adaptively perform sound insulation and noise elimination in different scenes, and realize automatic switching and identification of a call mode and a sound pickup mode in different interactive scenes.

The virtual sound insulation method comprises the following steps:

a proximity signal detection step of detecting whether or not a proximity signal exists after the audio signal acquisition step;

if the approach signal is detected, executing an acquisition instruction adjusting step;

an output signal quality evaluation step of evaluating quality of an output signal of the virtual audio sound insulation step;

if the quality evaluation does not meet the preset condition, executing an audio recognition model updating step;

the audio recognition model updating step is for updating the audio recognition model used in the virtual audio soundproofing step.

In a specific application, the system can be used for an interactive mobile terminal, the interactive mobile terminal comprises at least one human-computer interaction interface, the human-computer interaction interface provides a setting option, the setting option is used for setting a state corresponding control relation between the proximity detection signal and the pickup array, and the state corresponding control relation comprises control states of the pickup array corresponding to different proximity signal types under different scenes.

At this time, the virtual sound insulation communication method according to the present invention may be applied to a mobile terminal including a plurality of proximity sensors and a plurality of sound pickup arrays.

Therefore, based on the above mobile terminal, the method of the present invention is implemented to include the following steps S901 to S906:

s901: judging whether the mobile terminal is in an interactive state, wherein the interactive state comprises the opening of a voice call or the opening of a video call;

if yes, go to step S902;

s902: turning on the plurality of pickup arrays;

s903: determining whether at least one of the plurality of proximity sensors detects a proximity signal,

if yes, go to step S904; otherwise, go to step S905;

s904: adjusting the state of the pickup array according to the type of the detected approach signal, and entering step S605;

s906: performing quality evaluation on the output signal, and judging whether the output signal meets a preset standard;

if yes, returning to the step S903;

if not, a feedback signal is sent to the audio AI processor, so that the audio AI processor returns to step S903 after updating the audio AI identification model.

The method provided by the second aspect may be implemented automatically on at least one mobile terminal based on the system provided by the first aspect, and the implementation manner may be in the form of programmed instructions and the like.

The advantages and key technical means of the invention at least comprise:

(1) the audio recognition processing is carried out through the audio AI processor capable of updating the recognition model, and the existing various noise reduction recognition technologies can be effectively fused;

(2) and the pickup mode switching module receives a proximity detection signal of the proximity sensor and controls the states of the first pickup array and the second pickup array based on the proximity detection signal, so that the automatic switching and recognition of a call mode and a pickup mode are realized in different interactive scenes.

Further advantages of the invention will be apparent in the detailed description section in conjunction with the drawings attached hereto.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a body flow diagram of a virtual sound insulation communication method according to an embodiment of the present invention

FIG. 2 is a flow chart of a virtual sound insulation communication method implemented based on a mobile terminal

FIG. 3 is a schematic diagram of pre-processing an audio input signal

FIG. 4 is a specific hardware configuration diagram of a mobile terminal implementing the method of the present invention

It should be noted that fig. 1-4 are merely schematic illustrations and do not represent the location of actual structures, and that different locations or sizes are merely relative illustrations.

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

In fig. 1, the method comprises the following steps:

a proximity signal detection step of detecting whether or not a proximity signal exists after the audio signal acquisition step;

if the approach signal is detected, executing an acquisition instruction adjusting step;

an output signal quality evaluation step of evaluating quality of an output signal of the virtual audio sound insulation step;

if the quality evaluation does not meet the preset condition, executing an audio recognition model updating step;

the audio recognition model updating step is for updating the audio recognition model used in the virtual audio soundproofing step.

More specifically, the state monitoring instruction is issued based on the detected interaction state.

Wherein the audio signal acquisition step further comprises a preprocessing step;

and the preprocessing step is used for preprocessing the collected external input audio signals, wherein the preprocessing comprises signal pre-emphasis, framing and end point monitoring.

Said framing operation comprises performing with a sliding time window function related to a transfer function used by said signal pre-emphasis step.

The pre-emphasis step uses a transfer function of:wherein, in the step (A),(ii) a z is a transfer function variable.

The sliding time window function expression adopted by the framing operation is as follows:

(ii) a N is the length of each frame.

In the framing operation, the sequence length D of the overlapping adjacent frame sequences satisfies the following condition:

，

where N is the length of each frame.

Fig. 2 illustrates a virtual soundproof communication method applied to a mobile terminal including a plurality of proximity sensors and a plurality of pickup arrays.

In particular, with reference to fig. 2, the method comprises the steps of:

s901: judging whether the mobile terminal is in an interactive state, wherein the interactive state comprises the opening of a voice call or the opening of a video call;

if yes, go to step S902;

s902: turning on the plurality of pickup arrays;

s903: determining whether at least one of the plurality of proximity sensors detects a proximity signal,

if yes, go to step S904; otherwise, go to step S905;

s904: adjusting the state of the pickup array according to the type of the detected approach signal, and entering step S605;

s906: performing quality evaluation on the output signal, and judging whether the output signal meets a preset standard;

if yes, returning to the step S903;

if not, a feedback signal is sent to the audio AI processor, so that the audio AI processor returns to step S903 after updating the audio AI identification model.

As a further preference, see fig. 3.

After the audio input signal is collected in step S905, the audio input signal is input to the audio AI processor after being preprocessed; the pretreatment comprises the following steps:

s9051: pre-emphasis processing of the audio input signal by a high-pass filter having a transfer function of:wherein, in the step (A),(ii) a z is a transfer function variable;

s9052: framing the pre-emphasized audio input signal with a sliding time window associated with the high-pass filter;

the sliding time window function expression is as follows:

(ii) a N is the length of each frame;

s9053: denoising the sequence after framing by adopting a spectral subtraction method;

s9054: and carrying out end point detection on the denoised sequence to obtain the audio frequency domain characteristics between every two adjacent end points.

In the above embodiments, after the audio input signal is acquired by the pickup array and is preprocessed, the audio AI recognition model of the audio AI processor is used to perform noise reduction recognition, including single-microphone noise reduction, dual-microphone noise reduction, near-field speech recognition, far-field speech recognition noise reduction, and the like, and there are various common methods in the art, which are not described in detail herein, and reference may be made specifically to the following related technical documents:

Jonghee Han, Sunhyun Yook, Kyoung Won Nam. Comparative evaluation ofvoice activity detectors in single microphone noise reduction algorithms[J].Biomed Eng Lett. 2012 (2):255-264

yaojian, microphone array signal processing technology research [ D ]. Harbin engineering university, 2012:7-32.

Gillespie B W, Malvar H S, Florêncio D A F. Speech dereverberation via maximum-kurtosis subband adaptive filtering[C]//Proceedings of the 2001 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'01), Salt Lake City: IEEE Press, 2001: 3701–3074.

Study on speech enhancement technology in the dawn-far field speech recognition system [ D ]. Chongqing post and telecommunications university, 2019.

However, as an important algorithm improvement of the present invention, the sliding time window function adopted in step S9052 is one of the important improvements of the present invention for improving the speech recognition efficiency, and with the sliding time window function, it can be ensured that the sequence length D of the overlapping adjacent frame sequences satisfies the following condition:

。

namely, the overlapping length is ensured to be larger than half of the frame length, thereby ensuring the accuracy and the efficiency to replace.

In contrast, the window functions commonly used for framing of the speech signal at present mainly include a conventional rectangular window function and a hamming window function, the overlapping length of the window functions is at most half of the frame length, and the parameters of the sliding time window function are not related to the parameter selection of the transfer function of the high-pass filter, but are not related to the parameter selection of the transfer function of the high-pass filter.

In the above embodiment of the present invention, the parameters of the sliding time window function itself have a strong relationship (i.e. α) with the parameter selection of the transfer function of the high-pass filter, and can be adaptively changed, so that the algorithm execution speed is faster.

Fig. 4 is a specific hardware system configuration diagram of a mobile terminal implementing the method of the present invention.

Referring to fig. 4, the system includes an audio memory, an audio processor, an audio output assembly, and a pickup array.

The system also comprises a plurality of proximity sensors and a pickup mode switching module;

the pickup array is connected with the audio memory;

the plurality of proximity sensors are all connected with the pickup mode switching module;

the pickup mode switching module is connected with the pickup array;

the pickup array comprises a first pickup array and a second pickup array;

based on the embodiment of fig. 4, the first pickup array includes a first microphone and a second microphone; the second pickup array includes a third microphone; the third microphone is different from the first microphone or the second microphone.

In fig. 4, the audio processor is an audio AI processor including at least one updatable audio AI identification model. The system also comprises an audio preprocessing module, and the audio preprocessing module is connected with the audio AI processor and the audio memory.

The system also comprises a self-feedback module, wherein the self-feedback module is connected with the audio output component;

the self-feedback module carries out quality evaluation on the audio output by the audio output assembly and judges whether the audio output assembly meets a preset standard or not; and if not, sending a feedback signal to the audio AI processor so that the audio AI processor updates the audio AI identification model.

As a core contribution of embodying the present invention with respect to the prior art, the sound pickup mode switching module receives a proximity detection signal of the proximity sensor, and controls the states of the first sound pickup array and the second sound pickup array based on the proximity detection signal.

The system of fig. 4 is applicable to a mobile terminal; the mobile terminal is provided with a proximity sensor on each of the top edge side and the left and right edge sides; the first pickup array is located at the top edge side portion of the mobile terminal and the second pickup array is located at a bottom edge side portion of the mobile terminal.

At this time, the mobile terminal may be an interactive mobile terminal including at least one human-computer interactive interface, such as a smart phone, a laptop, and the like.

The man-machine interaction interface provides a setting option, the setting option is used for setting a state corresponding control relation between the proximity detection signal and the pickup array, and the state corresponding control relation comprises control states of the pickup array corresponding to different proximity signal types in different scenes.

In one scenario, if the pickup mode switching module receives a proximity detection signal that does not receive any sensor, the third microphone of the second pickup array is kept in an on state.

In another scenario, the pickup mode switching module turns off a first microphone of the first pickup array if the proximity detection signal is from the first proximity sensor;

in one scenario, the pickup mode switching module turns off a second microphone of the first pickup array if the proximity detection signal is from the second proximity sensor

In one scenario, if the proximity detection signal is from the third proximity sensor, the pickup mode switching module turns off the first microphone and the second microphone of the first pickup array while turning on the third microphone of the second pickup array.

In one scenario, if the pickup mode switching module receives a proximity detection signal that does not receive any sensor, the third microphone of the second pickup array is kept in an off state while turning on the first and second microphones of the first pickup array.

In practice, the mobile terminal can adaptively sense the current use scene and switch to the corresponding pickup requirement by configuring a plurality of proximity detectors and a plurality of pickup arrays on the mobile terminal; the audio recognition processing is carried out through the audio AI processor capable of updating the recognition model, and the existing various noise reduction recognition technologies can be effectively fused; and the pickup mode switching module receives a proximity detection signal of the proximity sensor and controls the states of the first pickup array and the second pickup array based on the proximity detection signal, so that the automatic switching and recognition of a call mode and a pickup mode are realized in different interactive scenes.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

13页详细技术资料下载

Virtual sound insulation communication method with audio recognition function

相关技术

网友询问留言