Voice signal processing method, device, system, equipment and storage medium

文档序号：1414840 发布日期：2020-03-10 浏览：45次中文

阅读说明：本技术 语音信号处理方法、装置、系统、设备和存储介质 (Voice signal processing method, device, system, equipment and storage medium ) 是由田彪何召卫余涛于 2018-08-31 设计创作，主要内容包括：本发明公开了一种语音信号处理方法、装置、系统、设备和存储介质。该方法包括：使用图像采集设备获取实时图像,利用所述实时图像进行人脸识别,根据人脸识别结果检测目标人员发出语音的时间段；对麦克风阵列接收的音频信号进行声源定位,确定所述音频信号中声源的方位信息；根据所述实时图像中目标人员发出语音的时间段和所述声源的方位信息,进行语音起止点分析,确定所述音频信号中的语音起止时间点。根据本发明实施例提供的语音信号处理方法,可以在多干扰源的嘈杂环境下对语音信号进行语音端点检测,提高系统的抗干扰能力。(The invention discloses a voice signal processing method, a device, a system, equipment and a storage medium. The method comprises the following steps: acquiring a real-time image by using image acquisition equipment, performing face recognition by using the real-time image, and detecting a time period of voice of a target person according to a face recognition result; carrying out sound source positioning on an audio signal received by a microphone array, and determining azimuth information of a sound source in the audio signal; and analyzing the start and stop points of the voice according to the time period of the voice of the target person in the real-time image and the azimuth information of the sound source, and determining the start and stop time points of the voice in the audio signal. According to the voice signal processing method provided by the embodiment of the invention, voice endpoint detection can be carried out on the voice signal in a noisy environment with multiple interference sources, so that the anti-interference capability of the system is improved.)

1. A speech signal processing method comprising:

acquiring a real-time image by using image acquisition equipment, performing face recognition by using the real-time image, and detecting a time period of voice of a target person according to a face recognition result;

carrying out sound source positioning on an audio signal received by a microphone array, and determining azimuth information of a sound source in the audio signal;

and analyzing the start and stop points of the voice according to the time period of the voice of the target person in the real-time image and the azimuth information of the sound source, and determining the start and stop time points of the voice in the audio signal.

2. The speech signal processing method according to claim 1, wherein said performing face recognition using the real-time image comprises:

detecting whether a human face image exists in the real-time image;

when the real-time image has the face image, the face characteristic points of the face image are identified, and the characteristic points of the edge of the mouth part of the face image are determined.

3. The speech signal processing method according to claim 1, wherein the detecting a time period during which the target person utters speech according to the face recognition result comprises:

acquiring feature points of the human mouth edge in the face recognition result, and determining whether mouth opening and closing actions exist according to the feature value change information of the feature points of the human mouth edge;

taking the person with the mouth opening and closing action as the target person; and

and taking the continuous time period of the mouth opening and closing action of the target person in the real-time image as the time period of the voice of the target person.

4. The speech signal processing method according to claim 1, wherein the performing sound source localization on the audio signals received by the microphone array, and determining azimuth information of a sound source in the audio signals comprises:

through the sound source positioning, azimuth information of a sound source in the audio signal is obtained, wherein the azimuth information comprises a horizontal angle, a pitch angle and a distance of the sound source relative to the microphone array.

5. The speech signal processing method according to claim 1, wherein the determining a speech start-stop time point in the audio signal by performing speech start-stop point analysis according to a time period during which a target person utters speech and azimuth information of the sound source in the real-time image comprises:

determining a sound reception range of the microphone array according to the azimuth information of the sound source, and acquiring an audio signal in the sound reception range;

carrying out voice detection on the audio signals in the reception range, and determining the voice existence probability of the audio signals in the reception range;

and when the voice existence probability of the audio signal in the reception range is greater than a preset probability threshold, carrying out voice start and stop point analysis according to the time period of voice generation of the target person in the real-time image and the azimuth information of the sound source, and determining the voice start and stop time point in the audio signal.

6. The speech signal processing method according to claim 5, wherein said performing speech detection on the audio signal in the reception range and determining the speech existence probability of the audio signal in the reception range comprises:

extracting acoustic features of the audio signal through the voice detection;

comparing the characteristic value of the acoustic characteristic with a system threshold value of the acoustic characteristic of the voice signal, and determining whether the voice signal exists in the audio signal or not according to the comparison result;

and determining the voice existence probability according to whether the voice signal exists in the audio signal.

7. The speech signal processing method according to claim 5, wherein said performing speech detection on the audio signal in the reception range and determining the speech existence probability of the audio signal in the reception range comprises:

determining, using a voice activity detection component, a probability of a voice signal being present in the audio signal, wherein,

the voice activity detection component is obtained by carrying out neural network model training in advance by using voice samples and non-voice samples.

8. The speech signal processing method according to claim 1, wherein the determining a speech start-stop time point in the audio signal by performing speech start-stop point analysis according to a time period during which a target person utters speech and azimuth information of the sound source in the real-time image comprises:

determining a sound receiving range of the microphone array according to the azimuth information of the sound source;

acquiring an audio signal in the sound reception range, and determining the voice start-stop time point of the audio signal in the sound reception range;

and if the voice time period determined by the voice starting and stopping time point is in the time period of the voice of the target person, taking the voice starting and stopping time point of the audio signal in the sound reception range as the voice starting and stopping time point in the audio signal.

9. The speech signal processing method according to claim 8, wherein the determining a start-stop time point of the audio signal within the reception range comprises:

carrying out audio enhancement processing on the audio signal in the radio reception range;

and determining the voice start and stop point of the audio signal after the audio enhancement processing in the sound reception range.

10. A speech signal processing system comprising:

the image acquisition equipment is used for acquiring a real-time image;

a sound collecting device for receiving an audio signal;

data processing apparatus of

Carrying out face recognition by using the real-time image, and detecting a time period of voice production of a target person according to a face recognition result;

carrying out sound source positioning on an audio signal received by a microphone array, and determining azimuth information of a sound source in the audio signal;

11. A speech signal processing apparatus comprising:

the face recognition module is used for acquiring a real-time image by using image acquisition equipment, performing face recognition by using the real-time image and detecting a time period when a target person sends a voice according to a face recognition result;

the sound source positioning module is used for carrying out sound source positioning on the audio signals received by the microphone array and determining the azimuth information of the sound source in the audio signals;

and the voice endpoint detection module is used for analyzing the voice start and stop points according to the time period of the voice sent by the target person in the real-time image and the azimuth information of the sound source, and determining the voice start and stop time points in the audio signal.

12. A speech signal processing apparatus comprising a memory and a processor;

the memory is used for storing executable program codes;

the processor is configured to read executable program code stored in the memory to perform the speech signal processing method of any one of claims 1 to 9.

13. A computer-readable storage medium, comprising instructions which, when executed on a computer, cause the computer to perform the speech signal processing method according to any one of claims 1 to 9.

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a system, a device, and a storage medium for processing a voice signal.

Background

In a Voice recognition system, Voice Activity Detection (VAD) is accurately and effectively carried out, so that the calculation amount can be reduced, the processing time can be shortened, the noise interference of an unvoiced segment can be eliminated, and the accuracy of Voice recognition is improved. Since the speech signal not only contains the useful speech end but also contains the useless background noise segment, the speech end point detection can detect the starting point and the ending point of the speech from a given speech signal, and the speech signal is divided into two types, namely the speech end and the silence segment (background noise segment).

Disclosure of Invention

The embodiment of the invention provides a voice signal processing method, a voice signal processing device, a voice signal processing system, voice signal processing equipment and a storage medium, which can improve the anti-interference capability of a voice recognition system in a noisy environment with multiple interference sources.

According to an aspect of the embodiments of the present invention, there is provided a speech signal processing method, including:

carrying out face recognition in a visual range of the image acquisition equipment, and detecting whether a target person sends out voice or not according to a face recognition result;

carrying out sound source positioning on the received sound signal to be recognized, and determining a voice signal existing in a target area by combining a sound source positioning result and a detection result of whether a target person sends a voice;

and carrying out voice endpoint detection on the voice signals existing in the target area to obtain voice segments to be recognized in the voice signals.

According to another aspect of the embodiments of the present invention, there is provided a voice signal processing apparatus including:

carrying out face recognition in a visual range of the image acquisition equipment, and detecting whether a target person sends out voice or not according to a face recognition result;

and carrying out voice endpoint detection on the voice signals existing in the target area to obtain voice segments to be recognized in the voice signals.

According to still another aspect of embodiments of the present invention, there is provided a voice signal processing apparatus including: a memory and a processor; the memory is used for storing programs; the processor is used for reading the executable program codes stored in the memory to execute the voice signal processing method.

According to still another aspect of an embodiment of the present invention, there is provided a speech signal processing system including:

the image acquisition equipment is used for acquiring a real-time image;

a sound collecting device for receiving an audio signal;

the data processing equipment is used for carrying out face recognition by utilizing the real-time image and detecting a time period of voice of a target person according to a face recognition result; carrying out sound source positioning on the audio signals received by the microphone array, and determining azimuth information of sound sources in the audio signals; and analyzing the starting and stopping points of the voice according to the time period of the voice of the target person in the real-time image and the azimuth information of the sound source, and determining the starting and stopping time points of the voice in the audio signal.

According to a further aspect of the embodiments of the present invention, there is provided a computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the speech signal processing method of the above-described aspects.

According to the voice signal processing method, the voice signal processing device, the voice signal processing system, the voice signal processing equipment and the voice signal processing storage medium, voice endpoint detection can be performed on the voice signals in a noisy environment with multiple interference sources, and the anti-interference capacity of the system is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic view illustrating an application scenario of a voice signal processing method according to an exemplary embodiment of the present invention;

fig. 2 is a block configuration diagram showing a speech signal processing system according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a scene illustrating sound source localization of a target area by a microphone array according to an embodiment of the present invention;

FIG. 4 is a flow chart illustrating a speech signal processing method according to an embodiment of the present invention;

fig. 5 is a schematic configuration diagram showing a speech signal processing apparatus according to an embodiment of the present invention;

fig. 6 is a schematic diagram showing a hardware configuration of a speech signal processing system according to an embodiment of the present invention;

fig. 7 is a block diagram illustrating an exemplary hardware architecture of a computing device in which the speech signal processing method and apparatus according to the embodiments of the present invention may be implemented.

Detailed Description

Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

In the embodiment of the present invention, the voice signal processing system, such as an intelligent sound device, an intelligent voice shopping machine, an intelligent voice ticket vending machine, and an intelligent voice elevator, generally needs to perform voice signal acquisition and voice signal processing in a noisy environment where multiple interference sources exist, or in a real environment where multiple people interact, such as a shopping mall, a subway station, and a social place.

In the description of the embodiments of the invention described below, an array of microphones may be used for signal sampling and signal processing of sound signals from spatially different directions in a noisy environment where multiple sources of interference are present. Each acoustic sensor, e.g. microphone, of a microphone array may be referred to as an array element, each microphone array comprising at least two array elements. Each array element can be regarded as a sound collecting channel, and a multi-channel sound signal can be obtained by using a microphone array comprising a plurality of array elements.

The microphone array in the embodiment of the invention can be an array formed by arranging a group of acoustic sensors at different positions in space according to a certain shape rule, and is a device for carrying out spatial sampling on a sound signal which is transmitted in space. The shape arrangement rule formed by arranging the acoustic sensors in the microphone array may be referred to as a topology of the microphone array, and the microphone array may be divided into a linear microphone array, a planar microphone array and a stereo microphone array according to the topology of the microphone array.

As an example, a linear microphone array may indicate that the centers of the array elements of the microphone array are located on the same straight line, such as a horizontal array; the planar microphone array can represent that the centers of array elements of the microphone array are distributed on a plane, such as a triangular array, a circular array, a T-shaped array, an L-shaped array, a square array and the like; the stereo microphone array may represent that the centers of the array elements of the microphone array are distributed in a stereo space, such as a polyhedral array, a spherical array, and the like.

The speech signal processing method of the embodiment of the present invention does not specifically limit the specific form of the microphone array used. As one example, the microphone array may be a horizontal array, a T-shaped array, an L-shaped array, a square array.

In the embodiment of the present invention, the practical application scenario of the speech signal processing generally includes various interference sources such as ambient noise, human voice interference, reverberation, and echo. Reverberation is understood to be an acoustic phenomenon in which a sound signal and the sound signal are reflected and absorbed many times by obstacles during propagation to form a sound wave superposition; the Echo may also be referred to as an Acoustic Echo (Acoustic Echo), and the Echo may be understood as a repeated sound signal formed by sound played by a speaker of the speech processing device itself through propagation and reflection in space, and the repeated sound signal may be transmitted back to a microphone to form noise interference. The various interference sources such as the environmental noise, the human voice interference, the reverberation and the echo form a strong-interference and complex and changeable acoustic environment, and the quality of the user voice collected by the voice processing system is damaged.

In the embodiment of the present invention, a Multimodal (Multimodal) speech recognition system refers to a computer speech recognition system that performs speech recognition by using a plurality of information fusion methods. As an example, the multi-modal speech recognition system can process traditional audio information and improve the recognition effect of human-computer interaction through visual information of human face and mouth.

The following describes a practical application scenario of the voice signal processing method according to the embodiment of the present invention, taking voice ticket purchasing at a subway station as an example. Fig. 1 is a schematic diagram illustrating an application scenario of a speech signal processing method according to an exemplary embodiment of the present invention.

As shown in fig. 1, the voice ticket purchasing environment of the subway station may include a voice ticket purchasing system 100 and a ticket purchaser 101, and the voice ticket purchasing system 100 may include a display device 102, a voice processing device 103, and an image capturing device 104. The voice ticket purchasing system 100 can make the ticket purchaser 101 use the voice interaction mode to realize the functions of ticket purchasing by specifying the station name, ticket price or destination fuzzy search.

In one embodiment, the display device 102 may include a microphone array (not shown in the figure), and the speech processing device 103 may collect the sound signals from the actual ticketing environment in real time by using a plurality of sound collection channels provided by a plurality of array elements in the microphone array.

With continued reference to FIG. 1, in one embodiment, the display device 102 may be a large screen display device for displaying suggested voice interaction instructions, which may be an example of instructions having a canonical guide for voice interaction of the ticket purchaser 101 with the voice processing device 103. Such as "i want to go to site B", "buy two tickets to site C", and "two tickets for fare a", etc.; the display device 102 may invoke a map service to display recommended subway lines and stations closest to a destination after being processed by the voice processing device 103 according to the destination in the voice interaction instruction sent by the ticket purchaser 101; and the display device 102 can also display the payment information, so that the voice ticket purchasing system 100 finishes ticket drawing after the ticket purchaser 101 pays according to the displayed payment information.

In an actual ticket buying environment, the sound signal to be recognized collected by the speech processing device 103 using the microphone array not only includes a target speech signal from a target sound source, but also includes a non-target speech signal including various interference sources such as ambient noise, human voice interference, reverberation, and echo in a microphone array sound pickup range. As one example, the environmental noise may include, for example, operation noise of a subway train, noise generated by operation of a ventilation air-conditioning apparatus, and the like; the human voice disturbance may be, for example, a voice signal uttered by a person other than the ticket purchaser 101.

In order to pick up effective voice signals in a noisy environment with multiple interference sources and provide a stable voice recognition effect, embodiments of the present invention provide a voice signal processing method, apparatus, system and storage medium, which can perform voice activity detection in a noisy environment such as a public place with multiple interference sources by combining multimode information such as computer vision detection, sound source positioning information and voice probability detection, and extract a clean and accurate voice segment for voice recognition.

For better understanding of the present invention, the following will describe the speech signal processing method according to the embodiment of the present invention in detail with reference to fig. 2 and 3, and it should be noted that these embodiments are not intended to limit the scope of the present invention.

Figure 2 shows a block schematic of a speech signal processing system according to an embodiment of the invention,

fig. 3 is a schematic diagram of a scene in which a microphone array performs sound source localization on a target area according to an embodiment of the present invention.

As shown in fig. 2, the speech signal processing system 200 of the embodiment of the present invention may include a computer vision detection subsystem 210, a sound source three-dimensional information monitoring subsystem 220, a speech probability detection subsystem 230, and a speech endpoint detection subsystem 240.

In one embodiment, the vision-based speech detection refers to: the computer vision detection subsystem 210 performs face recognition in the visual range of the image acquisition equipment, and determines the time period of the voice of the target person according to the face recognition result.

In embodiments of the present invention, computer vision inspection subsystem 210 may include a video/image capture device and a video/image processing module.

In one embodiment, the real-time image is acquired by a video/image acquisition device, such as a camera, which takes a video or a real-time image within a visible range of the camera, and detects whether a human face exists in the visible range through the taken video or image; and if the human face exists, extracting the information of the human face characteristic points. And determining the change of the information of the characteristic points of the edge of the mouth of the person through the extracted information of the characteristic points of the face so as to determine whether the mouth makes opening and closing actions, if so, determining a target person for artificially speaking voice, and taking the continuous time period of the opening and closing actions of the mouth of the target person in the real-time image as the time period for the target person to speak.

As an example, the extraction of the human face visual feature point information may generally select feature points of the human mouth edge, and perform the acquisition of mouth feature information on the selected feature points of the mouth edge, such as the height of the mouth, the width of the mouth, the shape of the mouth, the position of the lower jaw, the speed of the lower jaw, and the like. And detecting the change of the mouth characteristic point information according to the acquired characteristic information, and estimating the movement of mouth muscles through the change of the mouth characteristic point information, for example, the mouth makes opening and closing motions, so that the person can be judged to speak through the opening and closing motions of the mouth.

In the embodiment of the present invention, in order to improve the detection efficiency of the mouth feature, it is not necessary to directly detect the mouth feature from the entire acquired image. Because the mouth of the person is smaller than the face target, the face detection can be performed on the collected video or the multiple images, and the mouth feature detection is further performed in the video or image area where the face is detected, so that the efficiency and the accuracy of the mouth detection are improved.

In the embodiment of the invention, the sound source three-dimensional information monitoring can be carried out on the sound source target in the set target area range on the basis of the microphone array, so that the spatial information monitoring of sound signals from different directions in the target area range is realized.

As shown in fig. 3, in one embodiment, sound signals received from different directions may be used as sound signals to be recognized by a microphone array, and the sound signals to be recognized may include sound signals from a target sound source and sound signals of interference sources such as noise 1, noise 2, and noise 3 in a sound pickup range of the microphone array.

In one embodiment, the Direction or position Of the target sound source in the sound signal to be recognized containing the background noise can be determined by spatially locating the sound signal to be recognized by Direction Of Arrival (DOA).

In this embodiment, the direction of arrival is used to indicate the incoming wave direction of the sound wave reaching the reference array element in the microphone array, i.e. the angle between the propagation direction of the speech signal and the normal direction of the microphone array relative to the reference array element of the microphone array. In some embodiments, the Angle may also be referred to as an Angle Of Arrival (Arrival Of Arrival) Of the speech signal.

In this embodiment, sound signals from different orientations may be localized based on the DOA estimation. Specifically, the direction of arrival of the beam can be obtained by DOA estimation, and the estimated position of the sound source target position is obtained by triangulation using DOA estimated by the receiving array elements of the plurality of microphone arrays.

In this embodiment, the direction information of each sound signal can be determined by performing direction-of-arrival estimation on each sound signal in the sound signals to be recognized. According to the azimuth information of each path of sound signal, the sound source positions of the paths of voice signals received by the microphone array can be detected, and the sound signals meeting the threshold range of the arrival angle are used as candidate sound source targets. As one example, the angle of arrival threshold range may be set to 0 degrees to 180 degrees, for example.

With continued reference to fig. 3, the sound source three-dimensional information detection subsystem 220 may perform detection of spatial three-dimensional information on the candidate sound source target estimated by the DOA, where the spatial three-dimensional information of the candidate sound source target may include, for example, a horizontal angle, a pitch angle, and a distance of the candidate sound source target with respect to the microphone array.

In one embodiment, a three-dimensional spatial coordinate system of the microphone array may be pre-established. As an example, in the three-dimensional space coordinate system of the microphone array, the origin of coordinates M₀It may be the center position of the microphone array in the speech processing device 103, or the position of any one of the array elements in the microphone array, or other designated positions.

In one embodiment, the array elements of each microphone array may be determined relative to the origin of coordinates M based on the order of the array elements and the spacing distance between the array elements₀Offset distance ofTo determine each array element M_iRelative to the origin of coordinates M₀Three-dimensional space coordinates of (a).

In one embodiment, assuming that the candidate sound source target is located at a spatial location point S in three-dimensional space, the three-dimensional space coordinates of the location point S may be represented as S (x)₀，y₀，z₀) Wherein x is₀，y₀，z₀The coordinate values of the X axis, the Y axis and the Z axis of the coordinate system of the position point S in the three-dimensional space respectively, (X is the coordinate value of the position point S in the three-dimensional space₀，y₀，z₀) Representing the three-dimensional spatial coordinates of the spatial location point S.

In this embodiment, the three-dimensional space coordinates and the coordinate vector of the spatial position point S satisfy:

wherein r is₀Represents the spatial position point S (x) where the candidate sound source target is located₀，y₀，z₀) Coordinate origin M of three-dimensional space coordinate system₀Distance between (0, 0, 0), pitch angle θ₀Representing a spatial point S and an origin of coordinates M₀Included angle between the formed connecting line and the positive direction of the Z axis, horizontal angle

Represents the projection S' of the space point S on the XOY plane and the coordinate origin M₀The included angle between the formed connecting line and the positive direction of the X axis. Wherein the horizontal angleCan be in the range of

Angle of pitch theta₀The value range of the angle can be more than or equal to 0 degree theta₀≤90°。

In one embodiment, r may be₀Referred to as the distance, theta, of the spatial location point S from the microphone array₀Referred to as the pitch angle of the spatial location point S and the microphone array,referred to as the horizontal angle of the spatial location point S with the microphone array.

In the embodiment of the present invention, after detecting a time difference between arrival of a sound signal from a same sound source at two different array elements, a distance difference between arrival of the sound signal of the same sound source at the two different array elements can be calculated by using the time difference, and a position or a direction of the sound source relative to a microphone array is calculated by using the distance difference between arrival of the sound signal of the same sound source at the two different array elements, a three-dimensional spatial coordinate of each array element in the microphone array, and a three-dimensional spatial coordinate of the sound source according to a geometric analysis principle.

In the embodiment of the present invention, in order to improve the processing efficiency of the voice interaction device 103, a sound receiving range of the microphone array may be preset, that is, a pitch angle range, a horizontal angle range, and a distance range relative to the microphone array, only sound sources within the sound receiving range are responded and processed, and audio signals outside the sound receiving range are all regarded as noise signals, so as to reduce the acquisition range of the target person and improve the calculation efficiency of the voice interaction device 103.

In the embodiment of the invention, the spatial range of the sound reception area of the microphone array can be determined according to the actual application scene. In a voice ticketing application scenario, the ticket purchaser 101 is typically located within a relatively fixed area proximate to the voice ticketing system 100, from which the acoustic signal has a higher probability of including the target sound source. Thus, in one embodiment, the set target region satisfies the condition that any one spatial point R (x) within the target region_i，y_i，z_i) Satisfies the coordinate vector of r_i≤r_max，θ_i≤θ_max，

That is, the distance between the spatial point R in the target area and the microphone array is less than or equal to the preset maximum distance value R_maxThe spatial point R andthe horizontal angle of the microphone array is less than or equal to the maximum value of the preset horizontal angle

The pitch angle between the space point R and the microphone array is less than or equal to the maximum value theta of the preset pitch angle_max。

As shown in fig. 3, after sound source localization of a sound signal to be recognized based on DOA estimation, a plurality of sound sources such as noise 1, noise 2, noise 3, and azimuth information of the ticket purchaser 101 can be determined. In this example, the setting of the sound reception range by the microphone array may effectively filter part of the interference sources, for example, the noise 3 located outside the sound reception range.

In the embodiment of the invention, the spatial information of the audio signal in the receiving area is monitored, so that the sound source detection range can be reduced, and the sound source detection accuracy and the operation efficiency of the voice processing equipment are improved.

In one embodiment, to improve the accuracy of speech recognition, the speech probability detection subsystem 230 may detect the speech presence probability of the audio signal within the range of sound pickup. And when the voice existence probability of the audio signal is greater than a preset probability threshold, determining the voice signal in the radio reception range, and carrying out voice start and stop point analysis on the audio signal in the radio reception range.

In the embodiment of the present invention, the speech probability detection subsystem 230 may determine the speech existence probability of the target region by analyzing the audio signal characteristics, or determine the speech existence probability of the target region by a speech detection model. In some embodiments, the probability of speech presence in the target region may also be determined by a method of analyzing audio signal characteristics in combination with a speech activity modeling model. The following describes in detail the steps of determining the speech existence probability of the target region according to the embodiment of the present invention with specific embodiments.

In the embodiment of the present invention, the audio signal is a periodic signal, and the time domain analysis and the frequency domain analysis are two different ways of periodically analyzing the audio signal. In brief, the time domain can be used to describe the relationship between the speech signal and time, that is, the time is used as a variable to analyze the dynamic change of the speech signal along with the time; and the frequency domain can be used to describe the relationship between the speech signal and the frequency, i.e. the characteristics of the speech signal at different frequencies are analyzed with the frequency as a variable.

In one embodiment, the probability of speech presence in the target region may be determined by analyzing audio signal characteristics, such as amplitude variations and spectral distribution of the audio signal.

In one embodiment, the amplitude of the audio signal represents the distance of the audio signal from the highest position to the lowest position of the vibration. Since the noise signal is usually a segment of the speech signal with small amplitude variation in the time domain of the speech signal, the segment of the speech signal containing the speech of the speaker usually has large amplitude variation. According to this principle, an amplitude variation threshold for identifying a noise signal may be preset, and the probability of existence of speech in audio signals from different directions in the detection target region may be determined by extracting an amplitude variation value of the audio signal, and comparing the amplitude variation value of the audio signal with the amplitude variation threshold.

In this embodiment, according to the amplitude variation of the time domain signal of the audio signal existing in the target region, if the time domain signal of the audio signal contains an audio signal segment whose amplitude variation value is greater than a preset amplitude variation threshold, it is determined that the probability of existence of speech in the audio signal existing in the target region is greater.

In one embodiment, the frequency spectrum of the audio signal may be understood as the frequency profile of the audio signal. Generally, in the frequency domain of the speech signal, the spectrum distribution of the noise signal is relatively uniform, and the speech signal segment containing the speech of the speaker has a relatively large variation with the spectrum distribution. Therefore, for the audio signals received by the microphone array in the target area, whether the audio signals received in the target area exist is determined by extracting the spectrum distribution characteristics and comparing the extracted spectrum distribution characteristics with the spectrum distribution threshold value.

As an example, the spectral distribution characteristic may be a statistical characteristic of a power value variance of the audio signal for each frame of the audio signal in the frequency domain. That is, by extracting the variance of each frame of signal in the audio signal with respect to the power value, comparing the variance of the extracted power value with a preset variance threshold, if the frequency domain signal of the audio signal contains an audio signal segment whose variance of the power value of each frame of signal is greater than the variance threshold, it is determined that the voice existence probability in the audio signal is greater.

In one embodiment, the extracted key features of the audio signal, such as amplitude variation and spectral distribution, may be combined to compare the feature value of the extracted audio signal with a preset system threshold, and when the comparison result is that the probability of existence of speech in the audio signal is greater, it is determined that the speech signal exists in the audio signal in the target region.

In one embodiment, the probability of speech presence of the audio signal from different locations in the target region may also be detected by a speech detection model.

In the embodiment of the invention, a neural network model for detecting the voice signals and the non-voice signals can be pre-constructed, the neural network model is trained by using the positive samples marked as voice and the negative samples marked as non-voice, and the neural network model which can be used for detecting the voice activity and is obtained by training can be called as a voice detection model. It should be understood that, in the embodiments of the present invention, a specific form of the neural network model is not particularly limited, and the neural network model may be any one of neural networks such as a deep neural network, a cyclic neural network, or a convolutional neural network.

In one embodiment, positive samples labeled speech may represent sound segments that contain acoustic features of the speech signal, and negative samples labeled non-speech may represent sound segments that do not contain acoustic features of the speech signal.

In the embodiment of the invention, the voice activity detection is carried out on the audio signals in the sound reception range by utilizing the voice detection model, and if the voice activity modeling model outputs that the audio signals in the sound reception range comprise the voice signals, the probability that the audio signals in the sound reception range comprise the voice signals is determined to be larger.

With continued reference to fig. 2, according to the visual detection information of the computer visual detection subsystem 210, the sound source positioning information of the sound source three-dimensional information monitoring subsystem 220, and the voice existence probability analysis result of the voice probability detection subsystem 230, the voice endpoint detection subsystem 240 is utilized to determine the starting time point and the ending time point of the audio signal within the sound reception range.

In one embodiment, the sound source positioning information of the sound source three-dimensional information monitoring subsystem 220 may be utilized to determine a sound reception range of the microphone array, acquire an audio signal within the sound reception range, and determine a voice start-stop time point of the audio signal within the sound reception range through voice endpoint detection; and if the voice time period determined by the voice starting and stopping time point is in the time period of the voice sent by the target person, taking the voice starting and stopping time point of the audio signal in the sound receiving range as the voice starting and stopping time point in the audio signal received by the microphone array.

In one embodiment, in order to improve the accuracy of the voice processing system, the sound source positioning information of the sound source three-dimensional information monitoring subsystem 220 is used for determining the sound receiving range of the microphone array, performing voice detection on the audio signals in the sound receiving range, and determining the voice existence probability of the audio signals in the sound receiving range; when the voice existence probability of the audio signal in the reception range is greater than a preset probability threshold, performing voice start and stop point analysis on the audio signal in the reception range through voice endpoint detection, and determining a voice start and stop time point in the audio signal in the reception range; and if the voice start-stop time point in the audio signal in the sound reception range is in the time period of the voice of the target person, taking the voice start-stop time point of the audio signal in the sound reception range as the voice start-stop time point in the audio signal received by the microphone array.

In one embodiment, Voice Activity Detection (VAD), also referred to as Voice Activity Detection, is a method of obtaining Voice segments in an audio signal, determining a start point and an end point of the Voice segments in the audio signal, and extracting the Voice segments in the audio signal, thereby eliminating interference of a silent section and a non-Voice signal, reducing the computational pressure of a Voice recognition system, and increasing the response speed of the Voice recognition system.

In the embodiment of the invention, the voice segment obtained by voice endpoint detection can be input into a voice recognition system for voice recognition. In the embodiment of the invention, the voice endpoint detection not only can reduce the calculation amount and shorten the processing time, but also can remove the interference of background noise during silence, and improve the anti-interference performance and the voice recognition performance of the system.

Fig. 4 shows a flow chart of a speech signal processing method according to an embodiment of the invention. As shown in fig. 4, in one embodiment, a speech signal processing method 400 in the embodiment of the present invention includes the following steps:

and S410, acquiring a real-time image by using image acquisition equipment, performing face recognition by using the real-time image, and detecting a time period when the target person utters voice according to a face recognition result.

In one embodiment, the performing face recognition by using the real-time image specifically may include:

step S411, detecting whether a face image exists in the real-time image.

Step S412, when the face image exists in the real-time image, the face characteristic points of the face image are identified, and the characteristic points of the edge of the mouth part in the face image are determined.

In an embodiment, the step of detecting a time period during which the target person utters the voice according to the face recognition result may specifically include:

step S413, acquiring feature points of the mouth edge in the face recognition result, and determining whether a mouth opening and closing action exists according to the feature value change information of the feature points of the mouth edge.

In step S414, a person having a mouth opening and closing operation is set as a target person.

In step S415, the duration of the mouth opening and closing action of the target person in the real-time image is used as the time period for the target person to make a voice.

In this embodiment, the detection of the mouth opening and closing is realized through the change information of the mouth feature points in the human face, so that the target person who utters the voice and the time period during which the target person utters the voice are determined.

Step S420, performing sound source localization on the audio signal received by the microphone array, and determining azimuth information of the sound source in the audio signal.

In an embodiment, step S420 may specifically include: through sound source positioning, azimuth information of a sound source in the audio signal is obtained, wherein the azimuth information comprises a horizontal angle, a pitch angle and a distance of the sound source relative to the microphone array.

In the embodiment, the three-dimensional information monitoring of the horizontal angle, the pitch angle and the distance of the sound signal is realized based on the microphone array, so that the spatial information monitoring of the target is realized when a plurality of speakers or other interference sources with sound exist, the reception range is restricted, the sound outside the set angle (the horizontal angle and/or the pitch angle) and the distance range is regarded as noise, and no recognition response is carried out, so that the processing efficiency of the voice signal processing system is improved, and the anti-interference performance of the system on the interference sources is improved.

And step S430, analyzing the starting and stopping points of the voice according to the time period of the voice of the target person in the real-time image and the azimuth information of the sound source, and determining the starting and stopping time points of the voice in the audio signal.

In the embodiment of the invention, in order to improve the processing efficiency and accuracy of the audio signal, the audio signal in the pickup region may be subjected to voice existence probability detection, and if the voice existence probability is greater than a probability threshold, the audio signal in the pickup range is subjected to voice start-stop point analysis.

In an embodiment, step S430 may specifically include:

step S431, determining the sound receiving range of the microphone array according to the azimuth information of the sound source, and acquiring an audio signal in the sound receiving range;

step S432, carrying out voice detection on the audio signal in the reception range, and determining the voice existence probability of the audio signal in the reception range;

and step S433, when the voice existence probability of the audio signal in the reception range is larger than a preset probability threshold, performing voice start and stop point analysis according to the time period of voice generation of the target person in the real-time image and the azimuth information of the sound source, and determining the voice start and stop time point in the audio signal.

In an embodiment, step S432 may specifically include:

extracting acoustic features of the audio signal through voice detection; comparing the characteristic value of the acoustic characteristic with a system threshold value of the acoustic characteristic of the voice signal, and determining whether the voice signal exists in the audio signal or not according to the comparison result; and determining the existence probability of the voice according to whether the voice signal exists in the audio signal.

In this step, the sound signal characteristic value may be a key characteristic of the sound signal such as amplitude variation, spectral distribution, and the like.

In another embodiment, step S432 may specifically include:

and determining the probability of the voice signal existing in the audio signal by utilizing a voice activity detection component, wherein the voice activity detection component is obtained by carrying out neural network model training in advance by using a voice sample and a non-voice sample.

In an embodiment, in step S430 or step S433, the step of performing a speech start/stop point analysis according to a time period during which the target person utters speech and the azimuth information of the sound source in the real-time image, and determining a speech start/stop time point in the audio signal may specifically include:

step S11, determining the sound reception range of the microphone array according to the azimuth information of the sound source;

step S12, acquiring audio signals within a sound reception range, and determining the voice start-stop time points of the audio signals within the sound reception range;

and step S13, if the voice time period determined by the voice starting and stopping time point is in the time period of the voice of the target person, the voice starting and stopping time point of the audio signal in the sound reception range is used as the voice starting and stopping time point in the audio signal.

In an embodiment, the step of determining the voice start-stop time point of the audio signal in the sound reception range in step S12 may specifically include:

carrying out audio enhancement processing on the audio signal in the reception range; and determining the voice start and stop points of the audio signals after the audio enhancement processing in the sound reception range.

In one embodiment, the audio enhancement processing may include beamforming processing and noise reduction processing.

According to the voice signal processing method provided by the embodiment of the invention, the voice start and stop point analysis is carried out on the enhanced audio signal according to the received visual detection information, the sound source positioning information and the voice analysis existence probability, the start and stop time point of the audio signal is determined, and the audio of the segment is sent to the voice recognition engine for voice recognition, so that the anti-interference performance of the voice processing system to an interference source is effectively improved, and the stable interaction performance of the voice signal processing system in a strong interference environment is realized.

Fig. 5 shows a block diagram of a speech signal processing apparatus according to an embodiment of the present invention, and as shown in fig. 5, the speech signal processing apparatus 500 may include:

the face recognition module 510 is configured to acquire a real-time image using an image acquisition device, perform face recognition using the real-time image, and detect a time period when a target person utters a voice according to a face recognition result.

The sound source positioning module 520 is configured to perform sound source positioning on the audio signal received by the microphone array, and determine azimuth information of a sound source in the audio signal.

And the voice endpoint detection module 530 is configured to perform voice start-stop point analysis according to the time period when the target person utters the voice in the real-time image and the azimuth information of the sound source, and determine a voice start-stop time point in the audio signal.

In one embodiment, the face recognition module 510 may include:

and the image detection unit is used for detecting whether the real-time image has the face image.

And the mouth characteristic point determining unit is used for identifying the human face characteristic points of the human face image when the human face image exists in the real-time image and determining the characteristic points of the human mouth edge in the human face image.

In this embodiment, the face recognition module 510 may further include:

and the mouth opening and closing determining unit is used for acquiring the characteristic points of the mouth edge in the face recognition result and determining whether the action exists according to the characteristic value change information of the characteristic points of the mouth edge.

And the target person determining unit is used for taking a person with mouth opening and closing actions as a target person.

And the voice time period determining unit is used for taking the continuous time period of the mouth opening and closing action of the target person in the real-time image as the time period of the voice of the target person.

In one embodiment, the sound source localization module 520 may be specifically configured to:

through sound source positioning, azimuth information of a sound source in the audio signal is obtained, wherein the azimuth information comprises a horizontal angle, a pitch angle and a distance of the sound source relative to the microphone array.

In one embodiment, the voice endpoint detection module 530 may specifically include:

and the sound reception range determining unit is used for determining the sound reception range of the microphone array according to the azimuth information of the sound source and acquiring the audio signal in the sound reception range.

And the voice detection unit is used for carrying out voice detection on the audio signal in the reception range and determining the voice existence probability of the audio signal in the reception range.

The voice endpoint detection module 530 may further be configured to, when the voice existence probability of the audio signal within the reception range is greater than a preset probability threshold, perform voice start-stop point analysis according to a time period during which the target person sends a voice in the real-time image and the azimuth information of the sound source, and determine a voice start-stop time point in the audio signal.

In one embodiment, the voice detection unit may be specifically configured to:

In one embodiment, the voice detection unit is specifically configured to:

In an embodiment, the voice endpoint detection module 530 may be further specifically configured to:

determining the sound receiving range of the microphone array according to the azimuth information of the sound source;

acquiring an audio signal in a reception range, and determining a voice start-stop time point of the audio signal in the reception range;

and if the voice time period determined by the voice starting and stopping time point is in the time period of the voice sent by the target person, taking the voice starting and stopping time point of the audio signal in the sound reception range as the voice starting and stopping time point in the audio signal.

In one embodiment, the voice endpoint detection module 530, when specifically configured to determine the voice start-stop time point of the audio signal within the reception range, may further be configured to: carrying out audio enhancement processing on the audio signal in the reception range; and determining the voice start and stop points of the audio signals after the audio enhancement processing in the sound reception range.

According to the voice signal processing device provided by the embodiment of the invention, the start-stop time point of the target voice can be determined for the enhanced audio stream according to the received visual detection information, the sound source positioning information and the voice analysis existence probability, so that the anti-interference performance of the voice processing system to an interference source in a noisy environment is enhanced, and the stable interaction performance of the voice signal processing system in a strong interference environment is realized.

Other details of the speech signal processing apparatus according to the embodiment of the present invention are similar to the speech signal processing method according to the embodiment of the present invention described above with reference to fig. 1 to 4, and are not repeated herein.

Fig. 6 shows a schematic structural diagram of a speech signal processing system according to an embodiment of the present invention. As shown in fig. 6, the speech signal processing system 600 in the embodiment of the present invention may include:

and an image acquisition device 610 for acquiring real-time images.

A sound collecting device 620 for receiving the audio signal.

The data processing device 630 is used for performing face recognition by using the real-time image and detecting a time period of voice uttered by the target person according to a face recognition result; carrying out sound source positioning on the audio signals received by the microphone array, and determining azimuth information of sound sources in the audio signals; and analyzing the starting and stopping points of the voice according to the time period of the voice of the target person in the real-time image and the azimuth information of the sound source, and determining the starting and stopping time points of the voice in the audio signal.

Other details of the speech signal processing system according to the embodiment of the present invention are similar to the speech signal processing method according to the embodiment of the present invention described above with reference to fig. 1 to 4, and are not repeated herein.

Fig. 7 is a block diagram illustrating an exemplary hardware architecture of a computing device capable of implementing a speech signal processing method and apparatus according to an embodiment of the present invention.

As shown in fig. 7, computing device 700 includes an input device 701, an input interface 702, a central processor 703, a memory 704, an output interface 705, and an output device 706. The input interface 702, the central processing unit 703, the memory 704, and the output interface 705 are connected to each other via a bus 710, and the input device 701 and the output device 706 are connected to the bus 710 via the input interface 702 and the output interface 705, respectively, and further connected to other components of the computing device 700. Specifically, the input device 701 receives input information from the outside (e.g., a microphone array or an image pickup device), and transmits the input information to the central processor 703 through the input interface 702; the central processor 703 processes input information based on computer-executable instructions stored in the memory 704 to generate output information, stores the output information temporarily or permanently in the memory 704, and then transmits the output information to the output device 706 through the output interface 705; the output device 706 outputs output information external to the computing device 700 for use by a user.

That is, the computing device shown in fig. 7 may also be implemented to include: a memory storing computer-executable instructions; and a processor which, when executing the computer executable instructions, may implement the speech signal processing method described in connection with fig. 1 to 4. Here, the processor may communicate with a microphone array used by the speech processing device to execute computer-executable instructions based on the relevant information from the speech processing device to implement the speech signal processing method described in connection with fig. 1-4.

In one embodiment, the computing device 700 shown in FIG. 7 may be implemented as a speech signal processing device characterized by comprising a memory and a processor; the memory is used for storing executable program codes; the processor is used to read the executable program code stored in the memory to perform the speech signal processing method as described above in connection with fig. 1 to 5.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product or computer-readable storage medium. The computer program product or computer-readable storage medium includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.

As described above, only the specific embodiments of the present invention are provided, and it can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present invention, and these modifications or substitutions should be covered within the scope of the present invention.

20页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：磁头及磁记录再现装置

Voice signal processing method, device, system, equipment and storage medium

相关技术

网友询问留言