Method and apparatus for performing sound zone localization in spatial region, device and medium

文档序号：193355 发布日期：2021-11-02 浏览：22次中文

阅读说明：本技术 在空间区域内进行音区定位方法和装置、设备和介质 (Method and apparatus for performing sound zone localization in spatial region, device and medium ) 是由胡玉祥朱长宝余凯牛建伟于 2021-08-03 设计创作，主要内容包括：本公开实施例公开了一种在空间区域内进行音区定位的方法和装置、设备和介质,其中,方法包括：确定在设定空间区域内从至少一个视角各自采集到的图像；基于所述至少一个视角各自采集到的图像,确定所述设定空间区域内的人脸信息；确定所述设定空间区域内采集到的至少一路混合语音信号；对所述至少一路混合语音信号进行语音分离,获得至少一路语音分离信号；基于所述至少一路语音分离信号对应的第一唤醒信息以及所述人脸信息,确定所述第一唤醒信息对应的唤醒信号在所述设定空间区域内的声源位置。本公开实施例可以提高定位效率和准确度,提高了对同侧前后排声源的区分定位准确率。(The embodiment of the disclosure discloses a method, a device, equipment and a medium for positioning a sound zone in a space region, wherein the method comprises the following steps: determining images respectively collected from at least one view angle in a set space region; determining face information in the set spatial region based on the images respectively acquired from the at least one view angle; determining at least one path of mixed voice signals collected in the set space region; performing voice separation on the at least one path of mixed voice signal to obtain at least one path of voice separation signal; and determining the sound source position of the wake-up signal corresponding to the first wake-up information in the set spatial region based on the first wake-up information corresponding to the at least one voice separation signal and the face information. The embodiment of the disclosure can improve the positioning efficiency and accuracy, and improve the distinguishing and positioning accuracy of the front and rear sound sources on the same side.)

1. A method of range finding in a spatial region, comprising:

determining images respectively collected from at least one view angle in a set space region;

determining face information in the set spatial region based on the images respectively acquired from the at least one view angle;

determining at least one path of mixed voice signals collected in the set space region;

performing voice separation on the at least one path of mixed voice signal to obtain at least one path of voice separation signal;

and determining the sound source position of the wake-up signal corresponding to the first wake-up information in the set spatial region based on the first wake-up information corresponding to the at least one voice separation signal and the face information.

2. The method of claim 1, wherein the determining the face information in the set spatial region based on the images respectively acquired from the at least one view angle comprises:

respectively executing face recognition on the images respectively collected based on the at least one visual angle to obtain at least one group of recognition information;

and determining the face information in the set space region based on the at least one group of identification information.

3. The method according to claim 1 or 2, wherein the performing voice separation on the at least one mixed voice signal to obtain at least one voice separated signal comprises:

performing voice separation on the at least one path of mixed voice signal by using a voice separation algorithm to obtain at least one path of independent audio signal;

and determining at least one voice separation signal from the at least one independent audio signal.

4. The method according to any one of claims 1 to 3, wherein the determining, based on the first wake-up information corresponding to the at least one voice separation signal and the face information, a sound source position of a wake-up signal corresponding to the first wake-up result in the set spatial region includes:

processing the at least one voice separation signal by using a first neural network to obtain at least one first awakening message corresponding to the at least one voice signal; each path of voice signal corresponds to a first awakening message, and each first awakening message represents awakening or not awakening preset equipment;

determining at least one sound production position in the set space region based on the face information;

responding to a first awakening result for awakening the preset equipment in the at least one first awakening message;

and processing the at least one path of mixed voice signal and the first awakening information for awakening the preset equipment by using a second neural network, and determining the sound source position in the at least one sound generating position.

5. The method of any of claims 1-4, further comprising:

performing lip movement identification on a first image acquired from a first visual angle in the at least one visual angle by using a third neural network, and determining at least one lip movement information in the first image acquired from the first visual angle;

determining second wake-up information based on the at least one lip movement information and the at least one voice separation signal by using a fourth neural network; the second awakening information represents that the preset equipment is awakened or not awakened by the voice separation signal corresponding to the face in the first image.

6. The method according to claim 5, wherein the determining, based on the first wake-up information corresponding to the at least one voice separation signal and the face information, a sound source position of a wake-up signal corresponding to the first wake-up information in the set spatial region includes:

determining at least one sound production position in the set space region based on the face information;

and determining the sound source position from the at least one sound production position by utilizing a preset positioning rule based on the first awakening information, the second awakening information, the at least one lip movement information and the at least one path of mixed voice signal.

7. The method of claim 6, wherein the determining the sound source position from the at least one sound generation position using a preset localization rule based on the first wake-up information, the second wake-up information, the at least one lip movement information, and the at least one mixed voice signal comprises:

determining at least one first sound production position corresponding to the first image from the at least one sound production position;

in response to the at least one utterance position including the first utterance position, determining at least one sound source position from the at least one first utterance position based on the second wake-up information and the at least one lip movement information, and/or determining at least one sound source position from the at least one second utterance position other than the first utterance position by processing the at least one mixed voice signal and the first wake-up information using a fifth neural network;

and in response to the at least one sound production position not including the first sound production position, processing the at least one path of mixed voice signal and the first wake-up information by using a fifth neural network, and determining at least one sound source position from the at least one second sound production position.

8. An apparatus for performing a range finding in a spatial region, comprising:

the image acquisition module is used for determining images acquired from at least one view angle in a set space region;

the face information determining module is used for determining face information in the set spatial region based on the images acquired by the image acquisition module at least one visual angle respectively;

the voice acquisition module is used for determining at least one path of mixed voice signals acquired in the set space region;

the voice separation module is used for carrying out voice separation on at least one path of mixed voice signals collected by the voice collection module to obtain at least one path of voice separation signals;

and the position positioning module is used for determining the sound source position of the awakening signal corresponding to the first awakening information in the set spatial region based on the first awakening information corresponding to the at least one path of voice separation signal obtained by the voice separation module and the face information determined by the face information determination module.

9. A computer-readable storage medium, in which a computer program is stored, the computer program being adapted to perform the method for range finding in a spatial area as claimed in any one of the preceding claims 1 to 7.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method for sound range localization in a spatial region as claimed in any one of claims 1 to 7.

Technical Field

The present disclosure relates to sound source localization technology, and more particularly, to a method and apparatus, device, and medium for sound zone localization in a spatial region.

Background

With the continuous development of intelligent voice interaction technology, more and more intelligent interaction devices are in operation. For example, smart televisions, smart speakers, smart homes, smart robots, in-vehicle smart interactive devices, and so forth. The interactive equipment is awakened through the awakening words, people can perform voice interaction with the intelligent interactive equipment, and the intelligent interactive equipment is instructed to complete operations such as music playing and weather broadcasting.

After the intelligent interaction equipment is awakened, the direction information of the awakening word can be determined according to the voice signal picked up by the microphone, and the voice is directionally picked up according to the direction of the awakening word, so that noise interference is reduced. However, for example, in a vehicle-mounted intelligent interactive scene, a two-microphone centralized microphone array installed at a ceiling lamp or a vehicle-mounted position in a vehicle is generally used for receiving sound signals, and a sound source positioning algorithm adopted in the related art is difficult to distinguish sound sources in front and rear rows on the same side in the vehicle, and only can distinguish main driving directions and auxiliary driving directions.

Disclosure of Invention

The present disclosure is proposed to solve the above technical problems. Embodiments of the present disclosure provide a method and apparatus, device, and medium for range localization within a spatial region.

According to an aspect of the embodiments of the present disclosure, there is provided a method for performing sound zone localization in a spatial region, including:

determining images respectively collected from at least one view angle in a set space region;

determining face information in the set spatial region based on the images respectively acquired from the at least one view angle;

determining at least one path of mixed voice signals collected in the set space region;

performing voice separation on the at least one path of mixed voice signal to obtain at least one path of voice separation signal;

According to another aspect of the embodiments of the present disclosure, there is provided an apparatus for performing sound zone localization in a spatial region, including:

the image acquisition module is used for determining images acquired from at least one view angle in a set space region;

the voice acquisition module is used for determining at least one path of mixed voice signals acquired in the set space region;

According to yet another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the method for sound zone localization in a space region according to any of the embodiments.

According to still another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method for performing sound range localization in a spatial region according to any of the embodiments.

Based on the method, the device, the equipment and the medium for positioning the sound zone in the space area, which are provided by the embodiment of the disclosure, the face recognition is utilized to determine at least one piece of face information which is possibly sounded, and the awakening information and the face information are combined, so that the positioning efficiency and the positioning accuracy are improved, and the distinguishing and positioning accuracy of front and back sound sources on the same side is improved.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1-a is a schematic diagram of an alternative structure of a range finding system provided in an exemplary embodiment of the present disclosure.

Fig. 1-b is a schematic diagram of an alternative microphone location relationship in a range finding system provided by an exemplary embodiment of the present disclosure.

Fig. 1-c are schematic diagrams of sound signals obtained by an optional microphone in a range finding system provided by an exemplary embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating a method for performing a sound zone localization in a spatial region according to an exemplary embodiment of the present disclosure.

Fig. 3 is a schematic flow chart of step 202 in the embodiment shown in fig. 2 of the present disclosure.

Fig. 4 is a schematic flow chart of step 204 in the embodiment shown in fig. 2 of the present disclosure.

FIG. 5 is a schematic flow chart of step 205 in the embodiment shown in FIG. 2 according to the present disclosure.

Fig. 6 is a flowchart illustrating a method for performing a range finding in a spatial region according to another exemplary embodiment of the present disclosure.

FIG. 7 is a flowchart illustrating step 205 in the embodiment shown in FIG. 6 according to the present disclosure.

Fig. 8 is a schematic flow chart of step 2056 in the embodiment shown in fig. 7 of the present disclosure.

Fig. 9 is a schematic structural diagram of an apparatus for performing sound zone localization in a spatial region according to an exemplary embodiment of the present disclosure.

Fig. 10 is a schematic structural diagram of an apparatus for performing sound zone localization in a spatial region according to another exemplary embodiment of the present disclosure.

Fig. 11 is a schematic structural diagram of an apparatus for performing sound zone localization in a spatial region according to still another exemplary embodiment of the present disclosure.

Fig. 12 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the application

In the process of implementing the present disclosure, the inventors found that, in the current vehicle-mounted field, a two-microphone centralized microphone array is mostly used, and at least the following problems exist: the control of the two-microphone four-tone area cannot be realized.

Exemplary System

Fig. 1-a is a schematic diagram of an alternative structure of a range finding system provided in an exemplary embodiment of the present disclosure. As shown in fig. 1, the spatial region of the present embodiment is in the vehicle, and the present embodiment divides the interior of the vehicle into 4 sound zones, which correspond to the main driving (1L), the assistant driving (1R), the main driving rear (2L), and the assistant driving rear (2R), respectively. The range locating system in a vehicle may include, but is not limited to: the system comprises two camera devices and two microphones, wherein one camera device is positioned on the roof of the vehicle in the vehicle and can be called as a camera A; the other camera device is positioned on a column A in the vehicle (namely the edge of the front windshield is close to the main driving position), and can be called as a camera B; two microphones (C and D) are positioned on two sides of the roof camera A; the positional relationship of the two microphones may be as shown in the alternative example shown in fig. 1-b; as shown in fig. 1-c, the sound waves reach the diaphragm through a certain angle, so that the diaphragm vibrates, and the vibration signal is converted into an electrical signal through the signal conversion device, so as to obtain an original voice signal collected by the microphone.

In the embodiment shown in fig. 1-a, the copilot face and lip movements and the main driving back and copilot back faces are identified through the image collected by the camera a; and recognizing the face and lip movements of the main driver through the image collected by the camera B.

The embodiment realizes a two-microphone four-tone area positioning algorithm combined with image multi-mode information, utilizes signals received by a microphone array in a voice awakening time period and combines human face and lip movement information of passengers in a vehicle, so that the effective discrimination of front and rear speakers can be realized, and the control of the two-microphone four-tone area is realized; under the vehicle-mounted application scene, accurate response to voice commands of different passengers of the whole vehicle can be realized, and the positions of sound sources corresponding to the awakening words or the command words are obtained, for example, the passenger says 'open windows', and the vehicle automatically opens the window closest to the passenger.

Exemplary method

Fig. 2 is a flowchart illustrating a method for performing a sound zone localization in a spatial region according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device, as shown in fig. 2, and includes the following steps:

step 201, images respectively collected from at least one view angle in a set space region are determined.

The set space region can be a limited space region such as a vehicle or a room; at least one viewing angle may be to capture images of people or objects in the set spatial area from at least one direction, for example, as shown in fig. 1-a, for the vehicle interior, two viewing angles are used to capture images of people in the vehicle interior through cameras a and B disposed on the vehicle interior roof and the vehicle interior a pillar.

Step 202, determining face information in a set spatial region based on images respectively acquired from at least one viewing angle.

In an embodiment, the face information included in the image acquired from at least one view may be the same or different, and this embodiment implements more comprehensive face information acquisition in the set spatial region through at least one view, thereby avoiding the problem of face information omission due to limited angles; for example, as shown in the embodiment of fig. 1-a, the face and lip movement information of the copilot and the face information of the main driver and the copilot are identified through the image collected by the camera a; and recognizing the face and lip movement information of the main driver through the image collected by the camera B.

Step 203, determining at least one path of mixed voice signal collected in the set spatial region.

Optionally, at least one microphone may be utilized to collect at least one mixed voice signal, wherein the process of collecting the mixed voice signal by each microphone may be as shown in fig. 1-c; when there are a plurality of microphones, the positional relationship of the microphones may be as shown in an alternative example shown in fig. 1-b.

And 204, performing voice separation on at least one path of mixed voice signal to obtain at least one path of voice separation signal.

Optionally, the speech separation may be performed on the mixed speech signal by any speech separation method in the prior art, for example, a blind source separation algorithm, and the present embodiment does not limit the speech separation method to be specifically adopted.

Step 205, determining a sound source position of the wake-up signal corresponding to the first wake-up information in the set spatial region based on the first wake-up information corresponding to the at least one voice separation signal and the face information.

In this embodiment, the first wake-up information may be obtained by performing wake-up recognition on at least one voice separation signal, and optionally, the first wake-up information may indicate to wake up or not wake up the preset device, and when there is a wake-up signal to wake up the preset device, the embodiment combines the wake-up information and the face information to achieve sound zone positioning and determine a sound source position corresponding to the wake-up signal.

According to the method for positioning the sound zone in the space area, provided by the embodiment of the disclosure, at least one piece of face information which is possibly sounded is determined by face recognition, and positioning efficiency and accuracy are improved by combining wake-up information, lip movement and face information, so that the distinguishing and positioning accuracy of front and rear sound sources on the same side is improved.

As shown in fig. 3, based on the embodiment shown in fig. 2, step 202 may include the following steps:

step 2021, performing face recognition on the images respectively collected based on at least one viewing angle to obtain at least one group of recognition information.

Optionally, the face feature extraction method may be applied to perform face recognition on at least one image, for example, face recognition is implemented by using a face recognition network technology means; a set of identification information is determined for each of the at least one image by face recognition, and each set of identification information may include at least one face.

Step 2022, determining the face information in the set spatial region based on at least one group of the identification information.

In this embodiment, since image acquisition is performed in the same set spatial region through at least one view angle, there may be a case where the same face is acquired at different view angles, that is, there is repeated identification information (corresponding to the same person) in at least one group of identification information.

As shown in fig. 4, based on the embodiment shown in fig. 2, step 204 may include the following steps:

step 2041, performing voice separation on at least one path of mixed voice signal by using a voice separation algorithm to obtain at least one path of independent audio signal.

In this embodiment, since there may be multiple persons making sound at the same time in the set spatial region, each mixed voice signal may include multiple audio signals, and optionally, at least one mixed voice signal is processed by using a voice separation algorithm, so that separated independent audio signals can be obtained, such as a blind source separation algorithm.

Step 2042, at least one voice separation signal is determined from at least one independent audio signal.

In the embodiment, the voice signals sent by each person are separated through voice separation, so that the awakening information can be determined based on the separated voice signals, and the accuracy of positioning the sound source position corresponding to the awakening signal is improved; for example, as shown in fig. 1-b, a two-microphone array composed of two microphones receives two mixed voice signals, and through voice separation, two separated voice signals audio1 and audio2 can be obtained, wherein audio1 and audio2 can be voice signals uttered by any speaker in the vehicle; the voice separation algorithm in this embodiment may adopt any method capable of realizing voice separation in the prior art, for example, in a positive setting scenario, that is, when the number of microphones is the same as the number of sound sources, N channels of mixed voice signals are separated by using the voice separation algorithm to obtain N channels of separated signals, where N refers to the number of microphones.

As shown in fig. 5, based on the embodiment shown in fig. 2, step 205 may include the following steps:

step 2051, processing the at least one voice separation signal by using a first neural network to obtain at least one first wake-up message corresponding to the at least one voice signal.

Each path of voice signal corresponds to one first awakening message, and each first awakening message represents awakening or not awakening the preset equipment.

Optionally, before the first neural network is used for processing, the first neural network may also be trained by using an audio signal with a known wake-up result, so that the trained first neural network may improve the accuracy of obtaining wake-up information based on each path of voice signal; alternatively, the preset device may include, for example, an air conditioner, a car machine, a sound, a display screen, or any other device that can be controlled by voice.

And step 2052, determining at least one sound production position in the set spatial region based on the face information.

In this embodiment, which positions a person is located in a set spatial region can be determined through the face information, and since the device is awakened only by the person, the position where the person is located determined by the face information is used as the sound production position. For example, in a vehicle-mounted application scene, the face extraction algorithm is used to process the image information of two cameras in the vehicle, so that the face information of passengers in the vehicle can be obtained: the method comprises the steps that main driving face information (face _1L), auxiliary driving face information (face _1R), main driving rear face information (face _2L) and auxiliary driving rear face information (face _2R) can be determined based on the face information of 4 seats, and if no face information is detected at a certain position in the vehicle, the position cannot enter a final positioning selection range.

Step 2053, in response to the first wake-up result for waking up the preset device existing in the at least one first wake-up message.

In this embodiment, the wake-up signal needs to be located only when the first wake-up information wakes up the preset device, and optionally, the method may further include not performing sound zone location in response to that the first wake-up result wakening up the preset device does not exist in the at least one first wake-up information.

And step 2054, processing the at least one path of mixed voice signal and the first awakening information for awakening the preset equipment by using the second neural network, and determining the sound source position in the at least one sound generating position.

Before the sound source position is determined by using the second neural network, training the second neural network based on a sample voice signal with a known sound source position and awakening information corresponding to the sample voice signal can be further included, so that the accuracy of the second neural network in determining the sound source position is improved; in addition, besides the sound source positioning by using the second neural network, the determination of the sound source position can be realized based on a common signal processing positioning algorithm, such as a GCC-PHAT positioning algorithm.

In the embodiment, one sounding position is determined in at least one sounding position to serve as the sound source position corresponding to the wake-up signal, so that the range of sound source positioning is reduced, and the efficiency and the accuracy of sound source positioning are improved; in an alternative example, the voice split signal audio1 and the voice split signal audio2 are input to the first neural network, single-mode voice wake-up flag information swkp1 and swkp2 are obtained, a wake-up result is output, wake-up is identified by 1, non-wake-up is identified by 0, for example, swkp1 returns to a value of 1, indicating that the preset device is woken up by the audio1, and then, next, swkp1, mic1, and mic2 input range location modules perform sound source location.

As shown in fig. 6, on the basis of the embodiment shown in fig. 2, before performing step 205, the method may further include:

step 601, performing lip movement identification on the first image acquired from the first view angle of the at least one view angle by using a third neural network, and determining at least one lip movement information in the first image acquired from the first view angle.

Optionally, the third neural network may be any identification network that can identify whether the lips are moving, and the output lip movement information may include lip feature vectors, specifically, may include position coordinates of lip feature points and lip movement probability; determining whether the lip movement result is obtained through the lip movement probability, for example, comparing the lip movement probability with a preset probability, outputting 1 when the lip movement probability is greater than the preset probability, and outputting 0 when the lip movement probability is less than or equal to the preset probability, wherein 1 indicates that the lip is moved, and 0 indicates that the lip is not moved; the preset probability can be set according to the application scene.

And step 602, determining second awakening information by using a fourth neural network based on the at least one lip movement information and the at least one voice separation signal.

The second awakening information represents that the voice separation signal corresponding to the face in the first image awakens or does not awaken the preset equipment.

In this embodiment, the fourth neural network performs voice wake-up by using the lip movement information and the voice classification signal, when the voice information is matched with the image information, the wake-up rate of a scene with a low signal-to-noise ratio can be increased, and when the voice information is not matched with the image information, the system cannot wake-up; in an optional example, when the application scene is used for positioning a sound source in a vehicle, the lip motion extraction algorithm is used for processing image information of two cameras in the vehicle, so that lip motion information lip1 of a main driver and lip motion information lip2 of a vice driver can be obtained; processing the voice signals received by the two microphone arrays by using a voice separation algorithm to obtain two separated voice signals, namely, audio1 and audio2, wherein the audio1 and the audio2 can be the voice signals sent out by any speaker in the car; lip movement information lip1 and lip2 of the main and auxiliary drivers and two separated voice signals audio1 and audio2 are sent to a multimode voice awakening module (a fourth neural network), so that second awakening information mwkp1 and mwkp2 of the main and auxiliary drivers can be obtained.

As shown in fig. 7, based on the embodiment shown in fig. 6, step 205 may include:

and step 2055, determining at least one sound production position in the set space area based on the face information.

The implementation and effect of this step can refer to step 2052 in the embodiment shown in fig. 5, which is not described herein again.

And step 2056, determining the sound source position from the at least one sound production position by using a preset positioning rule based on the first awakening information, the second awakening information, the at least one lip movement information and the at least one path of mixed voice signal.

Alternatively, the sound source may be localized based on a commonly used signal processing localization algorithm such as direction of arrival localization technique (DOA).

In the embodiment, the position of the sound source is determined by combining the second awakening information determined based on the lip movement information, so that the sound source positioning accuracy of the position corresponding to the face of the lip movement information is improved, for example, in a vehicle-mounted application scene, the lip movement extraction algorithm is used for processing the image information of two cameras in a vehicle, and lip movement information lip _1L of a main driver and lip movement information lip _1R of a subsidiary driver can be obtained; two separated speech signals, audio1 and audio2, can be obtained by processing the speech signals received by the two microphone arrays using a speech separation algorithm (e.g., blind source separation), where audio1 and audio2 can be the speech signals uttered by any speaker in the vehicle. Based on lip movement information lip _1L and lip _1R of the main and auxiliary drivers and the two separated voice signals audio1 and audio2, multimode voice wake-up mark information mwkp _1L of the main driver and multimode voice wake-up mark information mwkp _1R of the auxiliary drivers can be obtained. The voice awakening is carried out by utilizing the lip movement information and the voice signal, the awakening rate of a scene with a low signal-to-noise ratio can be improved when the voice information is matched with the image information, and the preset equipment cannot be awakened when the voice information is not matched with the image information.

As shown in fig. 8, based on the embodiment shown in fig. 7, step 2056 may include:

step 801, at least one first sound production position corresponding to the first image is determined from at least one sound production position.

In this embodiment, the first sound-emitting position is a position where a person corresponding to the lip movement information is located, for example, a front row position in a set space, and in an in-vehicle application scenario, as shown in fig. 1-a, the first sound-emitting position is a main driving position 1L and a sub-driving position 1R.

Step 802, judging whether at least one sound production position comprises a first sound production position, if so, executing step 803; otherwise, step 804 is performed.

And 803, determining at least one sound source position from the at least one first sound production position based on the second wake-up information and the at least one lip movement information, and/or determining at least one sound source position from at least one second sound production position except the first sound production position by processing the at least one mixed voice signal and the first wake-up information by using a fifth neural network.

In this embodiment, a plurality of wake-up voices may exist at the same time, for example, when a sound source position exists in a first sound emission position, at least one sound source position also exists in a second sound emission position, at this time, in addition to the information of the first sound emission position, the first wake-up information corresponding to the second sound emission position needs to be processed.

Optionally, determining at least one sound source position from the at least one first sound generation position based on the second wake-up information and the at least one lip movement information comprises:

for each first sound-emitting position in the at least one first sound-emitting position, responding to second awakening information corresponding to the first sound-emitting position to awaken preset equipment, or responding to lip movement information corresponding to the first sound-emitting position to be lip movement; and determining the first sound-emitting position as a sound source position.

In this embodiment, when a voice signal exists at a first sound-emitting position where lip movement information can be obtained, second wake-up information obtained based on the voice signal is identified, and when the second wake-up information can wake up a preset device, it indicates that the first sound-emitting position corresponding to the second wake-up information is a sound source position; alternatively, it may be determined that there is a lip movement (lip movement is required for voice uttering) based on the lip movement information, that is, it may be determined that the first sound generation position is a sound source position.

And step 804, processing the at least one path of mixed voice signal and the first awakening information by using a fifth neural network, and determining at least one sound source position from at least one second sound production position.

In the present embodiment, the second utterance position is a position having no lip movement information but having face information, and for such an occurrence position having no lip movement information, in the present embodiment, the sound source position is determined by using a neural network, and sound source localization is realized by using a fifth neural network or a general signal processing localization algorithm based on a direction of arrival localization technique (DOA) or the like. For example, in an in-vehicle application scenario, the specific sound source location rule is shown in table 1 below:

TABLE 1

In the embodiment, the interior of the vehicle is divided into 4 sound zones which respectively correspond to a main driving zone, an auxiliary driving zone, a main driving zone and an auxiliary driving zone; mwkp1 and mwkp2 respectively represent second awakening information corresponding to the main driving position and the secondary driving position, 1 represents awakening of the preset device, and 0 represents not awakening of the preset device.

Any of the methods for performing a range finding in a spatial region provided by embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, any method for conducting range location in a spatial region provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor may execute any method for conducting range location in a spatial region by calling a corresponding instruction stored in a memory. And will not be described in detail below.

Exemplary devices

Fig. 9 is a schematic structural diagram of an apparatus for performing sound zone localization in a spatial region according to an exemplary embodiment of the present disclosure. As shown in fig. 9, the apparatus provided in this embodiment includes:

an image acquisition module 91 for determining images respectively acquired from at least one viewing angle within a set spatial region.

And a face information determining module 92, configured to determine face information in a set spatial region based on images acquired by at least one viewing angle acquired by the image acquiring module 91.

And the voice acquisition module 93 is configured to determine at least one path of mixed voice signals acquired in the set spatial region.

And the voice separation module 94 is configured to perform voice separation on the at least one path of mixed voice signal acquired by the voice acquisition module 93 to obtain at least one path of voice separation signal.

The position locating module 95 is configured to determine, based on the first wake-up information corresponding to the at least one voice separation signal obtained by the voice separation module 94 and the face information determined by the face information determining module 92, a sound source position of a wake-up signal corresponding to the first wake-up information in the set spatial region.

According to the device for positioning the sound zone in the spatial region, provided by the embodiment of the disclosure, at least one piece of face information which is possibly sounded is determined by face recognition, and by combining the awakening information and the face information, the positioning efficiency and accuracy are improved, and the distinguishing and positioning accuracy of front and rear sound sources on the same side is improved.

Fig. 10 is a schematic structural diagram of an apparatus for performing sound zone localization in a spatial region according to another exemplary embodiment of the present disclosure. As shown in fig. 10, in the apparatus provided in this embodiment, the face information determining module 92 is specifically configured to perform face recognition on images respectively acquired based on at least one viewing angle, so as to obtain at least one group of recognition information; and determining the face information in the set space region based on at least one group of identification information.

A voice separation module 94, specifically configured to perform voice separation on the at least one mixed voice signal by using a voice separation algorithm to obtain at least one independent audio signal; and determining at least one voice separation signal from the at least one independent audio signal.

A position-location module 95, comprising:

a single-mode voice wake-up unit 951, configured to process the at least one voice separation signal by using a first neural network, so as to obtain at least one first wake-up message corresponding to the at least one voice signal; each path of voice signal corresponds to a first awakening message, and each first awakening message represents awakening or not awakening the preset equipment.

A first position determining unit 952, configured to determine at least one utterance position within a set spatial region based on the face information.

A first sound source positioning unit 953, configured to respond to a first wake-up result for waking up a preset device in at least one first wake-up message; and processing the at least one path of mixed voice signal and first awakening information for awakening the preset equipment by using a second neural network, and determining the sound source position in the at least one sound generating position.

Fig. 11 is a schematic structural diagram of an apparatus for performing sound zone localization in a spatial region according to still another exemplary embodiment of the present disclosure. As shown in fig. 11, the apparatus provided in this embodiment further includes:

the lip movement identification module 111 is used for carrying out lip movement identification on the first image acquired from the first visual angle in the at least one visual angle by using a third neural network, and determining at least one lip movement information in the first image acquired from the first visual angle;

a wake-up recognition module 112, configured to determine second wake-up information based on the at least one lip movement information and the at least one voice separation signal by using a fourth neural network; the second awakening information represents that the voice separation signal corresponding to the face in the first image awakens or does not awaken the preset equipment.

A position-location module 95, comprising:

a second position determining unit 954, configured to determine at least one utterance position within the set spatial region based on the face information;

a second sound source positioning unit 955, configured to determine a sound source position from the at least one sound production position by using a preset positioning rule based on the first wake-up information, the second wake-up information, the at least one lip movement information, and the at least one mixed voice signal.

Optionally, a second sound source positioning unit 955, specifically configured to determine at least one first sound emission position corresponding to the first image from the at least one sound emission position; in response to the at least one utterance position including the first utterance position, determining at least one sound source position from the at least one first utterance position based on the second wake-up information and the at least one lip movement information, and/or, processing the at least one mixed speech signal and the first wake-up information using a fifth neural network, determining at least one sound source position from at least one second utterance position other than the first utterance position; and in response to the at least one sound production position not including the first sound production position, processing the at least one mixed voice signal and the first wake-up information by using a fifth neural network, and determining at least one sound source position from at least one second sound production position.

Optionally, the second sound source positioning unit 955, when determining at least one sound source position from the at least one first sound emitting position based on the second wake-up information and the at least one lip movement information, is configured to wake up the preset device in response to the second wake-up information corresponding to the first sound emitting position or lip movement in response to the lip movement information corresponding to the first sound emitting position for each of the at least one first sound emitting position; and determining the first sound-emitting position as a sound source position.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 12. The electronic device may be either or both of the first device 100 and the second device 200, or a stand-alone device separate from them that may communicate with the first device and the second device to receive the collected input signals therefrom.

FIG. 12 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.

As shown in fig. 12, the electronic device 120 includes one or more processors 121 and a memory 122.

The processor 121 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 120 to perform desired functions.

Memory 122 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 121 to implement the method for range finding in a spatial region and/or other desired functions of the various embodiments of the present disclosure described above. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 120 may further include: an input device 123 and an output device 124, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, when the electronic device is the first device 100 or the second device 200, the input device 123 may be the microphone or the microphone array described above for capturing the input signal of the sound source. When the electronic device is a stand-alone device, the input means 123 may be a communication network connector for receiving the acquired input signals from the first device 100 and the second device 200.

The input device 123 may include, for example, a keyboard, a mouse, and the like.

The output device 124 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 124 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 120 relevant to the present disclosure are shown in fig. 12, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 120 may include any other suitable components, depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in a method of range localization within a spatial region according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the steps in the method of range localization within a spatial region according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

23页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：机器交互方法和装置

Method and apparatus for performing sound zone localization in spatial region, device and medium

相关技术

网友询问留言