Target sound source positioning method and system based on joint optimization network

文档序号：189013 发布日期：2021-11-02 浏览：31次中文

阅读说明：本技术 一种基于联合优化网络的目标声源定位方法及系统 (Target sound source positioning method and system based on joint optimization network ) 是由刘忆森周松斌万智勇于 2021-09-29 设计创作，主要内容包括：本发明提供了一种基于联合优化网络的目标声源定位方法及系统。该方案包括通过在监测位置放置的麦克风阵列采集所有的目标声源信号,获取声音数据集；将声音数据集划分为训练集和验证集；对训练集进行预处理后,送入预设的目标声检测和定位联合网络,获取目标声源定位模型；将验证集进行预处理后,送入目标声源定位模型,计算目标声检测模块对于训练集和验证集的余弦相似度,并确定目标声源检测阈值；实施获取声音数据,利用余弦相似度和目标声源检测阈值确定目标声源的方位预测。该方案通过基于联合优化网络的目标声源定位,只需采集包含目标声源的高信噪比声音信号进行训练建模,可同时进行目标声源检测和端到端系统定位。(The invention provides a target sound source positioning method and system based on a joint optimization network. The method comprises the steps that all target sound source signals are collected through a microphone array arranged at a monitoring position, and a sound data set is obtained; dividing a sound data set into a training set and a verification set; after preprocessing the training set, sending the training set into a preset target sound detection and positioning combined network to obtain a target sound source positioning model; preprocessing the verification set, sending the preprocessed verification set into a target sound source positioning model, calculating the cosine similarity of a target sound detection module to the training set and the verification set, and determining a target sound source detection threshold; and acquiring sound data, and determining the azimuth prediction of the target sound source by using the cosine similarity and the target sound source detection threshold. According to the scheme, through target sound source positioning based on a joint optimization network, only high signal-to-noise ratio sound signals containing a target sound source are collected to carry out training modeling, and target sound source detection and end-to-end system positioning can be carried out simultaneously.)

1. A target sound source positioning method based on a joint optimization network is characterized by comprising the following steps:

collecting all target sound source signals through a microphone array placed at a monitoring position, and storing the target sound source signals and sound data into a sound data set together according to a coordinate label;

dividing the sound data set into a training set and a verification set;

after preprocessing the training set, sending the training set into a preset target sound detection and positioning combined network to obtain a target sound source positioning model, wherein the target sound detection and positioning combined network comprises a time sequence feature extraction module, a target sound detection module and a sound source coordinate regression prediction module;

after preprocessing the verification set, sending the preprocessed verification set into the target sound source positioning model, calculating cosine similarity of the target sound detection module to the training set and the verification set, and determining a target sound source detection threshold;

acquiring sound data in real time, and determining the azimuth prediction of a target sound source by using the cosine similarity and the target sound source detection threshold;

after the training set is preprocessed, the training set is sent to a preset target sound detection and positioning combined network, and a target sound source positioning model is obtained, and the method specifically comprises the following steps:

acquiring training sound data in the training set;

acquiring a window frame length which is 1024;

framing the training sound data according to a second calculation formula to generate windowed framed data;

carrying out short-time Fourier transform on the windowed framing data to obtain a time-frequency energy spectrum and a time-frequency phase spectrum of the sound data;

combining the time-frequency energy spectrum and the time-frequency phase spectrum to be used as feature data of sound data;

classifying a target sound source detection and positioning joint optimization network into a time-frequency feature extraction module, a target sound detection module and a sound source coordinate regression prediction module, wherein the time-frequency feature extraction module is a bidirectional circulation network, the target sound detection module is a convolution self-encoder, and the sound source coordinate regression prediction module is a convolution network;

predicting the sound source position of the coding characteristics through the sound source coordinate regression prediction module;

the loss function of the target sound source detection and positioning combined optimization network consists of the loss function of the target sound detection module and the loss function of a sound source coordinate regression prediction module;

simultaneously optimizing the loss function of the target sound detection module and the loss function of the sound source coordinate regression prediction module by adopting a gradient descent method to obtain a target sound source positioning model;

the second calculation formula:

wherein the content of the first and second substances,Win the form of a sequence of windows,nis as followsnThe point of the light beam is the point,his the window frame length.

2. The joint optimization network-based target sound source positioning method according to claim 1, wherein the collecting all target sound source signals by a microphone array disposed at the monitoring position and storing the target sound source signals into the sound data set according to the coordinate tags together with the sound data comprises:

establishing a rectangular coordinate system by taking the microphone as an origin;

dividing the monitoring location into at least one monitoring region location sub-block;

acquiring central coordinates of all the monitoring area position sub-blocks;

calculating the azimuth angle of the microphone array according to the center coordinates of the position sub-blocks of the monitoring area;

collecting sound source signals of all the sub-blocks at the monitoring area positions, and storing the sound source signals as initial sound source signals;

storing the initial sound source signal at a preset fixed time interval;

acquiring a preset sampling frequency;

acquiring the data length by using a first calculation formula;

storing the initial sound source signal corresponding to the data length as the sound data set;

the first calculation formula is:

N=t ₀*f _c

wherein the content of the first and second substances,Nfor the purpose of the data length, the length of the data,t ₀for the purpose of said fixed time interval,f _cis the preset sampling frequency.

3. The method for positioning a target sound source based on a joint optimization network as claimed in claim 2, wherein the dividing the sound data set into a training set and a verification set specifically comprises:

acquiring the sound data set, and dividing the sound data set into a training set and a verification set;

acquiring target coordinates in a unit rectangular coordinate system mapped by azimuth angles of all the microphone arrays;

all target coordinates are tagged into the training set and the validation set.

4. The method according to claim 1, wherein the preprocessing is performed on the verification set, and then the preprocessed verification set is sent to the target sound source localization model, and the cosine similarity of the target sound detection module to the training set and the verification set is calculated, and a target sound source detection threshold is determined, which specifically includes:

acquiring the data in the verification set and the data in the training set, and sending the data to the target sound source positioning model;

acquiring a first reconstruction characteristic by utilizing the time-frequency characteristic extraction module in the target sound source detection and positioning joint optimization network；

Obtaining a second reconstruction feature using the target sound detection module in the target sound source detection and localization joint optimization network；

Obtaining the target sound source detection threshold value by using a third calculation formula;

judging whether the maximum value of the cosine similarity is larger than the target sound source detection threshold value or not;

the third calculation formula is:

wherein max is the maximum value of all cosine similarities, which is the number of training samples,the first reconstruction characteristics output after the first training set passes through the time frequency characteristic extraction module are obtained,for the second reconstruction feature output after the jth verification set passes through the time-frequency feature extraction module,and the mean value of the cosine similarity of the jth verification data and all the training set data is obtained.

5. The method according to claim 4, wherein the obtaining of the sound data in real time and the determining of the azimuth prediction of the target sound source by using the cosine similarity and the target sound source detection threshold specifically comprise:

during real-time detection, windowing and framing collected sound data, and performing short-time Fourier transform to obtain characteristic data of the time-frequency energy spectrum and the time-frequency phase spectrum containing the sound data;

acquiring first real-time data of sound data for windowing and framing;

carrying out short-time Fourier transform on the first real-time data to obtain the time-frequency energy spectrum and the time-frequency phase spectrum containing sound data;

the characteristic data of the time-frequency energy spectrum and the time-frequency phase spectrum are obtained, the characteristic data are sent to the target sound source detection and positioning combined optimization network, and cosine similarity is calculated;

comparing the detected signal with a preset target sound source detection threshold, and detecting no target sound if the cosine similarity is smaller than the target sound source detection threshold;

and if the cosine similarity is not smaller than the target sound source detection threshold, detecting the target sound at this time, and taking the output of the sound source coordinate regression prediction module as a target sound source positioning result.

6. A joint optimization network-based target sound source localization system, comprising:

the data acquisition module is used for acquiring all target sound source signals through a microphone array arranged at a monitoring position and storing the target sound source signals and sound data into a sound data set together according to a coordinate label;

the sample dividing module is used for dividing the sound data set into a training set and a verification set;

the module training module is used for sending the training set to a preset target sound detection and positioning combined network after preprocessing the training set to obtain a target sound source positioning model, wherein the target sound detection and positioning combined network comprises a time sequence feature extraction module, a target sound detection module and a sound source coordinate regression prediction module;

the threshold value determining module is used for sending the preprocessed verification set into the target sound source positioning model, calculating the cosine similarity of the target sound detection module to the training set and the verification set, and determining a target sound source detection threshold value;

and the implementation detection module is used for acquiring sound data in real time and determining the azimuth prediction of the target sound source by using the cosine similarity and the target sound source detection threshold.

7. The system for positioning a target sound source based on a joint optimization network as claimed in claim 6, further comprising a model operation submodule for:

the loss function L of the target sound source detection and positioning joint optimization network is determined by the loss function of the target sound detection moduleL _EAnd said sound source coordinate regression prediction module loss functionL _PComposition is carried out; wherein the content of the first and second substances,L _Echaracterised by the time-frequencyL ₂The error of the normal form is recovered,L _Pas an azimuth coordinateL ₂A normal form prediction error, specifically, a calculation formula of a loss function L of the target sound source detection and positioning joint optimization network is as follows:

wherein the content of the first and second substances,in order to train the number of samples,the time-frequency characteristics of the training set samples obtained by the time-frequency characteristic extraction module,for the reconstruction time-frequency characteristics obtained by the target sound detection module,，in turn the abscissa and ordinate of the orientation of the ith sample,andto take on a value of (0, 1)]The constant coefficient of (a) is,for the system to predict the value of the shaft,the predicted value of the system to the shaft is obtained.

8. A computer-readable storage medium on which computer program instructions are stored, which, when executed by a processor, implement the method of any one of claims 1-5.

9. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the steps of any of claims 1-5.

Technical Field

The invention relates to the technical field of sound source positioning, in particular to a target sound source positioning method and a target sound source positioning system based on a joint optimization network.

Background

The target sound source detection and positioning can be carried out according to the sound source signals received by the pickup system, so that the system has a lot of applications in traffic whistle monitoring, audio and video monitoring and blind visual navigation systems.

Prior to the present technique, conventional sound source localization included 2 types: the first method is an algorithm beam forming method, so the sound source positioning algorithm cannot position the target sound, and the positioning algorithm cannot accurately position the target sound in a scene with complex environmental reverberation and many sound sources; the second method is a sound source target detection positioning method based on a machine learning algorithm, the method divides a system into a sound source detection subtask and a sound source positioning subtask, and the problems that training data cannot exhaust non-target sound source scenes during model training and the system is not end-to-end exist.

Disclosure of Invention

In view of the above problems, the present invention provides a target sound source localization method and system based on a joint optimization network, which can achieve target sound source localization based on the joint optimization network, and can perform target sound source detection and end-to-end system localization simultaneously by only acquiring a high signal-to-noise ratio sound signal containing a target sound source for training modeling.

According to a first aspect of the embodiments of the present invention, a target sound source localization method based on a joint optimization network is provided.

In one or more embodiments, preferably, the method for target sound source localization based on a joint optimization network includes:

dividing the sound data set into a training set and a verification set;

and acquiring sound data in real time, and determining the azimuth prediction of the target sound source by using the cosine similarity and the target sound source detection threshold.

In one or more embodiments, preferably, the acquiring all target sound source signals by a microphone array disposed at the monitoring position, and storing the target sound source signals into the sound data set according to the coordinate tags and the sound data, specifically includes:

establishing a rectangular coordinate system by taking the microphone as an origin;

dividing the monitoring location into at least one monitoring region location sub-block;

acquiring central coordinates of all the monitoring area position sub-blocks;

calculating the azimuth angle of the microphone array according to the center coordinates of the position sub-blocks of the monitoring area;

collecting sound source signals of all the sub-blocks at the monitoring area positions, and storing the sound source signals as initial sound source signals;

storing the initial sound source signal at a preset fixed time interval;

acquiring a preset sampling frequency;

acquiring the data length by using a first calculation formula;

storing the initial sound source signal corresponding to the data length as the sound data set;

the first calculation formula is:

N=t ₀*f _c

In one or more embodiments, preferably, the dividing the sound data set into a training set and a verification set specifically includes:

acquiring the sound data set, and dividing the sound data set into a training set and a verification set;

acquiring target coordinates in a unit rectangular coordinate system mapped by azimuth angles of all the microphone arrays;

all target coordinates are tagged into the training set and the validation set.

In one or more embodiments, preferably, after the training set is preprocessed, the training set is sent to a preset target sound detection and positioning joint network to obtain a target sound source positioning model, where the target sound detection and positioning joint network includes a time sequence feature extraction module, a target sound detection module, and a sound source coordinate regression prediction module, and specifically includes:

acquiring training sound data in the training set;

acquiring a window frame length, wherein the window frame length is 1024;

framing the training sound data according to a second calculation formula to generate windowed framed data;

carrying out short-time Fourier transform on the windowed framing data to obtain a time-frequency energy spectrum and a time-frequency phase spectrum of the sound data;

combining the time-frequency energy spectrum and the time-frequency phase spectrum to be used as feature data of sound data;

predicting the sound source position of the coding characteristics through the sound source coordinate regression prediction module;

the second calculation formula:

wherein the content of the first and second substances,Win the form of a sequence of windows,nis as followsnThe point of the light beam is the point,his the window frame length.

In one or more embodiments, preferably, after the preprocessing is performed on the verification set, the preprocessed verification set is sent to the target sound source localization model, the cosine similarity of the target sound detection module to the training set and the verification set is calculated, and a target sound source detection threshold is determined, which specifically includes:

acquiring the data in the verification set and the data in the training set, and sending the data to the target sound source positioning model;

acquiring a first reconstruction characteristic by utilizing the time-frequency characteristic extraction module in the target sound source detection and positioning joint optimization network；

Obtaining a second reconstruction feature using the target sound detection module in the target sound source detection and localization joint optimization network；

Obtaining the target sound source detection threshold value by using a third calculation formula;

judging whether the maximum value of the cosine similarity is larger than the target sound source detection threshold value or not;

the third calculation formula is:

wherein max is the maximum value of all cosine similarities,in order to train the number of samples,is as followsThe training set is output after passing through the time-frequency characteristic extraction module,for the jth verification set output after passing through the time-frequency feature extraction module,and the mean value of the cosine similarity of the jth verification data and all the training set data is obtained.

In one or more embodiments, preferably, the acquiring the sound data in real time, and determining the azimuth prediction of the target sound source by using the cosine similarity and the target sound source detection threshold specifically includes:

acquiring first real-time data of sound data for windowing and framing;

carrying out short-time Fourier transform on the first real-time data to obtain the time-frequency energy spectrum and the time-frequency phase spectrum containing sound data;

According to a second aspect of the embodiments of the present invention, a target sound source localization system based on a joint optimization network is provided.

In one or more embodiments, preferably, the target sound source localization system based on the joint optimization network includes:

the sample dividing module is used for dividing the sound data set into a training set and a verification set;

In one or more embodiments, preferably, the joint optimization network-based target sound source localization system further includes a model operation submodule configured to:

wherein the content of the first and second substances,the time-frequency characteristics of the training set samples obtained by the time-frequency characteristic extraction module,for the reconstruction time-frequency characteristics obtained by the target sound detection module,，the abscissa and the ordinate of the orientation of the ith sample are sequentially provided, N is the number of training samples,andto take on a value of (0, 1)]The constant coefficient of (a) is,is a system pairThe predicted value of the axis is,is a system pairPredicted value of the axis.

According to a third aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method according to any one of the first aspect of embodiments of the present invention.

According to a fourth aspect of embodiments of the present invention, there is provided an electronic device, comprising a memory and a processor, the memory being configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the steps of any one of the first aspect of embodiments of the present invention.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

1) in the embodiment of the invention, the feature extraction and the sound coordinate prediction of the target sound source are realized by obtaining the target sound detection and positioning network.

2) In the embodiment of the invention, the specific target sound threshold value can be obtained through training without using experience by providing the method for calculating the target sound judgment threshold value.

3) In the embodiment of the invention, a rapid target sound source positioning result is obtained by combining real-time monitoring with cosine similarity and threshold comparison.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a target sound source localization method based on a joint optimization network according to an embodiment of the present invention.

Fig. 2 is a flowchart of collecting all target sound source signals by a microphone array disposed at a monitoring location and storing the target sound source signals in a sound data set together with sound data according to a coordinate tag in a target sound source localization method based on a joint optimization network according to an embodiment of the present invention.

Fig. 3 is a flowchart of dividing the sound data set into a training set and a verification set in a joint optimization network-based target sound source localization method according to an embodiment of the present invention.

Fig. 4 is a flowchart of an embodiment of the present invention, in a target sound source localization method based on a joint optimization network, after preprocessing the training set, sending the training set to a preset target sound detection and localization joint network to obtain a target sound source localization model, where the target sound detection and localization joint network includes a time sequence feature extraction module, a target sound detection module, and a sound source coordinate regression prediction module.

Fig. 5 is a flowchart of sending the preprocessed verification set to the target sound source localization model, calculating cosine similarity of the training set and the verification set by the target sound detection module, and determining a target sound source detection threshold in the target sound source localization method based on the joint optimization network according to an embodiment of the present invention.

Fig. 6 is a flowchart of acquiring sound data in real time and determining an azimuth prediction of a target sound source by using the cosine similarity and the target sound source detection threshold in a target sound source localization method based on a joint optimization network according to an embodiment of the present invention.

Fig. 7 is a block diagram of a target sound source localization system based on a joint optimization network according to an embodiment of the present invention.

Fig. 8 is a schematic diagram of target sound source localization in one embodiment of the present invention.

Fig. 9 is a block diagram of an electronic device in one embodiment of the invention.

Detailed Description

In some of the flows described in the present specification and claims and in the above figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, with the order of the operations being indicated as 101, 102, etc. merely to distinguish between the various operations, and the order of the operations by themselves does not represent any order of performance. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a target sound source positioning method and system based on a joint optimization network. The scheme can realize target sound source positioning based on a joint optimization network, and can simultaneously carry out target sound source detection and end-to-end system positioning only by acquiring a high signal-to-noise ratio sound signal containing a target sound source for training and modeling.

According to a first aspect of the embodiments of the present invention, a target sound source localization method based on a joint optimization network is provided.

Fig. 1 is a flowchart of a target sound source localization method based on a joint optimization network according to an embodiment of the present invention.

In one or more embodiments, as shown in fig. 1, preferably, the method for target sound source localization based on joint optimization network includes:

s101, collecting all target sound source signals through a microphone array arranged at a monitoring position, and storing the target sound source signals and sound data into a sound data set according to a coordinate label;

s102, dividing the sound data set into a training set and a verification set;

s103, after preprocessing the training set, sending the training set into a preset target sound detection and positioning combined network to obtain a target sound source positioning model, wherein the target sound detection and positioning combined network comprises a time sequence feature extraction module, a target sound detection module and a source coordinate regression prediction module;

s104, preprocessing the verification set, sending the preprocessed verification set into the target sound source positioning model, calculating cosine similarity of the target sound detection module to the training set and the verification set, and determining a target sound source detection threshold;

and S105, acquiring sound data in real time, and determining the azimuth prediction of the target sound source by using the cosine similarity and the target sound source detection threshold.

In the embodiment of the invention, after the microphone array is placed at the position to be monitored, the target sound source position at the lowest loss function is finally generated through sound detection and sound source coordinate prediction by performing feature extraction on the original sound feature matrix.

As shown in fig. 2, in one or more embodiments, preferably, the acquiring all target sound source signals by a microphone array disposed at the monitoring position and storing the target sound source signals into the sound data set according to the coordinate tags and the sound data specifically includes:

s201, establishing a rectangular coordinate system by taking the microphone array as an origin;

s202, dividing the monitoring position into at least one monitoring area position sub-block;

s203, acquiring central coordinates of all the monitoring area position sub-blocks;

s204, calculating the azimuth angle of the microphone array according to the center coordinates of the position sub-blocks of the monitoring area;

s205, collecting sound source signals of all the sub-blocks at the monitoring area positions, and storing the sound source signals as initial sound source signals;

s206, storing the initial sound source signal at a preset fixed time interval;

s207, acquiring a preset sampling frequency;

s208, acquiring the data length by using a first calculation formula;

s209, storing the initial sound source signal corresponding to the data length as the sound data set;

the first calculation formula is:

N=t ₀*f _c

wherein the content of the first and second substances,Nfor the purpose of the data length, the length of the data,t ₀is as described aboveAt regular intervals of time, the time interval is set,f _cis the preset sampling frequency.

In the embodiment of the invention, after the microphone array is placed at the position to be monitored, a rectangular coordinate system is established by taking the position of the microphone array as the origin of coordinates. Dividing a monitoring region into a number of location sub-blocksObtaining the center coordinates of each sub-blockAnd calculating the azimuth angle of each sub-block relative to the microphone array(ii) a And acquiring a target sound source signal in each sub-block. And at time intervalst ₀Is stored with a sampling frequency off _cThe length of the collected data isN=t ₀*f _c. Obtaining multi-directional target sound source signal dataD _T。

As shown in fig. 3, in one or more embodiments, preferably, the dividing the sound data set into a training set and a verification set specifically includes:

s301, acquiring the sound data set, and dividing the sound data set into a training set and a verification set;

s302, obtaining all azimuth angles of the microphone array to be mapped into target coordinates in a unit rectangular coordinate system;

s303, marking all target coordinates into the training set and the verification set.

In the implementation of the invention, the collected sound data D_TDividing the azimuth into a training set T and a verification set V and dividing all the azimuthsMapping to a rectangular coordinate system with microphone as the origin to obtain corresponding coordinates；

In addition, the training set is denoted asVerification set isWherein t represents voice data, 3600 training sets, 2400 verification sets, position coordinate labels corresponding to the training sets and training set data.

As shown in fig. 4, in one or more embodiments, preferably, after the training set is preprocessed, the training set is sent to a preset target sound detection and positioning joint network to obtain a target sound source positioning model, where the target sound detection and positioning joint network includes a time sequence feature extraction module, a target sound detection module, and a sound source coordinate regression prediction module, and specifically includes:

s401, acquiring training sound data in the training set;

s402, acquiring the length of a window frame;

wherein the window frame length is 1024;

s403, framing the training voice data according to a second calculation formula to generate windowed framed data;

s404, carrying out short-time Fourier transform on the windowed and framed data to obtain a time-frequency energy spectrum and a time-frequency phase spectrum of the sound data;

s405, combining the time-frequency energy spectrum and the time-frequency phase spectrum to be used as feature data of sound data;

s406, classifying a target sound source detection and positioning joint optimization network into a time-frequency feature extraction module, a target sound detection module and a sound source coordinate regression prediction module, wherein the time-frequency feature extraction module is a bidirectional circulation network, the target sound detection module is a convolution self-encoder, and the sound source coordinate regression prediction module is a convolution network;

s407, predicting the sound source position of the coding features through the sound source coordinate regression prediction module;

s408, enabling the loss function of the target sound source detection and positioning combined optimization network to be composed of the loss function of the target sound detection module and the loss function of the sound source coordinate regression prediction module;

s409, simultaneously optimizing the loss function of the target sound detection module and the loss function of the sound source coordinate regression prediction module by adopting a gradient descent method to obtain a target sound source positioning model;

the second calculation formula:

wherein the content of the first and second substances,Win the form of a sequence of windows,nis as followsnThe point of the light beam is the point,his the window frame length.

In the embodiment of the invention, each voice data is windowed and framed, the length of each frame is 1024, and the window sequence is as follows:

in the formula: is a window sequence, n is the nth point;

carrying out short-time Fourier transform on the data after windowing and framing to obtain a time-frequency energy spectrum D of the sound data_energyAnd time-frequency phase spectrum D_phase. And combine the twoAnd the data is sent into the joint network as the characteristic data of the sound data for training.

The target sound source detection and positioning joint optimization network is divided into a time-frequency feature extraction module F, a target sound detection module E and a sound source coordinate regression prediction module P. The time-frequency feature extraction module is a bidirectional circulation network, the target sound detection module is a convolution self-encoder, and the sound source coordinate regression prediction module is a convolution network.

Time-frequency energy spectrum D_energyAnd time-frequency phase spectrum D_phaseThe method comprises the steps of firstly entering a time-frequency feature extraction module as input, strengthening the time context relation of input data, taking an obtained feature vector z as input, entering a target sound detection module, compressing the feature vector in the target sound detection module to obtain a coding feature c, and obtaining a reconstruction feature through network decodingAnd inputting the coding characteristics c into a sound source coordinate regression prediction module as input to predict the sound source direction.

Loss function of target sound source detection and positioning combined optimization network is lost by target sound detection module_EAnd a sound source coordinate regression prediction module loss function L_PTwo parts are formed. Wherein L is_EL being a time-frequency feature₂Normal form recovery error, L_PL as an azimuth coordinate₂The normal form prediction error:

whereinThe time-frequency characteristics of the training set samples obtained by the time-frequency characteristic extraction module,for the reconstructed time-frequency characteristics obtained by the target sound detection module (c) (d)，) In order to be the coordinates of the sample orientation,Nin order to train the number of samples,andto take on a value of (0, 1)]Constant coefficient of (d);

and simultaneously optimizing the two losses by adopting a gradient descent method in the training process to obtain a target sound source positioning model M.

As shown in fig. 5, in one or more embodiments, preferably, after the preprocessing is performed on the verification set, the preprocessed verification set is sent to the target sound source localization model, the cosine similarity of the target sound detection module to the training set and the verification set is calculated, and a target sound source detection threshold is determined, which specifically includes:

s501, acquiring the data in the verification set and the data in the training set, and sending the data to the target sound source positioning model;

s502, acquiring a first reconstruction feature by utilizing the time-frequency feature extraction module in the target sound source detection and positioning joint optimization network；

S503, acquiring a second reconstruction characteristic by using the target sound detection module in the target sound source detection and positioning joint optimization network；

S504, obtaining the target sound source detection threshold value by using a third calculation formula;

s505, judging whether the maximum value of the cosine similarity is larger than the target sound source detection threshold value or not;

the third calculation formula is:

In the embodiment of the invention, the training set T and the verification set V are respectively sent into a trained joint network, and the reconstruction characteristics are obtained through a time-frequency characteristic extraction module and a target sound detection module in sequenceAndcomputing validation data and all trainingMean value C of cosine similarity of exercise set data_jAnd with C_jThe maximum value of (a) is a threshold value for determining whether or not the target sound is:

fig. 6 is a flowchart of acquiring sound data in real time and determining an azimuth prediction of a target sound source by using the cosine similarity and the target sound source detection threshold in a target sound source localization method based on a joint optimization network according to an embodiment of the present invention.

As shown in fig. 6, in one or more embodiments, preferably, the acquiring sound data in real time, and determining the azimuth prediction of the target sound source by using the cosine similarity and the target sound source detection threshold specifically includes:

s601, during real-time detection, windowing and framing collected sound data, and then performing short-time Fourier transform to obtain characteristic data of the time-frequency energy spectrum and the time-frequency phase spectrum containing the sound data;

s602, acquiring first real-time data of sound data for windowing and framing;

s603, carrying out short-time Fourier transform on the first real-time data to obtain the time-frequency energy spectrum and the time-frequency phase spectrum containing sound data;

s604, obtaining the characteristic data of the time-frequency energy spectrum and the time-frequency phase spectrum, sending the characteristic data to the target sound source detection and positioning joint optimization network, and calculating cosine similarity;

s605, comparing the detected signal with a preset target sound source detection threshold, and if the cosine similarity is smaller than the target sound source detection threshold, detecting no target sound at this time;

and S606, if the cosine similarity is not smaller than the target sound source detection threshold, detecting the target sound at this time, and taking the output of the sound source coordinate regression prediction module as a target sound source positioning result.

In the embodiment of the invention, after windowing and framing every 1024-length data of the acquired 0.5s sound data, short-time Fourier transform is carried out to obtain the time frequency spectrum of the sound signal. Further, after modulus taking and phase solving are carried out on the time frequency spectrum, the time frequency spectrum and the phase are combined into a characteristic data matrix containing a time frequency energy spectrum and a time frequency phase spectrum of the sound data; and sending the characteristic data into a joint network trained in advance, calculating the average cosine similarity ct between the output of the target sound detection module obtained by the sound data and the output of the target sound detection module of the training data, comparing the average cosine similarity ct with a preset threshold th, detecting no target sound at this time if the average cosine similarity ct is less than the threshold th, and neglecting the output of a sound source coordinate regression prediction module. If the value is larger than the threshold value, the target sound is detected at this time, and the output of the sound source coordinate regression prediction module is the target sound source positioning result.

According to a second aspect of the embodiments of the present invention, a target sound source localization system based on a joint optimization network is provided.

Fig. 7 is a block diagram of a target sound source localization system based on a joint optimization network according to an embodiment of the present invention.

In one or more embodiments, as shown in fig. 7, preferably, the target sound source localization system based on the joint optimization network includes:

the data acquisition module 701 is used for acquiring all target sound source signals through a microphone array placed at a monitoring position and storing the target sound source signals and sound data into a sound data set according to a coordinate tag;

a sample division module 702, configured to divide the sound data set into a training set and a verification set;

the module training module 703 is configured to send the training set to a preset target sound detection and positioning joint network after preprocessing the training set, and obtain a target sound source positioning model, where the target sound detection and positioning joint network includes a timing characteristic extraction module, a target sound detection module, and a source coordinate regression prediction module;

a threshold determination module 704, configured to send the preprocessed verification set to the target sound source localization model, calculate cosine similarity of the target sound detection module to the training set and the verification set, and determine a target sound source detection threshold;

and the implementation detection module 705 is configured to obtain sound data in real time, and determine the azimuth prediction of the target sound source by using the cosine similarity and the target sound source detection threshold.

In one or more embodiments, preferably, the joint optimization network-based target sound source localization system further includes a model operation submodule 706, configured to:

Fig. 8 is a schematic diagram of target sound source localization in one embodiment of the present invention.

As shown in fig. 8, after obtaining the sound feature matrix, performing time-frequency feature extraction, sound source coordinate regression prediction, and target sound detection, and performing joint optimization loss through a loss function to obtain target sound.

According to a fourth aspect of the embodiments of the present invention, there is provided an electronic apparatus. Fig. 9 is a block diagram of an electronic device in one embodiment of the invention. The electronic equipment shown in fig. 9 is a general sound source localization arrangement comprising a general computer hardware structure comprising at least a processor 901 and a memory 902. The processor 901 and the memory 902 are connected by a bus 903. The memory 902 is adapted to store instructions or programs executable by the processor 901. Processor 901 may be a stand-alone microprocessor or a collection of one or more microprocessors. Thus, the processor 901 implements the processing of data and the control of other devices by executing instructions stored by the memory 902 to perform the method flows of embodiments of the present invention as described above. The bus 903 connects the above components together, as well as to the display controller 904 and display devices and input/output (I/O) devices 905. Input/output (I/O) devices 905 may be a mouse, keyboard, modem, network interface, touch input device, motion-sensing input device, printer, and other devices known in the art. Typically, the input/output devices 905 are connected to the system through an input/output (I/O) controller 906.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

3) In the embodiment of the invention, a rapid target sound source positioning result is obtained by combining real-time monitoring with cosine similarity and threshold comparison.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

25页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于声波的多手机定位方法

Target sound source positioning method and system based on joint optimization network

相关技术

网友询问留言