Multi-channel convolution aliasing voice channel estimation algorithm combined with video signal

文档序号:1546329 发布日期:2020-01-17 浏览:27次 中文

阅读说明:本技术 一种结合视频信号的多通道卷积混叠语音信道估计算法 (Multi-channel convolution aliasing voice channel estimation algorithm combined with video signal ) 是由 杨俊杰 杨祖元 谢胜利 杨超 解元 于 2019-08-30 设计创作,主要内容包括:本发明公开了一种结合视频信号的多通道卷积混叠语音信道估计算法,引进新型数学工具和分析方法,融合视音频信号信息,实现卷积语音混叠信道的有效估计。该方法借助说话人嘴部区域视频信号,通过非负矩阵分解提取说话人嘴型特征数据;利用密度聚类方法检测说话人嘴部特征数据的聚类中心,检测出说话人嘴部处于静默状态的图像帧,进一步提取单一说话人发声主导的所有时间窗口。根据局部主导时间窗口信息,从时频域观测语音信号成分计算局部主导协方差矩阵,通过特征值分解提取出主导特征向量,从而实现混叠语音信道估计。对比当前较为流行的单模态音频下的混叠语音信道估计方法,从数值实验上证明了所提方法的优越性。(The invention discloses a multi-channel convolution aliasing voice channel estimation algorithm combined with a video signal, introduces a novel mathematical tool and an analysis method, fuses video and audio signal information, and realizes effective estimation of a convolution voice aliasing channel. The method extracts speaker mouth shape characteristic data through non-negative matrix factorization by means of speaker mouth region video signals; and detecting a clustering center of the speaker mouth characteristic data by using a density clustering method, detecting an image frame of the speaker mouth in a silent state, and further extracting all time windows dominated by the single speaker voice. And according to the local dominant time window information, calculating a local dominant covariance matrix from the time-frequency domain observation voice signal components, and extracting a dominant eigenvector through eigenvalue decomposition, thereby realizing aliasing voice channel estimation. Compared with the aliasing voice channel estimation method under the current popular single-mode audio, the method proves the superiority of the method from numerical experiments.)

1. A multi-channel convolution aliasing speech channel estimation algorithm in combination with a video signal, comprising the steps of:

collecting video data of a plurality of speakers, and editing video images of mouth regions of the speakers to form a video database; meanwhile, recording the voice signal of each speaker, and constructing an audio database; synthesizing a plurality of multi-channel convolution aliasing voice signals by using an audio database;

carrying out non-negative matrix decomposition on the vectorization expression matrix of the video image of the mouth region of the speaker to respectively obtain an image characteristic matrix and an image expression matrix; performing mathematical modeling on the multi-channel convolution aliasing voice signal in a time-frequency domain through short-time Fourier transform;

carrying out density clustering on image representation matrixes of single speaker column by column to search out a maximum density clustering center, setting a threshold value to obtain a neighbor data point subscript set of the maximum density clustering center, taking the neighbor data point subscript set as a speaker mouth silent state data set, and taking a complement of the data set as the speaker voice state data set; performing joint intersection operation on the silent state data sets and the sounding state data sets of a plurality of speakers to detect a local main guide set of a single speaker;

according to the local main guide set of a single speaker, time-frequency domain second-order covariance matrix sequences corresponding to corresponding time windows are respectively calculated, and main characteristic vectors are extracted from each-order covariance matrix to form an estimation aliasing channel.

2. The multi-channel convolution aliasing voice channel estimation algorithm for synthesizing video signals according to claim 1, characterized in that the video data of a plurality of speakers are collected, and video images of the mouth regions of the speakers are clipped to form a video database; meanwhile, recording the voice signal of each speaker, and constructing an audio database, wherein the method comprises the following steps:

recording the front speaking videos of a plurality of speakers through a camera, keeping a certain pause when the speakers recite each sentence, editing the video images of the mouth area of the speakers, and forming a video database; and recording a pure voice signal of a speaker through a microphone while recording the video to construct an audio database.

3. The multi-channel convolution aliasing speech channel estimation algorithm for synthesizing video signals according to claim 1, wherein the non-negative matrix decomposition is performed on the vectorized representation matrix of the video image of the mouth region of the speaker to obtain an image feature matrix and an image representation matrix, respectively, represented as:

Vi=WiHi

wherein, ViA vectorized representation matrix for representing the video image of the mouth region of the speaker, wherein the image characteristic matrix is Wi=[wi,1,...,wi,K]∈(R+)P×KThe image representation matrix is Hi=[hi,1,...,hi,Q]∈(R+)K×QWherein i represents the ith speaker, P is the total pixel value of the video frame, K is the number of columns of the image characteristic matrix, Q is the number of columns of the image representation matrix, R is the real number set, K<<Q,HiThe length of the die of all columns in the array being unit length, i.e.

4. The multi-channel convolution aliased speech channel estimation algorithm for synthesizing a video signal of claim 1 wherein the multi-channel convolution aliased speech signal is mathematically modeled in the time-frequency domain by a short-time fourier transform as represented by:

xf,d=Afsf,d+ef,d

wherein A isfIs an alias channel, s, at a frequency point f in the complex fieldf,dIs the speech source component on the time frequency point (f, d), ef,dIs gaussian noise.

5. The multi-channel convolution aliasing speech channel estimation algorithm for synthesizing video signals according to claim 1, characterized in that when performing density clustering column by column on image representation matrixes of single speakers, a local density value evaluation index p of the ith speaker is calculatediqExpressed as:

Figure FDA0002186518520000022

wherein phi isi,qkDefined as an image representation matrix HiCharacteristic row hi,qAnd hi,kThe euclidean distance between them,

Figure FDA0002186518520000023

6. The multi-channel convolution aliasing speech channel estimation algorithm for synthesizing video signals according to claim 1, wherein the setting of the threshold to obtain the subscript set of neighboring data points of the maximum density cluster center comprises:

setting a distance threshold value mu, and marking all image representation vector data point subscript sets with the distance from the maximum density cluster center to be lower than the threshold value as phii

7. The multi-channel convolution aliasing speech channel estimation algorithm for synthesizing video signals according to claim 1, wherein the time-frequency domain second-order covariance matrix sequences corresponding to the respective calculated time windows are expressed as:

wherein g (Ψ)i) Partial dominant set Ψ for a single speakeriAnd converting the mapping function into a corresponding voice time-frequency frame set.

8. The multi-channel convolution aliasing speech channel estimation algorithm for synthesizing video signals according to claim 1, wherein the dominant eigenvector is the eigenvector corresponding to the largest eigenvalue.

Technical Field

The invention relates to the field of voice signal processing, in particular to a multi-channel convolution aliasing voice channel estimation algorithm combined with a video signal.

Background

The task of Audio Speech Separation (ASS) is to separate the voice of a target speaker from a mixed Speech signal of multiple speakers received by a microphone by means of signal processing. This is a very challenging issue in the field of signal processing. Before the complete separation of voice is realized, obtaining aliasing channel information is a key link in the problem of voice separation. Under the actual condition, the interference of background noise can be overcome by processing the voice problem with the aid of the video signal, more accurate information of the speaking state of the speaker is obtained, and the defect that the mixed voice signal is processed by the single-mode audio signal in a noise and high reverberation environment is overcome.

In the actual recording situation, the voice signal is affected by the room reverberation effect and the interference of the background noise, and the recorded voice is often the result of aliasing synthesis of multiple fading paths, and can be mathematically described as a convolution aliasing model. Due to the influence of factors such as high reverberation and high background noise in the actual situation, an indoor voice convolution mixing system is complex, aliasing channel information is difficult to obtain, and great difficulty is brought to subsequent voice separation. In the aspect of single-mode audio signals, in order to solve the problem of aliasing channel estimation in reverberation and noise environments, methods for converting an observed speech signal into a time-frequency domain for batch processing are popular, such as the currently popular PARAFA-SC algorithm and the Bayes Ris-Min algorithm. However, for the problems of high reverberation and high noise in the real situation, the problem of mutual crosstalk between signal sources is easily caused in the prior art, and the final estimation of an aliasing channel is not ideal.

Disclosure of Invention

The invention aims to provide a multi-channel convolution aliasing voice channel estimation algorithm combined with a video signal, which can solve the problem that the estimation performance of the existing algorithm on an aliasing channel is not ideal enough.

In order to realize the task, the invention adopts the following technical scheme:

a multi-channel convolution aliasing speech channel estimation algorithm in combination with a video signal, comprising the steps of:

collecting video data of a plurality of speakers, and editing video images of mouth regions of the speakers to form a video database; meanwhile, recording the voice signal of each speaker, and constructing an audio database; synthesizing a plurality of multi-channel convolution aliasing voice signals by using an audio database;

carrying out non-negative matrix decomposition on the vectorization expression matrix of the video image of the mouth region of the speaker to respectively obtain an image characteristic matrix and an image expression matrix; performing mathematical modeling on the multi-channel convolution aliasing voice signal in a time-frequency domain through short-time Fourier transform;

carrying out density clustering on image representation matrixes of single speaker column by column to search out a maximum density clustering center, setting a threshold value to obtain a neighbor data point subscript set of the maximum density clustering center, taking the neighbor data point subscript set as a speaker mouth silent state data set, and taking a complement of the data set as the speaker voice state data set; performing joint intersection operation on the silent state data sets and the sounding state data sets of a plurality of speakers to detect a local main guide set of a single speaker;

according to the local main guide set of a single speaker, time-frequency domain second-order covariance matrix sequences corresponding to corresponding time windows are respectively calculated, and main characteristic vectors are extracted from each-order covariance matrix to form an estimation aliasing channel.

Furthermore, the video data of a plurality of speakers are collected, and video images of mouth regions of the speakers are edited to form a video database; meanwhile, recording the voice signal of each speaker, and constructing an audio database, wherein the method comprises the following steps:

recording the front speaking videos of a plurality of speakers through a camera, keeping a certain pause when the speakers recite each sentence, editing the video images of the mouth area of the speakers, and forming a video database; and recording a pure voice signal of a speaker through a microphone while recording the video to construct an audio database.

Further, the non-negative matrix decomposition is performed on the vectorized expression matrix of the video image of the mouth region of the speaker to obtain an image feature matrix and an image expression matrix, which are expressed as:

Vi=WiHi

wherein, ViA vectorized representation matrix for representing the video image of the mouth region of the speaker, wherein the image characteristic matrix is Wi=[wi,1,...,wi,K]∈(R+)P×KThe image representation matrix is Hi=[hi,1,...,hi,Q]∈(R+)K×QWherein i represents the ith speaker, P is the total pixel value of the video frame, K is the number of columns of the image characteristic matrix, Q is the number of columns of the image representation matrix, R is the real number set, K<<Q,HiThe length of the die of all columns in the array being unit length, i.e.

Figure BDA0002186518530000021

Further, the mathematical modeling of the multi-channel convolution aliasing speech signal in the time-frequency domain by the short-time fourier transform is expressed as:

xf,d=Afsf,d+ef,d

wherein A isfIs an alias channel, s, at a frequency point f in the complex fieldf,dIs the speech source component on the time frequency point (f, d), ef,dIs gaussian noise.

Further, when the density clustering is carried out on the image representation matrix of the single speaker column by column, the evaluation index rho of the local density value of the ith speaker is calculatediqTo representComprises the following steps:

Figure BDA0002186518530000031

wherein phi isi,qkDefined as an image representation matrix HiCharacteristic row hi,qAnd hi,kThe euclidean distance between them,

Figure BDA0002186518530000032

is a preset Euclidean distance threshold value.

Further, the setting a threshold to obtain a subscript set of neighboring data points of a cluster center with the maximum density includes:

setting a distance threshold value mu, and marking all image representation vector data point subscript sets with the distance from the maximum density cluster center to be lower than the threshold value as phii

Further, the step of respectively calculating the time-frequency domain second-order covariance matrix sequences corresponding to the corresponding time windows is represented as:

Figure BDA0002186518530000033

wherein g (Ψ)i) Partial dominant set Ψ for a single speakeriAnd converting the mapping function into a corresponding voice time-frequency frame set.

Further, the dominant eigenvector is the eigenvector corresponding to the largest eigenvalue.

Compared with the prior art, the invention has the following technical characteristics:

the method detects a local dominant time window of a single speaker in a video image by means of video image detection of a speaker mouth region and introducing a mathematical tool (a non-negative matrix decomposition and density clustering method), and meanwhile constructs a time-frequency domain voice local covariance statistical matrix from an audio signal and extracts a dominant feature vector so as to estimate an aliasing channel; series of experiments prove that the algorithm has better estimation performance than other single audio mode algorithms.

Drawings

FIG. 1 is a diagram of a clean speech signal;

FIG. 2 is a diagram of an aliased speech signal;

fig. 3 (a) and (b) are mouth images of the speaker 1 and the speaker 2, respectively;

FIG. 4 is a schematic diagram of the density clustering effect of the feature data of the mouth image of the speaker 1;

FIG. 5 is a schematic diagram of a single speaker local dominance detection effect based on a mouth representation matrix;

FIG. 6 is a schematic flow chart of the method of the present invention.

Detailed Description

The invention provides a multi-channel convolution aliasing voice channel estimation algorithm combined with a video signal, which is characterized in that by means of video mouth region state detection, a plurality of speaker mouth keeping silent state key frames are extracted by utilizing a video image, all time windows with only one speaker sounding are detected based on the detection, and a convolution aliasing voice channel is estimated by combining with an observation audio signal. In video, the mouth region video signals of N speakers are represented as V1,...,VNIn which V isi∈RP×QFor vectorized representation of the ith personal mouth region video, P is the total pixel value of the video frame, Q represents the total number of video frames, i is 1, …, N. In audio, the convolutional speech aliasing system is x (t) ═ a × (t) + e (t), where x (t) ∈ RMRepresenting the observed speech signals collected by M microphones, A ∈ RM×N×LIs L-order aliasing channel matrix under reverberation condition, representing convolution symbol, s (t) epsilon RNFor clean speech signals, e (t) e RMIs the system noise; the invention aims to estimate a convolution aliasing voice channel A by combining video and audio signals.

Step 1, collecting video data of a plurality of speakers, and editing video images of mouth regions of the speakers to form a video database; meanwhile, recording the voice signal of each speaker, and constructing an audio database; a plurality of multi-channel convolved aliased speech signals are synthesized using an audio database.

Firstly, recording front speaking videos of a plurality of speakers through a camera, keeping a certain pause when the speakers recite each sentence, and editing video images of a mouth area to form a video database; and recording a pure voice signal of a speaker through a microphone while recording the video to construct an audio database.

In the embodiment, three speech aliasing schemes are synthesized, wherein the number of microphones M is 2 and 3, respectively, and the number of speakers N is 2,3 and 4, respectively, and the three speech aliasing schemes are labeled as (M, N) ═ 2, (M, N) ═ 3, and (M, N) ═ 3, 4. The sampling rate of the recorded voice is fs 8000, and the acquisition length is 40 seconds. In addition, the microphone spacing is set to 0.05 m, the speaker spacing is set to 0.4 m, the distance between the microphone center and the speaker center is set to 1.2 m, and the reverberation time is respectively set as follows: RT (reverse transcription)60The room impulse response function is generated by Image-based rir algorithm (J. allen and d. berkley, Image method for influencing the same]J.acoust.soc.amer.,65(4), 1979.). The method comprises the steps that a Samsung I9100 mobile phone is adopted to record videos of a plurality of speakers, the sampling rate is fps equal to 25, and the size of each image is 90 times 110 pixel points; the short time fourier window function length is set to 2048.

Step 2, carrying out non-negative matrix decomposition on the vectorization expression matrix of the video image of the mouth region of the speaker to respectively obtain an image characteristic matrix and an image expression matrix so as to extract the characteristics of the video image of the mouth region of the speaker; a multi-channel convolved aliased speech signal is mathematically modeled in the time-frequency domain by a short-time fourier transform.

Because the video image array is large, the direct processing calculation amount in the image domain is large, and the algorithm complexity is increased. According to the scheme, the video image characteristic information is obtained through non-negative matrix decomposition, and the reduction of the image dimensionality of the mouth region is realized.

Vectorized representation matrix V of video image of mouth region of speakeriA non-negative matrix factorization is performed, expressed as:

Vi=WiHi

wherein the image feature matrix is Wi=[wi,1,...,wi,K]∈(R+)P×KThe image representing momentArray is Hi=[hi,1,...,hi,Q]∈(R+)K×QWherein i represents the ith speaker, P is the total pixel value of the video frame, K is the number of columns of the image characteristic matrix, Q is the number of columns of the image representation matrix, R is the real number set, K<<Q。HiThe length of the die of all columns in the array being unit length, i.e.

Figure BDA0002186518530000051

Mathematically modeling a multi-channel convolved aliased speech signal x (t) in the time-frequency domain using a short-time Fourier transform:

there are N signals (N2, 3,4), aliasing occurs when M microphones receive (M2, 3), and aliasing speech signal component x in time-frequency points (f, d)f,dExpressed as:

xf,d=Afsf,d+ef,d

wherein A isf=[af,1,...,af,N]Is an alias channel, s, at a frequency point f in the complex fieldf,dIs the speech source component on the time frequency point (f, d), ef,dIs gaussian noise.

Step 3, representing a matrix H for the image of a single speaker iiCarrying out density clustering column by column, searching out the maximum density clustering center, and setting a threshold value mu to obtain an adjacent data point set phi of the maximum density clustering centeriIt is used as the data set for keeping silent state of i mouth of speaker, and the complementary set of the data set is used as the data set for speaking state of speaker

Figure BDA0002186518530000052

Performing union intersection operation on the silent state data sets and the sounding state data sets of the N speakers to detect a local main guide set of a single speaker, wherein the local main guide set is marked as psi1,...,ΨN

In this step, the local density value evaluation index of the ith speaker is calculated as: rhoiqQ1.., Q, expressed as:

wherein phi isi,qkDefined as an image representation matrix HiCharacteristic row hi,qAnd hi,kThe euclidean distance between them,

Figure BDA0002186518530000062

for a preset Euclidean distance threshold, e.g. from the set of distances { phi ]i,qk}q,k=1,...,QExtracting the first 6% -8% of distance values (arranged from small to large) as a threshold value; extracting the local density value index rho for each speakeri1,...,ρiN,i=1,…,N。

Searching out maximum density clustering center and setting distance threshold mu to obtain speaker mouth silence state data set phiiIn this embodiment, μ ≈ 0.3, and all image expression vector data point subscript sets with a distance from the maximum density cluster center lower than the threshold are labeled as Φi(i.e., silence state data set), and additionally marks the speaker voicing state data set as ΦiThe complement of (1), is recorded as

Figure BDA0002186518530000063

The local main guide set of a single speaker is detected through intersection operation as follows:

Figure BDA0002186518530000064

jl∈{1,...,i-1,i+1,...N},l=1,...,N-1。

and 4, respectively calculating a time-frequency domain second-order covariance matrix sequence corresponding to a corresponding time window according to a local dominant set of a single speaker, and extracting a dominant eigenvector from each order covariance matrix to form an estimated aliasing channel.

Time-frequency domain aliasing voice signal component x obtained by modeling by utilizing multi-channel convolution aliasing voice signal in step 2f,dConstructing a local second-order covariance matrix as follows:

wherein g (Ψ)i) Partial dominant set Ψ for a single speakeriAnd converting the mapping function into a corresponding voice time-frequency frame set.

Performing eigenvalue decomposition on the local second-order covariance matrix, and extracting an eigenvector (dominant eigenvector) mark corresponding to the maximum eigenvalueThereby constructing an estimated aliased channel

Figure BDA0002186518530000067

And implementing aliasing channel estimation.

The feasibility and superiority of the algorithm are illustrated by three specific simulation experiments, all of which are implemented in the programming environment of MacBook Air, Intel Core i5, CPU 1.8GHz, macOS 10.13.6 and Matlab R2018 b. First, the present solution uses the Audio and video data set provided by David Dov et al as the test set (David Dov, Ronen Talmon, and Israel Cohen, Audio-visual voice activity detection using differences maps [ J ]. IEEE/ACM Trans. Audio, Speech, Lang. Process.,23 (4)), 2015: 732-. In the data set, the scheme selects mouth movement videos and corresponding voice data of 4 speakers respectively, and constructs a video and audio test data set according to the step one. The clean speech signal waveform is shown in fig. 1 below, and the aliased speech waveform is shown in fig. 2 below. Video capture of a mouth region image of a speaker is shown in fig. 3, detection of a density clustering center effect graph through the third step is shown in fig. 4, and detection of a local dominant time window of a single speaker through the fourth step is shown in fig. 5.

In addition, the scheme compares the estimated aliasing channel precision as performance:

Figure BDA0002186518530000071

the smaller the error value, the higher the estimation accuracy.

The scheme takes into account different reverberation RTs60The following convolutional voice aliasing channel estimation problem is compared with two convolutional aliasing channel estimation algorithms based on audio signals, namely Bayes-RisM, which are popular nowadaysin and PARAFAC-SC are compared, and the performance result of the aliasing channel estimation is shown in the following table 1. Obviously, the performance of the convolution aliasing channel estimation algorithm provided by the scheme is more excellent.

TABLE 1 different reverberation RTs60Lower-alias channel estimation accuracy (MSE)s)

Figure BDA0002186518530000072

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种语音识别方法、装置、电子设备及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!