Voice tampering detection method based on multi-feature fusion

文档序号：1393448 发布日期：2020-02-28 浏览：12次中文

阅读说明：本技术 基于多种特征融合的语音篡改检测方法 (Voice tampering detection method based on multi-feature fusion ) 是由包永强梁瑞宇谢跃唐闺臣王青云朱悦李明于 2019-09-06 设计创作，主要内容包括：本发明公开了一种基于多种特征融合的语音篡改检测方法,检测语音文件是否为拼接而成,包括以下步骤：步骤S1、将待检测的语音数据进行分帧,划分为多组语音数据帧；步骤S2、对每组语音数据帧提取多维特征；步骤S3、构建基于Attention-RNN的模型作为分类器；步骤S4、将步骤S2中提取到的多维特征输入训练好的分类器,从而判断当前帧语音是否被篡改。本发明的方法通过提取帧级特征能够有效挖掘语音信号中前后特征的差异,将多种特征相结合,语音特征挖掘更加丰富,利用注意力机制为同一样本的局部赋予不同的重要性,自动学习出时序信号的特征。(The invention discloses a voice tampering detection method based on multi-feature fusion, which is used for detecting whether a voice file is formed by splicing or not, and comprises the following steps: step S1, framing the voice data to be detected, and dividing the voice data into a plurality of groups of voice data frames; step S2, extracting multidimensional characteristics from each group of voice data frames; step S3, constructing an Attention-RNN-based model as a classifier; and step S4, inputting the multi-dimensional features extracted in the step S2 into a trained classifier, and judging whether the current frame speech is tampered. The method can effectively mine the difference of the front and rear characteristics in the voice signal by extracting the frame-level characteristics, combines a plurality of characteristics, mines the voice characteristics more abundantly, endows different importance for the local part of the same sample by utilizing an attention mechanism, and automatically learns the characteristics of the time sequence signal.)

1. A voice tampering detection method based on multi-feature fusion is used for detecting whether voice files are formed by splicing or not, and is characterized by comprising the following steps:

step S1, framing the voice data to be detected, and dividing the voice data into a plurality of groups of voice data frames;

step S2, extracting multidimensional characteristics from each group of voice data frames;

step S3, constructing an Attention-RNN-based model as a classifier;

step S4, inputting the multidimensional features extracted in step S2 into the classifier trained in step S3, and determining whether the current frame speech is tampered.

2. The voice tamper detection method based on multi-feature fusion according to claim 1, characterized in that: in the step S3, the Attention-RNN model adopts two RNN layers, wherein the first layer is a bidirectional RNN layer, then an Attention layer is accessed, then a full connection Dense00 layer and a drop for relieving overfitting are connected, finally, the input is sent to a Dense layer and sent to a softmax classifier, each input is firstly sent to bi-RNN, intermediate states are generated according to respective states, and output is obtained through weighting.

3. The voice tamper detection method based on multi-feature fusion according to claim 1, characterized in that: in step S2, 67-dimensional speech features are extracted from each frame of speech, and the 67-dimensional speech features include the following:

speech feature number 1-11: a chromaticity diagram calculated from a speech signal power spectrogram;

speech feature numbers 12-47: mel cepstral coefficients, first order mel cepstral coefficients, second order mel cepstral coefficients;

speech feature number 48-49: zero crossing rate, root mean square;

speech feature number 50-59: spectrum centroid, P-order spectrum bandwidth, spectrogram contrast, roll-off frequency;

speech feature number 60-62: fitting the spectrogram by a polynomial to obtain polynomial coefficients;

phonetic feature numbers 63-64: chaos correlation dimension and chaos entropy;

the speech feature numbers 65-67 are: harmonic energy characteristics, fundamental frequency disturbance characteristics, speech amplitude disturbance.

4. The voice tamper detection method based on multi-feature fusion according to claim 1, characterized in that: in step S1, the frame length of each group of voice data frames is 512, and the frame shift is 256.

Technical Field

The invention relates to the technical field of voice tampering, in particular to a voice tampering detection method based on multi-feature fusion.

Background

The rapid development of digital voice technology has led to its wider and wider application range. But the appearance of powerful voice editing software destroys the authenticity and the safety of voice. In special scenes such as court testimony and historical document backup, the authenticity of digital image materials is ensured. Therefore, determining whether the voice is tampered or not is an urgent problem to be solved by the relevant department of justice.

Digital voice tamper authentication techniques have emerged and developed rapidly since the nineties of the twentieth century. Farid in 1999 proposed a method to detect voice signal tampering using bispectrum analysis; grigoras proposes a detection method for detecting voice tampering by utilizing ENF (electric network frequency) information; yaoqiu et al proposed a voice resampling tampering detection method based on an expectation maximization algorithm; ding et al propose a method for detecting whether a voice signal is interpolated or spliced and distorted by using a subband spectrum smoothing method, and shaonian et al propose a method for detecting whether a voice signal is distorted and recorded in other recording devices by using the background noise characteristic of digital recording devices; yang et al propose a tamper detection method based on MP3 format voice frame displacement.

With the development of machine learning and deep learning techniques, researchers have proposed a variety of effective machine learning and deep learning recognition models. These have had great success in the sound classification problem. Therefore, the adoption of deep learning algorithm for recognition of voice tampering is one of the research directions in the future. Relatively little research is currently being directed towards speech tamper recognition.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides a voice tampering detection method based on multi-dimensional feature fusion, which can effectively identify and distinguish voice tampering conditions and has good robustness.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:

a voice tampering detection method based on multi-feature fusion is used for detecting whether voice files are formed by splicing or not, and is characterized by comprising the following steps:

step S1, framing the voice data to be detected, and dividing the voice data into a plurality of groups of voice data frames;

step S2, extracting multidimensional characteristics from each group of voice data frames;

step S3, constructing an Attention-RNN-based model as a classifier;

step S4, inputting the multidimensional features extracted in step S2 into the classifier trained in step S3, and determining whether the current frame speech is tampered.

Preferably, in step S3, the Attention-RNN model uses two RNN layers, wherein the first layer is a bidirectional RNN layer, then an Attention layer is accessed, then a fully connected sense 00 layer and a drop for relieving overfitting are followed, and finally the inputs are sent to a Dense layer and sent to a softmax classifier, each input is first sent to bi-RNN, intermediate states are generated according to the respective states, and the outputs are obtained by weighting.

Preferably, in step S2, 67-dimensional speech features are extracted from each frame of speech, and the 67-dimensional speech features include the following:

speech feature number 1-11: a chromaticity diagram calculated from a speech signal power spectrogram;

speech feature numbers 12-47: mel cepstral coefficients, first order mel cepstral coefficients, second order mel cepstral coefficients;

speech feature number 48-49: zero crossing rate, root mean square;

speech feature number 50-59: spectrum centroid, P-order spectrum bandwidth, spectrogram contrast, roll-off frequency;

speech feature number 60-62: fitting the spectrogram by a polynomial to obtain polynomial coefficients;

phonetic feature numbers 63-64: chaos correlation dimension and chaos entropy;

the speech feature numbers 65-67 are: harmonic energy characteristics, fundamental frequency disturbance characteristics and voice amplitude disturbance;

preferably, in step S1, each group of voice data frames has a frame length of 512 and a frame shift of 256.

Has the advantages that: compared with the prior art, the invention has the following beneficial effects:

(1) the difference of the front and rear features in the voice signal can be effectively mined by extracting the frame-level features;

(2) the voice features are more abundantly mined by combining a plurality of features;

(3) and giving different importance to the local part of the same sample by using an attention mechanism, and automatically learning the characteristics of the time sequence signal.

Drawings

FIG. 1 is a schematic structural diagram of the invention adopting the Attention-RNN in step S3.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

The invention discloses a voice tampering detection method based on multi-feature fusion, which is used for detecting whether a voice file is spliced or not, and is characterized by comprising the following steps:

step S1, framing the voice data to be detected, and dividing the voice data into a plurality of groups of voice data frames;

step S2, extracting multidimensional characteristics from each group of voice data frames;

step S3, constructing an Attention-RNN-based model as a classifier;

step S4, inputting the multidimensional features extracted in step S2 into the classifier trained in step S3, and determining whether the current frame speech is tampered.

In step S3, the method of using the Attention-RNN model as a classifier includes:

the model firstly adopts two RNN layers, wherein the first layer is a bidirectional RNN layer, then an attention layer is accessed, then a fully-connected Dense00 layer and a dropout for reducing overfitting are connected, finally, the input is sent to a Dense layer and sent to a softmax classifier, each input is firstly sent to bi-RNN, an intermediate state is generated according to the respective state, the output is obtained by weighting, the weight coefficient determines the weight contribution of each input state to the output state, different weights are distributed to the output vector of the bidirectional RNN layer, so that the model can focus attention on the important speech feature and reduce the effect of other irrelevant features,

assuming that the output vector is h and the weight is α, representing the importance of each feature, the combined representation is:

wherein, α has the calculation formula:

wherein the hidden layer output of the activation function is

u_it＝tanh(W_wh_it+b_w) (3)。

In step S2, 67-dimensional speech features are extracted from each frame of speech, and the 67-dimensional speech features include the following:

speech feature number 1-11: a chromaticity diagram calculated from a speech signal power spectrogram;

speech feature numbers 12-47: mel cepstral coefficients, first order mel cepstral coefficients, second order mel cepstral coefficients;

speech feature number 48-49: zero crossing rate, root mean square;

speech feature number 50-59: spectrum centroid, P-order spectrum bandwidth, spectrogram contrast, roll-off frequency;

speech feature number 60-62: fitting the spectrogram by a polynomial to obtain polynomial coefficients;

phonetic feature numbers 63-64: chaos correlation dimension and chaos entropy;

the chaos correlation dimension D (m) is calculated by the formula:

where m represents the embedding dimension of the reconstructed phase space, r is the radius of the hypersphere of the m-dimensional phase space, C_m(r) is the associated integral of the signal in the space;

the chaos entropy is defined as:

wherein σ is the maximum Lyapunov exponent, p (i)₁，…，i_σ) Representing the probability that the signal is in a small space, τ being the time delay;

the speech feature numbers 65-67 are: harmonic energy characteristics, fundamental frequency disturbance characteristics and voice amplitude disturbance;

the harmonic energy characteristic formula is as follows:

wherein E_pAnd E_apRespectively are harmonic component energy and noise component energy;

the fundamental frequency disturbance characteristic formula is as follows:

wherein, F0_iThe fundamental frequency of the ith frame of voice;

the speech amplitude perturbation formula is as follows:

wherein A is_iThe amplitude of the i frame speech.

In step S1, the frame length of each group of voice data frames is 512, and the frame shift is 256. .

The model first takes two RNN layers, the first of which is a bi-directional RNN layer, then accesses an attention layer, followed by a fully connected sense layer and a drop for mitigating overfitting, and finally sends the input to a Dense layer and to a softmax classifier.

The principle of Attention mechanism (Attention) is to simulate the human visual Attention mechanism. When we are focusing on the same east and west, attention moves with the movement of the eyes, which means that the attention distribution of our vision to the target is different. The Attention mechanism was first applied in the field of computer vision in neural networks, and in recent years, researchers have introduced the Attention mechanism into natural language processing and speech. To date, the Attention mechanism has enjoyed great success in text summarization, sequence tagging, and speech recognition. The Attention mechanism can give different importance to the local part of the same sample, automatically learn the characteristics of a time sequence signal and improve the robustness of the model. The model output is the classification probability.

The core of the Attention-RNN network structure is a bidirectional RNN layer followed by an Attention layer. As shown in fig. 1, each input is first transmitted into bi-RNN, an intermediate state is generated according to the respective state, an output is obtained by weighting, the weight coefficient determines the weight contribution of each input state to the output state, and different weights are assigned to the output vector of the bidirectional RNN layer, so that the model can focus attention on the key speech features, and reduce the effects of other irrelevant features.

Assuming that the output vector is h and the weight is α, representing the importance of each feature, the combined representation is:

wherein, α has the calculation formula:

wherein the hidden layer output of the activation function is

u_it＝tanh(W_wh_it+b_w) (3)

The accuracy of the voice tampering detection method based on various feature combinations and the Attention-RNN can reach 92.6%. It is characterized in that: 1) the frame-level features are extracted, so that the difference of the front and rear features in the voice signal can be effectively mined; 2) the voice characteristics are more abundantly mined by combining a plurality of characteristics; 3) and (3) giving different importance to the part of the same sample by using an attention mechanism, and automatically learning the characteristics of the time sequence signal. Therefore, in practical applications, different recording apparatuses can be effectively distinguished.

Model (model)	Support vector machine	Standard recurrent neural network	Attention-RNN network
				Average recognition rate	81.5％	83.4％	92.6％

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

7页详细技术资料下载

Voice tampering detection method based on multi-feature fusion

相关技术

网友询问留言