Synthetic audio detection method, system, mobile terminal and storage medium

文档序号：1088605 发布日期：2020-10-20 浏览：19次中文

阅读说明：本技术 合成音频检测方法、系统、移动终端及存储介质 (Synthetic audio detection method, system, mobile terminal and storage medium ) 是由李稀敏曾志先叶志坚肖龙源于 2020-05-29 设计创作，主要内容包括：本发明提供了一种合成音频检测方法、系统、移动终端及存储介质,该方法包括：根据真实音频样本对CNN网络进行训练得到真实化特征转换器；控制真实化特征转换器对训练集数据进行特征转换得到真实化特征,将真实化特征和对应的标签信息输入LCNN网络进行模型训练得到合成音频检测模型；将待检测音频输入真实化特征转换器得到待检测特征；控制合成音频检测模型对待检测特征进行检测得到检测结果。本发明利用真实音频样本的特征学习CNN的模型的设计,以得到真实化特征转换器,该真实化特征转换器能将给定的特征转换为接近真实语音特征的特征,从而增强了训练集数据中真实语音和合成语音之间的区别,提高了合成音频检测模型训练的准确性。(The invention provides a method, a system, a mobile terminal and a storage medium for detecting synthetic audio, wherein the method comprises the following steps: training the CNN network according to the real audio sample to obtain a real feature converter; controlling a realistic feature converter to perform feature conversion on training set data to obtain realistic features, and inputting the realistic features and corresponding label information into an LCNN (low level network neural network) to perform model training to obtain a synthetic audio detection model; inputting the audio to be detected into a realistic feature converter to obtain the features to be detected; and controlling the synthetic audio detection model to detect the characteristics to be detected to obtain a detection result. The invention learns the design of the CNN model by utilizing the characteristics of the real audio sample to obtain the realistic characteristic converter, and the realistic characteristic converter can convert the given characteristics into the characteristics close to the characteristics of the real voice, thereby enhancing the difference between the real voice and the synthesized voice in the training set data and improving the accuracy of the training of the synthesized audio detection model.)

1. A method for synthesized audio detection, the method comprising:

acquiring a real audio sample, and training a CNN network according to the real audio sample to obtain a real feature converter;

controlling the realistic feature converter to perform feature conversion on training set data to obtain realistic features, and inputting the realistic features and corresponding label information into an LCNN (low level computing neural network) to perform model training to obtain a synthetic audio detection model;

inputting the audio to be detected into the realistic feature converter to obtain the features to be detected, and inputting the features to be detected into the trained synthetic audio detection model;

and controlling the synthetic audio detection model to detect the characteristics to be detected to obtain a detection result.

2. The synthetic audio detection method of claim 1 wherein the step of training a CNN network based on the real audio samples comprises:

controlling a convolutional layer in the CNN network to compress the dimension of the real audio sample by adopting a chained convolution mode, and acquiring a convolution result by adopting a ReLU activation function;

and carrying out deconvolution on the convolution result, and acquiring a deconvolution result by adopting the ReLU activation function.

3. The synthetic audio detection method according to claim 1 wherein, prior to the step of inputting the realistic features and corresponding label information into the LCNN network for model training, the method further comprises:

and performing feature cutting on the real features according to the length of a preset file, and supplementing 0 to the real features with the length smaller than the length of the preset file along a time axis to be supplemented to the length of the preset file.

4. The synthetic audio detection method of claim 1 wherein the method further comprises:

and after the pooling of the maximum pooling layer in the LCNN is completed, carrying out batch processing normalization on the characteristic data in the LCNN.

5. The synthetic audio detection method of claim 4 wherein the step of batch normalizing the feature data in the LCNN network comprises:

calculating the mean and variance of training data of each training batch in the LCNN;

performing normalization processing on the training data of the corresponding batch according to the mean value and the variance to obtain the distribution of 0-1;

and carrying out scale transformation and offset on the LCNN according to the distribution.

6. The synthetic audio detection method of claim 3 wherein the realistic feature is an LPS feature, the static dimension of the LPS feature is 863, and the predetermined file length is 256 frames.

7. The synthetic audio detection method of claim 1 wherein the activation function employed by the LCNN network is an MFM activation function.

8. A synthesized audio detection system, the system comprising:

the converter training module is used for acquiring a real audio sample and training the CNN network according to the real audio sample to obtain a real feature converter;

the model training module is used for controlling the realistic feature converter to perform feature conversion on training set data to obtain realistic features, and inputting the realistic features and corresponding label information into the LCNN network to perform model training to obtain a synthetic audio detection model;

the feature processing module is used for inputting the audio to be detected into the realistic feature converter to obtain the features to be detected and inputting the features to be detected into the trained synthetic audio detection model;

and the audio detection module is used for controlling the synthetic audio detection model to detect the characteristics to be detected to obtain a detection result.

9. A mobile terminal, characterized in that it comprises a storage device for storing a computer program and a processor running the computer program to cause the mobile terminal to perform the synthetic audio detection method according to any one of claims 1 to 7.

10. A storage medium, characterized in that it stores a computer program for use in a mobile terminal according to claim 9, which computer program, when being executed by a processor, carries out the steps of the synthetic audio detection method according to any one of claims 1 to 7.

Technical Field

The invention belongs to the technical field of audio detection, and particularly relates to a synthetic audio detection method, a synthetic audio detection system, a mobile terminal and a storage medium.

Background

The evolution of modern text-to-speech and voice conversion technologies can generate natural speech sounds, posing a threat to the security of speaker recognition systems, which makes detecting synthetic audio of non-real persons in speaker recognition systems a very important security issue.

The voiceprint recognition technology is a technology for judging the identity of a speaker by voice, and is mainly applied to the fields of banks, finance, security and the like, and has the characteristics of low cost and high efficiency.

The existing synthetic audio detection methods need manual sound wave feature selection, and then correspondingly perform synthetic judgment on the audio to be detected in a sound wave matching mode, namely, the sound wave of the audio to be detected is subjected to ripple matching with the preset sound wave through selection based on the manual sound wave feature so as to obtain a synthetic judgment result, but the audio detection efficiency is low due to the sound wave matching mode selected based on the manual feature, and the audio detection accuracy is poor.

Disclosure of Invention

Embodiments of the present invention provide a synthetic audio detection method, a synthetic audio detection system, a mobile terminal, and a storage medium, and aim to solve the problems of low audio detection efficiency and poor audio detection accuracy of the existing synthetic audio detection method.

The embodiment of the invention is realized in such a way that a synthetic audio detection method comprises the following steps:

acquiring a real audio sample, and training a CNN network according to the real audio sample to obtain a real feature converter;

and controlling the synthetic audio detection model to detect the characteristics to be detected to obtain a detection result.

Further, the step of training the CNN network according to the real audio samples includes:

controlling a convolution layer in the CNN network to compress dimensionality of the real audio sample by adopting a str i ded convolution mode, and acquiring a convolution result by adopting a ReLU activation function;

and carrying out deconvolution on the convolution result, and acquiring a deconvolution result by adopting the ReLU activation function.

Further, before the step of inputting the realistic features and the corresponding label information into the LCNN network for model training, the method further includes:

Still further, the method further comprises:

and after the pooling of the maximum pooling layer in the LCNN is completed, carrying out batch processing normalization on the characteristic data in the LCNN.

Further, the step of performing batch normalization on the feature data in the LCNN network includes:

calculating the mean and variance of training data of each training batch in the LCNN;

performing normalization processing on the training data of the corresponding batch according to the mean value and the variance to obtain the distribution of 0-1;

and carrying out scale transformation and offset on the LCNN according to the distribution.

Further, the realistic feature is an LPS feature, the static dimension of the LPS feature is 863, and the preset file length is 256 frames.

Furthermore, the activation function adopted by the LCNN network is an MFM activation function.

It is another object of an embodiment of the present invention to provide a synthesized audio detection system, including:

the converter training module is used for acquiring a real audio sample and training the CNN network according to the real audio sample to obtain a real feature converter;

and the audio detection module is used for controlling the synthetic audio detection model to detect the characteristics to be detected to obtain a detection result.

Another object of an embodiment of the present invention is to provide a mobile terminal, including a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the above-mentioned synthesized audio detection method.

It is another object of the embodiments of the present invention to provide a storage medium storing a computer program used in the above-mentioned mobile terminal, wherein the computer program, when executed by a processor, implements the steps of the above-mentioned synthesized audio detection method.

In the embodiment of the invention, the design of the model of the CNN is learned by utilizing the characteristics of the real audio sample to obtain the realistic characteristic converter, and the realistic characteristic converter can convert the given characteristics into the characteristics close to the characteristics of the real voice, thereby enhancing the difference between the real voice and the synthesized voice in the training set data, improving the training accuracy of the synthesized audio detection model and improving the accuracy of the subsequent synthesized audio detection.

Drawings

FIG. 1 is a flow chart of a synthesized audio detection method according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a synthesized audio detection method according to a second embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a synthesized audio detection system according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a mobile terminal according to a fourth embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

12页详细技术资料下载

Synthetic audio detection method, system, mobile terminal and storage medium

相关技术

网友询问留言