Method, apparatus, device and medium for recognizing synthesized speech

文档序号：989497 发布日期：2020-11-06 浏览：2次中文

阅读说明：本技术 用于识别合成语音的方法、装置、设备和介质 (Method, apparatus, device and medium for recognizing synthesized speech ) 是由殷翔于 2020-07-30 设计创作，主要内容包括：本公开的实施例公开了用于识别合成语音的方法、装置、设备和介质。该方法的一具体实施方式包括：获取待识别语音；对待识别语音进行识别,以生成用于指示待识别语音是否属于合成语音的预指示信息；基于预指示信息对待识别语音进行后处理,以生成用于指示待识别语音是否属于合成语音的指示信息。该实施方式可以兼顾识别的准确率和效率,从而提升了对合成语音进行识别的效果。(Embodiments of the present disclosure disclose methods, apparatuses, devices and media for recognizing synthesized speech. One embodiment of the method comprises: acquiring a voice to be recognized; recognizing the voice to be recognized to generate pre-indication information for indicating whether the voice to be recognized belongs to synthesized voice; and post-processing the voice to be recognized based on the pre-indication information to generate indication information for indicating whether the voice to be recognized belongs to the synthesized voice. The embodiment can give consideration to the accuracy and efficiency of recognition, thereby improving the effect of recognizing the synthetic speech.)

1. A method for recognizing synthesized speech, comprising:

acquiring a voice to be recognized;

recognizing the voice to be recognized to generate pre-indication information for indicating whether the voice to be recognized belongs to synthesized voice;

and post-processing the voice to be recognized based on the pre-indication information to generate indication information for indicating whether the voice to be recognized belongs to synthesized voice.

2. The method according to claim 1, wherein the recognizing the speech to be recognized to generate pre-indication information indicating whether the speech to be recognized belongs to synthesized speech comprises:

and inputting the voice to be recognized into a pre-trained synthesized voice recognition model to obtain pre-indication information for indicating whether the voice to be recognized belongs to the synthesized voice, wherein the synthesized voice recognition model is used for representing the corresponding relation between the pre-indication information and the voice to be recognized.

3. The method according to claim 1, wherein the post-processing the speech to be recognized based on the pre-indication information to generate indication information indicating whether the recognized speech belongs to synthesized speech comprises:

and in response to the fact that the generated pre-indication information is used for indicating that the voice to be recognized belongs to the synthesized voice, performing post-processing on the voice to be recognized to generate indication information used for indicating whether the voice to be recognized belongs to the synthesized voice.

4. The method according to claim 3, wherein the post-processing the speech to be recognized to generate indication information indicating whether the speech to be recognized belongs to synthesized speech comprises:

extracting a target number of voice slices matched with phonemes from the voice to be recognized;

determining similarity between the extracted target number of voice slices;

and generating indication information for indicating that the speech to be recognized belongs to the synthesized speech in response to the fact that the obtained similarity meets the preset condition.

5. The method of claim 2, wherein the synthetic speech recognition model is trained by:

acquiring a training sample set, wherein training samples in the training sample set comprise sample voices and sample labeling information, the sample voices comprise real voices and synthesized voices corresponding to the real voices, and the sample labeling information is used for indicating whether the voices belong to the synthesized voices or not;

and taking the sample voice of the training sample in the training sample set as input, taking the sample marking information corresponding to the input sample voice as expected output, and training to obtain the synthetic voice recognition model.

6. The method of claim 5, wherein the synthetic speech recognition model comprises a densely connected convolutional network based on a bi-directional gating cycle unit (BGRU).

7. The method according to one of claims 1-6, wherein the method further comprises:

performing voice recognition on the voice to be recognized to generate a recognition text in response to determining that the generated indication information is used for indicating that the voice to be recognized does not belong to synthesized voice;

and determining whether to execute unlocking operation or not according to the matching of the identification text and preset verification information.

8. An apparatus for recognizing synthesized speech, comprising:

an acquisition unit configured to acquire a voice to be recognized;

a pre-recognition unit configured to recognize the speech to be recognized to generate pre-indication information indicating whether the speech to be recognized belongs to a synthesized speech;

a post-processing unit configured to post-process the speech to be recognized based on the pre-indication information to generate indication information indicating whether the speech to be recognized belongs to synthesized speech.

9. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-7.

Technical Field

Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method, apparatus, device, and medium for recognizing synthesized speech.

Background

Speech synthesis, also known as Text To Speech (TTS) technology, is a technology for simulating a human being by using an electronic computer and some special devices to produce Speech.

As synthesized speech and real speech become closer, how to identify which sounds are real and which are forged is an important issue of current research, and concerns security of recognition technologies such as voiceprint recognition, voice unlocking, and the like.

Disclosure of Invention

The present disclosure presents methods and apparatus for recognizing synthesized speech.

In a first aspect, an embodiment of the present disclosure provides a method for recognizing synthesized speech, the method including: acquiring a voice to be recognized; recognizing the voice to be recognized to generate pre-indication information for indicating whether the voice to be recognized belongs to synthesized voice; and post-processing the voice to be recognized based on the pre-indication information to generate indication information for indicating whether the voice to be recognized belongs to the synthesized voice.

In some embodiments, the recognizing the speech to be recognized to generate the pre-indication information indicating whether the speech to be recognized belongs to the synthesized speech includes: inputting the voice to be recognized into a pre-trained synthesized voice recognition model to obtain pre-indication information for indicating whether the voice to be recognized belongs to the synthesized voice, wherein the synthesized voice recognition model is used for representing the corresponding relation between the pre-indication information and the voice to be recognized.

In some embodiments, the post-processing the speech to be recognized based on the pre-indication information to generate indication information indicating whether the recognized speech belongs to synthesized speech includes: and in response to determining that the generated pre-indication information is used for indicating that the voice to be recognized belongs to the synthesized voice, performing post-processing on the voice to be recognized to generate indication information used for indicating whether the voice to be recognized belongs to the synthesized voice.

In some embodiments, the post-processing the speech to be recognized to generate indication information indicating whether the speech to be recognized belongs to synthesized speech includes: extracting target number of voice slices matched with phonemes from the voice to be recognized; determining similarity between the extracted target number of voice slices; and generating indication information for indicating that the voice to be recognized belongs to the synthesized voice in response to the fact that the obtained similarity meets the preset condition.

In some embodiments, the synthesized speech recognition model is trained by: acquiring a training sample set, wherein training samples in the training sample set comprise sample voice and sample marking information, the sample voice comprises real voice and synthesized voice corresponding to the real voice, and the sample marking information is used for indicating whether the voice belongs to the synthesized voice; and taking the sample voice of the training sample in the training sample set as input, taking the sample marking information corresponding to the input sample voice as expected output, and training to obtain a synthetic voice recognition model.

In some embodiments, the synthesized speech recognition model includes a dense connected convolutional network (densneet) based on a Bidirectional gated loop Unit (BGRU).

In some embodiments, the method further comprises: performing voice recognition on the voice to be recognized to generate a recognition text in response to the fact that the generated indication information is used for indicating that the voice to be recognized does not belong to the synthesized voice; and determining whether to execute unlocking operation or not according to the matching of the identification text and the preset verification information.

In a second aspect, an embodiment of the present disclosure provides an apparatus for recognizing synthesized speech, the apparatus including: an acquisition unit configured to acquire a voice to be recognized; a pre-recognition unit configured to recognize a speech to be recognized to generate pre-indication information indicating whether the speech to be recognized belongs to a synthesized speech; and the post-processing unit is configured to perform post-processing on the voice to be recognized based on the pre-indication information so as to generate indication information for indicating whether the voice to be recognized belongs to the synthesized voice.

In a third aspect, an embodiment of the present disclosure provides an electronic device for recognizing synthesized speech, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method of any of the embodiments of the method for recognizing synthesized speech as described above.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium for recognizing synthesized speech, on which a computer program is stored, which when executed by a processor, implements the method of any of the embodiments of the method for recognizing synthesized speech as described above.

According to the method and the device for recognizing the synthetic voice, the pre-indication information is generated after whether the voice to be recognized belongs to the synthetic voice is recognized, and the post-processing is further performed according to different conditions of the generated pre-indication information to generate the indication information. Therefore, the accuracy and efficiency of recognition are considered through the two-stage recognition, and the effect of recognizing the synthetic speech is improved.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for recognizing synthesized speech according to the present disclosure;

FIG. 3 is a schematic illustration of one application scenario of a method for recognizing synthesized speech according to the present disclosure;

FIG. 4 is a flow diagram of yet another embodiment of a method for recognizing synthesized speech according to the present disclosure;

FIG. 5 is a schematic diagram illustrating one embodiment of an apparatus for recognizing synthesized speech according to the present disclosure;

FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the embodiments of the disclosure and that no limitations are intended to the embodiments of the disclosure. It should be further noted that, for convenience of description, only portions related to the embodiments of the present disclosure are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 of an embodiment of a method for recognizing synthetic speech or an apparatus for recognizing synthetic speech to which embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the terminal devices 101, 102, 103 to interact with the server 105 over the network 104 to receive or transmit data or the like. The terminal devices 101, 102, 103 may have various communication client applications installed thereon, such as voice recognition software, video playing software, news information applications, image processing applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like.

The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting voice recognition, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for speech recognition software run by the terminal devices 101, 102, 103. The background server may analyze and the like the received speech to be recognized, and feed back a processing result (for example, indication information indicating whether the speech to be recognized belongs to the synthesized speech) to the terminal device. Optionally, the server 105 may include a cloud server.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be further noted that the method for recognizing synthesized speech provided by the embodiments of the present disclosure may be executed by a server, may also be executed by a terminal device, and may also be executed by the server and the terminal device in cooperation with each other. Accordingly, the parts (e.g., units, sub-units, modules, and sub-modules) included in the apparatus for recognizing synthesized speech may be all disposed in the server, may be all disposed in the terminal device, and may be disposed in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. The system architecture may comprise only the electronic device (e.g. server or terminal device) on which the method for recognizing synthetic speech is running, when the electronic device on which the method for recognizing synthetic speech is running does not need to perform a data transfer with the other electronic device.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for recognizing synthesized speech according to the present disclosure is shown. The method for recognizing synthesized speech includes the steps of:

step 201, obtaining the voice to be recognized.

In this embodiment, the execution subject of the method for recognizing the synthesized speech may acquire the speech to be recognized from a local or other electronic device or software module through a wired connection manner or a wireless connection manner. As an example, when the execution subject is a terminal device (e.g., the terminal devices 101, 102, 103 shown in fig. 1), the execution subject may acquire the voice to be recognized from a local or audio input device (e.g., a microphone). As still another example, when the execution subject is a server (e.g., the server 105 shown in fig. 1), the execution subject may obtain the speech to be recognized from a local or other electronic device (e.g., the terminal devices 101, 102, 103 shown in fig. 1) or a software module (e.g., a software module for obtaining the speech to be recognized).

Step 202, recognizing the speech to be recognized to generate pre-indication information for indicating whether the speech to be recognized belongs to the synthesized speech.

In this embodiment, the executing entity may recognize the speech to be recognized acquired in step 201 in various ways to generate pre-indication information indicating whether the speech to be recognized belongs to a synthesized speech. As an example, the execution subject may perform audio analysis on the acquired real voice data and synthesized voice in advance, and set a threshold value according to a statistical result after the analysis. The audio analysis may include generating a distribution of pitches, spectral formants, zero-crossing rates, etc. Then, the executing entity may perform audio analysis consistent with the foregoing on the speech to be recognized acquired in step 201 to obtain an analysis result. And comparing the obtained analysis result with the set threshold value to generate pre-indication information for indicating whether the voice to be recognized belongs to the synthesized voice.

In some optional implementation manners of this embodiment, the executing entity may input the to-be-recognized speech obtained in step 201 to a synthesized speech recognition model trained in advance, so as to obtain pre-indication information indicating whether the to-be-recognized speech belongs to a synthesized speech. The synthesized speech recognition model may include various models obtained through machine learning training and used for representing a correspondence between the pre-indication information and the speech to be recognized.

Based on the optional implementation manner, the execution main body can distinguish whether the speech to be recognized belongs to the synthesized speech by using a pre-trained synthesized speech recognition model, so that the accuracy of a distinguishing result is improved.

Alternatively, the synthesized speech recognition model may be obtained by training:

in a first step, a set of training samples is obtained.

In these implementations, an executive for training a synthesized speech recognition model may obtain a set of training samples in various ways. The training samples in the training sample set may include sample speech and sample labeling information. The sample speech may include real speech and synthesized speech corresponding to the real speech. The sample labeling information can be used to indicate whether the speech belongs to synthesized speech. As an example, the synthesized speech corresponding to the real speech may include synthesized speech obtained by a speech synthesis method such as a splicing method, a parameter method, an end-to-end method, and the like.

And secondly, taking the sample voice of the training sample in the training sample set as input, taking the sample marking information corresponding to the input sample voice as expected output, and training to obtain a synthetic voice recognition model.

In these implementations, the executing entity may use the sample speech of the training sample in the training sample set obtained in the first step as an input of the initial model, and obtain an output result corresponding to the input sample speech. Then, the execution subject may compare the obtained output result with sample labeling information corresponding to the input sample voice, and generate a difference value. According to the obtained difference value, the execution main body can adjust the parameters of the initial model, and train the adjusted initial model as a new initial model. When the training end condition is satisfied, the training may be ended, and the execution agent may determine the initial model obtained by the training as the synthetic speech recognition model. The initial voiceprint model may include various Neural network models that can be used to distinguish speech features, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs).

Alternatively, the synthetic speech recognition model may comprise a densely connected convolutional network based on bi-directional gated cyclic units. As an example, the above-described synthetic speech recognition model may be composed of a 3-layer convolutional network, a bi-directional gated cyclic unit and a dense connected network, and an activation function layer. Wherein the activation function may include softmax.

Based on the optional implementation mode, the dense connection convolution network based on the bidirectional gating circulation unit can better learn the prosodic fluctuation of the voice in a longer period during training, and is more sensitive to the discrimination of the synthesized voice and the real voice in the long-term prosody, so that the discrimination accuracy can be improved.

And 203, post-processing the speech to be recognized based on the pre-indication information to generate indication information for indicating whether the recognized speech belongs to the synthesized speech.

In this embodiment, the executing entity may perform post-processing on the speech to be recognized based on the pre-indication information generated in step 202 to generate indication information indicating whether the recognized speech belongs to the synthesized speech. The post-processing mode can be flexibly determined according to the actual application situation. As an example, the execution subject may first determine a confidence level of the pre-indication information. The confidence level may be determined by, for example, the degree of difference between the analysis result obtained in step 202 and the set threshold value, or may be obtained by a classification model output (e.g., input of softmax layer) trained in advance. Then, in response to determining that the confidence is smaller than a preset confidence threshold, the execution subject may perform post-processing on the speech to be recognized to generate a post-processing result. As an example, the post-processing may include sending the speech to be recognized to a manual review terminal. As yet another example, the post-processing may include re-executing the step of generating the pre-indication information. Then, the execution subject may adjust the pre-indication information according to the generated post-processing result. For example, in response to determining that the post-processing result coincides with the pre-indication information, the execution main body may determine the pre-indication information as the indication information indicating whether the speech to be recognized belongs to a synthesized speech. For another example, in response to determining that the post-processing result does not coincide with the pre-indication information, the execution main body may adjust the pre-indication information to coincide with the post-processing result, thereby generating the indication information indicating whether the speech to be recognized belongs to a synthesized speech.

In some optional implementations of the embodiment, in response to determining that the generated pre-indication information is used to indicate that the speech to be recognized belongs to a synthesized speech, the execution main body may perform post-processing on the speech to be recognized in various ways to generate indication information indicating whether the speech to be recognized belongs to the synthesized speech. For example, the executing entity may perform post-processing, such as sending the pre-indication information to the target user side for manual review or recognition again, on the to-be-recognized speech after determining that the generated pre-indication information indicates that the to-be-recognized speech belongs to the synthesized speech.

Based on the optional implementation manner, the generated pre-indication information is used for indicating that the speech to be recognized belongs to the synthesized speech as a condition of post-processing, so that the accuracy of judging as the synthesized speech is improved.

Optionally, the executing body may perform post-processing on the speech to be recognized to generate indication information indicating whether the speech to be recognized belongs to a synthesized speech by:

in the first step, a target number of voice slices matched with phonemes are extracted from the voice to be recognized.

In these implementations, the execution body described above may extract a target number of voice slices for phoneme matching from the voice to be recognized acquired in step 201 in various ways. The speech slice may include phoneme slices with similar contexts in different time periods. For example, a speech slice of the pronunciation of "up" in "morning today" in the 1 st second and the pronunciation of "up" in "tomorrow work" in the 6 th second. The target number may be any number specified in advance, or may be a number determined according to a preset rule.

In the second step, the similarity between the extracted target number of voice slices is determined.

In these implementations, the execution subject may determine a similarity between the target number of extracted voice slices of the first step. The similarity may include a similarity between feature vectors used for characterizing various speech features.

And thirdly, generating indication information for indicating that the voice to be recognized belongs to the synthesized voice in response to the fact that the obtained similarity meets the preset condition.

In these implementations, in response to determining that the similarity obtained in the second step satisfies a preset condition, the execution main body may generate indication information indicating that the speech to be recognized belongs to the synthesized speech in various ways. As an example, the preset condition may include being greater than a preset similarity threshold, e.g., 50%.

Based on the optional implementation manner, the generated pre-indication information can be adjusted by using the fact that the real voice has little change along with the fatigue degree, mood and other factors of the person, so that the indication information for indicating that the voice to be recognized belongs to the synthesized voice is generated under the condition that the preset condition is met. Thereby improving the accuracy for recognizing the synthesized speech as a whole.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for recognizing synthesized speech according to the present embodiment. In the application scenario of fig. 3, a user 301 says "how much today's weather" 302. The terminal 303 acquires "what is the weather today" 302 spoken by the user 301 as a voice to be recognized. Then, the terminal 303 recognizes the speech to be recognized 302, and generates the pre-instruction information 304 indicating that the speech to be recognized does not belong to a synthesized speech. Then, according to the pre-indication information 304 for indicating that the speech to be recognized does not belong to the synthesized speech, the terminal 303 may perform post-processing on the speech to be recognized 302 to generate the indication information 305 for indicating that the speech to be recognized does not belong to the synthesized speech.

Optionally, the terminal device 303 may also respond after determining that the speech to be recognized 302 does not belong to the synthesized speech, for example, invoke a weather forecast type APP to query a current weather condition (e.g., "cloudy and sunny" 306) and perform voice broadcast.

According to the method provided by the embodiment of the disclosure, the pre-indication information is generated by firstly identifying whether the speech to be identified belongs to the synthesized speech, and then the post-processing is further performed according to different conditions of the generated pre-indication information to generate the indication information. Therefore, the accuracy and efficiency of recognition are considered through the two-stage recognition, and the effect of recognizing the synthetic speech is improved.

With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for recognizing synthesized speech is illustrated. The process 400 of the method for recognizing synthesized speech includes the steps of:

step 401, obtaining a voice to be recognized.

In this embodiment, step 401 is substantially the same as step 201 in the corresponding embodiment of fig. 2, and is not described here again.

Step 402, recognizing the speech to be recognized to generate pre-indication information for indicating whether the speech to be recognized belongs to the synthesized speech.

In this embodiment, step 402 is substantially the same as step 202 in the corresponding embodiment of fig. 2, and is not described herein again.

And 403, performing post-processing on the speech to be recognized based on the pre-indication information to generate indication information for indicating whether the recognized speech belongs to the synthesized speech.

In this embodiment, step 403 is substantially the same as step 203 in the corresponding embodiment of fig. 2, and is not described herein again.

And step 404, in response to determining that the generated indication information is used for indicating that the voice to be recognized does not belong to the synthesized voice, performing voice recognition on the voice to be recognized to generate a recognition text.

In this embodiment, in response to the indication information generated in the determining step 403 indicating that the Speech to be recognized does not belong to the synthesized Speech, the executing body may perform Speech Recognition on the Speech to be recognized by using various Speech Recognition (ASR) technologies to generate the Speech into the recognized text.

Step 405, determining whether to execute an unlocking operation according to the matching of the identification text and the preset verification information.

In this embodiment, the executing entity may determine whether to execute the unlocking operation according to the matching of the identification text generated in step 404 and the preset verification information. The verification information may include a preset password or a dynamic verification code (e.g., a numeric string presented before each unlocking operation). The unlocking operation may include, for example, unlocking the screen, or may include an operation directly responding to the voice recognition text (e.g., turning on an "alarm"). As an example, the executing entity may determine to execute the unlocking operation in response to the recognition text generated in the determining step 404 matching the preset authentication information. As yet another example, the executing entity may determine not to perform the unlocking operation in response to determining that the identification text generated in step 404 does not match the preset authentication information.

It should be noted that, besides the above-mentioned contents, the embodiment of the present disclosure may also include the same or similar features and effects as the embodiment corresponding to fig. 2, and no further description is provided herein.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for recognizing synthesized speech in the present embodiment highlights the steps of performing speech recognition on the to-be-recognized speech that does not belong to the synthesized speech to generate a recognition text, and determining whether to perform an unlocking operation by matching the recognition text with the preset verification information. Therefore, the scheme described in the embodiment can be used in the field of voice unlocking, and by avoiding the response to the synthesized voice, the safety is improved, and the waste of computing resources and unnecessary energy consumption are reduced.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of an apparatus for recognizing synthesized speech, which corresponds to the embodiment of the method shown in fig. 2, and which may include the same or corresponding features as the embodiment of the method shown in fig. 2 and produce the same or corresponding effects as the embodiment of the method shown in fig. 2, in addition to the features described below. The device can be applied to various electronic equipment.

As shown in fig. 5, the apparatus 500 for recognizing synthesized speech of the present embodiment includes: an acquisition unit 501, a pre-recognition unit 502 and a post-processing unit 503. The acquiring unit 501 is configured to acquire a voice to be recognized; a pre-recognition unit 502 configured to recognize a speech to be recognized to generate pre-indication information indicating whether the speech to be recognized belongs to a synthesized speech; a post-processing unit 503 configured to perform post-processing on the speech to be recognized based on the pre-indication information to generate indication information indicating whether the speech to be recognized belongs to the synthesized speech.

In the present embodiment, in the apparatus 500 for recognizing synthesized speech: the specific processing of the obtaining unit 501, the pre-identifying unit 502, and the post-processing unit 503 and the technical effects thereof can refer to the related descriptions of step 201, step 202, and step 203 in the embodiment corresponding to fig. 2, which are not described herein again.

In some optional implementations of the present embodiment, the pre-recognition unit 502 may be further configured to: and inputting the voice to be recognized into a pre-trained synthesized voice recognition model to obtain pre-indication information for indicating whether the voice to be recognized belongs to the synthesized voice. The synthesized voice recognition model can be used for representing the corresponding relation between the pre-indication information and the voice to be recognized.

In some optional implementations of this embodiment, the post-processing unit 503 may be further configured to: and in response to determining that the generated pre-indication information is used for indicating that the voice to be recognized belongs to the synthesized voice, performing post-processing on the voice to be recognized to generate indication information used for indicating whether the voice to be recognized belongs to the synthesized voice.

In some optional implementations of this embodiment, the post-processing unit 503 may include: an extraction module (not shown), a determination module (not shown), and a generation module (not shown). The extraction module may be configured to extract a target number of phoneme-matched speech slices from the speech to be recognized. The determining module may be configured to determine a similarity between the extracted target number of voice slices. The generating module may be configured to generate indication information indicating that the speech to be recognized belongs to the synthesized speech in response to determining that the obtained similarity satisfies a preset condition.

In some optional implementations of this embodiment, the synthesized speech recognition model may be obtained by training through the following steps: acquiring a training sample set; and taking the sample voice of the training sample in the training sample set as input, taking the sample marking information corresponding to the input sample voice as expected output, and training to obtain a synthetic voice recognition model. Wherein the training samples in the training sample set may include sample speech and sample labeling information. The sample speech may include real speech and synthesized speech corresponding to the real speech. The sample labeling information can be used to indicate whether the speech belongs to synthesized speech.

In some alternative implementations of the present embodiment, the synthesized speech recognition model may include a densely-connected convolutional network based on bi-directional gated cyclic units.

In some optional implementations of this embodiment, the apparatus 500 for recognizing synthesized speech may further include: an identification unit (not shown in the figure), a determination unit (not shown in the figure). The recognition unit may be configured to perform speech recognition on the speech to be recognized to generate the recognition text in response to determining that the generated indication information indicates that the speech to be recognized does not belong to the synthesized speech. The determination unit may be configured to determine whether to perform an unlocking operation according to a matching of the identification text and the preset verification information.

In the apparatus provided by the above embodiment of the present disclosure, the obtaining unit 501 obtains the speech to be recognized; then, the pre-recognition unit 502 recognizes the voice to be recognized to generate pre-indication information indicating whether the voice to be recognized belongs to the synthesized voice; after that, the post-processing unit 503 performs post-processing on the speech to be recognized based on the pre-indication information to generate indication information indicating whether the speech to be recognized belongs to the synthesized speech. Therefore, the accuracy and efficiency of recognition can be considered, and the effect of recognizing the synthetic speech is improved.

Referring now to fig. 6, a schematic diagram of an electronic device (e.g., the server or terminal device of fig. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In accordance with one or more embodiments of the present disclosure, there is provided a method for recognizing synthesized speech, the method including: acquiring a voice to be recognized; recognizing the voice to be recognized to generate pre-indication information for indicating whether the voice to be recognized belongs to synthesized voice; and post-processing the voice to be recognized based on the pre-indication information to generate indication information for indicating whether the voice to be recognized belongs to the synthesized voice.

According to one or more embodiments of the present disclosure, in a method for recognizing a synthesized speech provided by the present disclosure, the recognizing a speech to be recognized to generate pre-indication information indicating whether the speech to be recognized belongs to the synthesized speech includes: inputting the voice to be recognized into a pre-trained synthesized voice recognition model to obtain pre-indication information for indicating whether the voice to be recognized belongs to the synthesized voice, wherein the synthesized voice recognition model is used for representing the corresponding relation between the pre-indication information and the voice to be recognized.

According to one or more embodiments of the present disclosure, in the method for recognizing a synthesized voice, the post-processing the to-be-recognized voice based on the pre-indication information to generate the indication information indicating whether the recognized voice belongs to the synthesized voice includes: and in response to determining that the generated pre-indication information is used for indicating that the voice to be recognized belongs to the synthesized voice, performing post-processing on the voice to be recognized to generate indication information used for indicating whether the voice to be recognized belongs to the synthesized voice.

According to one or more embodiments of the present disclosure, in a method for recognizing a synthesized voice, the post-processing a voice to be recognized to generate indication information indicating whether the voice to be recognized belongs to the synthesized voice includes: extracting target number of voice slices matched with phonemes from the voice to be recognized; determining similarity between the extracted target number of voice slices; and generating indication information for indicating that the voice to be recognized belongs to the synthesized voice in response to the fact that the obtained similarity meets the preset condition.

According to one or more embodiments of the present disclosure, the present disclosure provides a method for recognizing synthesized speech, in which the synthesized speech recognition model is trained by the following steps: acquiring a training sample set, wherein training samples in the training sample set comprise sample voice and sample marking information, the sample voice comprises real voice and synthesized voice corresponding to the real voice, and the sample marking information is used for indicating whether the voice belongs to the synthesized voice; and taking the sample voice of the training sample in the training sample set as input, taking the sample marking information corresponding to the input sample voice as expected output, and training to obtain a synthetic voice recognition model.

According to one or more embodiments of the present disclosure, the present disclosure provides a method for recognizing synthesized speech, wherein the synthesized speech recognition model includes a densely-connected convolutional network based on a bidirectional gated cyclic unit.

In accordance with one or more embodiments of the present disclosure, the method for recognizing synthesized speech provided by the present disclosure further includes: performing voice recognition on the voice to be recognized to generate a recognition text in response to the fact that the generated indication information is used for indicating that the voice to be recognized does not belong to the synthesized voice; and determining whether to execute unlocking operation or not according to the matching of the identification text and the preset verification information.

In accordance with one or more embodiments of the present disclosure, there is provided an apparatus for recognizing synthesized speech, the apparatus including: an acquisition unit configured to acquire a voice to be recognized; a pre-recognition unit configured to recognize a speech to be recognized to generate pre-indication information indicating whether the speech to be recognized belongs to a synthesized speech; and the post-processing unit is configured to perform post-processing on the voice to be recognized based on the pre-indication information so as to generate indication information for indicating whether the voice to be recognized belongs to the synthesized voice.

In accordance with one or more embodiments of the present disclosure, in an apparatus for recognizing synthesized speech provided by the present disclosure, the pre-recognition unit is further configured to: inputting the voice to be recognized into a pre-trained synthesized voice recognition model to obtain pre-indication information for indicating whether the voice to be recognized belongs to the synthesized voice, wherein the synthesized voice recognition model is used for representing the corresponding relation between the pre-indication information and the voice to be recognized.

In accordance with one or more embodiments of the present disclosure, in an apparatus for recognizing synthesized speech provided by the present disclosure, the post-processing unit is further configured to: and in response to determining that the generated pre-indication information is used for indicating that the voice to be recognized belongs to the synthesized voice, performing post-processing on the voice to be recognized to generate indication information used for indicating whether the voice to be recognized belongs to the synthesized voice.

According to one or more embodiments of the present disclosure, in an apparatus for recognizing synthesized speech provided by the present disclosure, the post-processing unit includes: an extraction module configured to extract a target number of phoneme-matched speech slices from the speech to be recognized; a determination module configured to determine a similarity between the extracted target number of voice slices; and the generating module is configured to generate indication information for indicating that the speech to be recognized belongs to the synthesized speech in response to the fact that the obtained similarity meets the preset condition.

According to one or more embodiments of the present disclosure, in an apparatus for recognizing synthesized speech provided by the present disclosure, the synthesized speech recognition model is trained by the following steps: acquiring a training sample set, wherein training samples in the training sample set comprise sample voice and sample marking information, the sample voice comprises real voice and synthesized voice corresponding to the real voice, and the sample marking information is used for indicating whether the voice belongs to the synthesized voice; and taking the sample voice of the training sample in the training sample set as input, taking the sample marking information corresponding to the input sample voice as expected output, and training to obtain a synthetic voice recognition model.

According to one or more embodiments of the present disclosure, the present disclosure provides an apparatus for recognizing synthesized speech, wherein the synthesized speech recognition model includes a densely-connected convolutional network based on a bidirectional gated cyclic unit.

According to one or more embodiments of the present disclosure, in an apparatus for recognizing synthesized speech provided by the present disclosure, the apparatus further includes: a recognition unit configured to perform speech recognition on the speech to be recognized to generate a recognition text in response to a determination that the generated indication information indicates that the speech to be recognized does not belong to the synthesized speech; a determination unit configured to determine whether to perform an unlocking operation according to a matching of the recognition text with preset authentication information.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a pre-recognition unit, and a post-processing unit. The names of these units do not in some cases constitute a limitation on the unit itself, and for example, a receiving unit may also be described as a "unit that acquires speech to be recognized".

As another aspect, embodiments of the present disclosure also provide a computer-readable medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a voice to be recognized; recognizing the voice to be recognized to generate pre-indication information for indicating whether the voice to be recognized belongs to synthesized voice; and post-processing the voice to be recognized based on the pre-indication information to generate indication information for indicating whether the voice to be recognized belongs to the synthesized voice.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the embodiments of the present disclosure is not limited to the specific combinations of the above-described features, but also encompasses other embodiments in which any combination of the above-described features or their equivalents is possible without departing from the scope of the present disclosure. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

18页详细技术资料下载

Method, apparatus, device and medium for recognizing synthesized speech

相关技术

网友询问留言