Speech synthesis method, electronic device, and storage medium

文档序号：139037 发布日期：2021-10-22 浏览：22次中文

阅读说明：本技术 语音合成方法、电子设备及存储介质 (Speech synthesis method, electronic device, and storage medium ) 是由于鹏伟于 2021-07-15 设计创作，主要内容包括：本发明公开一种语音合成方法,应用于服务器,该方法包括：获取待合成文本；获取目标说话人特征参数；将所述待合成文本和所述目标说话人特征参数输入至通用语音合成模型得到具有目标说话人音色的合成音频。本发明的语音合成方法在服务器执行,在合成具有目标说话人音色的合成音频时,无需配置目标说话人专属的语音合成模型,只需要将待合成文本和目标说话人特征参数输入至通用语音合成模型即可。由此本发明的语音合成方法只需要预先准备目标说话人特征参数即可基于通用语音合成模型合成对应于待合成文本的合成音频。(The invention discloses a voice synthesis method, which is applied to a server and comprises the following steps: acquiring a text to be synthesized; acquiring characteristic parameters of a target speaker; and inputting the text to be synthesized and the characteristic parameters of the target speaker into a universal speech synthesis model to obtain a synthesized audio with the tone of the target speaker. The voice synthesis method is executed in the server, when synthesizing the synthesized audio with the tone of the target speaker, the exclusive voice synthesis model of the target speaker does not need to be configured, and only the text to be synthesized and the characteristic parameters of the target speaker need to be input into the universal voice synthesis model. Therefore, the voice synthesis method of the invention can synthesize the synthetic audio corresponding to the text to be synthesized based on the universal voice synthesis model only by preparing the characteristic parameters of the target speaker in advance.)

1. A speech synthesis method is applied to a server, and the method comprises the following steps:

acquiring a text to be synthesized;

acquiring characteristic parameters of a target speaker;

and inputting the text to be synthesized and the characteristic parameters of the target speaker into a universal speech synthesis model to obtain a synthesized audio with the tone of the target speaker.

2. The method of claim 1, wherein the obtaining the target speaker characteristic parameters comprises: and acquiring the characteristic parameters of the target speaker from a preset speaker characteristic parameter library.

3. The method of claim 2, wherein the predetermined speaker profile library comprises a plurality of speaker profiles and corresponding speaker identity information.

4. The method as claimed in claim 3, wherein obtaining the target speaker characteristic parameter from the pre-set speaker characteristic parameter library comprises: and acquiring the characteristic parameters of the target speaker from a preset speaker characteristic parameter library according to the identity information of the target speaker.

5. The method according to claim 3, wherein the speaker characteristic parameters in the pre-defined speaker characteristic parameter library are obtained by:

receiving the recorded audio of a speaker;

and carrying out self-adaptive training on a universal speech synthesis model according to the recorded audio of the speaker to obtain the speaker characteristic parameters corresponding to the speaker.

6. The method of claim 5, wherein adaptively training a generic speech synthesis model based on the recorded audio of the speaker to obtain speaker characteristic parameters corresponding to the speaker comprises:

carrying out self-adaptive training on the universal speech synthesis model according to the recorded audio of the speaker to obtain a speaker speech synthesis model;

speaker characteristic parameters corresponding to the speaker are extracted from the speaker speech synthesis model.

7. The method of claim 5, further comprising:

after receiving the recorded audio of a speaker, determining whether the quality of the recorded audio of the speaker meets a preset condition;

if not, sending reminding information for re-recording the speaker audio;

if so, the subsequent steps are performed.

8. The method according to any one of claims 5-7, wherein the recorded audio of the speaker is recorded by the speaker through a terminal device and uploaded to the server.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-8.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.

Technical Field

The present invention relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method, an electronic device, and a storage medium.

Background

With the continuous popularization and application of the voice man-machine conversation technology in the intelligent terminal, the function of customizing the tone of the intelligent terminal by a user appears in order to improve the user experience. For example, for a smart speaker, a user may set the timbre of the smart speaker to the timbre of a person familiar with the smart speaker (e.g., a wife of the user) according to the user's needs. To implement the tone customization function, a user usually records the audio of a target speaker in the smart speaker and submits the audio to the server, and then the server trains the audio recorded by the target speaker to obtain a target speech synthesis model capable of synthesizing the speech with the tone of the target speaker.

However, in the process of implementing the present invention, the inventor finds that, as the number of users of the smart speaker increases, more and more users customize the tone of the smart speaker, and thus more and more target speech synthesis models need to be trained and stored at the server side. Namely, personalized tone customization of each person, and a corresponding speech synthesis model is provided at the server side. If there are millions, or even tens of millions, of intelligent terminals, there may be a corresponding number of speech synthesis models. If the service quality of the speech synthesis is guaranteed, a huge number of servers are required to be added, and the service cost is greatly increased.

Disclosure of Invention

An embodiment of the present invention provides a speech synthesis method, an electronic device, and a storage medium, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a speech synthesis method, which is applied to a server, and the method includes:

acquiring a text to be synthesized;

acquiring characteristic parameters of a target speaker;

In some embodiments, the obtaining the characteristic parameters of the target speaker comprises: and acquiring the characteristic parameters of the target speaker from a preset speaker characteristic parameter library.

In some embodiments, the predetermined speaker characteristic parameter library includes a plurality of speaker characteristic parameters and a plurality of corresponding speaker identity information.

In some embodiments, obtaining the target speaker characteristic parameter from the preset speaker characteristic parameter library includes: and acquiring the characteristic parameters of the target speaker from a preset speaker characteristic parameter library according to the identity information of the target speaker.

In some embodiments, the speaker characteristic parameters in the preset speaker characteristic parameter library are obtained by the following steps:

receiving the recorded audio of a speaker;

In some embodiments, adaptively training a generic speech synthesis model based on the recorded audio of the speaker to obtain speaker characteristic parameters corresponding to the speaker comprises:

carrying out self-adaptive training on the universal speech synthesis model according to the recorded audio of the speaker to obtain a speaker speech synthesis model;

speaker characteristic parameters corresponding to the speaker are extracted from the speaker speech synthesis model.

In some embodiments, the speech synthesis method further comprises:

after receiving the recorded audio of a speaker, determining whether the quality of the recorded audio of the speaker meets a preset condition;

if not, sending reminding information for re-recording the speaker audio;

if so, the subsequent steps are performed.

In some embodiments, the recorded audio of the speaker is recorded by the speaker through a terminal device and uploaded to the server.

In a second aspect, an embodiment of the present invention provides a storage medium, in which one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the above speech synthesis methods of the present invention.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform any of the speech synthesis methods of the present invention described above.

In a fourth aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a storage medium, and the computer program includes program instructions, which, when executed by a computer, cause the computer to execute any one of the above-mentioned speech synthesis methods.

The voice synthesis method is executed in the server, and when synthesizing the synthesized audio with the tone of the target speaker, the exclusive voice synthesis model of the target speaker does not need to be configured, and only the text to be synthesized and the characteristic parameters of the target speaker need to be input into the universal voice synthesis model. Therefore, the voice synthesis method of the embodiment of the invention can synthesize the synthetic audio corresponding to the text to be synthesized based on the universal voice synthesis model only by preparing the characteristic parameters of the target speaker in advance.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of an embodiment of a speech synthesis method of the present invention;

FIG. 2 is a flow chart of another embodiment of the speech synthesis method of the present invention;

FIG. 3 is a flow chart of another embodiment of a speech synthesis method of the present invention;

fig. 4 is a schematic structural diagram of an embodiment of an electronic device according to the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The speech synthesis method of the embodiment of the invention is realized as an online speech synthesis method. For example, a user performs a voice man-machine conversation through a smart terminal (e.g., a smart phone, a smart speaker, a smart robot, a smart television, etc.) configured with a voice interaction function, and the smart terminal communicates with a server. For example, audio data is synthesized at the server and sent back to the intelligent terminal.

As shown in fig. 1, an embodiment of the present invention provides a speech synthesis method applied to a server, where the method includes:

and S10, acquiring the text to be synthesized.

Illustratively, the text to be synthesized may be text content input by the user through the terminal device. The text to be synthesized can also be text content recognized by the terminal device according to the voice content input by the user. The text to be synthesized can also be text content obtained by the server performing voice recognition on the audio data sent by the terminal equipment. The text to be synthesized may also be an answer determined by the server to correspond to the user's question or the determined promotional content.

And S20, acquiring characteristic parameters of the target speaker.

Illustratively, the target speaker characteristic parameter is a characteristic parameter of the target speaker to be reproduced. For example, if the user sets the voice broadcast sound of the smart speaker to the wife sound, the target speaker characteristic parameter is the wife sound characteristic parameter. Wherein the sound characteristic parameter may be a timbre characteristic parameter.

And S30, inputting the text to be synthesized and the characteristic parameters of the target speaker into a universal speech synthesis model to obtain a synthesized audio with the tone of the target speaker.

The speech synthesis method of the embodiment is executed in the server, and when synthesizing the synthesized audio with the tone of the target speaker, a dedicated speech synthesis model of the target speaker does not need to be configured, and only the text to be synthesized and the characteristic parameters of the target speaker need to be input into the universal speech synthesis model. Therefore, the speech synthesis method of the embodiment can synthesize the synthesized audio corresponding to the text to be synthesized based on the general speech synthesis model only by preparing the characteristic parameters of the target speaker in advance.

In this embodiment, the preset speaker characteristic parameter library includes a plurality of speaker characteristic parameters and corresponding identity information of a plurality of speakers. Wherein, the speaker characteristic parameters (i.e. speaker voice characteristic parameters) correspond to the speaker identity information one to one. The speaker identity information may be account information for the user to log in. Therefore, no matter whether the user uses the same intelligent terminal or not, the user can obtain the synthesized audio with the same tone as long as the user logs in the same account. The speaker identity information may also be ID information that the user may be an intelligent terminal used by the user.

In the embodiment, the voice characteristic parameters of a plurality of speakers can be stored by constructing the preset speaker characteristic parameter library, and a plurality of speakers can synthesize the synthesized audio with different speaker timbres by multiplexing the same universal voice synthesis model. The utilization rate of the voice synthesis model is improved, and the burden of deploying the voice synthesis model by the server is reduced.

Illustratively, the target speaker identity information may be account information for the user to log in. Therefore, no matter whether the user uses the same intelligent terminal or not, the user can obtain the synthesized audio with the same tone as long as the user logs in the same account. The targeted speaker identity information may also be ID information that the user may be a smart terminal used by the user.

In the process of communication between the user and the server through the terminal equipment, the communication data comprises the identity information of the target speaker. The server only needs to analyze the communication data to obtain the corresponding identity information of the target speaker. And acquiring corresponding characteristic parameters of the target speaker from a preset speaker characteristic parameter library according to the identity and the trust of the target speaker.

Fig. 2 is a flowchart of another embodiment of the speech synthesis method according to the present invention, in which the speaker characteristic parameters in the preset speaker characteristic parameter library are obtained through the following steps:

and S01, receiving the recorded audio of the speaker.

Illustratively, the recorded audio of the speaker is recorded by the speaker through the terminal device and uploaded to the server.

S02, carrying out self-adaptive training on the universal speech synthesis model according to the recorded audio of the speaker to obtain the speaker characteristic parameter corresponding to the speaker.

Exemplarily, taking a smart phone equipped with a voice assistant as an example, the steps of recording audio for a speaker in the present embodiment are as follows:

step 1: the method comprises the following steps that firstly, a setting interface of the smart phone is entered, and a prompt of sound recording is given under a quiet condition.

Step 2: the smart phone can detect surrounding noise, the environment requires 55dB and the signal-to-noise ratio is 10dB under the condition of normal home noise, and a home scene can be truly restored with high requirements. It is mainly desired that the collected audio is sufficiently clean to help promote similarity of sound reproduction if noise detection is by continuing to operate downwards; if the noise detection fails, the user is prompted to look for a quiet environment.

And step 3: setting an interface to display a text with pinyin, wherein the text needs to have the three-tone coverage as wide as possible and is matched with the actual product; the method with pinyin mainly solves the problem of polyphone characters. The text read by the user needs to be carefully designed.

And 4, step 4: the interface is set to display the reading function at the same time, and text audio can be played and displayed for children, old people, users with serious accents and the like to refer to.

And 5: if the user does not need to read, the user can click to start recording to record audio; the recording process requires that 10 words have been set.

Step 6: after the audio is recorded, the smart phone can detect whether the recorded audio meets the requirement; including the amount of audio, signal-to-noise ratio, intelligibility, etc. If the detection fails, the user needs to return to the step 3 again and is prompted that the recording needs to be carried out again.

And 7: and clicking to upload the audio file after the detection is qualified, displaying the uploading progress through a setting interface, and prompting the successful uploading after the uploading is finished.

Fig. 3 is a flowchart illustrating another embodiment of a speech synthesis method according to the present invention, in which adaptive training is performed on a generic speech synthesis model according to the recorded audio of the speaker to obtain speaker characteristic parameters corresponding to the speaker, including:

s021, carrying out self-adaptive training on the universal speech synthesis model according to the recorded audio of the speaker to obtain the speaker speech synthesis model.

Illustratively, a small amount of target person audio is extracted for adaptive training based on a base model (i.e., a general speech synthesis model), so that the data volume of the target person is effectively reduced on the conventional synthesis, and the required model is generated rapidly.

S022, extracting speaker characteristic parameters corresponding to the speaker from the speaker voice synthesis model.

Illustratively, the recorded audio uploaded by the user is processed to serve as a training corpus of a basic model (i.e., a universal speech synthesis model), the training only involves partial parameters of the basic model, and after the training is completed, the parameters after the "adjustment" are stored in a specified format.

The use stage is as follows: the user initiates a request for synthesizing personalized sound, the repeated carving service extracts key information in the request and sends the key information to the bottom layer synthesis engine, the synthesis engine transmits information such as characteristic parameters of a speaker, text characteristics and the like corresponding to the client into the basic model, and the basic model synthesizes the sound and finally sends the sound to the user.

The space occupied by the speaker characteristic parameters in the embodiment of the invention is smaller than that of the basic model, and the speaker characteristic parameters are stored for each user instead of the speaker model, so that the storage space requirement can be reduced. In addition, the copy-and-copy service can respond to the composite request of all users by using only one basic model, so that server resources are saved (the number of models which can be deployed by one server is limited). And the time consumption for cold starting a new speaker is reduced, and the time consumption for loading characteristic parameters of the speaker is far less than that for loading a new model.

The model training and synthesis technology at the server end can effectively reduce the load of the chip at the end, if the synthesis model is put locally, the time from synthesis to local playing is reduced by 300ms, but the requirement of a 53600 Mhz and the requirement of a memory 30M are required for the memory.

In some embodiments, the speech synthesis method further comprises:

after receiving the recorded audio of a speaker, determining whether the quality of the recorded audio of the speaker meets a preset condition;

if not, sending reminding information for re-recording the speaker audio;

if yes, the following steps are executed: and carrying out self-adaptive training on a universal speech synthesis model according to the recorded audio of the speaker to obtain the speaker characteristic parameters corresponding to the speaker.

Determining whether the recorded audio quality of the speaker meets the preset condition in the embodiment includes: and determining whether the signal-to-noise ratio of the recorded audio of the speaker meets the preset signal-to-noise ratio condition.

And when the recorded audio quality does not accord with the preset condition, the server sends reminding information for re-recording the audio of the speaker to the intelligent terminal so as to remind the user to re-record.

It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In some embodiments, the present invention provides a non-transitory computer-readable storage medium, in which one or more programs including executable instructions are stored, and the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any of the above speech synthesis methods of the present invention.

In some embodiments, the present invention further provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform any of the speech synthesis methods described above.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speech synthesis method.

Fig. 4 is a schematic diagram of a hardware structure of an electronic device for performing a speech synthesis method according to another embodiment of the present application, where as shown in fig. 4, the electronic device includes:

one or more processors 410 and a memory 420, with one processor 410 being an example in fig. 4.

The apparatus for performing the speech synthesis method may further include: an input device 430 and an output device 440.

The processor 410, the memory 420, the input device 430, and the output device 440 may be connected by a bus or other means, such as the bus connection in fig. 4.

The memory 420, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the speech synthesis method in the embodiments of the present application. The processor 410 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 420, namely, implements the speech synthesis method of the above-described method embodiment.

The memory 420 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the speech synthesis apparatus, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 420 may optionally include memory located remotely from processor 410, which may be connected to the speech synthesis apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 430 may receive input numeric or character information and generate signals related to user settings and function control of the speech synthesis apparatus. The output device 440 may include a display device such as a display screen.

The one or more modules are stored in the memory 420 and, when executed by the one or more processors 410, perform the speech synthesis method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

10页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于空洞卷积神经网络的端到端语种识别分类方法

Speech synthesis method, electronic device, and storage medium

相关技术

网友询问留言