Method, device and system for constructing personalized speech synthesis model and electronic equipment

文档序号:600202 发布日期:2021-05-04 浏览:16次 中文

阅读说明:本技术 个性化语音合成模型构建方法、装置、系统及电子设备 (Method, device and system for constructing personalized speech synthesis model and electronic equipment ) 是由 霍媛圆 雷鸣 于 2019-10-29 设计创作,主要内容包括:本申请公开了个性化语音合成模型构建方法、装置和系统,个性化语音合成方法、装置和系统,以及电子设备。其中,模型构建方法包括:将录音文本分割为多个句子文本;在采集用户录音数据时,以第一显示方式显示当前阅读句子文本,以第二显示方式显示当前阅读句子文本后的文本信息;向服务端发送采集到的用户录音数据,以使得服务端根据所述用户录音数据,构建所述用户的个性化语音合成模型。采用这种处理方式,使得控制用户录音中句子与句子之间的停顿,避免在句子中间的非正常停顿,由此可确保用户录音质量,以便于从整段录音中获得较好的录音分句结果;因此,可以有效提升个性化语音合成模型的准确度,进而提升个性化语音合成的语音自然度和音色。(The application discloses a method, a device and a system for constructing a personalized speech synthesis model, a method, a device and a system for personalized speech synthesis, and electronic equipment. The model construction method comprises the following steps: dividing the recording text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; and sending the collected user recording data to a server so that the server constructs the personalized speech synthesis model of the user according to the user recording data. By adopting the processing mode, the pause between sentences in the user recording is controlled, and the abnormal pause in the middle of the sentences is avoided, so that the user recording quality can be ensured, and a better recording sentence dividing result can be obtained from the whole recording; therefore, the accuracy of the personalized speech synthesis model can be effectively improved, and the speech naturalness and the tone of the personalized speech synthesis are further improved.)

1. A personalized speech synthesis model construction method is characterized by comprising the following steps:

dividing the recording text into a plurality of sentence texts;

when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode;

and sending the collected user recording data to a server so that the server constructs the personalized speech synthesis model of the user according to the user recording data.

2. The method of claim 1,

the first display mode includes: a highlight display mode;

the second display mode includes: non-highlighting display mode.

3. The method of claim 1,

the first display mode and the second display mode have different colors, fonts and/or font sizes.

4. The method of claim 1,

the second display mode includes: and the recording progress bar mode is used for facilitating the user to adjust the recording speed according to the recording progress bar.

5. The method of claim 1,

the text information after the text of the current reading sentence comprises: the number of sentences the user is recording, and/or the number of unread sentences.

6. The method according to claim 1, wherein said displaying the text of the current reading sentence in the first display mode comprises:

determining the display duration of the text of the current reading sentence according to the text length of the text of the current reading sentence;

and displaying the text of the current reading sentence in a first display mode for the display duration.

7. The method according to claim 6, wherein said determining a display duration of the text of the current reading sentence according to the text length of the text of the current reading sentence comprises:

determining a first display duration of the text of the current reading sentence according to the text length and the character reading duration of the text of the current reading sentence;

and taking the time length longer than the first display time length as a second display time length of the text of the current reading sentence.

8. The method of claim 1, further comprising:

and generating a recording text with the text length smaller than the length threshold value at least according to the characters with different pronunciation modes of the users in different areas.

9. The method of claim 1, further comprising:

and filtering out voice data which is irrelevant to the recorded text from the user recorded data.

10. An apparatus for constructing a personalized speech synthesis model, comprising:

a text division unit for dividing the recording text into a plurality of sentence texts;

the text display unit is used for displaying the text of the current reading sentence in a first display mode and displaying the text information after the text of the current reading sentence in a second display mode when the recording data of the user is collected;

and the recording data sending unit is used for sending the collected user recording data to the server so that the server constructs the personalized voice synthesis model of the user according to the user recording data.

11. An electronic device, comprising:

a processor;

a memory;

the memory is used for storing a program for realizing the personalized speech synthesis model building method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: dividing the recording text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; and sending the collected user recording data to a server so that the server constructs the personalized speech synthesis model of the user according to the user recording data.

12. A personalized speech synthesis model construction method is characterized by comprising the following steps:

receiving user recording data sent by a client;

acquiring a recording text corresponding to the user recording data;

and constructing an individualized voice synthesis model of the user according to the user recording data and the recording text.

13. The method of claim 12, wherein constructing a personalized speech synthesis model for the user based on the user recorded data and the recorded text comprises:

dividing the user recording data into a plurality of sentence recording data;

determining sentence text corresponding to the sentence recording data;

constructing a network structure of the personalized speech synthesis model;

and learning to obtain the personalized speech synthesis model from the corresponding relation set between the sentence recording data and the sentence text.

14. The method of claim 13, wherein the network structure comprises a neural network structure.

15. The method of claim 13, wherein the segmenting the user recorded data into a plurality of sentence recorded data comprises:

and dividing the user recording data into a plurality of sentence recording data through a voice activity detection algorithm.

16. The method of claim 12, further comprising:

acquiring characters with different pronunciation modes of users in different areas in the recording text;

acquiring recording segment data corresponding to the characters in the user recording data;

and constructing an individualized voice synthesis model of the user according to the corresponding relation among the user recording data, the recording text, the words and the recording fragment data.

17. The method of claim 12, further comprising:

and filtering out voice data which is irrelevant to the recorded text from the user recorded data.

18. An apparatus for constructing a personalized speech synthesis model, comprising:

the recording data receiving unit is used for receiving user recording data sent by the client;

the recording text acquisition unit is used for acquiring a recording text corresponding to the user recording data;

and the model building unit is used for building the personalized voice synthesis model of the user according to the user recording data and the recording text.

19. An electronic device, comprising:

a processor;

a memory;

the memory is used for storing a program for realizing the personalized speech synthesis model building method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: receiving user recording data sent by a client; acquiring a recording text corresponding to the user recording data; and constructing an individualized voice synthesis model of the user according to the user recording data and the recording text.

20. A system for building a personalized speech synthesis model, comprising:

the personalized speech synthesis model construction apparatus according to claim 10; and a personalized speech synthesis model construction device according to claim 18.

21. A method for personalized speech synthesis, comprising:

receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode;

according to the user recording data, constructing an individualized voice synthesis model of the target user;

receiving a personalized voice synthesis request aiming at a target user and sent by a client; the synthesis request comprises second recording text information;

and generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user.

22. A personalized speech synthesis apparatus, comprising:

the first request receiving unit is used for receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode;

the model building unit is used for building the personalized voice synthesis model of the target user according to the user recording data;

the second request receiving unit is used for receiving a personalized voice synthesis request aiming at a target user and sent by the client; the synthesis request comprises second recording text information;

and the voice synthesis unit is used for generating the personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user.

23. An electronic device, comprising:

a processor;

a memory;

the memory is used for storing a program for realizing the personalized speech synthesis method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; according to the user recording data, constructing an individualized voice synthesis model of the target user; receiving a personalized voice synthesis request aiming at a target user and sent by a client; the synthesis request comprises second recording text information; and generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user.

24. A method for personalized speech synthesis, comprising:

determining a second recording text to be subjected to voice synthesis of the target user;

sending a personalized voice synthesis request aiming at a target user to a server, wherein the synthesis request comprises second recording text information, so that the server executes the following steps: generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user; the personalized speech synthesis model is constructed in the following way: receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; and constructing the personalized voice synthesis model of the target user according to the recording data.

25. A personalized speech synthesis apparatus, comprising:

the recording text determining unit is used for determining a second recording text to be subjected to voice synthesis of the target user;

a request sending unit, configured to send a personalized speech synthesis request for a target user to a server, where the synthesis request includes second recording text information, so that the server performs the following steps: generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user; the personalized speech synthesis model is constructed in the following way: receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; and constructing the personalized voice synthesis model of the target user according to the recording data.

26. An electronic device, comprising:

a processor;

a memory;

the memory is used for storing a program for realizing the personalized speech synthesis method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: determining a second recording text to be subjected to voice synthesis of the target user; sending a personalized voice synthesis request aiming at a target user to a server, wherein the synthesis request comprises second recording text information, so that the server executes the following steps: generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user; the personalized speech synthesis model is constructed in the following way: receiving a personalized speech synthesis model construction request aiming at a target user and sent by the electronic equipment; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; and constructing the personalized voice synthesis model of the target user according to the recording data.

27. The apparatus of claim 26,

the device comprises a smart sound box;

the intelligent sound box comprises: the device comprises a sound acquisition device, a sound playing device and a display device;

the intelligent sound box is specifically used for collecting the user recording data through the sound collection device, displaying the first recording text through the display device, and playing the personalized voice data through the sound playing device.

28. A personalized speech synthesis system, comprising:

the personalized speech synthesis apparatus of claim 22; and a personalized speech synthesis device according to claim 25.

Technical Field

The application relates to the technical field of data processing, in particular to a method, a device and a system for constructing a personalized speech synthesis model, a method, a device and a system for personalized speech synthesis, and electronic equipment.

Background

The personalized Speech synthesis is To make TTS (Text To Speech, Speech synthesis) Speech technology synthesize Speech, speaking mode, and speaking emotion of a specific person after some voice segments of a person are recorded by some recording devices at the side.

Personalized speech synthesis technology relates to a plurality of new technologies which are developed in the aspect of phonetics, and comprises the following steps: a voice frequency spectrum feature conversion technology, a rhythm feature conversion technology, a construction technology of a personalized voice synthesis model, a personalized parameter self-adaption technology and the like. The construction technology of the personalized speech synthesis model is one of core technologies of the personalized speech synthesis technology, and the technology can be realized in various ways. One way is to train a personalized voice synthesis model directly according to the recorded data, and the way has the advantages of simplicity and feasibility; the other method is to learn and obtain an individualized speech synthesis model from training data formed by the corresponding relation between each sentence of recording and the sentence, and the method can synthesize text speech with high naturalness and good tone, so the method becomes a construction technology of the individualized speech synthesis model commonly used at present.

However, in the process of implementing the present invention, the inventor finds that the prior art solution has at least the following problems: because a better sentence segmentation result of the recording cannot be obtained from the whole recording, a personalized speech synthesis model with higher quality cannot be obtained, and further, text speech with high naturalness and good tone cannot be synthesized by using the model.

Disclosure of Invention

The application provides a method for constructing an individualized speech synthesis model, which aims to solve the problem of low accuracy of the individualized speech synthesis model in the prior art. The application further provides a device and a system for constructing the personalized speech synthesis model, a method, a device and a system for personalized speech synthesis, and electronic equipment.

The application provides a method for constructing a personalized speech synthesis model, which comprises the following steps:

dividing the recording text into a plurality of sentence texts;

when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode;

and sending the collected user recording data to a server so that the server constructs the personalized speech synthesis model of the user according to the user recording data.

Optionally, the first display mode includes: a highlight display mode;

the second display mode includes: non-highlighting display mode.

Optionally, the first display mode and the second display mode have different colors, fonts and/or font sizes.

Optionally, the second display mode includes: and the recording progress bar mode is used for facilitating the user to adjust the recording speed according to the recording progress bar.

Optionally, the text information after the text of the current reading sentence includes: the number of sentences the user is recording, and/or the number of unread sentences.

Optionally, the displaying the text of the current reading sentence in the first display mode includes:

determining the display duration of the text of the current reading sentence according to the text length of the text of the current reading sentence;

and displaying the text of the current reading sentence in a first display mode for the display duration.

Optionally, the determining the display duration of the text of the current reading sentence according to the text length of the text of the current reading sentence includes:

determining a first display duration of the text of the current reading sentence according to the text length and the character reading duration of the text of the current reading sentence;

and taking the time length longer than the first display time length as a second display time length of the text of the current reading sentence.

Optionally, the method further includes:

and generating a recording text with the text length smaller than the length threshold value at least according to the characters with different pronunciation modes of the users in different areas.

Optionally, the method further includes:

and filtering out voice data which is irrelevant to the recorded text from the user recorded data.

The present application further provides a personalized speech synthesis model building apparatus, including:

a text division unit for dividing the recording text into a plurality of sentence texts;

the text display unit is used for displaying the text of the current reading sentence in a first display mode and displaying the text information after the text of the current reading sentence in a second display mode when the recording data of the user is collected;

and the recording data sending unit is used for sending the collected user recording data to the server so that the server constructs the personalized voice synthesis model of the user according to the user recording data.

The present application further provides an electronic device, comprising:

a processor;

a memory;

the memory is used for storing a program for realizing the personalized speech synthesis model building method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: dividing the recording text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; and sending the collected user recording data to a server so that the server constructs the personalized speech synthesis model of the user according to the user recording data.

The application also provides a method for constructing the personalized speech synthesis model, which comprises the following steps:

receiving user recording data sent by a client;

acquiring a recording text corresponding to the user recording data;

and constructing an individualized voice synthesis model of the user according to the user recording data and the recording text.

Optionally, the constructing a personalized speech synthesis model of the user according to the user recording data and the recording text includes:

dividing the user recording data into a plurality of sentence recording data;

determining sentence text corresponding to the sentence recording data;

constructing a network structure of the personalized speech synthesis model;

and learning to obtain the personalized speech synthesis model from the corresponding relation set between the sentence recording data and the sentence text.

Optionally, the network structure comprises a neural network structure.

Optionally, the dividing the user recording data into a plurality of sentence recording data includes:

and dividing the user recording data into a plurality of sentence recording data through a voice activity detection algorithm.

Optionally, the method further includes:

acquiring characters with different pronunciation modes of users in different areas in the recording text;

acquiring recording segment data corresponding to the characters in the user recording data;

and constructing an individualized voice synthesis model of the user according to the corresponding relation among the user recording data, the recording text, the words and the recording fragment data.

Optionally, the method further includes:

and filtering out voice data which is irrelevant to the recorded text from the user recorded data.

The present application further provides a personalized speech synthesis model building apparatus, including:

the recording data receiving unit is used for receiving user recording data sent by the client;

the recording text acquisition unit is used for acquiring a recording text corresponding to the user recording data;

and the model building unit is used for building the personalized voice synthesis model of the user according to the user recording data and the recording text.

The present application further provides an electronic device, comprising:

a processor;

a memory;

the memory is used for storing a program for realizing the personalized speech synthesis model building method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: receiving user recording data sent by a client; acquiring a recording text corresponding to the user recording data; and constructing an individualized voice synthesis model of the user according to the user recording data and the recording text.

The present application further provides a personalized speech synthesis model construction system, including:

a device for constructing a personalized speech synthesis model according to the client side; and constructing the device according to the personalized voice synthesis model at the server side.

The application also provides a personalized speech synthesis method, which comprises the following steps:

receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode;

according to the user recording data, constructing an individualized voice synthesis model of the target user;

receiving a personalized voice synthesis request aiming at a target user and sent by a client; the synthesis request comprises second recording text information;

and generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user.

The present application further provides a personalized speech synthesis device, comprising:

the first request receiving unit is used for receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode;

the model building unit is used for building the personalized voice synthesis model of the target user according to the user recording data;

the second request receiving unit is used for receiving a personalized voice synthesis request aiming at a target user and sent by the client; the synthesis request comprises second recording text information;

and the voice synthesis unit is used for generating the personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user.

The present application further provides an electronic device, comprising:

a processor;

a memory;

the memory is used for storing a program for realizing the personalized speech synthesis method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; according to the user recording data, constructing an individualized voice synthesis model of the target user; receiving a personalized voice synthesis request aiming at a target user and sent by a client; the synthesis request comprises second recording text information; and generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user.

The application also provides a personalized speech synthesis method, which comprises the following steps:

determining a second recording text to be subjected to voice synthesis of the target user;

sending a personalized voice synthesis request aiming at a target user to a server, wherein the synthesis request comprises second recording text information, so that the server executes the following steps: generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user; the personalized speech synthesis model is constructed in the following way: receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; and constructing the personalized voice synthesis model of the target user according to the recording data.

The present application further provides a personalized speech synthesis device, comprising:

the recording text determining unit is used for determining a second recording text to be subjected to voice synthesis of the target user;

a request sending unit, configured to send a personalized speech synthesis request for a target user to a server, where the synthesis request includes second recording text information, so that the server performs the following steps: generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user; the personalized speech synthesis model is constructed in the following way: receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; and constructing the personalized voice synthesis model of the target user according to the recording data.

The present application further provides an electronic device, comprising:

a processor;

a memory;

the memory is used for storing a program for realizing the personalized speech synthesis method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: determining a second recording text to be subjected to voice synthesis of the target user; sending a personalized voice synthesis request aiming at a target user to a server, wherein the synthesis request comprises second recording text information, so that the server executes the following steps: generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user; the personalized speech synthesis model is constructed in the following way: receiving a personalized speech synthesis model construction request aiming at a target user and sent by the electronic equipment; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; and constructing the personalized voice synthesis model of the target user according to the recording data.

Optionally, the device includes a smart speaker;

the intelligent sound box comprises: the device comprises a sound acquisition device, a sound playing device and a display device;

the intelligent sound box is specifically used for collecting the user recording data through the sound collection device, displaying the first recording text through the display device, and playing the personalized voice data through the sound playing device.

The present application further provides a personalized speech synthesis system, comprising:

according to the personalized voice synthesis device positioned at the client side; and according to the personalized speech synthesis device at the service end side.

The present application also provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the various methods described above.

The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the various methods described above.

Compared with the prior art, the method has the following advantages:

according to the method for constructing the personalized speech synthesis model, the recording text is divided into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; sending the collected user recording data to a server, so that the server constructs an individualized voice synthesis model of the user according to the user recording data; the processing mode can control the pause between sentences in the user recording, avoid abnormal pause in the middle of the sentences, thereby ensuring the user recording quality and being convenient for obtaining better recording sentence division results from the whole recording; therefore, the accuracy of the personalized speech synthesis model can be effectively improved, and the speech naturalness and the tone of the personalized speech synthesis are further improved.

Drawings

FIG. 1 is a flow chart of an embodiment of a method for constructing a personalized speech synthesis model provided by the present application;

FIG. 2 is a schematic diagram of a recorded text display of an embodiment of a method for constructing a personalized speech synthesis model provided in the present application;

FIG. 3 is a schematic diagram of an embodiment of an apparatus for constructing a personalized speech synthesis model provided in the present application;

FIG. 4 is a schematic diagram of an embodiment of an electronic device provided herein;

FIG. 5 is a flow chart of an embodiment of a method for constructing a personalized speech synthesis model provided herein;

FIG. 6 is a schematic diagram of an embodiment of an apparatus for constructing a personalized speech synthesis model provided by the present application;

FIG. 7 is a schematic diagram of an embodiment of an electronic device provided herein;

FIG. 8 is a schematic diagram of an embodiment of a personalized speech synthesis model construction system provided herein;

FIG. 9 is a flow diagram of an embodiment of a personalized speech synthesis method provided herein;

FIG. 10 is a schematic diagram of an embodiment of a personalized speech synthesis apparatus provided herein;

FIG. 11 is a schematic diagram of an embodiment of an electronic device provided herein;

FIG. 12 is a flow diagram of an embodiment of a personalized speech synthesis method provided herein;

FIG. 13 is a schematic diagram of an embodiment of a personalized speech synthesis apparatus provided herein;

FIG. 14 is a schematic diagram of an embodiment of an electronic device provided herein;

FIG. 15 is a schematic diagram of an embodiment of a personalized speech synthesis system provided herein.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

The application provides a personalized speech synthesis model construction method, a personalized speech synthesis model construction device and a personalized speech synthesis model construction system, and an electronic device. Each of the schemes is described in detail in the following examples.

First embodiment

Please refer to fig. 1, which is a flowchart illustrating an embodiment of a method for constructing a personalized speech synthesis model according to the present application, wherein the execution subject of the method includes but is not limited to a terminal device. The terminal devices described in this application include but are not limited to mobile communication devices, namely: the mobile phone or the smart phone also includes terminal devices such as a personal computer, a PAD, and an iPad.

The construction method of the personalized speech synthesis model comprises the following steps:

step S101: the recorded text is divided into a plurality of sentence texts.

The recording text, also called recording file, is a file provided for the user to record and read in the personalized TTS product.

In one example, the method may further comprise the steps of: and generating the recording text with the text length smaller than the length threshold value at least according to the characters with different pronunciation modes of the users in different areas. The length threshold value can be determined according to the number of the characters with different pronunciation modes of the users in different areas included in the recorded text. Generally, the greater the number of such words, the lower the length threshold.

Taking a Chinese recording text as an example, southern people (such as Zhejiang and Guangdong) speak Mandarin, and compared with a notch, such as a 'hot' character, northern people can easily read the Chinese recording text, Zhejiang people can read the Chinese recording text as pronouncing biased to 'hungry', and similar characters belong to characters with different pronouncing modes of users in different areas. In this embodiment, by including these specific words in the recording text, make under the limited circumstances of recording text, can include more data that can embody user's voiceprint characteristic in the recording data, can obtain abundant user's voiceprint characteristic data more easily according to such recording data, just so can effectively reduce recording text length, especially in the comparatively noisy environment of recording environment, effectively reduce recording text length, in order to shorten the recording time, reduce the recording interference, can effectively promote the recording quality, thereby promote the degree of accuracy of personalized speech synthesis model. In addition, the reduction of the length of the recorded text can reduce the data volume of the recorded data, so that the synchronous processing pressure of the server can be reduced; therefore, the computing resources of the server side can be effectively saved, and the network resources can be saved.

Step S103: when the user recording data is collected, displaying the text of the current reading sentence in a first display mode, and displaying the text information after the text of the current reading sentence in a second display mode.

The quality of the user's recording is very important to the final effect of the personalized TTS, so how to make the user complete the recording meeting the requirements is very important. In order to reduce the difficulty of user interaction, the recording file is generally segmented, and each segment contains several words. The method for constructing the personalized speech synthesis model provided by the embodiment of the application divides the recording data by sentences, so that each recording paragraph needs to be divided into sentences. In the embodiment, the reasonable pause between sentences is guided by the subtitle refreshing mode, so that the background can simply realize paragraph segmentation and meet the requirement of model production.

In this embodiment, different display modes are adopted for the sentence text which is currently being read by the user and the sentence text information thereafter. And displaying the text of the current reading sentence in a first display mode, and displaying the text information after the text of the current reading sentence in a second display mode.

As shown in fig. 2, in an example, the text information after the current reading of the sentence text is the sentence text to be read; the first display mode is a highlight display mode, such as a yellow highlight display mode; the second display mode is a non-highlight display mode, such as non-yellow highlight display. In this way, the user can be guided to keep reasonable pause between sentences, and the user can predict the text amount to be read and the like.

In another example, the text information after the text of the current reading sentence includes: the sequence number of the sentence being recorded by the user and/or the number of unread sentences; the first display mode and the second display mode have different colors, fonts and/or font sizes. For example, the color of the first display mode is red, the font is song style, the font size is three, and the current sentence text to be read is displayed in the mode; the second display mode is black, the font is a regular font, the font size is a five-point, and the quantity of the unread sentences is displayed in the mode. Likewise, the processing mode can still guide the user to keep reasonable pause between sentences, avoid abnormal pause between sentences, and enable the user to predict the text amount to be read.

In yet another example, the second display mode includes: recording progress bar mode. The recording progress bar can adjust the recording progress according to the data related to the recording progress, such as the number of recorded sentences, the number of unread sentences and the like of a user. By adopting the processing mode, the current recording progress of the user can be reminded, so that the user can adjust the recording speed, the recording text length can be effectively reduced, particularly in the environment with noisy recording environment, the recording text length is effectively reduced, the recording time is shortened, the recording interference is reduced, the recording quality can be effectively improved, and the accuracy of the personalized speech synthesis model is improved. In addition, the data volume of the recorded data can be reduced due to the improvement of the recording speed, so that the synchronous processing pressure of the server can be reduced; therefore, the computing resources of the server side can be effectively saved, and the network resources can be saved.

It should be noted that, when the currently read sentence text and the sentence text to be read are displayed in different display modes, the currently read sentence text may be continuously displayed for a fixed duration to ensure that the user keeps a reasonable pause between sentences. In addition, the display time of the sentence can be controlled according to the actual length of the text of the current reading sentence, so that the reasonable pause between the sentences can be ensured for the user, the too long time for displaying can be avoided, the reading speed is reduced, and the recording experience of the user is influenced.

In this embodiment, the displaying the text of the current reading sentence in the first display mode may include the following sub-steps: 1) determining the display duration of the text of the current reading sentence according to the text length of the text of the current reading sentence; 2) and displaying the text of the current reading sentence in a first display mode for the display duration. For example, a sentence including 10 words is displayed for a shorter time than a sentence including 15 words.

In a specific implementation, the determining the display duration of the text of the current reading sentence according to the text length of the text of the current reading sentence may include the following substeps: 1.1) determining a first display duration of the text of the current reading sentence according to the text length of the text of the current reading sentence and the reading duration of one word; 1.2) taking the time length longer than the first display time length as a second display time length of the text of the current reading sentence. For example, the display time of each word is 1 second, and the display time of a sentence including 10 words is longer than 10 seconds. The reading time length can be a preset time length, or a time length determined according to the current reading speed of the user, for example, when the reading time is shortened by the user, the word reading time length can be correspondingly reduced; when the user is tired and wants to slow down the reading speed, the word reading duration can be correspondingly increased.

Two specific embodiments are given below.

In the first mode, a user is guided to read corresponding sentences aloud by a display mode of highlighting sentence by sentence, and the sentence is paused for a long enough time. The display of this manner in the paragraph is as follows:

1) if there are N sentences in the current reading paragraph, the display duration of the ith sentence (i 1.. N) on the screen is Ti (the time is longer than the time for a normal user to read the sentence).

2) And highlighting the documents sentence by sentence according to the display time Ti of each sentence from the time of starting the recording of the user, and guiding the user to read the documents.

3) The non-highlighted case (the text of the sentence to be read) is processed in a fuzzy way, but the user can sense the existence of the case before and after the current reading sentence so as to keep the psychological expectation of reading the subsequent case.

And in the second mode, the user is guided to read the corresponding sentence by sentence-by-sentence display, and the sentence is paused for a long enough time.

2. By sentence-by-sentence display, the user is guided to read the corresponding sentence aloud and pause for a sufficiently long time. The display of this manner in the paragraph is as follows:

1) if there are N sentences in a paragraph, the display time of the ith sentence (i 1.. N) is Ti (the time is longer than the time when a normal user finishes reading the sentence).

2) And (4) only displaying the ith file every time according to the display time Ti of each sentence from the time of starting recording of the user, and guiding the user to read the files aloud.

3) And the counting display indicates that the user is recording the first sentence in the N sentences, so that the user is prompted to record the first sentence and how many sentences need to be recorded.

After user recording data used for generating the personalized speech synthesis model is collected, the next step can be carried out, and the collected user recording data is sent to the server side.

Step S105: and sending the collected user recording data to a server so that the server constructs the personalized speech synthesis model of the user according to the user recording data.

The server side can divide the user recording data uploaded by the terminal equipment into a plurality of sentence recording data through a voice activity detection algorithm (VAD) and use the sentence recording data as training data, and then a personalized voice synthesis model of the user is obtained through learning from the training data.

In one example, the method may further comprise the steps of: and filtering out voice data which is irrelevant to the recorded text from the user recorded data. By adopting the processing mode, the processing mode of the server can be effectively simplified; therefore, the synchronous processing pressure of the server can be effectively reduced.

In a specific implementation, the step of filtering out the voice data irrelevant to the recording text from the user recording data may be implemented in the following manner: and determining the position of the user, identifying the voice data of each sound source from the user recording data, and extracting the voice data of the real recording user according to the position of each sound source and the position of the user. Since the identification of the recorded data of different sound sources belongs to the mature prior art, it is not described here again.

As can be seen from the foregoing embodiments, in the personalized speech synthesis model construction method provided in the embodiments of the present application, a recording text is divided into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; sending the collected user recording data to a server, so that the server constructs an individualized voice synthesis model of the user according to the user recording data; the processing mode can control the pause between sentences in the user recording, avoid abnormal pause in the middle of the sentences, thereby ensuring the user recording quality and being convenient for obtaining better recording sentence division results from the whole recording; therefore, the accuracy of the personalized speech synthesis model can be effectively improved, and the speech naturalness and the tone of the personalized speech synthesis are further improved.

In the foregoing embodiment, a personalized speech synthesis model construction method is provided, and correspondingly, the present application also provides a personalized speech synthesis model construction device. The apparatus corresponds to an embodiment of the method described above.

Second embodiment

Please refer to fig. 3, which is a schematic diagram of an embodiment of a personalized speech synthesis model construction apparatus according to the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The present application further provides a personalized speech synthesis model building apparatus, including:

a text division unit 301 for dividing the sound recording text into a plurality of sentence texts;

the text display unit 303 is configured to display a text of a current reading sentence in a first display mode and display text information after the text of the current reading sentence in a second display mode when user recording data is collected;

and a recording data sending unit 305, configured to send the collected user recording data to the server, so that the server constructs a personalized speech synthesis model of the user according to the user recording data.

Optionally, the first display mode includes: a highlight display mode; the second display mode includes: non-highlighting display mode.

Optionally, the first display mode and the second display mode have different colors, fonts and/or font sizes.

Optionally, the second display mode includes: and the recording progress bar mode is used for facilitating the user to adjust the recording speed according to the recording progress bar.

Optionally, the text information after the text of the current reading sentence includes: the number of sentences the user is recording, and/or the number of unread sentences.

Optionally, the text display unit 303 is specifically configured to determine a display duration of the text of the current reading sentence according to the text length of the text of the current reading sentence; and displaying the text of the current reading sentence in a first display mode for the display duration.

Optionally, the text display unit 303 is specifically configured to determine a first display duration of the text of the current reading sentence according to the text length and the word reading duration of the text of the current reading sentence; and taking the time length longer than the first display time length as a second display time length of the text of the current reading sentence.

Optionally, the method further includes:

and the recording text generation unit is used for generating a recording text with the text length smaller than the length threshold value at least according to the characters with different pronunciation modes of the users in different areas.

Optionally, the method further includes:

and the voice data filtering unit is used for filtering the voice data which is irrelevant to the recording text from the user recording data.

Third embodiment

Please refer to fig. 4, which is a schematic diagram of an embodiment of an electronic device according to the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor 401 and a memory 402; the memory is used for storing a program for realizing the personalized speech synthesis model building method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: dividing the recording text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; and sending the collected user recording data to a server so that the server constructs the personalized speech synthesis model of the user according to the user recording data.

The electronic device can be a smart sound box, a smart phone and the like.

In one example, the smart speaker includes: the device comprises a sound acquisition device, a sound playing device and a display device; the intelligent sound box is specifically used for collecting the user recording data through a sound collection device, displaying the first recording text through a display device, and playing personalized voice data of the user, which is synthesized according to the model and aims at the target text, through a sound playing device.

Fourth embodiment

Please refer to fig. 5, which is a flowchart illustrating an embodiment of a method for constructing a personalized speech synthesis model according to the present application, wherein an execution body of the method includes a server. The construction method of the personalized speech synthesis model comprises the following steps:

step S501: and receiving user recording data sent by the client.

The client includes but is not limited to a mobile communication device, namely: the mobile phone or the smart phone also includes terminal devices such as a personal computer, a PAD, and an iPad.

The user recording data comprises recording data which is acquired by the first method embodiment and ensures normal pause between sentences.

In specific implementation, the client receives a request for constructing a personalized speech synthesis model for a target recording text. The request may include an identification of the target recording text and user recording data corresponding to the target recording text, and may also include a user identifier.

Step S503: and acquiring a recording text corresponding to the user recording data.

To construct the personalized speech synthesis model of the user, not only the recording data of the user needs to be acquired, but also the recording text corresponding to the recording data of the user needs to be acquired.

In specific implementation, the recording text can be obtained by inquiring from the recording text library according to the identifier of the target recording text carried by the request.

Step S505: and constructing an individualized voice synthesis model of the user according to the user recording data and the recording text.

After the user recording data and the corresponding recording text are obtained, the personalized speech synthesis model of the user can be constructed according to the data, and the model can be stored by the user identification.

In this embodiment, step S505 may include the following sub-steps: 1) dividing the user recording data into a plurality of sentence recording data; 2) determining sentence text corresponding to the sentence recording data; 3) constructing a network structure of the personalized speech synthesis model; 4) and learning to obtain the personalized speech synthesis model from the corresponding relation set between the sentence recording data and the sentence text.

1) And dividing the user recording data into a plurality of sentence recording data.

Since the user recording data includes recording data that ensures normal pauses between sentences acquired by the first method embodiment, the present embodiment divides the user recording data into a plurality of sentence recording data by a voice activity detection algorithm (VAD); the processing mode is simple and easy to implement, can effectively reduce the processing pressure of the server, and can ensure that sentences with higher quality are separated.

2) Sentence text corresponding to the sentence recording data is determined.

After the recorded data of each sentence of each segment in the recorded data of the user are completely separated, and a text sentence dividing technology is combined, a corresponding relation set between the recorded data of the sentences and the text of the sentences can be generated.

3) And constructing a network structure of the personalized speech synthesis model.

4) And learning to obtain the personalized speech synthesis model from the corresponding relation set between the sentence recording data and the sentence text.

According to the method provided by the embodiment of the application, the personalized speech synthesis model is obtained by learning the corresponding relation set through a machine learning algorithm. The network structure of the personalized speech synthesis model comprises a neural network structure, such as a convolutional neural network and the like. Since the models and the training method thereof belong to the mature prior art, the details are not repeated here.

In this embodiment, the method may further include the steps of: 1) acquiring characters with different pronunciation modes of users in different areas in the recording text; 2) acquiring recording segment data corresponding to the characters in the user recording data; 3) and constructing an individualized voice synthesis model of the user according to the corresponding relation among the user recording data, the recording text, the words and the recording fragment data.

In specific implementation, which data belong to the words with different pronunciation modes of the users in different areas can be marked in a dictionary, the words in the recorded text are matched with the dictionary, and the words with different pronunciation modes of the users in different areas in the recorded text are determined; then, determining which parts in the recorded data correspond to which character through a voice processing algorithm, thereby obtaining recorded segment data corresponding to the character in the recorded data of the user; and finally, constructing an individualized voice synthesis model of the user according to the corresponding relation among the user recording data, the recording text, the words and the recording fragment data. By adopting the processing mode, the constructed model not only comprises the sound characteristic data of the user, such as characteristic data related to pitch, tone intensity, tone length and tone color, but also comprises the pronunciation mode of the user for a specific word and the like.

In one example, the method may further comprise the steps of: and filtering out voice data which is irrelevant to the recorded text from the user recorded data. By adopting the processing mode, the processing mode of the client can be effectively simplified; therefore, the computing resources of the client can be effectively saved.

In a specific implementation, the step of filtering out the voice data irrelevant to the recording text from the user recording data may be implemented in the following manner: and determining the position of the user, identifying the voice data of each sound source from the user recording data, and extracting the voice data of the real recording user according to the position of each sound source and the position of the user. Since the identification of the recorded data of different sound sources belongs to the mature prior art, it is not described here again.

As can be seen from the foregoing embodiments, in the method for constructing a personalized speech synthesis model provided in the embodiments of the present application, user recording data sent by a client is received; acquiring a recording text corresponding to the user recording data; according to the user recording data and the recording text, constructing an individualized voice synthesis model of the user; the processing mode can control the pause between sentences in the user recording, avoid abnormal pause in the middle of the sentences, thereby ensuring the user recording quality and being convenient for obtaining better recording sentence division results from the whole recording; therefore, the accuracy of the personalized speech synthesis model can be effectively improved, and the speech naturalness and the tone of the personalized speech synthesis are further improved.

In the foregoing embodiment, a personalized speech synthesis model construction method is provided, and correspondingly, the present application also provides a personalized speech synthesis model construction device. The apparatus corresponds to an embodiment of the method described above.

Fifth embodiment

Please refer to fig. 6, which is a schematic diagram of an embodiment of a personalized speech synthesis model construction apparatus according to the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The present application further provides a personalized speech synthesis model building apparatus, including:

a recording data receiving unit 601, configured to receive user recording data sent by a client;

a recording text obtaining unit 603 configured to obtain a recording text corresponding to the user recording data;

a model building unit 605, configured to build an personalized speech synthesis model of the user according to the user recording data and the recording text.

Optionally, the model building unit 605 is specifically configured to divide the user recording data into a plurality of sentence recording data; determining sentence text corresponding to the sentence recording data; constructing a network structure of the personalized speech synthesis model; and learning to obtain the personalized speech synthesis model from the corresponding relation set between the sentence recording data and the sentence text.

Optionally, the network structure comprises a neural network structure.

Optionally, the model building unit 605 is specifically configured to divide the user recording data into a plurality of sentence recording data through a voice activity detection algorithm.

Optionally, the method further includes:

the specific character acquisition unit is used for acquiring characters with different user pronunciation modes in different areas in the recording text;

a recording segment acquiring unit, configured to acquire recording segment data corresponding to the word in the user recording data;

the model building unit 605 is specifically configured to build an individualized speech synthesis model of the user according to the correspondence between the user recording data, the recording text, the word, and the recording segment data.

Optionally, the method further includes:

and the voice data filtering unit is used for filtering the voice data which is irrelevant to the recording text from the user recording data.

Sixth embodiment

Please refer to fig. 7, which is a schematic diagram of an embodiment of an electronic device according to the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor 701 and a memory 702; the memory is used for storing a program for realizing the personalized speech synthesis model building method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: receiving user recording data sent by a client; acquiring a recording text corresponding to the user recording data; and constructing an individualized voice synthesis model of the user according to the user recording data and the recording text.

Seventh embodiment

Please refer to fig. 8, which is a schematic diagram of an embodiment of a personalized speech synthesis model construction system of the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The system embodiments described below are merely illustrative.

The present application further provides a personalized speech synthesis model building system, including:

a client 801, where the client 801 is deployed with the personalized speech synthesis model construction device described in the fifth embodiment, and the device is used to divide a recording text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; sending the collected user recording data to a server, so that the server constructs an individualized voice synthesis model of the user according to the user recording data;

the server 802, where the server 802 is deployed with the personalized speech synthesis model building apparatus described in the second embodiment, and the apparatus is used to receive user recording data sent by a client; and constructing the personalized voice synthesis model of the user according to the user recording data.

As can be seen from the foregoing embodiments, the personalized speech synthesis model construction system provided in the embodiments of the present application divides a recording text into a plurality of sentence texts through a client; when the recording data of the user is collected, displaying the text of the current reading sentence to the user in a first display mode, and displaying the text information after the text of the current reading sentence to the user in a second display mode; sending the collected user recording data to a server; receiving user recording data sent by a client through a server; according to the user recording data, constructing a personalized voice synthesis model of the user; the processing mode controls the pause between sentences in the user recording, avoids abnormal pause in the middle of the sentences, and can ensure the recording quality of the user, so that the server can obtain better recording sentence dividing results from the whole recording, and further construct a personalized speech synthesis model with higher quality; therefore, the accuracy of the personalized speech synthesis model can be effectively improved, and the speech naturalness and the tone of the personalized speech synthesis are further improved.

Eighth embodiment

Please refer to fig. 9, which is a flowchart illustrating an embodiment of a personalized speech synthesis method according to the present application, wherein an execution main body of the method includes a terminal device. The personalized speech synthesis method provided by the application comprises the following steps:

step S901: and receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client.

The model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when the user recording data is collected, displaying the text of the current reading sentence in a first display mode, and displaying the text information after the text of the current reading sentence in a second display mode. The model construction request further comprises an identification of the target user and an identification of the first recording text.

Step S903: and constructing the personalized voice synthesis model of the target user according to the user recording data.

Step S905: and receiving a personalized voice synthesis request aiming at a target user and sent by a client.

The synthesis request can comprise an identifier of a second recording text, and the second recording text is stored in the server in advance; the content of the second recorded text may also be included, and such recorded text may be text entered by the user.

Step S907: and generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user.

After the personalized voice synthesis model of the target user is established, the model can be applied to generate personalized voice data corresponding to a second recording text according to the second recording text specified by the user. For example, if the second recorded text specified by the user is a story, the story audio data, such as the speech, the speaking mode, and the speaking emotion of the user, can be synthesized by using the voice characteristics of the user included in the model.

As can be seen from the foregoing embodiments, in the personalized speech synthesis method provided in the embodiments of the present application, a request for building a personalized speech synthesis model for a target user is received, where the request is sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; according to the user recording data, constructing an individualized voice synthesis model of the target user; receiving a personalized voice synthesis request aiming at a target user and sent by a client; the synthesis request comprises second recording text information; generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user; the processing mode controls the pause between sentences in the user recording, avoids abnormal pause in the middle of the sentences, can ensure the user recording quality, is convenient to obtain better recording sentence dividing results from the whole recording, constructs a high-quality personalized speech synthesis model based on the sentence dividing results, and synthesizes speech data with the user sound characteristics by utilizing the model; therefore, the voice naturalness and the tone of the personalized voice synthesis can be effectively improved.

In the foregoing embodiment, a personalized speech synthesis method is provided, and correspondingly, the present application further provides a personalized speech synthesis apparatus. The apparatus corresponds to an embodiment of the method described above.

Ninth embodiment

Please refer to fig. 10, which is a schematic diagram of an embodiment of a personalized speech synthesis apparatus of the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The present application further provides a personalized speech synthesis apparatus, comprising:

a first request receiving unit 1001, configured to receive a personalized speech synthesis model construction request for a target user, where the request is sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode;

a model building unit 1003, configured to build an personalized speech synthesis model of the target user according to the user recording data;

a second request receiving unit 1005, configured to receive a personalized speech synthesis request for a target user sent by a client; the synthesis request comprises second recording text information;

a speech synthesis unit 1007, configured to generate personalized speech data of the target user corresponding to the second recording text according to the personalized speech synthesis model of the target user.

Tenth embodiment

Please refer to fig. 11, which is a diagram illustrating an embodiment of an electronic device according to the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor 1101 and a memory 1102; the memory is used for storing a program for realizing the personalized speech synthesis method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; according to the user recording data, constructing an individualized voice synthesis model of the target user; receiving a personalized voice synthesis request aiming at a target user and sent by a client; the synthesis request comprises second recording text information; and generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user.

Eleventh embodiment

Please refer to fig. 12, which is a flowchart illustrating an embodiment of a personalized speech synthesis method according to the present application, wherein an execution body of the method includes a server. The application provides a personalized speech synthesis method, which comprises the following steps:

step S1201: determining a second recording text to be subjected to voice synthesis of the target user;

step S1203: sending a personalized voice synthesis request aiming at a target user to a server, wherein the synthesis request comprises second recording text information, so that the server executes the following steps: generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user; the personalized speech synthesis model is constructed in the following way: receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; and constructing the personalized voice synthesis model of the target user according to the recording data.

As can be seen from the foregoing embodiments, in the personalized speech synthesis method provided in the embodiments of the present application, a second recording text to be speech-synthesized of a target user is determined; sending a personalized voice synthesis request aiming at a target user to a server, wherein the synthesis request comprises second recording text information, so that the server executes the following steps: generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user; the personalized speech synthesis model is constructed in the following way: receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; according to the recording data, constructing a personalized voice synthesis model of the target user; the processing mode controls the pause between sentences in the user recording, avoids abnormal pause in the middle of the sentences, can ensure the user recording quality, is convenient to obtain better recording sentence dividing results from the whole recording, constructs a high-quality personalized speech synthesis model based on the sentence dividing results, and synthesizes speech data with the user sound characteristics by using the model; therefore, the voice naturalness and the tone of the personalized voice synthesis can be effectively improved.

In the foregoing embodiment, a personalized speech synthesis method is provided, and correspondingly, the present application further provides a personalized speech synthesis apparatus. The apparatus corresponds to an embodiment of the method described above.

Twelfth embodiment

Please refer to fig. 13, which is a schematic diagram of an embodiment of a personalized speech synthesis apparatus of the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The present application further provides a personalized speech synthesis apparatus, comprising:

a recording text determination unit 1301, configured to determine a second recording text to be speech-synthesized of the target user;

a request sending unit 1303, configured to send a personalized speech synthesis request for a target user to a server, where the synthesis request includes second recorded text information, so that the server performs the following steps: generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user; the personalized speech synthesis model is constructed in the following way: receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; and constructing the personalized voice synthesis model of the target user according to the recording data.

Thirteenth embodiment

Please refer to fig. 14, which is a diagram illustrating an embodiment of an electronic device according to the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor 1401 and a memory 1402; the memory is used for storing a program for realizing the personalized speech synthesis method, and after the equipment is powered on and runs the program of the method through the processor, the following steps are executed: determining a second recording text to be subjected to voice synthesis of the target user; sending a personalized voice synthesis request aiming at a target user to a server, wherein the synthesis request comprises second recording text information, so that the server executes the following steps: generating personalized voice data of the target user corresponding to the second recording text according to the personalized voice synthesis model of the target user; the personalized speech synthesis model is constructed in the following way: receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; and constructing the personalized voice synthesis model of the target user according to the recording data.

The electronic device can be a smart sound box, a smart phone and the like.

In one example, the smart speaker includes: the device comprises a sound acquisition device, a sound playing device and a display device; the intelligent sound box is specifically used for collecting the user recording data through the sound collection device, displaying the first recording text through the display device, and playing the personalized voice data through the sound playing device.

Fourteenth embodiment

Please refer to fig. 15, which is a schematic diagram of an embodiment of a personalized speech synthesis system of the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The system embodiments described below are merely illustrative.

The present application additionally provides a personalized speech synthesis system, comprising: a client 1501 and a server 1502.

The client 1501 is deployed with the personalized speech synthesis model building apparatus described in the twelfth embodiment, and the apparatus is configured to determine a second recording text to be speech-synthesized of the target user; sending a personalized voice synthesis request aiming at a target user to a server, wherein the voice synthesis request comprises second recording text information; correspondingly, the server 1502 is deployed with the personalized speech synthesis model building apparatus described in the ninth embodiment, and the apparatus is configured to receive the speech synthesis request sent by the client; generating personalized voice data corresponding to the second recording text according to the personalized voice synthesis model of the target user; the personalized speech synthesis model is constructed in the following way: receiving a personalized speech synthesis model construction request aiming at a target user and sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; and constructing the personalized voice synthesis model of the target user according to the recording data.

As can be seen from the foregoing embodiments, in the personalized speech synthesis system provided in the embodiments of the present application, a second recording text to be speech-synthesized is determined by a client; sending an individualized voice synthesis request aiming at a second recording text to a server, and generating individualized voice data corresponding to the second recording text according to the individualized voice synthesis model through the server; the personalized speech synthesis model is constructed in the following way: receiving a personalized voice synthesis model construction request aiming at a first recording text sent by a client; the model construction request comprises user recording data corresponding to the first recording text; the user recording data is acquired in the following mode: dividing the first recorded text into a plurality of sentence texts; when user recording data is collected, displaying a text of a current reading sentence in a first display mode, and displaying text information after the text of the current reading sentence in a second display mode; according to the user recording data, constructing a personalized voice synthesis model of the user; the processing mode controls the pause between sentences in the user recording, avoids abnormal pause in the middle of the sentences, can ensure the user recording quality, is convenient to obtain better recording sentence dividing results from the whole recording, constructs a high-quality personalized speech synthesis model based on the sentence dividing results, and synthesizes speech data with the user sound characteristics by using the model; therefore, the voice naturalness and the tone of the personalized voice synthesis can be effectively improved.

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

28页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种支持多语种的语音学习方式和控制终端

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!