Speech synthesis method, speech synthesis device, electronic equipment and storage medium

文档序号：1833158 发布日期：2021-11-12 浏览：10次中文

阅读说明：本技术 语音合成方法、装置、电子设备及存储介质 (Speech synthesis method, speech synthesis device, electronic equipment and storage medium ) 是由郑颖龙周昉昉叶杭赖蔚蔚吴广财林嘉鑫刘佳木陈颖璇朱泰鹏黄彬系于 2021-07-21 设计创作，主要内容包括：本申请公开了一种语音合成方法、装置、电子设备及存储介质,涉及语音处理技术领域。该方法包括：在语音播报的过程中,当检测到用户的输入语音时,识别所述输入语音的语音特征；根据所述语音特征,确定用于播报语音的语音参数,所述语音参数用于针对待播报的文本信息生成与所述语音参数对应的语音；基于对所述待播报的文本信息的语法分析,将标识信息添加至所述待播报的文本信息中,得到目标文本信息；基于所述语音参数以及目标文本信息,生成用于播报的目标语音。如此,可以根据用户的语音特征,确定对应的语音参数,并基于该语音参数生成与针对该用户的个性化的目标语音,提升了用户的语音交互体验。(The application discloses a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, and relates to the technical field of voice processing. The method comprises the following steps: in the voice broadcasting process, when the input voice of a user is detected, recognizing the voice characteristics of the input voice; determining voice parameters for broadcasting voice according to the voice characteristics, wherein the voice parameters are used for generating voice corresponding to the voice parameters aiming at the text information to be broadcasted; based on the syntactic analysis of the text information to be broadcasted, adding identification information to the text information to be broadcasted to obtain target text information; and generating target voice for broadcasting based on the voice parameters and the target text information. Therefore, the corresponding voice parameters can be determined according to the voice characteristics of the user, and the personalized target voice aiming at the user is generated based on the voice parameters, so that the voice interaction experience of the user is improved.)

1. A method of speech synthesis, the method comprising:

in the voice broadcasting process, when the input voice of a user is detected, recognizing the voice characteristics of the input voice;

determining voice parameters for broadcasting voice according to the voice characteristics, wherein the voice parameters are used for generating voice corresponding to the voice parameters aiming at the text information to be broadcasted;

based on the syntactic analysis of the text information to be broadcasted, adding identification information to the text information to be broadcasted to obtain target text information;

and generating target voice for broadcasting based on the voice parameters and the target text information.

2. The method of claim 1, wherein determining voice parameters for broadcasting voice according to the voice characteristics comprises:

determining user attribute information of the user according to the voice characteristics;

and acquiring voice parameters corresponding to the user attribute information as voice parameters for broadcasting voice.

3. The method according to claim 2, wherein the user attribute information includes a user age, and the obtaining the voice parameter corresponding to the user attribute information includes:

acquiring an age interval in which the user age is located as a target age interval;

and acquiring a voice parameter corresponding to the target age interval as a voice parameter for broadcasting voice.

4. The method of claim 1, wherein determining voice parameters for broadcasting voice according to the voice characteristics comprises:

determining emotion information of the user according to the voice characteristics;

and acquiring a voice parameter corresponding to the emotion information as a voice parameter for broadcasting voice.

5. The method of claim 4, wherein before the adding identification information to the text information to be broadcasted based on the parsing of the text information to be broadcasted to obtain the target text information, the method further comprises:

and when the emotion information meets the set emotion condition, acquiring first text information as the text information to be broadcasted, wherein the first text information is used for adjusting the emotion of the user.

6. The method of claim 1, wherein the identification information comprises a connective word, and the adding identification information to the text information to be broadcasted based on a syntactic analysis of the text information to be broadcasted to obtain target text information comprises:

identifying clauses in the text information to be broadcasted to obtain a plurality of clauses;

acquiring a target clause existing in the plurality of clauses, wherein the number of words in the target clause is greater than a first threshold value;

dividing the target clause into a plurality of clause components based on the syntactic analysis of the target clause;

and adding the connecting words between the adjacent clause components.

7. The method of claim 6, wherein after said dividing the target clause into a plurality of clause components based on a syntactic analysis of the target clause, the method further comprises:

acquiring a target clause component existing in the plurality of clause components, wherein the number of words in the target clause component is greater than a second threshold value, and the second threshold value is smaller than the first threshold value;

and adding pause identifiers between the target clause component and the adjacent clause component, wherein the pause identifiers are used for generating pause voices with specified duration between the voices corresponding to the target clause component and the voices corresponding to the adjacent clause component when the target voices are generated.

8. The method according to any one of claims 1 to 7, wherein before the generating of the target voice for the announcement based on the voice parameter and the target text information, the method further comprises:

if the target text information contains a plurality of clauses, adding a specified identifier between every two adjacent clauses in the plurality of clauses, wherein the specified identifier is used for generating ventilation voice between voices corresponding to every two adjacent clauses when the target voice is generated.

9. The method according to any one of claims 1 to 7, wherein before the generating of the target voice for the announcement based on the voice parameter and the target text information, the method further comprises:

and if the generation of the target voice is not finished within the specified duration, acquiring preset voice as the target voice for broadcasting.

10. A speech synthesis apparatus, characterized in that the apparatus comprises:

the voice analysis module is used for identifying voice characteristics of input voice when the input voice of a user is detected;

the parameter determining module is used for determining voice parameters for broadcasting voice according to the voice characteristics, and the voice parameters are used for synthesizing target voice for broadcasting aiming at the text information to be broadcasted;

the information adding module is used for adding identification information into the text information to be broadcasted based on the syntactic analysis of the text information to be broadcasted to obtain target text information;

and the voice generation module is used for generating target voice for broadcasting based on the voice parameters and the target text information.

11. An electronic device, comprising:

one or more processors;

a memory;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-9.

12. A computer-readable storage medium, characterized in that a program code is stored in the computer-readable storage medium, which program code can be called by a processor to perform the method according to any of claims 1-9.

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech synthesis method, apparatus, electronic device, and storage medium.

Background

With the development of artificial intelligence technology, man-machine conversation starts to enter people's daily life widely, and common scenes include an intelligent customer service robot, an intelligent sound box, a chat robot and the like. The core of the man-machine conversation is that the machine can automatically understand and analyze the voice input by the user according to the data trained or learned in advance under the established system framework and give a meaningful voice reply.

However, when speech synthesis is performed on text information to be broadcasted, the input characters are matched with the pronunciation library one by one, and the pronunciations of all the characters are connected in series to generate speech to be broadcasted.

Disclosure of Invention

In view of the above, the present application provides a speech synthesis method, apparatus, electronic device and storage medium.

In a first aspect, an embodiment of the present application provides a speech synthesis method, where the method includes: in the voice broadcasting process, when the input voice of a user is detected, recognizing the voice characteristics of the input voice; determining voice parameters for broadcasting voice according to the voice characteristics, wherein the voice parameters are used for generating voice corresponding to the voice parameters aiming at the text information to be broadcasted; based on the syntactic analysis of the text information to be broadcasted, adding identification information to the text information to be broadcasted to obtain target text information; and generating target voice for broadcasting based on the voice parameters and the target text information.

In a second aspect, an embodiment of the present application provides a speech synthesis apparatus, including: the device comprises a voice analysis module, a parameter determination module, an information adding module and a voice generation module. The voice analysis module is used for identifying voice characteristics of input voice when the input voice of a user is detected; the parameter determining module is used for determining voice parameters for broadcasting voice according to the voice characteristics, and the voice parameters are used for synthesizing target voice for broadcasting aiming at the text information to be broadcasted; the information adding module is used for adding identification information into the text information to be broadcasted based on the syntactic analysis of the text information to be broadcasted to obtain target text information; and the voice generation module is used for generating target voice for broadcasting based on the voice parameters and the target text information.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a memory; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the speech synthesis method provided by the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, and the program code may be called by a processor to execute the speech synthesis method provided in the first aspect.

In the scheme provided by the application, in the voice broadcasting process, when the input voice of a user is detected, the voice characteristics of the input voice are identified; determining voice parameters for broadcasting voice according to the voice characteristics, wherein the voice parameters are used for generating voice corresponding to the voice parameters aiming at the text information to be broadcasted; based on the syntactic analysis of the text information to be broadcasted, adding the identification information to the text information to be broadcasted to obtain target text information; and generating target voice for broadcasting based on the voice parameters and the target text information. Therefore, the corresponding voice parameters can be determined according to the voice characteristics of the user, and the personalized target voice aiming at the user is generated based on the voice parameters, so that the voice interaction experience of the user is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 shows a schematic flow chart of a speech synthesis method according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating a speech synthesis method according to another embodiment of the present application.

Fig. 3 is a flowchart illustrating a speech synthesis method according to still another embodiment of the present application.

Fig. 4 is a flowchart illustrating a speech synthesis method according to another embodiment of the present application.

Fig. 5 is a block diagram of a speech synthesis apparatus according to an embodiment of the present application.

Fig. 6 is a block diagram of an electronic device for executing a speech synthesis method according to an embodiment of the present application.

Fig. 7 is a storage unit according to an embodiment of the present application, configured to store or carry program codes for implementing a speech synthesis method according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

In the related voice synthesis technology, only the input characters are matched with a pronunciation library one by one, and then the pronunciations of all the characters are connected in series to generate the voice to be broadcasted, the tone, the speed, the volume, the tone and the timbre of the voice generated by the method are uniform and lack of change, so that a user can easily perceive that the voice is automatically broadcasted or replied by a machine, the hearing experience is reduced, and the patience is lost, so that the manual service is sought, and thus, the intelligent answering robot loses the fundamental function of saving manpower.

In view of the above problems, the inventors propose a voice synthesis method, apparatus, electronic device, and storage medium, which can determine a voice parameter for broadcasting voice based on a voice feature of an input voice when the input voice of a user is detected in a voice broadcasting process, and generate a target voice for broadcasting based on the voice parameter and target text information. This is described in detail below.

Referring to fig. 1, fig. 1 is a flowchart illustrating a speech synthesis method according to an embodiment of the present application. The speech synthesis method provided by the embodiment of the present application will be described in detail with reference to fig. 1. The speech synthesis method may include the steps of:

step S110: in the voice broadcasting process, when the input voice of a user is detected, the voice characteristics of the input voice are recognized.

In this embodiment, the voice broadcast may be applied to various scenarios, for example, an intelligent customer service system, an intelligent chat robot, an intelligent question and answer robot, or a telemarketing scenario, which is not limited in this embodiment. The input voice of the user may be a voice spoken by the user to a currently used smart device supporting human-computer interaction, where the smart device may include a smart robot, a smart phone, a smart wearable device (such as a smart watch, a smart headset, and the like), a tablet computer, a notebook computer, and the like, which is not limited in this embodiment.

Optionally, in the process of human-computer voice interaction between the user and the intelligent device, the user may input voice first, and the intelligent device broadcasts a corresponding reply according to the input voice of the user to answer the information that the user wants to know, for example, in an intelligent customer service system, the user inputs voice "now several times", and correspondingly, the intelligent device may broadcast a corresponding reply voice "now 9 am, ask for other needs"; or the voice may be first broadcast by the intelligent device, for example, "ask for a request for insurance services", and the user may input a reply voice for the voice according to the broadcast voice, for example, "ask for which types of insurance services if necessary".

Based on this, the intelligent device can monitor the input voice of the user in the human-computer voice interaction process, namely, in the voice broadcasting process, and recognize the voice characteristics of the input voice when the input voice of the user is monitored, so that the personalized reply voice is generated according to the voice characteristics of the user in the following process, and the hearing experience of the user is improved. The voice feature may include a plurality of features, such as a tone, a pitch, a volume, a voiceprint feature, a speed of speech, and the like of the input voice, which is not limited in this embodiment.

Step S120: and determining voice parameters for broadcasting voice according to the voice characteristics, wherein the voice parameters are used for generating voice corresponding to the voice parameters aiming at the text information to be broadcasted.

In this embodiment, the voice parameters may include a tone, a tone color, a speech speed, and the like, which is not limited in this embodiment, and different voice characteristics may correspond to different voice parameters of the broadcast voice, so that the broadcast voice generated based on the voice parameters is also different.

In some embodiments, the voice characteristics such as the pitch and the speech speed in the input voice may be used as the voice parameters for broadcasting the voice, and specifically, if the pitch and the speech speed in the input voice are low, correspondingly, the pitch in the voice parameters for broadcasting the voice may also be low, and the speech speed is slow, which conforms to the speaking habit of the user. Therefore, the pitch and the speech rate in the input voice can be used as the pitch and the speech rate in the voice parameters for the broadcast voice.

In other embodiments, the speed of speech in the speech feature may be obtained, the speed of speech interval in which the speed of speech is located may be determined, and the speed of speech corresponding to the speed of speech interval may be obtained as the speed of speech in the speech parameter for broadcasting the speech. The mapping relationship between different speech rate intervals and the speech rate corresponding to the interval may be pre-stored, and after the speech rate of the speech input by the user is obtained, the speech rate for broadcasting the speech may be obtained based on the mapping relationship after the speech rate interval of the speech output by the user is determined. It is understood that the process of determining the pitch and timbre is similar to the process of determining the speech rate, and reference is made to the above implementation process, which is not described herein again.

Step S130: and adding identification information into the text information to be broadcasted based on the syntactic analysis of the text information to be broadcasted to obtain target text information.

In this embodiment, syntactic analysis may also be performed on the text information to be broadcasted, and identification information is added to the text information to be broadcasted, so that the text information to be broadcasted has more interest and affinity. The syntactic analysis can be that the text information to be broadcasted is divided into a subject, a predicate and an object, wherein the identification information can be added between the subject and the predicate and can also be added between the predicate and the object; if the text information to be broadcasted comprises a plurality of clauses, identification information can be added between adjacent clauses; if the text information to be broadcasted only includes one clause, identification information may also be added before and after the clause, which is not limited in this embodiment. The text information to be broadcasted can be reply text information determined according to input information of a user, that is, the reply text information can be determined according to input voice of the user, and when the current text information is broadcasted, when the number of times that the user interrupts voice in a preset time period is monitored to exceed the preset number of times, the user is judged not to be interested in the currently broadcasted text information at the moment, the preset inquiry text information is used as target text information, and the preset inquiry text information can be used for prompting the user to input voice so as to know the content that the user wants to know; the text information to be broadcasted may also be preset broadcast text information. The method includes the steps of detecting the number of times that a user interrupts voice in a preset time period, and judging whether the number of times is larger than the preset number of times or not by detecting the number of times that the user speaks in the preset time period of current voice broadcasting, wherein the preset number of times can be preset or can be used as the preset number of times according to the number of problems contained in the current voice broadcasting, namely, the method can be understood as representing that the user interrupts voice if the number of times that the user speaks exceeds the number of the problems in the current voice broadcasting process.

In some embodiments, the identification information may be an interactive text, that is, after the identification information is added to the text information to be broadcasted, the affinity and interest of the information to be broadcasted may be increased, and the interactivity may be improved. For example, the text information to be broadcasted is "3 am at present", and by syntactic analysis, it can be determined that "now" is a subject, "yes" is a predicate, "3 am" is an object, the word "already" can be added between the subject and the predicate, and "it is too late to go to sleep early" is added at the end of the text information to be broadcasted to prompt the user to go to sleep early and increase interactivity in the process of broadcasting voice, and correspondingly, the finally obtained target text information is "it is 3 am at present and it is too late to go to sleep early".

In other embodiments, the identification information may also be a meaningless phrase, and after the meaningless phrase is added to the text information to be broadcasted, the generated corresponding voice may include a pause or a wart, such as "kayao", "this", "that", and so on, so that it may be difficult for the user to perceive that the object speaking thereto is an auto-answer robot or voice broadcast.

Step S140: and generating target voice for broadcasting based on the voice parameters and the target text information.

Based on this, after the voice parameter and the target Text information are determined, the target Text information can be converted into voice through a Text To Speech (TTS) technology according To the voice parameter, that is, the target voice for broadcasting is obtained. The target speech may be generated by a parametric method based on the speech parameters, that is, the fundamental frequency, the formant frequency, and the like of the target speech are generated by parameter adjustment so that the target speech may satisfy the speech parameters, and it may be understood that the speech speed, the tone, the pitch, and the like of the generated target speech are matched with the speech parameters.

In this embodiment, the corresponding voice parameter can be determined according to the voice feature of the user, and the personalized target voice for the user is generated based on the voice parameter, so that the voice interaction experience of the user is improved; meanwhile, identification information is added in text information to be broadcasted, synthesized target voice can contain pauses, excrescences, interactive voice and the like, and then the target voice can be heard more favorably and interestingly, so that a user can be enabled to be more difficult to perceive that an object speaking with the user is an automatic response robot or voice broadcast, smooth proceeding of an automatic voice broadcast process is guaranteed, manual service is reduced, and labor cost is saved.

Referring to fig. 2, fig. 2 is a flowchart illustrating a speech synthesis method according to another embodiment of the present application. The speech synthesis method provided by the embodiment of the present application will be described in detail with reference to fig. 2. The speech synthesis method may include the steps of:

step S210: in the voice broadcasting process, when the input voice of a user is detected, the voice characteristics of the input voice are recognized.

In this embodiment, the specific implementation manner of step S210 may refer to the contents in the foregoing embodiments, and is not described herein again.

Step S220: and determining the user attribute information of the user according to the voice characteristics.

In this embodiment, the user attribute information of the user may be determined according to a voice feature of the input voice, where the user attribute information may include multiple types, such as age, gender, region of interest, and culture degree, and the voice feature may also include multiple types, such as timbre, pitch, volume, voiceprint feature, speech rate, accent, and the like, which is not limited in this embodiment. Wherein, the gender can be determined according to the tone and/or the tone of the user; age may be determined from the user's tone and/or voiceprint characteristics; the region can be determined according to the accent of the user; the culture degree can be determined according to the age of the user and the region.

Step S230: and acquiring voice parameters corresponding to the user attribute information as voice parameters for broadcasting voice.

Based on this, after the user attribute information of the user is acquired, further, the voice parameter corresponding to the user attribute information may be acquired as the voice parameter for broadcasting the voice. That is to say, the user attribute information of each user is different, the obtained corresponding voice parameters are also different, the correspondingly generated voice parameters for broadcasting are also different, and further the generated target voices for broadcasting are also different, that is, in the voice interaction process, personalization of voice generation for each user is realized.

In some embodiments, if the user attribute information is the user age, acquiring an age interval in which the user age is located as a target age interval; and acquiring a voice parameter corresponding to the target age interval as a voice parameter for broadcasting voice. The age of the user can be obtained by identifying and writing the voiceprint features of the input voice; the voice broadcasting method comprises the steps of obtaining a voice parameter corresponding to a target age interval, and storing a plurality of age intervals and the voice parameter corresponding to each age interval in advance, namely, each age interval has a mapping relation with the corresponding voice parameter, judging which age interval of the plurality of pre-stored age intervals the user age is in after the user age is obtained currently, taking the age interval as the target age interval, obtaining the voice parameter corresponding to the target age interval according to the mapping relation, and taking the voice parameter as the voice parameter for broadcasting voice. Wherein, when presetting a plurality of age intervals and the speech parameter that every age interval corresponds, because the culture degree of older and the less user of age may not be high, and it is also slower to understand the speed of things probably, consequently, can be with the volume in the speech parameter that the less and the more older age interval of age corresponds turn up some, and speech speed is turned down some, so, can guarantee that the less and the more older user of age can hear clearly the pronunciation of reporting, prevent because of the speech speed is too fast or the volume is less, cause the user to fail in time to obtain and understand the content in the pronunciation of reporting.

For example, the plurality of age sections stored in advance respectively include: the age of the user is 20 years, the age of the user can be in the age range of [19 years-30 years ], and further, a voice parameter corresponding to the age range of [19 years-30 years ] stored in advance is acquired as a voice parameter for broadcasting voice.

In other embodiments, if the user attribute information is the user gender, a voice parameter corresponding to the user gender is acquired as a voice parameter for broadcasting the voice. Specifically, whether the user is a male or a female can be determined by the frequency of the input voice, because the pitch of the male voice is lower than that of the female voice, understandably, the frequency of the male voice is also lower than that of the female voice, so that the frequency of the input voice can be obtained, whether the frequency belongs to a low-frequency region or a high-frequency region can be determined, and if the frequency of the input voice belongs to the high-frequency region, the user can be determined as a female; if the frequency of the input voice is a low frequency region, it can be determined that the user is a male. The frequency threshold of the low frequency region and the high frequency region can be obtained by statistical analysis of sound frequency data of a large number of males and females. For example, the speech speed of the female is generally slow, and the speech speed of the male is generally fast, so the speech parameters corresponding to the female and the male can be set to be different, the speech parameters for the female can be set to be relatively slow, the tone of the female is relatively soft, and the speech parameters for the male can be set to be relatively fast, and the volume of the male is relatively large. Of course, the voice parameters corresponding to different genders may also be set by the user according to different application scenarios, which is not limited in this embodiment.

In still other embodiments, if the user attribute is the area to which the user belongs, the voice parameter corresponding to the area to which the user belongs is acquired as the voice parameter for broadcasting the voice. The region corresponding to the accent of the user may be determined according to the voice characteristics of the user, that is, the region to which the user of the user belongs, where the region may be a country, a province, a city, and the like, and the voice parameters corresponding to different regions may be preset, which is not limited in this embodiment. Because the speaking habits of the users in different areas are different, the speaking accent in the area can be used as the voice parameter corresponding to the area where the user belongs, namely, the speaking accent corresponding to the area where the user belongs is used as the voice parameter for broadcasting voice, so that the generated accent for broadcasting voice also conforms to the speaking accent of the area where the user belongs, and therefore, in the voice communication process, the users are more familiar, and the users are more difficult to perceive that the object speaking with the users is an automatic response robot or voice broadcasting.

For example, if the accent of the user is obtained as the sichuan, the area corresponding to the sichuan is the sichuan province, and thus the area to which the user belongs can be determined as the sichuan province, and the accent of the sichuan is used as the voice parameter for broadcasting the voice.

In still other embodiments, the user attribute information of the user may include multiple types at the same time, and in order to further improve the accuracy of obtaining the voice parameters corresponding to the user attribute information, a multidimensional mapping table may be established in advance for the multiple types of user attribute information and the preset voice parameters corresponding thereto, and after obtaining the multiple types of user attribute information of the user, the voice parameters corresponding to the multiple types of user attribute information are determined as the voice parameters for broadcasting voice based on the multidimensional mapping table and the multiple types of user attribute information. Specifically, if the multiple types of user attribute information simultaneously include gender, age, region and culture degree, the preset voice parameters in the multidimensional mapping relation table correspond to preset gender, preset age interval, preset region and preset culture degree, based on the preset gender, age, belonged region and culture degree, after the gender, age, belonged region and culture degree of the current user are obtained, the preset gender, preset age interval, preset region and preset culture degree in the multi-bit mapping relation table are respectively matched, the preset gender which is the same as the gender of the current user is obtained as a target gender, the preset age interval in which the age of the current user is located is obtained as a target age interval, the region which is matched with the belonged region of the current user is obtained as a target region, and the culture degree which is matched with the culture degree of the current user is obtained as a target culture degree; and acquiring voice parameters corresponding to the target gender, the target age interval, the target area and the target culture degree from the multi-dimensional mapping table to serve as the voice parameters for broadcasting voice.

For example, if the gender of the current user is female, the age is 24 years, the region of the current user is Sichuan province, and the cultural degree is the subject, the preset age interval includes 0-19 years, 20-39 years and 40-80 years, the region of the current user includes 23 provinces of China, and the cultural degree includes the subject and above and below the subject. Based on the multi-dimensional mapping table, the target gender is determined to be the female, the target age interval of the age is 20-40 years old, the target area is Sichuan province, and the target culture degree is the subject and above, so that the voice parameters corresponding to the gender of the female, the target age interval of 20-40 years old, the target area of Sichuan province, and the target culture degree of the subject and above can be obtained from the multi-dimensional mapping table to serve as the voice parameters for broadcasting voice.

Step S240: and adding identification information into the text information to be broadcasted based on the syntactic analysis of the text information to be broadcasted to obtain target text information.

Step S250: and generating target voice for broadcasting based on the voice parameters and the target text information.

In this embodiment, the detailed implementation of steps S240 to S250 may refer to the content in the foregoing embodiments, and will not be described herein again.

In this embodiment, the user attribute information of the user may be determined according to the voice feature of the input voice of the user, the corresponding voice parameter may be determined according to the user attribute information, and the target voice for broadcasting may be generated based on the voice parameter. Therefore, target voices for broadcasting with different voice parameters can be generated according to different user attribute information, namely, in the voice interaction process, individuation of voice generation of each user is achieved, the intimacy of voice communication between the human machine is improved, and the user can not easily perceive that an object speaking with the user is an automatic response robot or voice broadcasting.

Referring to fig. 3, fig. 3 is a flowchart illustrating a speech synthesis method according to still another embodiment of the present application. The speech synthesis method provided by the embodiment of the present application will be described in detail below with reference to fig. 3. The speech synthesis method may include the steps of:

step S310: in the voice broadcasting process, when the input voice of a user is detected, the voice characteristics of the input voice are recognized.

In this embodiment, the specific implementation manner of step S310 may refer to the contents in the foregoing embodiments, and is not described herein again.

Step S320: and determining emotion information of the user according to the voice characteristics.

In this embodiment, the user emotion information may be information representing the emotion of the user, and the emotion of the user may include joy, anger, sadness, surprise, fear, confusion, concentration, vague, and the like, which is not limited herein.

The voice feature may be a tone of the user, that is, performing voice analysis on the input voice to obtain a current tone of the user. As a specific implementation manner, the input voice may be analyzed to obtain parameter information related to the speaking mood, such as voice volume, tone, voice content, and the like, and the user mood is determined according to a specific parameter value of the parameter information, where the manner of specifically analyzing the user mood may not be limited. Based on the user mood, the user mood can be further analyzed, and the user emotion information of the user can be obtained. Of course, the embodiment of obtaining the emotion information of the user according to the mood of the user may not be limited.

In some embodiments, if the emotion information of the user includes excited emotion and calm emotion, determining whether the volume of the input voice is greater than a preset volume threshold, and if the volume is greater than the preset volume threshold, determining that the emotion information of the user is excited; and if the volume is less than or equal to the preset volume threshold, judging the emotion of the user to be calm. The preset volume threshold may be preset, or may be adjusted according to different application scenarios, which is not limited in this embodiment.

In other embodiments, if the emotion information of the user includes excited emotion and calm emotion, determining whether the speech rate of the input speech is greater than a preset speech rate threshold, and if the speech rate is greater than the preset speech rate threshold, determining that the emotion information of the user is excited; and if the speech speed is less than or equal to the preset speech speed threshold value, judging the emotion of the user to be calm. The preset speech rate threshold may be preset, or may be adjusted according to different application scenarios, which is not limited in this embodiment.

In still other embodiments, the emotion of the user may also be determined according to a plurality of speech characteristic parameters. Specifically, if the emotion information of the user includes three emotions, namely, extreme excitement, more excitement and calm, whether the speed of the input voice is greater than a preset speed threshold and the volume of the input voice is greater than a preset volume threshold is judged, and if the speed of the input voice is greater than the preset speed threshold and the volume is greater than the preset volume threshold, the emotion information of the user can be judged to be extreme excitement; if the speech rate is greater than the preset speech rate threshold value but the volume is less than or equal to the preset volume threshold value, or the volume is greater than the preset volume threshold value but the speech rate is less than or equal to the preset speech rate threshold value, judging that the emotion information of the user is excited; and if the volume is less than or equal to the preset volume threshold and the speech speed is less than or equal to the preset speech speed threshold, judging that the emotion information of the user is calm.

In still other embodiments, a plurality of voice characteristic parameters of the user can be input into a pre-trained emotion scoring model to obtain emotion scoring; comparing the emotion score with a preset score threshold, and if the emotion score is larger than the preset score threshold, judging that the emotion information of the user is excited; and if the emotion score is less than or equal to a preset score threshold value, judging that the emotion information of the user is calm. The preset score threshold may be preset, or may be adjusted according to different application scenarios, which is not limited in this embodiment.

Step S330: and acquiring a voice parameter corresponding to the emotion information as a voice parameter for broadcasting voice.

Based on this, after the emotion information of the user is determined, a voice parameter corresponding to the emotion information may be acquired as a voice parameter for broadcasting. To improve the interactivity of the automatic broadcast, the voice parameters for broadcasting voice may be changed according to the mood change of the user, so that the user feels that the automatic response robot or the smart customer service is carefully communicating with her. Therefore, various emotion information can be preset in advance, and the voice parameters corresponding to each emotion information are set, for example, if the emotion information of the user is excited, at this time, the tone color in the voice parameters can be set to be milder, the tone is set to be lower, and the volume is set to be smaller, so that the target voice for broadcasting generated based on the voice parameters gives a sense of milder auditory sensation to the user who currently has excited emotion.

In some embodiments, when the emotion information meets a set emotion condition, first text information is acquired as text information to be broadcasted, and the first text information is used for adjusting the emotion of the user. The set emotional condition may be a sad emotion, an excited emotion, or the like, and the corresponding first text information may be different for different set emotional conditions. Specifically, if the emotional condition is set as an excited emotion, the first text message may be a text message for soothing the user, such as "excited otherwise," and if you are not interested in the package, another package … … can be seen.

In other embodiments, since users with different user attribute information may react differently to the same emotion information, the voice parameters for voice announcement may also differ, for example, male and female may react differently to the same thing, and female may feel happy, but male may behave more often. Based on the method, after the emotion information of the user is determined, the user attribute information of the user is determined based on the voice characteristics, and the voice parameters corresponding to the emotion information and the user attribute information of the user are determined and used as the voice parameters for voice broadcasting. Specifically, the user attribute information may include gender and age of the user, and a multidimensional mapping table may be established in advance for the gender, age and emotion information of the user, where the multidimensional mapping table includes preset voice parameters and corresponding preset gender, preset age intervals and preset emotion information, the preset gender includes male and female, the preset age intervals may include multiple age intervals, such as 0 year to 14 years, 15 years to 55 years, 56 years to 80 years, and the preset emotion information may also include multiple emotions, such as sad emotion, excited emotion, happy emotion, and the like; based on the above, after the age, the gender and the emotion information of the current user are obtained, the current user is matched with the preset gender, the preset age interval and the preset emotion information in the multi-dimensional mapping table, the preset gender which is the same as the gender of the current user is taken as the target gender, the preset age interval in which the age of the current user is located is taken as the target age interval, and the preset emotion information matched with the emotion information of the current user is taken as the target emotion information; and then, the voice parameters corresponding to the target gender, the target age interval and the target emotion information in the multi-dimensional mapping table are used as voice parameters for broadcasting.

Step S340: and adding identification information into the text information to be broadcasted based on the syntactic analysis of the text information to be broadcasted to obtain target text information.

Step S350: and generating target voice for broadcasting based on the voice parameters and the target text information.

In this embodiment, the detailed implementation manner of steps S340 to S350 may refer to the content in the foregoing embodiments, and is not described herein again.

In this embodiment, emotion information of the user may be determined according to a voice feature of the input voice of the user, a corresponding voice parameter may be determined according to the emotion information, and a target voice for broadcasting may be generated based on the voice parameter. Therefore, target voices for broadcasting with different moods and different voice speeds can be generated according to changes of emotion information of users, namely, in the voice interaction process, individuation of voice generation of each user is achieved, the familiarity of voice communication between the human-computer is improved, and the user can hardly perceive that an object speaking with the user is an automatic response robot or voice broadcasting.

Referring to fig. 4, fig. 4 is a flowchart illustrating a speech synthesis method according to another embodiment of the present application. The speech synthesis method provided by the embodiment of the present application will be described in detail below with reference to fig. 4. The speech synthesis method may include the steps of:

step S410: in the voice broadcasting process, when the input voice of a user is detected, the voice characteristics of the input voice are recognized.

Step S420: and determining voice parameters for broadcasting voice according to the voice characteristics, wherein the voice parameters are used for generating voice corresponding to the voice parameters aiming at the text information to be broadcasted.

In this embodiment, the detailed implementation manner of steps S410 to S420 may refer to the content in the foregoing embodiments, and is not described herein again.

Step S430: and identifying clauses in the text information to be broadcasted to obtain a plurality of clauses.

Step S440: and acquiring a target clause existing in the plurality of clauses, wherein the number of words in the target clause is greater than a first threshold value.

In this embodiment, the number of words in some clauses in the text information to be broadcasted may be large, and if the clauses with the large number of words are directly converted into voices, the converted voices may be hard and hard, and auditory experience of the user may be affected. Therefore, the clauses in the text information to be broadcasted can be identified to obtain a plurality of clauses, the number of words of each clause is obtained, whether the number of words of the clause exceeds a first threshold value or not is judged, if the number of words of the clause exceeds the first threshold value, the clause is judged to be a long sentence, and the clause is taken as a target clause. The first threshold may be preset (e.g. 10), or may be adjusted according to a specific application scenario.

Step S450: the target clause is divided into a plurality of clause components based on a syntactic analysis of the target clause.

Step S460: and adding the connecting words between adjacent clause components to obtain target text information.

Based on this, after the target clause with a large number of words is determined, a connective word can be added in the target clause to make the converted target voice more like a real person speaking. However, when a conjunction is added, if the conjunction is added at will, the content that would affect the intended expression of the text information to be broadcast may be added, so the target clause may be parsed and divided into a plurality of clause components, where the clause components may include a subject, a predicate, an object, an actor, a predicate, a subject, a complement, a center, and the like. Further, connecting words can be added between adjacent separated components to obtain target text information. The term "connect" may be any combination of "yes", "then", "yes", "this", etc., without limitation.

For example, if the target clause is "the smartphone cannot recognize the voice uttered by the user when the network of the smartphone is not good", the target clause may be parsed and divided into multiple clause components, such as the subjects "smartphone", predicates "unrecognizable", and objects "voice uttered by the user" when the network of the smartphone is not good; based on this, a connecting word can be added among the subjects, the predicates and the objects at random; a link word may be added between two designated clause components, for example, only the link word "that" is added between the idiom and the subject, and the target clause to which the link word is added becomes "the smartphone cannot recognize the voice uttered by the user when the network of the smartphone is not good", so that the target clause can be converted into the target voice more colloquially.

In practical application, the number of words of one clause component in a plurality of clause components may be large, if the clause component is only directly converted into the target voice, the time for broadcasting the target voice may be long, but the broadcasting voice still has no pause, the user still has a hard auditory sensation, and the user considers that the current conversation with the user is a response robot, so that the user loses patience and seeks manual customer service.

Based on this, in some embodiments, a target clause component existing in the plurality of clause components may be acquired, the number of words in the target clause component being greater than a second threshold value, the second threshold value being less than the first threshold value; and adding pause identifiers between the target clause component and the adjacent clause component, wherein the pause identifiers are used for generating pause voices with specified duration between the voices corresponding to the target clause component and the voices corresponding to the adjacent clause component when the target voices are generated. The pause identifier may be a comma, a period, a semicolon, or a pause, which is not limited in this embodiment, and the durations of the pause voices generated corresponding to different pause identifiers are different. That is, after a plurality of sentence components are acquired, the number of words of each sentence component is acquired, and it is determined whether or not the number of words is larger than a second threshold value. Furthermore, pause marks are added between the target clause component and the adjacent clause components, so that when the target voice is generated, pause voice with a specified duration is generated between the voice corresponding to the target clause component and the voice corresponding to the adjacent clause components. Therefore, the generated target voice is closer to the habit of the real person in speaking, and after the voice with more words is spoken, the user can pause and speak the next content.

For example, still take the target clause "the smartphone cannot recognize the voice uttered by the user when the network of the smartphone is not good" as an example, where the idiom is "when the network of the smartphone is not good" and the subject is "the smartphone", and since the number of words of the idiom is large, a pause identifier (such as a comma) may be added between the idiom and the subject, and the target clause to which the comma is added becomes "when the network of the smartphone is not good, the smartphone cannot recognize the voice uttered by the user" so that a voice pause of a specified duration is generated between the voice corresponding to the generated idiom and the voice corresponding to the subject.

In some embodiments, if the target text information includes a plurality of clauses, a specific identifier is added between every two adjacent clauses in the plurality of clauses, and the specific identifier is used for generating ventilation voice between voices corresponding to every two adjacent clauses when the target voice is generated. It can be understood that, in order to make the generated target speech closer to the speech of the real person, when the real person speaks, ventilation sound exists between each clause, and based on this, a specified identifier can be added between every two adjacent clauses in the multiple clauses in the target text information, so that when the target speech is generated, ventilation speech is generated between the speech corresponding to every two adjacent clauses.

Step S470: and generating target voice for broadcasting based on the voice parameters and the target text information.

In this embodiment, the specific implementation manner of step S470 may refer to the contents in the foregoing embodiments, and is not described herein again.

In some embodiments, if the generation of the target voice is not completed within a specified time period, a preset voice is acquired as the target voice for broadcasting. The specified duration may be preset, or may be adjusted according to a specific application scenario, which is not limited in this embodiment. In practical applications, the network of the smart device used by the user may be poor, resulting in slow recognition of the voice input by the user, or slow generation of the target voice, and thus, resulting in incomplete generation of the target voice within a specified duration; if at this moment, voice broadcasting is not carried out, a voice chat cold scene may be caused, and then a user is caused to end the current voice interaction process. Therefore, preset voice can be acquired as target voice for broadcasting, wherein the preset voice can be voice avoiding cold spots such as "kayi", "this", "i think of a thought here", "you are a little, etc., so as to alleviate the problem that the user is not bored due to long waiting time caused by the long generated target voice, and if the synthesis of the target voice is completed after the preset voice is broadcasted, the target voice can be continuously broadcasted so as to continue the chat content with the user.

In some embodiments, the speech quality of the input speech may also be analyzed; when the voice quality is lower than a preset quality threshold, acquiring second text information, wherein the second text information is used for prompting the user to re-input the voice with the voice quality reaching the preset quality threshold; the second text information is used as the second text information. The voice quality of the input voice can be determined by obtaining the signal-to-noise ratio of the input voice, when the signal-to-noise ratio is larger than a preset value, the voice quality is judged to be lower than a preset quality threshold value, and then second text information is obtained to serve as target text information. When the voice quality of the user is poor, the user is prompted to increase the volume or to be far away from noise to input the voice again, and therefore the situation that the voice of the user cannot be recognized due to poor voice quality of the user can be prevented.

In this embodiment, when the text information to be broadcasted includes a plurality of clauses, a designated identifier may be added between adjacent clauses to generate ventilation voices between voices corresponding to every two adjacent clauses, so that the generated target voices are closer to the voice habit of a real person during speaking; and the connecting words are added in the clause components with more words, and the pause identifications are added between the clauses with more words and the adjacent clauses, so that the generated target voice is more natural and spoken, the intimacy of voice communication between the human and the machine is improved, and the user is more difficult to perceive that the object speaking with the human and the machine is an automatic response robot or voice broadcast.

Referring to fig. 5, a block diagram of a speech synthesis apparatus 500 according to another embodiment of the present application is shown. The apparatus 500 may comprise: a speech analysis module 510, a parameter determination module 520, an information addition module 530, and a speech generation module 540.

The voice analysis module 510 is configured to, when an input voice of a user is detected, recognize a voice feature of the input voice;

the parameter determining module 520 is configured to determine a voice parameter for broadcasting a voice according to the voice feature, where the voice parameter is used to synthesize a target voice for broadcasting for a text message to be broadcasted;

the information adding module 530 is configured to add identification information to the text information to be broadcasted based on syntax analysis of the text information to be broadcasted to obtain target text information;

the voice generating module 540 is configured to generate a target voice for broadcasting based on the voice parameter and the target text information.

In some embodiments, the parameter determination module 520 may include: an information determining unit and a parameter acquiring unit. The information determining unit may be configured to determine user attribute information of the user according to the voice feature. The parameter acquiring unit may be configured to acquire a voice parameter corresponding to the user attribute information as a voice parameter for broadcasting a voice.

In this manner, the user attribute information includes a user age, and the parameter obtaining unit may include: an interval acquisition subunit and a parameter acquisition subunit. The section acquiring subunit may be configured to acquire an age section in which the age of the user is located as a target age section. The parameter acquiring subunit may be configured to acquire a voice parameter corresponding to the target age interval as a voice parameter for broadcasting voice.

In other embodiments, the parameter determination module 520 may include: the emotion recognition system comprises an emotion determining unit and a parameter acquiring unit. Wherein the emotion determining unit may be configured to determine emotion information of the user according to the speech feature. The parameter acquisition unit may be configured to acquire a voice parameter corresponding to the emotion information as a voice parameter for broadcasting voice.

In this manner, the speech synthesis apparatus 500 may further include: a first obtaining module. The first obtaining module can be specifically used for adding identification information into the text information to be broadcasted based on the syntactic analysis of the text information to be broadcasted, and obtaining first text information as the text information to be broadcasted when the emotion information meets a set emotion condition before obtaining target text information, wherein the first text information is used for adjusting the emotion of the user.

In some embodiments, the identification information includes a connection word, and the information adding module 530 may include: the system comprises an identification unit, a target clause acquisition unit, a clause division unit and an information adding unit. The identification unit can be used for identifying the clauses in the text information to be broadcasted to obtain a plurality of clauses. The target clause acquiring unit may be configured to acquire a target clause existing in the plurality of clauses, where a number of words in the target clause is greater than a first threshold. The clause dividing unit may be configured to divide the target clause into a plurality of clause components based on a syntactic analysis of the target clause. The information adding unit may be configured to add the conjunctions between adjacent sentence components.

In this manner, the speech synthesis apparatus 500 may further include: and a target component acquisition module. Wherein the target component acquiring module may be configured to acquire a target clause component existing in the plurality of clause components after the target clause is divided into the plurality of clause components based on the parsing of the target clause, a number of words in the target clause component being greater than a second threshold value, the second threshold value being smaller than the first threshold value. The information adding unit may be specifically configured to add a pause identifier between the target clause component and an adjacent clause component thereof, where the pause identifier is used to generate a pause speech with a specified duration between a speech corresponding to the target clause component and a speech corresponding to the adjacent clause component when the target speech is generated.

In some embodiments, the information adding module may be specifically configured to, before the target voice for broadcasting is generated based on the voice parameter and the target text information, add a specific identifier between every two adjacent clauses in the multiple clauses if the target text information includes multiple clauses, where the specific identifier is used to generate ventilation voice between voices corresponding to every two adjacent clauses when the target voice is generated.

In some embodiments, the speech synthesis apparatus 500 may further include: a voice acquisition unit. The voice acquisition unit may be configured to acquire a preset voice as a target voice for broadcast if generation of the target voice is not completed within a specified duration.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, the coupling between the modules may be electrical, mechanical or other type of coupling.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

In summary, in the scheme provided by the embodiment of the application, the corresponding voice parameter can be determined according to the voice feature of the user, and the personalized target voice for the user is generated based on the voice parameter, so that the voice interaction experience of the user is improved; meanwhile, identification information is added in text information to be broadcasted, synthesized target voice can contain pauses, excrescences, interactive voice and the like, and then the target voice can be heard more favorably and interestingly, so that a user can be enabled to be more difficult to perceive that an object speaking with the user is an automatic response robot or voice broadcast, smooth proceeding of an automatic voice broadcast process is guaranteed, manual service is reduced, and labor cost is saved.

An electronic device provided by the present application will be described below with reference to the drawings.

Referring to fig. 6, fig. 6 shows a block diagram of an electronic device 600 according to an embodiment of the present application, and the alarm notification method according to the embodiment of the present application may be executed by the electronic device 600.

The server 600 in the embodiments of the present application may include one or more of the following components: a processor 601, a memory 602, and one or more applications, wherein the one or more applications may be stored in the memory 602 and configured to be executed by the one or more processors 601, the one or more programs configured to perform the methods as described in the aforementioned method embodiments.

Processor 601 may include one or more processing cores. The processor 601 connects various parts throughout the electronic device 600 using various interfaces and lines, and performs various functions of the electronic device 600 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 602, and calling data stored in the memory 602. Alternatively, the processor 601 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 601 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may be integrated into the processor 601, and implemented by a single communication chip.

The Memory 602 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 602 may be used to store instructions, programs, code sets, or instruction sets. The memory 602 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created by the electronic device 600 during use (such as the various correspondences described above), and so on.

In the several embodiments provided in the present application, the coupling or direct coupling or communication connection between the modules shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be in an electrical, mechanical or other form.

Referring to fig. 7, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable medium 700 has stored therein program code that can be called by a processor to perform the methods described in the above-described method embodiments.

The computer-readable storage medium 700 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer-readable storage medium 700 includes a non-transitory computer-readable storage medium. The computer readable storage medium 700 has storage space for program code 710 to perform any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 710 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

20页详细技术资料下载

Speech synthesis method, speech synthesis device, electronic equipment and storage medium

相关技术

网友询问留言