Speech synthesis method and electronic equipment

文档序号：1955158 发布日期：2021-12-10 浏览：19次中文

阅读说明：本技术 语音合成方法及电子设备 (Speech synthesis method and electronic equipment ) 是由张亮于 2021-09-22 设计创作，主要内容包括：本申请公开了一种语音合成方法及电子设备,该方法包括：确定获取的文本中的断句标识；基于断句标识,确定文本中的第一类文字和第二类文字；其中,第一类文字与断句标识具有第一位置关系,第二类文字与断句标识具有第二位置关系；获取第一类文字对应的第一语音信息和第二类文字对应的第二语音信息；其中,第一语音信息包括第一类文字的文字发音,以及在时序上位于该文字发音之前的换气音；第二语音信息包括第二类文字的文字发音；基于第一语音信息和第二语音信息,合成与文本相对应的目标音频。该方法所生成的目标音频中在适当的位置配置了换气音,能够显著提高该目标音频的拟人程度,有益于提高听众的视听体验。(The application discloses a voice synthesis method and electronic equipment, wherein the method comprises the following steps: determining sentence break marks in the obtained text; determining a first type of characters and a second type of characters in the text based on the sentence break marks; the first type of characters and the sentence break marks have a first position relation, and the second type of characters and the sentence break marks have a second position relation; acquiring first voice information corresponding to the first type of characters and second voice information corresponding to the second type of characters; the first voice information comprises character pronunciation of a first type of characters and ventilation sound positioned in front of the character pronunciation in time sequence; the second voice message comprises the character pronunciation of the second type of characters; synthesizing target audio corresponding to the text based on the first voice information and the second voice information. The target audio generated by the method is provided with ventilation sound at a proper position, so that the anthropomorphic degree of the target audio can be obviously improved, and the method is beneficial to improving the audio-visual experience of audiences.)

1. A method of speech synthesis comprising:

determining sentence break marks in the obtained text;

determining a first type of characters and a second type of characters in the text based on the sentence break identifier; the first type of characters and the sentence break marks have a first position relation, and the second type of characters and the sentence break marks have a second position relation;

acquiring first voice information corresponding to the first type of characters and second voice information corresponding to the second type of characters; the first voice information comprises the character pronunciation of the first type of characters and ventilation sound positioned in front of the character pronunciation in time sequence; the second voice message comprises the character pronunciation of the second type of characters;

synthesizing target audio corresponding to the text based on the first voice information and the second voice information.

2. The method of claim 1, wherein the determining sentence break identities in the obtained text comprises:

and determining punctuation marks in the text, and taking the punctuation marks as first sentence break marks.

3. The method of claim 2, wherein the determining of sentence break identities in the retrieved text further comprises:

and under the condition that the number of characters between two adjacent first sentence break marks is greater than a first threshold value, adding one or more second sentence break marks between two adjacent first sentence break marks so as to enable the number of characters between the adjacent first sentence break marks and the adjacent second sentence break marks or between the adjacent second sentence break marks to be less than a second threshold value.

4. The method of claim 3, wherein the adding one or more second sentence break identifiers between two adjacent first sentence break identifiers comprises:

semantic analysis is carried out on characters between two adjacent first sentence break identifications;

dividing characters between two adjacent first sentence break identifications into a plurality of semantic segments according to a semantic analysis result;

adding the second sentence break identifier between at least one pair of adjacent semantic fragments.

5. The method of claim 1, wherein the determining a first type of word and a second type of word in the text based on the sentence break identification comprises:

determining the first character behind the sentence break mark in the text as the first type of character;

and determining the other characters except the first type of characters in the text as the second type of characters.

6. The method of claim 1, wherein the obtaining of the first voice message corresponding to the first type of text comprises:

first voice information of ventilation sounds having random durations is acquired.

7. The method of claim 6, wherein said obtaining first voice information of ventilation tones having a random duration comprises:

acquiring third voice information comprising the character pronunciation of the first type of characters and the ventilation voice before the character pronunciation;

intercepting the character pronunciation of the first type of characters and the ventilation sound with random time length adjacent to the character pronunciation of the first type of characters from the third voice message to form first voice message with the ventilation sound with random time length; or

Acquiring ventilation sound with random time length and character pronunciation of the first type of characters;

and forming first voice information of the ventilation sound with random time length based on the ventilation sound with random time length and the character pronunciation of the first type of characters.

8. The method of claim 1, wherein the obtaining of the first voice message corresponding to the first type of text comprises:

and under the condition that each first type of characters has a plurality of corresponding first voice messages and the corresponding first voice messages have different breathing sounds, randomly acquiring one first voice message in the corresponding first voice messages.

9. The method of claim 1, wherein the obtaining of the first voice message corresponding to the first type of text comprises:

acquiring emotion information corresponding to each first type of characters;

based on the emotion information, acquiring the first voice information with ventilation sounds corresponding to the emotion information.

10. An electronic device, comprising:

the first determining module is used for determining sentence break marks in the acquired text;

the second determining module is used for determining a first type of characters and a second type of characters in the text based on the sentence break identifier; the first type of characters and the sentence break marks have a first position relation, and the second type of characters and the sentence break marks have a second position relation;

the acquisition module is used for acquiring first voice information corresponding to the first type of characters and second voice information corresponding to the second type of characters; the first voice information comprises the character pronunciation of the first type of characters and ventilation sound positioned in front of the character pronunciation in time sequence; the second voice comprises a character pronunciation of the second type of characters;

and the synthesis module is used for synthesizing the target audio corresponding to the text based on the first voice information and the second voice information.

Technical Field

The present disclosure relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method and an electronic device.

Background

When listening to speech-synthesized content for a long time, for example, when listening to an audio book synthesized by speech, people are more sensitive to the degree of naturalness of the synthesized speech. Sounds that are too monotonous and robot imaginative can make people susceptible to hearing fatigue, resulting in a decrease in comprehension of content and memory efficiency of knowledge.

Disclosure of Invention

The application provides a voice synthesis method and electronic equipment, and the technical scheme adopted by the embodiment of the application is as follows:

one aspect of the present application provides a speech synthesis method, including:

determining sentence break marks in the obtained text;

synthesizing target audio corresponding to the text based on the first voice information and the second voice information.

In some embodiments, the determining the sentence break identifier in the obtained text includes:

and determining punctuation marks in the text, and taking the punctuation marks as first sentence break marks.

In some embodiments, the determining the sentence break identifier in the obtained text further includes:

In some embodiments, the adding one or more second sentence identifiers between two adjacent first sentence identifiers includes:

semantic analysis is carried out on characters between two adjacent first sentence break identifications;

dividing characters between two adjacent first sentence break identifications into a plurality of semantic segments according to a semantic analysis result;

adding the second sentence break identifier between at least one pair of adjacent semantic fragments.

In some embodiments, the determining the first type of words and the second type of words in the text based on the sentence break identifier includes:

determining the first character behind the sentence break mark in the text as the first type of character;

and determining the other characters except the first type of characters in the text as the second type of characters.

In some embodiments, the obtaining the first voice information corresponding to the first type of text includes:

first voice information of ventilation sounds having random durations is acquired.

In some embodiments, the obtaining the first voice message with ventilation tones of random duration comprises:

acquiring third voice information comprising the character pronunciation of the first type of characters and the ventilation voice before the character pronunciation;

Acquiring ventilation sound with random time length and character pronunciation of the first type of characters;

In some embodiments, the obtaining the first voice information corresponding to the first type of text includes:

acquiring emotion information corresponding to each first type of characters;

based on the emotion information, acquiring the first voice information with ventilation sounds corresponding to the emotion information.

Another aspect of the present application provides an electronic device, including:

the first determining module is used for determining sentence break marks in the acquired text;

and the synthesis module is used for synthesizing the target audio corresponding to the text based on the first voice information and the second voice information.

The speech synthesis method of the embodiment of the application determines the sentence break identifier in the acquired text, determines the first type of characters and the second type of characters in the text based on the sentence break identifier, acquires the first speech information containing the character pronunciation and the ventilation sound of the first type of characters aiming at the first type of characters, acquires the second speech information containing the character pronunciation of the second type of characters aiming at the second type of characters, and configures the ventilation sound at a proper position based on the target audio synthesized by the first speech information and the second speech information, so that the anthropomorphic degree of the target audio can be obviously improved, and the audiovisual experience of audiences is improved.

Drawings

FIG. 1 is a flow chart of a speech synthesis method according to an embodiment of the present application;

FIG. 2 is a flowchart of step S1 of the speech synthesis method according to the embodiment of the present application;

FIG. 3 is a flowchart of step S12 of the speech synthesis method according to the embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a scenario of step S31 of a speech synthesis method according to an embodiment of the present application;

fig. 5 is a block diagram of a first embodiment of an electronic device according to an embodiment of the present application;

fig. 6 is a block diagram of a second embodiment of an electronic device according to an embodiment of the present application.

Detailed Description

Various aspects and features of the present application are described herein with reference to the drawings.

It will be understood that various modifications may be made to the embodiments of the present application. Accordingly, the foregoing description should not be construed as limiting, but merely as exemplifications of embodiments. Those skilled in the art will envision other modifications within the scope and spirit of the application.

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the application and, together with a general description of the application given above and the detailed description of the embodiments given below, serve to explain the principles of the application.

These and other characteristics of the present application will become apparent from the following description of preferred forms of embodiment, given as non-limiting examples, with reference to the attached drawings.

It should also be understood that, although the present application has been described with reference to some specific examples, a person of skill in the art shall certainly be able to achieve many other equivalent forms of application, having the characteristics as set forth in the claims and hence all coming within the field of protection defined thereby.

The above and other aspects, features and advantages of the present application will become more apparent in view of the following detailed description when taken in conjunction with the accompanying drawings.

Specific embodiments of the present application are described hereinafter with reference to the accompanying drawings; however, it is to be understood that the disclosed embodiments are merely exemplary of the application, which can be embodied in various forms. Well-known and/or repeated functions and constructions are not described in detail to avoid obscuring the application of unnecessary or unnecessary detail. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present application in virtually any appropriately detailed structure.

The specification may use the phrases "in one embodiment," "in another embodiment," "in yet another embodiment," or "in other embodiments," which may each refer to one or more of the same or different embodiments in accordance with the application.

Fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the present application, and referring to fig. 1, the speech synthesis method according to the embodiment of the present application may specifically include the following steps:

and S1, determining sentence break marks in the acquired text.

In actual application, the text can be acquired in various ways. The electronic device may obtain the text through another electronic device interaction, may obtain the text from a storage device inside the electronic device, may obtain the text based on a result of processing data by the electronic device, or may obtain the text in other manners. The specific content of the text is not limited here, for example, the specific content of the text may be an electronic book, an article, a poem, an ancient sentence, even interactive information, and the like.

When a real person reads characters, in order to keep reading fluency, ventilation is usually performed at a sentence break position or a position having a specific positional relationship with the sentence break position, so as to avoid that the ventilation action significantly affects pronunciation and rhythm of reading. Thus, in the case where a text is acquired, the sentence break identification in the text can be determined. The sentence break mark is used for marking the sentence break position of the character content in the text.

The sentence break identifier can be formed by original characters in the text, and can also be added based on the determined sentence break position. For example, when a text is acquired, the sentence break position of the text content in the text may be analyzed, and a sentence break identifier may be generated based on the analyzed sentence break position. Optionally, the position of the punctuation of the text content in the text may be analyzed in various ways, for example, part of speech analysis, semantic analysis, sentence pattern analysis, or the like may be performed on the text content to determine the position of the punctuation of the text content, or a trained self-learning model may be used to analyze the position of the punctuation of the text content in the text to determine the punctuation mark in the text.

S2, determining a first type of characters and a second type of characters in the text based on the sentence break marks; the first type of characters and the sentence break marks have a first position relation, and the second type of characters and the sentence break marks have a second position relation.

The ventilation sound is formed by the ventilation action, the ventilation sound needs to be allocated before the character pronunciation of the first character after the ventilation action, the first character after the ventilation action can be determined as the first type of character, and the rest characters in the text can be determined as the second type of character.

On the basis of the sentence break identification, the ventilation position can be determined. For example, the position with the sentence break logo can be configured as a ventilation position, that is, the sentence break position is configured as a ventilation position to simulate real people to ventilate with sentence break; a ventilation position can also be configured between the two sentence-break marks to simulate the ventilation of a real person in the reading process. Then, based on the ventilation position, the first type of words and the second type of words in the text are determined, that is, the first word after the ventilation position is determined as the first type of words, and the rest words except the first type of words in the text are determined as the second type of words.

Of course, the ventilation position is determined based on the sentence break marks and has a specific position relation with the sentence break marks, and the ventilation position also directly determines the first type of characters and the second type of characters. Therefore, a first positional relationship between the sentence break identifier and the first type of characters and a second positional relationship between the sentence break identifier and the second type of characters can also be predetermined. Under the condition that the sentence break mark in the text is determined, the first type of characters in the text can be determined based on the sentence break mark and the first position relation, the second type of characters in the text can be determined based on the sentence break mark and the second position relation, and the purpose of determining the first character behind the ventilation position is achieved.

S3, acquiring first voice information corresponding to the first type of characters and second voice information corresponding to the second type of characters; the first voice information comprises character pronunciation of a first type of characters and ventilation sound positioned in front of the character pronunciation in time sequence; the second voice message includes a text pronunciation for the second type of text.

The first voice information is actually audio frequency containing character pronunciation and ventilation sound, and the second voice information is actually audio frequency containing character pronunciation, wherein the audio frequency can be formed by prerecording by a broadcaster or formed by synthesizing based on the recorded audio frequency. The ventilation sounds may include inhalation sounds, or may include both inhalation sounds and exhalation sounds, including at least inhalation sounds.

On the basis of determining the character type, character type identifications can be added to the first type of characters and the second type of characters respectively. And acquiring first voice information based on the character type identification of the first type of characters, and acquiring second voice information based on the character type identification of the second type of characters. Taking the read content as a Chinese character as an example, the first voice information corresponding to the first type of characters and the second voice information corresponding to the second type of characters can be respectively obtained based on the Chinese character codes and the character type identifications of the Chinese characters.

The first speech information may be obtained directly from the first speech information that already contains the character pronunciation and the ventilation sound, or may be obtained separately from the character pronunciation and the ventilation sound of the first type of characters, and the first speech information may be synthesized based on the character pronunciation and the ventilation sound of the first type of characters.

S4, synthesizing target audio corresponding to the text based on the first speech information and the second speech information.

Under the condition that the first voice information and the second voice information are acquired, the corresponding first voice information and the corresponding second voice information can be synthesized into target audio according to the sequence of characters in the text. Optionally, the words in the text may have corresponding index information, and the index information is used to identify the positions of the words in the text. The first voice information and the second voice information may be sorted based on the index information, and the sorted first voice information and second voice information may be synthesized into the target audio.

The speech synthesis method of the embodiment of the application determines the sentence break identifier in the acquired text, determines the first type of characters and the second type of characters in the text based on the sentence break identifier, acquires the first speech information containing the character pronunciation and the ventilation sound of the first type of characters aiming at the first type of characters, acquires the second speech information containing the character pronunciation of the second type of characters aiming at the second type of characters, and configures the ventilation sound at a proper position based on the target audio synthesized by the first speech information and the second speech information, so that the anthropomorphic degree of the target audio can be obviously improved, the audio-visual experience of audiences can be improved, and the audiences are prevented from being tired.

As shown in fig. 2, in some embodiments, the step S1 of determining the sentence break identifier in the obtained text includes:

and S11, determining punctuation marks in the text, and taking the punctuation marks as first sentence break marks.

Punctuation is a symbol used to assist in the language of the word record to indicate pauses, mood, and the nature and role of words. Thus, punctuation in the text can be determined, identifying the punctuation as a first sentence break. The ventilation position is determined based on the position of the punctuation marks, then the first type of characters and the second type of characters are determined, or the first type of characters are determined based on the position relationship between the first type of characters and the punctuation marks which are identified in advance, and the second type of characters are determined based on the position relationship between the second type of characters and the punctuation marks which are identified in advance. For example, the position of the punctuation mark may be determined as the ventilation position, and thus the first character after the punctuation mark is the first type of character. In an alternative embodiment, the position of the dot number in the text may be determined, with the dot number identified as the first sentence break. The dot symbols are used to indicate pauses of different lengths in the text. Therefore, the punctuation marks are used as first sentence-break marks, sentence-break positions can be accurately marked, and further, the artificial matching and replacement breath sounds can be more humane.

Continuing with fig. 2, in some embodiments, step S1, determining the sentence break identifier in the obtained text, further includes:

s12, when the number of characters between two adjacent first sentence-break signs is greater than the first threshold, adding one or more second sentence-break signs between two adjacent first sentence-break signs, so that the number of characters between the adjacent first sentence-break signs and the adjacent second sentence-break signs is less than the second threshold.

In real scenes, when a sentence or a text segment is too long, it is often difficult for a real person to complete the reading of the entire sentence or the entire text field with one ventilation. For example, the following sentence is taken as an example:

". A central people radio station. Each listener, now broadcasting a weather forecast for the central weather station, distributed at six o' clock in the evening today. "

If only the position marked by the first sentence break mark is ventilated, the reading is done for three times, after the ' each audience ' is read, the ventilation is done, and after the ventilation, the ' weather forecast released by the central weather station at six o ' clock in the evening today ' needs to be continuously read, but because the text is too long, the reading is usually difficult for a real person to finish under the condition of one ventilation. In such a case, the real person usually breaks one or several times in a sentence or a text field, and ventilates so as to keep the breath stable and the word-out clear.

To simulate the real scene to further improve the personification degree of the audio, the sentence break length of the real person under normal conditions can be predetermined, and a first threshold value and a second threshold value are configured in advance on the basis again, wherein the first threshold value can be a limit sentence break length, and the second threshold value can be smaller than the first threshold value. Optionally, the second threshold may be a comfortable sentence break length, that is, a sentence break length that does not make a real person feel a sense of insufficient breath obviously.

Because the first punctuation marks are formed by punctuation marks, on the basis of determining the first punctuation marks, the purpose of judging whether the number of characters between two adjacent first punctuation marks is greater than a first threshold value or not can be further realized, namely, whether the punctuation is needed in the middle of a sentence or a character segment is judged, if the number of characters between two adjacent first punctuation marks is greater than the first threshold value, the sentence or the character segment is shown to be too long, one or more second punctuation marks can be added between two adjacent punctuation marks, so that the number of characters in each segment of characters formed after the punctuation is less than a second threshold value.

Still with "central people broadcasters". Each listener, now broadcasting a weather forecast for the central weather station, distributed at six o' clock in the evening today. "this is an example, and the position of the first sentence marker determined based on the punctuation mark is as follows:

". "first sentence break sign" of the central people broadcasting station. "first sentence-break logo" for each listener, "first sentence-break logo" for now broadcasting weather forecast released by central weather station at six o' clock in the evening today "

Taking the first threshold value as twenty and the second threshold value as fifteen as an example, the number of characters of the character field of ' the weather forecast released by the central weather station at six o ' clock at night now being broadcasted at present ' formed in a separated mode reaches twenty three and is larger than the first threshold value, so that one or more second sentence break marks can be added into the character field, the character field after being re-divided is smaller than fifteen, and only one second sentence break mark needs to be added after calculation. The position of the second sentence break indicator is as follows:

". "first sentence break sign" of the central people broadcasting station. "first sentence break logo" for each listener, "first sentence break logo" for now broadcasting central weather forecast "for second sentence break logo" released at six o' clock in the evening today "

Thus, four ventilation tones are formed, and the process is closer to the reading process of a real person.

In an alternative embodiment, shown in fig. 3, step S12, adding one or more second sentence break identifiers between two adjacent first sentence break identifiers includes:

s121, performing semantic analysis on characters between two adjacent first sentence break identifications;

s122, dividing characters between two adjacent first sentence break identifications into a plurality of semantic segments according to a semantic analysis result;

and S123, adding a second sentence break identifier between at least one pair of adjacent semantic fragments.

When a sentence or a text field is formed between two punctuations, the sentence or the text field can be analyzed semantically and divided into several semantic segments which are relatively independent in language meaning, and a second sentence-breaking mark is added between the semantic segments, so that a ventilation sound can be formed at a corresponding position, semantic continuity can be ensured, and formation of a sharp ventilation sound can be avoided.

In some embodiments, determining the sentence break identifier in the obtained text may further include:

determining the number of characters in a character field, wherein the character field is formed by characters between any two adjacent first sentence break marks;

under the condition that the number of characters with continuous N character fields is smaller than a third threshold value, removing one or more first sentence break marks so that the number of characters of the newly formed character fields is larger than a fourth threshold value;

wherein N is a positive integer greater than or equal to 2.

In real scenes, there may be punctuation separated into a plurality of consecutive shorter text segments, e.g., a plurality of side-by-side words separated by pause signs. The real person can break the sentence when reading the content, but the ventilation is not carried out every time the sentence is broken. Therefore, under the condition that the first sentence break identifier is determined, the number of characters of the character segments formed by the first sentence break identifier in a separating mode can be determined, whether the number of characters of N continuous character segments is smaller than a third threshold value is judged, if yes, one or more first sentence break identifiers are deleted, after the first sentence break identifier is deleted, the character segments are divided again, and the number of characters of the character segments after being divided again is larger than a fourth threshold value. In this way, the occurrence of continuous ventilation in a short time can be avoided, which is beneficial to further improve the personification procedure of the synthesized target audio.

It should be noted that the method for determining the sentence break identifier is only exemplary, and in specific implementation, a plurality of methods may be used to determine the sentence break identifier, which should not be construed as being limited to the method shown in the above example, for example, a self-learning model may also be used to analyze the sentence break position of the text content to determine the sentence break identifier in the text.

In some embodiments, the step S2, determining the first type of words and the second type of words in the text based on the sentence break identifier, includes:

determining a first character positioned after the sentence break mark in the text as a first type of character;

and determining the rest characters except the first type characters in the text as the second type characters.

Real people usually choose to ventilate at the position of the sentence break, so that the continuity and the fluency of the reading content can be kept, and abrupt pause can be avoided. On the basis, the ventilation position can be configured to be the same as the sentence-breaking position, so that the first character behind the sentence-breaking position is the first type of character, and the rest characters except the first type of character in the text are the second type of characters. In the case where the first sentence break identification and the second sentence break identification have been determined, both the first character after the first sentence break identification and the first character after the second sentence break identification may be determined as the first type of characters. Optionally, on the basis of determining the first character located after the sentence break identifier in the text as the first type of character, another characters may be randomly selected from the text as the first type of character. Because a real person may not perfectly ventilate at the sentence-break position every time, the personification program of the formed target audio frequency can be further improved by randomly selecting some ventilation positions. Of course, the ratio of the randomly selected ventilation positions to the total ventilation positions should be small to avoid significantly affecting the continuity and fluency of the target audio.

In some embodiments, in step S3, acquiring the first speech information corresponding to the first type of text includes:

s31, first voice information of ventilation sound with random time length is obtained.

The ventilation length of the real person presents a certain random distribution characteristic. The method comprises the steps of obtaining first voice information of ventilation sound with random time length, namely obtaining the first voice information which comprises character pronunciations corresponding to first characters and the ventilation sound with random time length located before the character pronunciations in time sequence. The time length of ventilation sound in the target audio frequency can be enabled to present the characteristic of randomized distribution, the ventilation characteristic of a real person can be simulated, the personification level of the target audio frequency is improved, and the problem that audiences are easy to hear fatigue due to too monotonous ventilation sound is solved.

In an alternative embodiment, step S31, obtaining the first voice message of the ventilation tone with a random duration includes:

acquiring a third voice message comprising a character pronunciation of the first type of characters and a ventilation voice before the character pronunciation;

and intercepting the character pronunciation of the first type of characters and the ventilation sound with random time length adjacent to the character pronunciation of the first type of characters from the third voice message to form the first voice message with the ventilation sound with random time length.

For the same character, the broadcaster can record two audios, one audio is the audio without the ventilation sound. Before the broadcaster carries out the recording of the character pronunciation, the broadcaster does not exchange air but directly reads the character pronunciation, so that an audio without the ventilation sound, namely, second voice information is formed. The other audio is audio with complete ventilation sound and character pronunciation, namely, third voice information, and the audio frequency spectrum of the third voice information is shown as part a in fig. 4. Before the character pronunciation, the broadcaster firstly ventilates and then pronounces the characters, so that the audio frequency containing complete ventilation sound and character pronunciation can be formed. Of course, the third voice message may be synthesized based on the complete ventilation sound and the character pronunciation as long as the third voice message includes the complete ventilation sound.

And acquiring the third voice message, and determining a random time length by using a randomization algorithm. Based on the random time length, the ventilation sound with the random time length is cut off from the starting side of the third voice message, the character pronunciation and the ventilation sound of the adjacent character pronunciation are reserved, and the first voice message with the ventilation sound with the random time length is formed, wherein part B in figure 4 shows a first voice message audio frequency spectrum with longer ventilation sound, and part C in figure 4 shows an audio frequency spectrum with the first voice message with shorter ventilation sound. Of course, the random duration determined by the randomization algorithm may also be used as the duration of the ventilation tones that need to be retained.

In another alternative embodiment, step S31, obtaining the first voice message of the ventilation tone with a random duration includes:

acquiring ventilation sound with random time length and character pronunciation of a first type of characters;

The announcer can directly record the ventilation sound with different time lengths, or intercept the ventilation sound with different time lengths from the audio recorded by the announcer, and a ventilation sound database can be constructed based on the ventilation sound. Under the condition that the first type of characters are determined, character pronunciations of the first type of characters can be obtained, ventilation sounds are randomly obtained from a ventilation sound database, the ventilation sounds are placed at the starting side of the character pronunciations, and the ventilation sounds and the character pronunciations are synthesized to form first voice information with the ventilation sounds of random duration. Therefore, the data size of the audio data required to be stored can be reduced, the random time length does not need to be determined by using a randomization algorithm, and only the audio needs to be randomly acquired.

In some embodiments, obtaining first speech information having a random duration of an inspiratory tone comprises:

determining a third position relation of the first type of characters relative to the text;

determining a correction coefficient for correcting the probability of the random event based on the third positional relationship;

based on the correction coefficient, first voice information of ventilation sounds having random durations is acquired.

The third positional relationship represents the position of the first type of characters relative to the text, such as paragraph beginning, paragraph middle, sentence beginning, sentence, and so on. The correction coefficient is used to correct the random event so that the trend of the random event is corrected by the correction coefficient while randomization is achieved. For example, the correction coefficient may be used to randomly determine the duration of the ventilation tone and correlate the duration of the ventilation tone with the third positional relationship.

The ventilation length of the real person presents a randomized characteristic and can be controlled by the real person to a certain extent. For example, between two paragraphs, due to the long pause, the real person is fully ventilated before starting the next reading, and only a short, gentle ventilation is needed before reading the next text. The correction coefficient can be adjusted so that the ventilator with a longer time can be obtained with a higher probability. When the first type of characters are positioned in the middle of one segment of characters and at the beginning of one sentence, the real person can fully ventilate for one time usually because of obvious sentence break pause but not to ventilate for many times, the ventilation time is long and the ventilation action is slow. The correction coefficient can be adjusted so that the ventilator with longer duration can be acquired with higher probability. When the first type of words is in the middle of a sentence, the real person will usually take a short and jerky ventilation because of the short pause. The correction coefficient can be adjusted so that the ventilator with a longer time can be obtained with a higher probability.

In some embodiments, in step S3, acquiring the first speech information corresponding to the first type of text includes:

Aiming at the same character, a plurality of different first voice messages can be prepared in advance, the first voice messages have different ventilation sounds, and the different ventilation sounds can be different in ventilation time length or different in ventilation modes, such as jerky, gentle, heavier fricative sound or smaller volume. Therefore, the randomization of the ventilation time and the ventilation mode can be realized, the randomized ventilation of a real person is simulated, and the personification degree of the target audio is improved. Moreover, the plurality of first voice messages are pre-manufactured, only random acquisition is needed in the synthesis process, the acquired voice messages do not need to be processed, and the data processing amount in the synthesis process can be reduced.

In some embodiments, in step S3, acquiring the first speech information corresponding to the first type of text includes:

acquiring emotion information corresponding to each first type of characters;

based on the emotion information, first voice information having a ventilation tone corresponding to the emotion information is acquired.

In the process of reading the real person aloud, ventilation is not only for tonifying qi, but also is an expression means of emotion or emotion to a certain extent, for example, urgent ventilation sound energy can generate strong breath rubbing sound, different emotional colors such as urgency, tension, feeling and the like of the reader can be shown, and gentle ventilation sound energy can show that the reader is in a calm and peaceful inner-heart state. Therefore, appropriate selection of ventilation sound can not only improve the degree of personification of the target audio, but also enrich the target audio in expressive power.

Based on the determined first type of characters, the emotion or emotion shown by the character content at the position of each first type of characters can be determined to form emotional information. Optionally, the text may be analyzed through, for example, an emotion analysis model to obtain emotion information corresponding to each first type of word. Emotion information corresponding to each first type of word can also be obtained based on semantic analysis. The feeling, emotion or emotion presented by the emotion information can be determined through the emotion information.

In practical application, the ventilation sounds capable of showing various emotions or emotions can be recorded in advance, the emotions or emotions capable of being shown by the ventilation sounds can be marked, and the ventilation sounds corresponding to the emotion sounds can be acquired based on the emotion information. For example, ventilatory sounds often require shortness and urge when a vivid or exciting mood is to be exhibited.

In practical application, only standard ventilation sounds can be recorded to obtain emotion information, and the ventilation sounds can be processed based on the emotion information to adjust audio parameters of the ventilation sounds, so that the ventilation sounds corresponding to the emotion information are formed. The audio parameters may include, for example, loudness, volume, duration, frequency, and the like. For example, a sharp ventilation sound can be formed by increasing the volume of the ventilation sound and compressing the length of the ventilation sound.

Referring to fig. 5, an embodiment of the present application further provides an electronic device, including:

a first determining module 10, configured to determine a sentence break identifier in the obtained text;

a second determining module 20, configured to determine a first type of words and a second type of words in the text based on the sentence break identifier; the first type of characters and the sentence break marks have a first position relation, and the second type of characters and the sentence break marks have a second position relation;

the acquiring module 30 is configured to acquire first voice information corresponding to the first type of text and second voice information corresponding to the second type of text; the first voice information comprises character pronunciation of a first type of characters and ventilation sound positioned in front of the character pronunciation in time sequence; the second voice comprises the character pronunciation of the second type of characters;

and a synthesis module 40 for synthesizing the target audio corresponding to the text based on the first voice information and the second voice information.

In some embodiments, the first determining module 10 is specifically configured to:

punctuation marks in the text are determined, and the punctuation marks are used as first sentence break marks.

In some embodiments, the first determination module 10 is further configured to:

and under the condition that the number of characters between two adjacent first sentence break marks is greater than a first threshold value, adding one or more second sentence break marks between the two adjacent first sentence break marks so as to enable the number of characters between the adjacent first sentence break marks and the adjacent second sentence break marks or between the adjacent second sentence break marks to be less than a second threshold value.

In some embodiments, the first determining module 10 is specifically configured to:

semantic analysis is carried out on characters between two adjacent first sentence break identifications;

dividing characters between two adjacent first sentence break identifications into a plurality of semantic segments according to a semantic analysis result;

and adding a second sentence break identifier between at least one pair of adjacent semantic fragments.

In some embodiments, the second determining module 20 is specifically configured to:

determining a first character positioned after the sentence break mark in the text as a first type of character;

and determining the rest characters except the first type characters in the text as the second type characters.

In some embodiments, the obtaining module 30 is specifically configured to:

first voice information of ventilation sounds having random durations is acquired.

In some embodiments, the obtaining module 30 is specifically configured to:

acquiring a third voice message comprising a character pronunciation of the first type of characters and a ventilation voice before the character pronunciation;

intercepting the character pronunciation of the first type of characters and the ventilation sound with random time length adjacent to the character pronunciation of the first type of characters from the third voice information to form first voice information with the ventilation sound with random time length; or

Acquiring ventilation sound with random time length and character pronunciation of a first type of characters;

In some embodiments, the obtaining module 30 is specifically configured to:

acquiring emotion information corresponding to each first type of characters;

based on the emotion information, first voice information having a ventilation tone corresponding to the emotion information is acquired.

Referring to fig. 6, an electronic device according to an embodiment of the present application further includes at least a memory 220 and a processor 210, where the memory 220 stores a program, and the processor 210 implements the method according to any of the above embodiments when executing the program on the memory 220.

The above embodiments are only exemplary embodiments of the present application, and are not intended to limit the present application, and the protection scope of the present application is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present application and such modifications and equivalents should also be considered to be within the scope of the present application.

17页详细技术资料下载

Speech synthesis method and electronic equipment

相关技术

网友询问留言