Speech synthesis system evaluation method and device, readable storage medium and terminal equipment

文档序号:50773 发布日期:2021-09-28 浏览:28次 中文

阅读说明:本技术 语音合成系统评测方法、装置、可读存储介质及终端设备 (Speech synthesis system evaluation method and device, readable storage medium and terminal equipment ) 是由 苏雪琦 王健宗 于 2021-06-25 设计创作,主要内容包括:本发明属于自然语言处理技术领域,尤其涉及一种语音合成系统评测方法、装置、计算机可读存储介质及终端设备。所述方法包括:将预设的文本序列分别输入至待评测的若干个语音合成系统中,并分别获取各个语音合成系统的输出语音序列;获取与所述文本序列对应的基准语音序列;根据预设的若干个评测维度分别计算各个语音合成系统的输出语音序列与所述基准语音序列之间的整体偏差距离;选取与所述基准语音序列之间的整体偏差距离最小的语音合成系统作为优选语音合成系统,并使用所述优选语音合成系统执行语音合成任务。通过本发明,可以在提高评测效率的同时也提高评测准确度。(The invention belongs to the technical field of natural language processing, and particularly relates to a speech synthesis system evaluation method and device, a computer readable storage medium and terminal equipment. The method comprises the following steps: respectively inputting preset text sequences into a plurality of speech synthesis systems to be evaluated, and respectively acquiring output speech sequences of the speech synthesis systems; acquiring a reference voice sequence corresponding to the text sequence; respectively calculating the integral deviation distance between the output voice sequence of each voice synthesis system and the reference voice sequence according to a plurality of preset evaluation dimensions; and selecting the voice synthesis system with the minimum integral deviation distance with the reference voice sequence as a preferred voice synthesis system, and executing a voice synthesis task by using the preferred voice synthesis system. By the method and the device, the evaluation efficiency can be improved, and the evaluation accuracy can be improved.)

1. A method for speech synthesis system evaluation, comprising:

respectively inputting preset text sequences into a plurality of different voice synthesis systems to be evaluated, and respectively acquiring output voice sequences of the voice synthesis systems;

acquiring a reference voice sequence corresponding to the text sequence;

respectively calculating the integral deviation distance between the output voice sequence of each voice synthesis system and the reference voice sequence according to a plurality of preset evaluation dimensions;

and selecting the voice synthesis system with the minimum integral deviation distance with the reference voice sequence as a preferred voice synthesis system, and executing a voice synthesis task by using the preferred voice synthesis system.

2. The method for evaluating speech synthesis systems according to claim 1, wherein the step of calculating the overall deviation distance between the output speech sequence of each speech synthesis system and the reference speech sequence according to a plurality of preset evaluation dimensions comprises:

respectively calculating deviation distances of a target voice sequence and the reference voice sequence in each evaluation dimension, wherein the target voice sequence is an output voice sequence of any one voice synthesis system;

and calculating the integral deviation distance between the target voice sequence and the reference voice sequence according to the deviation distance between the target voice sequence and the reference voice sequence in each evaluation dimension.

3. The method for evaluating a speech synthesis system according to claim 2, wherein the evaluation dimensions include feature vectors, durations, and pitch;

the calculating the deviation distance of the target speech sequence and the reference speech sequence in each evaluation dimension respectively comprises:

calculating the deviation distance of the target voice sequence and the reference voice sequence on the evaluation dimension of the feature vector;

calculating the deviation distance of the target voice sequence and the reference voice sequence on the evaluation dimension of duration;

and calculating the deviation distance of the target voice sequence and the reference voice sequence in the evaluation dimension of the sound intensity.

4. The method for evaluating a speech synthesis system according to claim 3, wherein the calculating a deviation distance between the target speech sequence and the reference speech sequence in an evaluation dimension of feature vectors comprises:

calculating the deviation distance of the target speech sequence and the reference speech sequence in the evaluation dimension of the feature vector according to the following formula:

k is the syllable serial number in the reference voice sequence, K is more than or equal to 1 and less than or equal to K, K is the total number of syllables in the reference voice sequence, Ps kIs the feature vector, P, of the k-th syllable in the target speech sequencen kFor the feature vector of the kth syllable in the reference speech sequence, DTW is the dynamic time warping function, MkDimension of feature vector of k-th syllable of the target speech sequence, NkDimension of feature vector for the k-th syllable of the reference speech sequence, DpAnd the deviation distance of the target speech sequence and the reference speech sequence on the evaluation dimension of the feature vector is used.

5. The method for evaluating a speech synthesis system according to claim 3, wherein the calculating a deviation distance between the target speech sequence and the reference speech sequence in an evaluation dimension of duration comprises:

calculating the deviation distance of the target speech sequence and the reference speech sequence in the evaluation dimension of the duration according to the following formula:

k is the syllable serial number in the reference voice sequence, K is more than or equal to 1 and less than or equal to K, K is the total number of syllables in the reference voice sequence, Ts kIs the duration, T, of the k-th syllable in the target speech sequencen kIs the duration, T, of the k-th syllable in the reference speech sequenceaIs the average of the durations of K syllables in the reference speech sequence, DtAnd the deviation distance of the target voice sequence and the reference voice sequence on the evaluation dimension of the duration is used.

6. The method for evaluating a speech synthesis system according to claim 3, wherein the calculating a deviation distance between the target speech sequence and the reference speech sequence in the evaluation dimension of the sound intensity comprises:

calculating the deviation distance of the target speech sequence and the reference speech sequence in the evaluation dimension of the sound intensity according to the following formula:

k is the syllable serial number in the reference voice sequence, K is more than or equal to 1 and less than or equal to K, K is the total number of syllables in the reference voice sequence, Es kFor the intensity of the k-th syllable in the target speech sequence, En kFor the intensity of the kth syllable in the reference speech sequence, EaIs the mean value of the sound intensities of K syllables in the reference speech sequence, DeThe deviation distance of the target speech sequence and the reference speech sequence in the evaluation dimension of the sound intensity is used.

7. The method for evaluating a speech synthesis system according to any one of claims 2 to 6, wherein the calculating an overall deviation distance between the target speech sequence and the reference speech sequence according to the deviation distances of the target speech sequence and the reference speech sequence in each evaluation dimension comprises:

calculating an overall deviation distance between the target speech sequence and the reference speech sequence according to:

wherein N is the serial number of the evaluation dimension, N is more than or equal to 1 and less than or equal to N, N is the total number of the evaluation dimensions, DnIs the deviation distance, omega, of the target speech sequence and the reference speech sequence in the nth evaluation dimensionnIs the weight of the nth evaluation dimension, andand D is the integral deviation distance between the target voice sequence and the reference voice sequence.

8. A speech synthesis system evaluation apparatus, comprising:

the interactive module is used for respectively inputting the preset text sequence into a plurality of different voice synthesis systems to be evaluated and respectively acquiring the output voice sequence of each voice synthesis system;

a reference voice sequence acquisition module for acquiring a reference voice sequence corresponding to the text sequence;

the integral deviation distance calculation module is used for calculating integral deviation distances between the output voice sequences of the voice synthesis systems and the reference voice sequences according to a plurality of preset evaluation dimensions;

and the voice synthesis system selection module is used for selecting the voice synthesis system with the minimum integral deviation distance with the reference voice sequence as the preferred voice synthesis system and executing a voice synthesis task by using the preferred voice synthesis system.

9. A computer-readable storage medium storing computer-readable instructions, which when executed by a processor implement the steps of a method for speech synthesis system evaluation according to any one of claims 1 to 7.

10. A terminal device comprising a memory, a processor and computer readable instructions stored in the memory and executable on the processor, characterized in that the processor, when executing the computer readable instructions, implements the steps of the speech synthesis system evaluation method according to any one of claims 1 to 7.

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a speech synthesis system evaluation method and device, a computer readable storage medium and terminal equipment.

Background

With the development of technology, a variety of Speech synthesis systems are currently available for realizing a conversion process from Text To Speech (TTS), and different Speech synthesis systems often have large performance differences, and a suitable Speech synthesis system needs To be selected according To actual conditions To execute a Speech synthesis task. However, at present, the quality of the speech synthesis system can only be evaluated by the subjective feeling of the user, which is low in efficiency and accuracy.

Disclosure of Invention

In view of this, embodiments of the present invention provide a speech synthesis system evaluation method, apparatus, computer-readable storage medium, and terminal device, so as to solve the problems of low efficiency and low accuracy of the existing speech synthesis system evaluation method.

A first aspect of an embodiment of the present invention provides a speech synthesis system evaluation method, which may include:

respectively inputting preset text sequences into a plurality of different voice synthesis systems to be evaluated, and respectively acquiring output voice sequences of the voice synthesis systems;

acquiring a reference voice sequence corresponding to the text sequence;

respectively calculating the integral deviation distance between the output voice sequence of each voice synthesis system and the reference voice sequence according to a plurality of preset evaluation dimensions;

and selecting the voice synthesis system with the minimum integral deviation distance with the reference voice sequence as a preferred voice synthesis system, and executing a voice synthesis task by using the preferred voice synthesis system.

In a specific implementation of the first aspect, the calculating, according to a plurality of preset evaluation dimensions, an overall deviation distance between the output speech sequence of each speech synthesis system and the reference speech sequence respectively may include:

respectively calculating deviation distances of a target voice sequence and the reference voice sequence in each evaluation dimension, wherein the target voice sequence is an output voice sequence of any one voice synthesis system;

and calculating the integral deviation distance between the target voice sequence and the reference voice sequence according to the deviation distance between the target voice sequence and the reference voice sequence in each evaluation dimension.

In a specific implementation of the first aspect, the evaluation dimension may include a feature vector, a duration, and a pitch;

the calculating the deviation distances of the target speech sequence and the reference speech sequence in each evaluation dimension respectively may include:

calculating the deviation distance of the target voice sequence and the reference voice sequence on the evaluation dimension of the feature vector;

calculating the deviation distance of the target voice sequence and the reference voice sequence on the evaluation dimension of duration;

and calculating the deviation distance of the target voice sequence and the reference voice sequence in the evaluation dimension of the sound intensity.

In a specific implementation of the first aspect, the calculating a deviation distance between the target speech sequence and the reference speech sequence in an evaluation dimension of a feature vector may include:

calculating the deviation distance of the target speech sequence and the reference speech sequence in the evaluation dimension of the feature vector according to the following formula:

k is the syllable serial number in the reference voice sequence, K is more than or equal to 1 and less than or equal to K, K is the total number of syllables in the reference voice sequence, Ps kIs the feature vector, P, of the k-th syllable in the target speech sequencen kFor the feature vector of the kth syllable in the reference speech sequence, DTW is the dynamic time warping function, MkDimension of feature vector of k-th syllable of the target speech sequence, NkDimension of feature vector for the k-th syllable of the reference speech sequence, DpAnd the deviation distance of the target speech sequence and the reference speech sequence on the evaluation dimension of the feature vector is used.

In a specific implementation of the first aspect, the calculating a deviation distance between the target speech sequence and the reference speech sequence in an evaluation dimension of duration may include:

calculating the deviation distance of the target speech sequence and the reference speech sequence in the evaluation dimension of the duration according to the following formula:

wherein, Ts kIs the duration, T, of the k-th syllable in the target speech sequencen kIs the duration, T, of the k-th syllable in the reference speech sequenceaIs the average of the durations of K syllables in the reference speech sequence, DtAnd the deviation distance of the target voice sequence and the reference voice sequence on the evaluation dimension of the duration is used.

In a specific implementation of the first aspect, the calculating a deviation distance between the target speech sequence and the reference speech sequence in an evaluation dimension of the sound intensity may include:

calculating the deviation distance of the target speech sequence and the reference speech sequence in the evaluation dimension of the sound intensity according to the following formula:

wherein E iss kFor the intensity of the k-th syllable in the target speech sequence, En kFor the intensity of the kth syllable in the reference speech sequence, EaIs the mean value of the sound intensities of K syllables in the reference speech sequence, DeThe deviation distance of the target speech sequence and the reference speech sequence in the evaluation dimension of the sound intensity is used.

In a specific implementation of the first aspect, the calculating an overall deviation distance between the target speech sequence and the reference speech sequence according to the deviation distances of the target speech sequence and the reference speech sequence in each evaluation dimension may include:

calculating an overall deviation distance between the target speech sequence and the reference speech sequence according to:

wherein N is the serial number of the evaluation dimension, N is more than or equal to 1 and less than or equal to N, N is the total number of the evaluation dimensions, DnIs the deviation distance, omega, of the target speech sequence and the reference speech sequence in the nth evaluation dimensionnIs the weight of the nth evaluation dimension, andand D is the integral deviation distance between the target voice sequence and the reference voice sequence.

A second aspect of an embodiment of the present invention provides a speech synthesis system evaluation apparatus, which may include:

the interactive module is used for respectively inputting the preset text sequence into a plurality of different voice synthesis systems to be evaluated and respectively acquiring the output voice sequence of each voice synthesis system;

a reference voice sequence acquisition module for acquiring a reference voice sequence corresponding to the text sequence;

the integral deviation distance calculation module is used for calculating integral deviation distances between the output voice sequences of the voice synthesis systems and the reference voice sequences according to a plurality of preset evaluation dimensions;

and the voice synthesis system selection module is used for selecting the voice synthesis system with the minimum integral deviation distance with the reference voice sequence as the preferred voice synthesis system and executing a voice synthesis task by using the preferred voice synthesis system.

In a specific implementation of the second aspect, the overall deviation distance calculation module may include:

the first calculation submodule is used for respectively calculating the deviation distance between a target voice sequence and the reference voice sequence in each evaluation dimension, and the target voice sequence is an output voice sequence of any one voice synthesis system;

and the second calculation submodule is used for calculating the integral deviation distance between the target voice sequence and the reference voice sequence according to the deviation distance between the target voice sequence and the reference voice sequence on each evaluation dimension.

In a specific implementation of the second aspect, the evaluation dimension may include a feature vector, a duration, and a pitch;

the first calculation sub-module may include:

the first calculation unit is used for calculating the deviation distance of the target voice sequence and the reference voice sequence on the evaluation dimension of the feature vector;

the second calculation unit is used for calculating the deviation distance of the target voice sequence and the reference voice sequence on the evaluation dimension of duration;

and the third calculating unit is used for calculating the deviation distance of the target voice sequence and the reference voice sequence on the evaluation dimension of the sound intensity.

In a specific implementation of the second aspect, the first calculating unit is specifically configured to calculate a deviation distance between the target speech sequence and the reference speech sequence in an evaluation dimension of the feature vector according to the following formula:

k is the syllable serial number in the reference voice sequence, K is more than or equal to 1 and less than or equal to K, K is the total number of syllables in the reference voice sequence, Ps kIs the feature vector, P, of the k-th syllable in the target speech sequencen kFor the feature vector of the kth syllable in the reference speech sequence, DTW is the dynamic time warping function, MkDimension of feature vector of k-th syllable of the target speech sequence, NkDimension of feature vector for the k-th syllable of the reference speech sequence, DpFor the target speech sequence and the reference speech sequenceThe deviation distance is listed in the evaluation dimension of the feature vector.

In a specific implementation of the second aspect, the second calculating unit is specifically configured to calculate a deviation distance between the target speech sequence and the reference speech sequence in an evaluation dimension of duration according to the following formula:

wherein, Ts kIs the duration, T, of the k-th syllable in the target speech sequencen kIs the duration, T, of the k-th syllable in the reference speech sequenceaIs the average of the durations of K syllables in the reference speech sequence, DtAnd the deviation distance of the target voice sequence and the reference voice sequence on the evaluation dimension of the duration is used.

In a specific implementation of the second aspect, the third calculating unit is specifically configured to calculate a deviation distance between the target speech sequence and the reference speech sequence in an evaluation dimension of the sound intensity according to the following formula:

wherein E iss kFor the intensity of the k-th syllable in the target speech sequence, En kFor the intensity of the kth syllable in the reference speech sequence, EaIs the mean value of the sound intensities of K syllables in the reference speech sequence, DeThe deviation distance of the target speech sequence and the reference speech sequence in the evaluation dimension of the sound intensity is used.

In a specific implementation of the second aspect, the second calculation sub-module is specifically configured to calculate the overall deviation distance between the target speech sequence and the reference speech sequence according to the following formula:

wherein N is the serial number of the evaluation dimension, N is more than or equal to 1 and less than or equal to N, N is the total number of the evaluation dimensions, DnIs the deviation distance, omega, of the target speech sequence and the reference speech sequence in the nth evaluation dimensionnIs the weight of the nth evaluation dimension, andand D is the integral deviation distance between the target voice sequence and the reference voice sequence.

A third aspect of the embodiments of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps of any one of the above-mentioned speech synthesis system evaluation methods.

A fourth aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of any one of the above methods for evaluating a speech synthesis system when executing the computer program.

A fifth aspect of the embodiments of the present invention provides a computer program product, which, when running on a terminal device, causes the terminal device to execute any of the steps of the above-mentioned speech synthesis system evaluation method.

Compared with the prior art, the embodiment of the invention has the following beneficial effects: the method comprises the steps of inputting preset text sequences into a plurality of voice synthesis systems to be evaluated respectively, and obtaining output voice sequences of the voice synthesis systems respectively; acquiring a reference voice sequence corresponding to the text sequence; respectively calculating the integral deviation distance between the output voice sequence of each voice synthesis system and the reference voice sequence according to a plurality of preset evaluation dimensions; and selecting the voice synthesis system with the minimum integral deviation distance with the reference voice sequence as a preferred voice synthesis system, and executing a voice synthesis task by using the preferred voice synthesis system. According to the embodiment of the invention, the quality of the speech synthesis system is evaluated by calculating the integral deviation distance between the output speech sequence of the speech synthesis system and the reference speech sequence without depending on the subjective feeling of the user, and the evaluation is comprehensively considered from a plurality of different evaluation dimensions, so that the evaluation efficiency is improved and the evaluation accuracy is also improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a schematic diagram of an exemplary implementation of an embodiment of the present invention;

FIG. 2 is a flow chart of an embodiment of a method for speech synthesis system evaluation in accordance with an embodiment of the present invention;

FIG. 3 is a schematic flow chart of calculating the overall deviation distance between the output speech sequence of each speech synthesis system and the reference speech sequence according to a plurality of preset evaluation dimensions;

FIG. 4 is a block diagram of an embodiment of an evaluation apparatus for a speech synthesis system in an embodiment of the present invention;

fig. 5 is a schematic block diagram of a terminal device in an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a speech synthesis system evaluation method independent of subjective feeling of a user, which evaluates the quality of a speech synthesis system by calculating the integral deviation distance between an output speech sequence of the speech synthesis system and a reference speech sequence, and comprehensively considers from a plurality of different evaluation dimensions in the evaluation process, thereby improving the evaluation efficiency and the evaluation accuracy.

Fig. 1 is a schematic diagram of a specific implementation environment according to an embodiment of the present invention, in which a terminal device is an execution main body according to an embodiment of the present invention, and is used for evaluating a speech synthesis system, and the speech synthesis system 1, the speech synthesis systems 2, …, and the speech synthesis system M are respectively a plurality of different speech synthesis systems to be evaluated. The specific number of the speech synthesis systems may be set according to actual situations, and is not particularly limited in this embodiment of the present application. And wired or wireless information interactive connection exists between the terminal equipment and each voice synthesis system.

Referring to fig. 2, an embodiment of a method for evaluating a speech synthesis system according to an embodiment of the present invention may include:

step S201, inputting the preset text sequence into a plurality of different speech synthesis systems to be evaluated, and obtaining an output speech sequence of each speech synthesis system.

The text sequence may be set according to an actual situation, for example, the text sequence may be set to "a gust of wind blows, a dragonfly flowers several times, so that the person is more fascinating, and is like a girl wearing a skirt dancing a beautiful dance", of course, the text sequence may also be set to another text sequence according to an actual situation, which is not specifically limited in this embodiment of the present application.

The terminal device inputs the text sequence into each speech synthesis system to be evaluated, each speech synthesis system performs TTS processing on the text sequence and outputs a corresponding speech sequence, and the speech sequence is recorded as an output speech sequence. Generally, because TTS processing methods adopted by respective speech synthesis systems are different, output speech sequences acquired by a terminal device from the respective speech synthesis systems are different from one another.

Step S202, a reference voice sequence corresponding to the text sequence is obtained.

In a specific implementation of the embodiment of the present invention, a real person may read the text sequence in advance, and perform voice recording on the reading process, so as to obtain a voice sequence corresponding to the text sequence. The speech sequence is referred to as a reference speech sequence as a reference for evaluating the speech synthesis system. If the difference between the output speech sequence of a certain speech synthesis system and the reference speech sequence is smaller, the speech synthesis system is more reliable, and conversely, if the difference between the output speech sequence of a certain speech synthesis system and the reference speech sequence is larger, the speech synthesis system is less reliable.

The pre-recorded reference voice sequence can be stored in a designated storage medium, and when the voice synthesis system evaluation is required, the terminal device can acquire the reference voice sequence from the storage medium.

Step S203, calculating the integral deviation distance between the output voice sequence of each voice synthesis system and the reference voice sequence according to a plurality of preset evaluation dimensions.

Taking an output speech sequence (denoted as a target speech sequence) of any speech synthesis system as an example, the process of calculating the overall deviation distance between the speech sequence and the reference speech sequence may specifically include the process shown in fig. 3:

step S2031, calculating deviation distances of the target voice sequence and the reference voice sequence in each evaluation dimension respectively.

In a specific implementation of the embodiment of the present invention, a Dynamic Time Warping (DTW) algorithm may be used to calculate the deviation distance, where a segment of speech includes a plurality of syllables, each syllable includes fundamental frequency information, fundamental frequencies of a synthesized speech and a natural speech are generally composed of syllables sequentially arranged according to a specific fundamental frequency sequence, and the lengths of the syllables are different and the lengths of the corresponding fundamental frequency sequences are also significantly different. The euclidean distance is usually used to represent the acoustic parameters, and the distance calculated by the DTW algorithm is the square of the euclidean distance of the fundamental sequences to be compared. Therefore, the distance calculated by the DTW algorithm is further processed to obtain the fundamental frequency distance between voices.

Although the DTW algorithm can feed back the difference of different voices in the fundamental frequency to output an objective evaluation result, the DTW algorithm cannot judge the difference of different voices in time length, pause and sound intensity under the conditions of the same pronunciation and different contexts. Therefore, in order to provide a more objective evaluation result, in another specific implementation of the embodiment of the present invention, processing of audio features may be added on the basis of the DTW algorithm, and evaluation is performed from multiple dimensions, such as feature vectors, durations, and intensities, so as to further improve accuracy and reliability of the evaluation result.

Specifically, the target speech sequence and the reference speech sequence may be preprocessed first, and the preprocessing process may include, but is not limited to, framing, pre-emphasis, and endpoint Detection (VAD). Wherein, the function of framing is to divide each syllable from the voice sequence; the pre-emphasis has the effects of filtering low-frequency interference, carrying out frequency spectrum promotion on a high-frequency part, and playing the effects of suppressing random noise and promoting the energy of an unvoiced part; the end point detection is used for identifying and eliminating a long mute period from a sound signal stream to obtain effective voice.

After preprocessing, the calculation of the deviation distance can be performed from various evaluation dimensions such as feature vectors, duration and sound intensity.

When calculating the deviation distance between the target speech sequence and the reference speech sequence in the evaluation dimension of the feature vector, the calculation of Mel Frequency Cepstrum Coefficient (MFCC) may be performed on each syllable of the target speech sequence and the reference speech sequence, and the calculation result may be used as the feature vector of each syllable. Then, the deviation distance of the target speech sequence and the reference speech sequence in the evaluation dimension of the feature vector is calculated according to the following formula:

wherein K is the syllable number in the reference voice sequence, K is more than or equal to 1 and less than or equal to K, K is the total number of syllables in the reference voice sequence, Ps kIs the feature vector, P, of the k-th syllable in the target speech sequencen kFeature vectors for the kth syllable in the reference speech sequence, DTW is the dynamic time warping function, MkDimension of feature vector for kth syllable of target speech sequence, NkDimension of feature vector of kth syllable of reference speech sequence, DpThe deviation distance of the target speech sequence and the reference speech sequence in the evaluation dimension of the feature vector is shown.

When calculating the deviation distance between the target speech sequence and the reference speech sequence in the evaluation dimension of the duration, the duration of each syllable of the target speech sequence and the reference speech sequence may be first calculated, and then the deviation distance between the target speech sequence and the reference speech sequence in the evaluation dimension of the duration may be calculated according to the following formula:

wherein, Ts kFor the duration of the kth syllable in the target speech sequence, Tn kIs the duration, T, of the kth syllable in the reference speech sequenceaIs the average of the durations of K syllables in a reference speech sequence, DtThe deviation distance of the target speech sequence and the reference speech sequence in the evaluation dimension of the duration is shown.

When calculating the deviation distance between the target speech sequence and the reference speech sequence in the evaluation dimension of the sound intensity, the sound intensity of each syllable of the target speech sequence and the reference speech sequence may be first calculated, and then the deviation distance between the target speech sequence and the reference speech sequence in the evaluation dimension of the sound intensity may be calculated according to the following formula:

wherein E iss kFor the intensity of the kth syllable in the target speech sequence, En kIs the intensity of the kth syllable in the reference speech sequence, EaIs the mean value of the sound intensities of K syllables in a reference speech sequence, DeThe deviation distance of the target speech sequence and the reference speech sequence in the evaluation dimension of the sound intensity is shown.

Step S2032, calculating the integral deviation distance between the target voice sequence and the reference voice sequence according to the deviation distance between the target voice sequence and the reference voice sequence in each evaluation dimension.

Specifically, the overall deviation distance between the target speech sequence and the reference speech sequence may be calculated according to the following equation:

wherein N is the serial number of the evaluation dimension, N is more than or equal to 1 and less than or equal to N, N is the total number of the evaluation dimensions, DnIs the deviation distance, omega, of the target speech sequence and the reference speech sequence in the nth evaluation dimensionnIs the weight of the nth evaluation dimension, andthe specific value can be set according to the actual situation, the embodiment of the application does not specifically limit the value, and D is the integral deviation distance between the target voice sequence and the reference voice sequence.

According to the process shown in fig. 3, the overall deviation distance between the output speech sequence of each speech synthesis system and the reference speech sequence can be obtained by traversing the output speech sequence of each speech synthesis system.

And S204, selecting the voice synthesis system with the minimum integral deviation distance with the reference voice sequence as a preferred voice synthesis system, and executing a voice synthesis task by using the preferred voice synthesis system.

If the overall deviation distance between a certain speech synthesis system and the reference speech sequence is larger, the speech synthesis system is more unreliable, and if the overall deviation distance between the certain speech synthesis system and the reference speech sequence is smaller, the speech synthesis system is more reliable. Therefore, the speech synthesis system with the minimum overall deviation distance from the reference speech sequence can be selected as the optimal evaluation result and is marked as the preferred speech synthesis system. When a speech synthesis task is subsequently received, the preferred speech synthesis system can then be used to perform the speech synthesis task.

In summary, in the embodiments of the present invention, the preset text sequence is respectively input into a plurality of different speech synthesis systems to be evaluated, and the output speech sequences of the speech synthesis systems are respectively obtained; acquiring a reference voice sequence corresponding to the text sequence; respectively calculating the integral deviation distance between the output voice sequence of each voice synthesis system and the reference voice sequence according to a plurality of preset evaluation dimensions; and selecting the voice synthesis system with the minimum integral deviation distance with the reference voice sequence as a preferred voice synthesis system, and executing a voice synthesis task by using the preferred voice synthesis system. According to the embodiment of the invention, the quality of the speech synthesis system is evaluated by calculating the integral deviation distance between the output speech sequence of the speech synthesis system and the reference speech sequence without depending on the subjective feeling of the user, and the evaluation is comprehensively considered from a plurality of different evaluation dimensions, so that the evaluation efficiency is improved and the evaluation accuracy is also improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Fig. 4 is a structural diagram of an embodiment of an evaluation apparatus for a speech synthesis system according to an embodiment of the present invention, which corresponds to the speech synthesis system evaluation method according to the above embodiment.

In this embodiment, an apparatus for evaluating a speech synthesis system may include:

the interactive module 401 is configured to input a preset text sequence into a plurality of different speech synthesis systems to be evaluated, and obtain output speech sequences of the speech synthesis systems respectively;

a reference voice sequence obtaining module 402, configured to obtain a reference voice sequence corresponding to the text sequence;

an overall deviation distance calculation module 403, configured to calculate an overall deviation distance between an output speech sequence of each speech synthesis system and the reference speech sequence according to a plurality of preset evaluation dimensions;

and a speech synthesis system selection module 404, configured to select a speech synthesis system with a smallest overall deviation distance from the reference speech sequence as a preferred speech synthesis system, and perform a speech synthesis task using the preferred speech synthesis system.

In a specific implementation of the embodiment of the present invention, the overall deviation distance calculating module may include:

the first calculation submodule is used for respectively calculating the deviation distance between a target voice sequence and the reference voice sequence in each evaluation dimension, and the target voice sequence is an output voice sequence of any one voice synthesis system;

and the second calculation submodule is used for calculating the integral deviation distance between the target voice sequence and the reference voice sequence according to the deviation distance between the target voice sequence and the reference voice sequence on each evaluation dimension.

In a specific implementation of the embodiment of the present invention, the evaluation dimension may include a feature vector, a duration, and a pitch;

the first calculation sub-module may include:

the first calculation unit is used for calculating the deviation distance of the target voice sequence and the reference voice sequence on the evaluation dimension of the feature vector;

the second calculation unit is used for calculating the deviation distance of the target voice sequence and the reference voice sequence on the evaluation dimension of duration;

and the third calculating unit is used for calculating the deviation distance of the target voice sequence and the reference voice sequence on the evaluation dimension of the sound intensity.

In a specific implementation of the embodiment of the present invention, the first calculating unit is specifically configured to calculate a deviation distance between the target speech sequence and the reference speech sequence in an evaluation dimension of a feature vector according to the following formula:

k is the syllable serial number in the reference voice sequence, K is more than or equal to 1 and less than or equal to K, K is the total number of syllables in the reference voice sequence, Ps kIs the feature vector, P, of the k-th syllable in the target speech sequencen kFor the feature vector of the kth syllable in the reference speech sequence, DTW is the dynamic time warping function, MkDimension of feature vector of k-th syllable of the target speech sequence, NkDimension of feature vector for the k-th syllable of the reference speech sequence, DpAnd the deviation distance of the target speech sequence and the reference speech sequence on the evaluation dimension of the feature vector is used.

In a specific implementation of the embodiment of the present invention, the second calculating unit is specifically configured to calculate a deviation distance between the target speech sequence and the reference speech sequence in an evaluation dimension of duration according to the following formula:

wherein, Ts kIs the duration, T, of the k-th syllable in the target speech sequencen kIs the duration, T, of the k-th syllable in the reference speech sequenceaIs the average of the durations of K syllables in the reference speech sequence, DtFor the target speech sequence and the reference speechThe deviation distance of the sound sequence in the evaluation dimension of the duration.

In a specific implementation of the embodiment of the present invention, the third calculating unit is specifically configured to calculate a deviation distance between the target speech sequence and the reference speech sequence in an evaluation dimension of the sound intensity according to the following formula:

wherein E iss kFor the intensity of the k-th syllable in the target speech sequence, En kFor the intensity of the kth syllable in the reference speech sequence, EaIs the mean value of the sound intensities of K syllables in the reference speech sequence, DeThe deviation distance of the target speech sequence and the reference speech sequence in the evaluation dimension of the sound intensity is used.

In a specific implementation of the embodiment of the present invention, the second calculating sub-module is specifically configured to calculate an overall deviation distance between the target speech sequence and the reference speech sequence according to the following formula:

wherein N is the serial number of the evaluation dimension, N is more than or equal to 1 and less than or equal to N, N is the total number of the evaluation dimensions, DnIs the deviation distance, omega, of the target speech sequence and the reference speech sequence in the nth evaluation dimensionnIs the weight of the nth evaluation dimension, andand D is the integral deviation distance between the target voice sequence and the reference voice sequence.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, modules and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Fig. 5 shows a schematic block diagram of a terminal device according to an embodiment of the present invention, and for convenience of description, only the relevant parts related to the embodiment of the present invention are shown.

In this embodiment, the terminal device 5 may be a computing device such as a desktop computer, a notebook computer, and a palm computer. The terminal device 5 may include: a processor 50, a memory 51, and computer readable instructions 52 stored in the memory 51 and executable on the processor 50, such as computer readable instructions to perform the speech synthesis system evaluation method described above. The processor 50, when executing the computer readable instructions 52, implements the steps in the above-described various speech synthesis system evaluation method embodiments, such as the steps S201 to S204 shown in fig. 2. Alternatively, the processor 50, when executing the computer readable instructions 52, implements the functions of the modules/units in the above-mentioned device embodiments, such as the functions of the modules 401 to 404 shown in fig. 4.

Illustratively, the computer readable instructions 52 may be partitioned into one or more modules/units that are stored in the memory 51 and executed by the processor 50 to implement the present invention. The one or more modules/units may be a series of computer-readable instruction segments capable of performing specific functions, which are used for describing the execution process of the computer-readable instructions 52 in the terminal device 5.

The Processor 50 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 51 may be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5. The memory 51 may also be an external storage device of the terminal device 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 5. Further, the memory 51 may also include both an internal storage unit and an external storage device of the terminal device 5. The memory 51 is used for storing the computer readable instructions and other instructions and data required by the terminal device 5. The memory 51 may also be used to temporarily store data that has been output or is to be output.

Each functional unit in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes a plurality of computer readable instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like, which can store computer readable instructions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

17页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:语音提取方法、装置、设备和存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!