Audio synthesis method, device, equipment and computer readable storage medium

文档序号:909794 发布日期:2021-02-26 浏览:2次 中文

阅读说明:本技术 一种音频合成方法、装置、设备及计算机可读存储介质 (Audio synthesis method, device, equipment and computer readable storage medium ) 是由 徐东 于 2020-11-18 设计创作,主要内容包括:本申请公开了一种音频合成方法、装置、设备及介质,获取干声音频;获取与干声音频对应的原始音素数据;获取对原始音素数据进行错误修复后得到的校准音素数据;将原始音素数据与校准音素数据进行对比,将起止时间相同且音素相同的音素数据确定为互验音素数据;将互验音素数据与干声音频进行处理,得到互验音素数据对应的互验干声;基于互验音素数据及互验干声对神经网络模型进行训练,以基于训练好的神经网络模型进行音频合成。本申请可以本申请通过音素互验技术对不同类型的音素数据进行技术处理,获得更有效的音素结果以及干声音频,从而有利于神经网络模型的训练,提升训练效率与合成音频的音质。(The application discloses an audio synthesis method, device, equipment and medium, which are used for acquiring the audio of a trunk; acquiring original phoneme data corresponding to the dry sound audio; acquiring calibration phoneme data obtained after error recovery is carried out on the original phoneme data; comparing the original phoneme data with the calibration phoneme data, and determining the phoneme data with the same starting time and ending time and the same phoneme as the cross-checking phoneme data; processing the cross-testing phoneme data and the dry sound to obtain cross-testing dry sound corresponding to the cross-testing phoneme data; and training the neural network model based on the mutual-testing phoneme data and the mutual-testing dry sound so as to perform audio synthesis based on the trained neural network model. According to the method and the device, technical processing can be performed on different types of phoneme data through a phoneme mutual testing technology, and more effective phoneme results and dry audio are obtained, so that training of a neural network model is facilitated, and training efficiency and tone quality of synthesized audio are improved.)

1. An audio synthesis method, comprising:

acquiring a dry sound audio;

acquiring original phoneme data corresponding to the dry sound audio, wherein the original phoneme data comprises start and stop times of phonemes in the dry sound audio, and the start and stop times comprise start times and end times;

acquiring calibration phoneme data obtained after error recovery is carried out on the original phoneme data;

comparing the original phoneme data with the calibration phoneme data, and determining phoneme data with the same start and stop time and the same phoneme as cross-checking phoneme data;

processing the cross-testing phoneme data and the dry sound to obtain cross-testing dry sound corresponding to the cross-testing phoneme data;

and training a neural network model based on the mutual testing phoneme data and the mutual testing dry sound so as to perform audio synthesis based on the trained neural network model.

2. The method of claim 1, wherein comparing the raw phoneme data with the calibration phoneme data and determining phoneme data with the same start and stop time and the same phoneme as the cross-test phoneme data comprises:

in the original phoneme data, setting phoneme data with duration less than a preset duration as a sil phoneme to obtain screened original phoneme data;

in the calibration phoneme data, setting phoneme data with the duration less than the preset duration as the sil phonemes to obtain screened calibration phoneme data;

in the screened calibration phoneme data, setting phonemes with the same start-stop time as the screened original phoneme data and different phonemes as the sil phonemes to obtain processed calibration phoneme data;

and determining the processed calibration phoneme data as the cross-testing phoneme data.

3. The method of claim 2, wherein said determining said processed alignment phoneme data as said crosschecking phoneme data comprises:

determining adjacent phoneme data in the processed calibration phoneme data;

if the start-stop time of the adjacent phoneme data is not continuous, in the processed calibration phoneme data, the start-stop time of the adjacent phoneme data is adjusted to be continuous, and the adjusted calibration phoneme data is determined as the cross-testing phoneme data;

and if the start-stop time of the adjacent phoneme data is continuous, directly determining the processed calibration phoneme data as the cross-testing phoneme data.

4. The method according to claim 3, wherein the processing the cross-testing phoneme data and the dry sound to obtain the cross-testing dry sound corresponding to the cross-testing phoneme data comprises:

acquiring target start-stop time of a phoneme with content of sil in the cross-testing phoneme data;

in the dry sound audio, the dry sound content with the same starting and stopping time as the target starting and stopping time is set to be mute, and the adjusted dry sound audio is used as the mutual-inspection dry sound.

5. The method of claim 4, wherein the muting the dry sound content having the same start-stop time as the target start-stop time comprises:

determining the dry sound content with the same start-stop time as the target start-stop time;

dividing the dry sound content into a starting section dry sound content, a middle section dry sound content and an ending section dry sound content according to the generation sequence of the dry sound content;

performing fade-out processing on the initial segment of dry sound content, and taking a fade-out processing result as a mute result of the initial segment of dry sound content;

directly setting the dry sound content of the middle section to be mute;

and performing fade-in processing on the end segment dry sound content, and taking a fade-in processing result as a mute result of the end segment dry sound content.

6. The method of claim 5, wherein the fading out the starting segment dry sound content comprises:

multiplying the audio frequency of the initial segment of dry sound content by a preset cos function to obtain the fade-out processing result;

the fade-in processing of the end segment dry sound content includes:

and multiplying the audio frequency of the end segment dry sound content by a preset sin function to obtain the fade-in processing result.

7. The method of any of claims 1 to 6, wherein the acquiring of the dry audio comprises:

and acquiring the dry sound audio with the audio format of WAV.

8. An audio synthesizing apparatus, comprising:

the system comprises a dry sound audio acquisition module, a dry sound audio acquisition module and a control module, wherein the dry sound audio acquisition module is used for acquiring dry sound audio;

an original phoneme obtaining module, configured to obtain original phoneme data corresponding to the dry sound audio, where the original phoneme data includes start and stop times of phonemes in the dry sound audio, and the start and stop times include start times and end times;

the calibration phoneme acquisition module is used for acquiring calibration phoneme data obtained after error recovery is carried out on the original phoneme data;

the cross-testing phoneme acquisition module is used for comparing the original phoneme data with the calibration phoneme data and determining phoneme data with the same starting time and ending time and the same phoneme as cross-testing phoneme data;

the mutual-testing dry sound acquisition module is used for processing the mutual-testing phoneme data and the dry sound to obtain mutual-testing dry sound corresponding to the mutual-testing phoneme data;

and the model training module is used for training the neural network model based on the mutual testing phoneme data and the mutual testing dry sounds so as to perform audio synthesis based on the trained neural network model.

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the audio synthesis method of any one of claims 1 to 7.

10. A computer-readable storage medium for storing a computer program which, when executed by a processor, implements the audio synthesis method of any one of claims 1 to 7.

Technical Field

The present application relates to the field of audio synthesis technology, and more particularly, to an audio synthesis method, apparatus, device, and computer-readable storage medium.

Background

Currently, during the process of recording songs, dry voices of users, that is, pure human voices, are collected. Phonemes are the smallest units of sound in human language that can distinguish meaning. The start and stop time of singing each phoneme by a user can be obtained through voice analysis on the prior information of the lyric text, namely a phoneme result corresponding to the dry sound is obtained, and the phoneme result can be used for training and synthesizing a neural network model and serves an automatic audio synthesis scene. In this process, high-precision phoneme start-stop time data, dry audio and a suitable data processing method are required, however, the applicant found that at least the following problems existed in the process of synthesizing audio: the phoneme start-stop time is not accurate enough and the quality of the synthesized audio is low.

In view of the above, how to improve the quality of the synthesized audio is a problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, an object of the present application is to provide an audio synthesizing method, apparatus, device and computer readable storage medium, which can improve the quality of synthesized audio. The specific scheme is as follows:

in a first aspect, the present application discloses an audio synthesis method, including:

acquiring a dry sound audio;

acquiring original phoneme data corresponding to the dry sound audio, wherein the original phoneme data comprises start and stop times of phonemes in the dry sound audio, and the start and stop times comprise start times and end times;

acquiring calibration phoneme data obtained after error recovery is carried out on the original phoneme data;

comparing the original phoneme data with the calibration phoneme data, and determining phoneme data with the same start and stop time and the same phoneme as cross-checking phoneme data;

processing the cross-testing phoneme data and the dry sound to obtain cross-testing dry sound corresponding to the cross-testing phoneme data;

and training a neural network model based on the mutual testing phoneme data and the mutual testing dry sound so as to perform audio synthesis based on the trained neural network model.

Optionally, the comparing the original phoneme data with the calibration phoneme data, and determining the phoneme data with the same start and end time and the same phoneme as the cross-test phoneme data includes:

in the original phoneme data, setting phoneme data with duration less than a preset duration as a sil phoneme to obtain screened original phoneme data;

in the calibration phoneme data, setting phoneme data with the duration less than the preset duration as the sil phonemes to obtain screened calibration phoneme data;

in the screened calibration phoneme data, setting phonemes with the same start-stop time as the screened original phoneme data and different phonemes as the sil phonemes to obtain processed calibration phoneme data;

and determining the processed calibration phoneme data as the cross-testing phoneme data.

Optionally, the determining the processed calibration phoneme data as the crosschecking phoneme data includes:

determining adjacent phoneme data in the processed calibration phoneme data;

if the start-stop time of the adjacent phoneme data is not continuous, in the processed calibration phoneme data, the start-stop time of the adjacent phoneme data is adjusted to be continuous, and the adjusted calibration phoneme data is determined as the cross-testing phoneme data;

and if the start-stop time of the adjacent phoneme data is continuous, directly determining the processed calibration phoneme data as the cross-testing phoneme data.

Optionally, the processing the cross-testing phoneme data and the dry sound to obtain the cross-testing dry sound corresponding to the cross-testing phoneme data includes:

acquiring target start-stop time of a phoneme with content of sil in the cross-testing phoneme data;

in the dry sound audio, the dry sound content with the same starting and stopping time as the target starting and stopping time is set to be mute, and the adjusted dry sound audio is used as the mutual-inspection dry sound.

Optionally, the setting the dry sound content with the same start-stop time as the target start-stop time to be mute includes:

determining the dry sound content with the same start-stop time as the target start-stop time;

dividing the dry sound content into a starting section dry sound content, a middle section dry sound content and an ending section dry sound content according to the generation sequence of the dry sound content;

performing fade-out processing on the initial segment of dry sound content, and taking a fade-out processing result as a mute result of the initial segment of dry sound content;

directly setting the dry sound content of the middle section to be mute;

and performing fade-in processing on the end segment dry sound content, and taking a fade-in processing result as a mute result of the end segment dry sound content.

Optionally, the performing fade-out processing on the start segment dry sound content includes:

multiplying the audio frequency of the initial segment of dry sound content by a preset cos function to obtain the fade-out processing result;

the fade-in processing of the end segment dry sound content includes:

and multiplying the audio frequency of the end segment dry sound content by a preset sin function to obtain the fade-in processing result.

Optionally, the acquiring the dry audio includes:

and acquiring the dry sound audio with the audio format of WAV.

In a second aspect, the present application discloses an audio synthesizing apparatus comprising:

the system comprises a dry sound audio acquisition module, a dry sound audio acquisition module and a control module, wherein the dry sound audio acquisition module is used for acquiring dry sound audio;

an original phoneme obtaining module, configured to obtain original phoneme data corresponding to the dry sound audio, where the original phoneme data includes start and stop times of phonemes in the dry sound audio, and the start and stop times include start times and end times;

the calibration phoneme acquisition module is used for acquiring calibration phoneme data obtained after error recovery is carried out on the original phoneme data;

the cross-testing phoneme acquisition module is used for comparing the original phoneme data with the calibration phoneme data and determining phoneme data with the same starting time and ending time and the same phoneme as cross-testing phoneme data;

the mutual-testing dry sound acquisition module is used for processing the mutual-testing phoneme data and the dry sound to obtain mutual-testing dry sound corresponding to the mutual-testing phoneme data;

and the model training module is used for training the neural network model based on the mutual testing phoneme data and the mutual testing dry sounds so as to perform audio synthesis based on the trained neural network model.

In a third aspect, the present application discloses an electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the audio synthesis method as described in any of the above.

In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program which, when executed by a processor, implements an audio synthesis method as described in any one of the above.

In the application, after the dry audio, the original phoneme data and the calibration phoneme data are obtained, the calibration phoneme data or the original phoneme data are not directly applied to train the neural network model, but the original phoneme data and the calibration phoneme data are firstly compared, the phoneme data with the same start and stop time and the same phoneme is determined as the cross-testing phoneme data, and the cross-testing phoneme data is the phoneme data with the same start and stop time and the same content in the original phoneme data and the calibration phoneme data, so that the cross-testing phoneme data is the most accurate phoneme in the original phoneme data and the calibration phoneme data, that is, the accurate cross-testing phoneme data can be obtained, correspondingly, in the dry audio, after the dry sound corresponding to the cross-testing phoneme data is determined as the cross-testing dry sound, the accurate cross-testing dry sound corresponding to the cross-testing phoneme data can be obtained, and thus, if the neural network model is subsequently trained based on the mutual testing phoneme data and the mutual testing dry sound, the accuracy of the mutual testing phoneme data and the mutual testing dry sound is high, so that the audio synthesis accuracy of the neural network model is high, and the quality of the synthesized audio is high if the trained neural network model is used for carrying out audio synthesis. In addition, because the data volume of the mutual testing phoneme and the mutual testing noise is small in the application, the application can also accelerate the training efficiency of the neural network model, and further improve the efficiency of audio synthesis, namely the application carries out technical processing on the phoneme data of different types through the phoneme mutual testing technology, and more effective phoneme results and the noise audio are obtained, so that the training of the neural network model is facilitated, and the training efficiency and the tone quality of the synthesized audio are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic diagram of a system framework to which the audio synthesis scheme provided herein is applicable;

FIG. 2 is a flow chart of an audio synthesis method provided herein;

FIG. 3 is a flow chart of a specific audio synthesis method provided by the present application;

FIG. 4 is a flow chart of a specific audio synthesis method provided by the present application;

FIG. 5 is a flow chart of a specific audio synthesis method provided by the present application;

FIG. 6 is a flow chart of a specific audio synthesis method provided by the present application;

FIG. 7 is a schematic diagram of an original phone, an alignment phone, and a cross-phone;

FIG. 8 is a schematic structural diagram of an audio synthesizing apparatus according to the present application;

fig. 9 is a block diagram of an electronic device provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Currently, in the process of recording songs, dry voices of users, namely pure voices without music, are generated, and the generated dry voices take audio as a carrier and contain singing information of the users. In the prior information of the lyric text, the start-stop time of singing each phoneme by the user can be obtained through voice analysis, namely a phoneme result corresponding to the dry sound is obtained, the phoneme result can be used for training and synthesizing a neural network model and serves an automatic audio synthesis scene, in the process, high-precision phoneme start-stop time data, dry sound audio and a proper data processing method are needed, and the quality of the synthesized audio is low. It should be noted that in the audio synthesis process, phonemes refer to the smallest sound units in human language that can be distinguished in meaning. In order to overcome the above technical problem, the present application provides an audio synthesizing method capable of improving the quality of synthesized audio.

In the audio synthesis scheme of the present application, the system framework adopted may specifically refer to fig. 1, and may specifically include: a backend server 01 and a number of clients 02 establishing a communication connection with the backend server 01.

In the present application, the background server 01 is configured to execute the audio synthesis method steps, including acquiring the audio of the trunk; acquiring original phoneme data corresponding to the dry sound audio, wherein the original phoneme data comprises start and stop times of phonemes in the dry sound audio, and the start and stop times comprise start times and end times; acquiring calibration phoneme data obtained after error recovery is carried out on the original phoneme data; comparing the original phoneme data with the calibration phoneme data, and determining phonemes with the same starting time and the same ending time and the same phoneme as cross-checking phoneme data; processing the cross-testing phoneme data and the dry sound to obtain cross-testing dry sound corresponding to the cross-testing phoneme data; and training the neural network model based on the mutual-testing phoneme data and the mutual-testing dry sound so as to perform audio synthesis based on the trained neural network model.

Further, the background server 01 may further include a dry audio database, an original phoneme database, a calibration phoneme database, a cross-testing phoneme database, and a cross-testing noise database. The dry audio database is used for storing various dry audio, such as the dry audio of popular music singers, the dry audio of rock music singers, the dry audio of operas, and the like. The original phoneme database may be specifically configured to store data obtained by counting phonemes and start-stop times in the dry audio. The calibration phoneme database may be used to store data obtained by repairing error information in the original phoneme data. The cross-testing phoneme database can be used for storing phonemes with the same start and stop time and the same content in the original phoneme data and the calibration phoneme data, and the cross-testing dry sound database can be used for storing dry sounds corresponding to the cross-testing phonemes in the dry sound audio. It can be understood that, after the audio synthesis scheme of the present application trains the neural network model and synthesizes audio, corresponding data can be saved by means of various databases. Thus, after the background server 01 receives the audio synthesis request of the user end 02 for a certain neural network model, the cross-testing phoneme data can be directly obtained from the cross-testing phoneme database, the cross-testing dry sound corresponding to the cross-testing phoneme is obtained from the cross-testing dry sound database, and the dry sound audio does not need to be obtained by expanding and executing again; acquiring original phoneme data corresponding to the dry sound audio, wherein the original phoneme data comprises starting and stopping times of phonemes in the dry sound audio; acquiring calibration phoneme data obtained after error recovery is carried out on the original phoneme data; comparing the original phoneme data with the calibration phoneme data, and determining the phoneme data with the same starting time and ending time and the same phoneme as the cross-checking phoneme data; in the dry sound audio, the dry sound corresponding to the cross-testing phoneme data is determined as the process of cross-testing the dry sound, so that a great deal of time is saved. In addition, in order to avoid the limitation of audio synthesis caused by the single mutual-testing phoneme data and the mutual-testing dry sound to the neural network model, different types of mutual-testing phoneme data and mutual-testing dry sound can be selected to train the neural network model for multiple times in the application, so that the neural network model with better applicability is obtained, and the applicability of audio synthesis is further improved.

Of course, the various databases may also be set in a service server of a third party, and data uploaded by a service end may be collected specially by the service server. In this way, when the background server 01 needs to use the database, the corresponding data can be obtained by initiating a corresponding database call request to the service server.

In the present application, the background server 01 may respond to the audio synthesis requests of one or more clients 02, and it can be understood that the audio synthesis requests initiated by different clients 02 in the present application may be audio synthesis requests for the same dry sound or audio synthesis requests initiated for different dry sounds.

Fig. 2 is a flowchart of an audio synthesizing method according to an embodiment of the present application. Referring to fig. 2, the audio synthesizing method includes:

step S11: and acquiring the dry sound audio.

In this embodiment, the dry Audio refers to a dry Audio waveform File recorded by a user, and the content and the Audio format of the dry Audio may be determined according to actual needs, for example, the Audio format of the dry Audio may be an MP3 Audio format, an MP4 Audio format, a midi (musical Instrument Digital interface) Audio format, a wav (wave Audio File format) Audio format, and the like.

It should be noted that, in a lossy coding scheme such as MP3, the waveforms of the interference sound will change because the actually read audio is shifted in time at the beginning and end of the audio due to the difference of the decoders, and in order to avoid this, the waveforms of the interference sound are guaranteed to be unchanged.

Step S12: original phoneme data corresponding to the dry sound audio is obtained, and the original phoneme data comprises starting and stopping times of phonemes in the dry sound audio.

In this embodiment, the phoneme is the smallest sound unit capable of distinguishing meanings in the human language, and after a plurality of factors are combined according to a specific sequence, the dry sound can be generated, so in the audio synthesis process, the phoneme information in the dry sound audio can be analyzed, and the result obtained by the analysis is used for subsequent audio synthesis, that is, after the dry sound audio is obtained, the original phoneme data corresponding to the dry sound audio can be obtained, specifically, the dry sound audio and the phoneme can be aligned by a speech recognition technology commonly used in the art, so as to obtain the original phoneme data, where the data in the original phoneme describes the start-stop time of each phoneme, and the start-stop time includes the start time and the end time.

For convenience of understanding, it is assumed that the phonemes in the present application are embodied in the form of International Phonetic symbols (IPA), which is a system for Phonetic transcription, is designed by the International Phonetic society based on latin letters, and serves as a standardized tagging method for spoken sounds, taking chinese characters as an example, when the sound of "i" is uttered, u "and me" are actually uttered successivelyThe two IPAs, the corresponding Pinyin is "wo", that is, the phoneme in the present scheme, refers to the words such as "u" and "uSuch IPA. Accordingly, the original phone data is IPA and the start/end time of IPA, for example, the start/end time of the phone "p" in the original phone data is 10ms and 30ms, and the start/end time of the phone "a" is 30ms and 60ms, respectively, so that the original phone data may be represented by info [ [10,30, p; 30,60, a;]expressed in such a form, i.e. may be expressed in [ start time, end time, phoneme;]represents raw phoneme data, wherein a semicolon represents oneThe data end of the phoneme may, of course, represent the original phoneme data in other manners, and the application is not limited in this respect.

Step S13: and acquiring calibration phoneme data obtained after error recovery is carried out on the original phoneme data.

In this embodiment, although the original phoneme data may be obtained by using a speech recognition technology, the accuracy of the original phoneme may be poor due to errors in the obtained original phoneme data, such as an error in the type of the phoneme, an error in the start/stop time of the phoneme, and the like, and if the neural network model is directly trained by using the original phoneme data, the accuracy of audio synthesis of the neural network model may be poor.

It should be noted that the post-processing technique for processing the raw phoneme data may include fundamental frequency curve extraction, audio energy extraction, phoneme alignment processing, and the like, and the description of the post-processing technique may refer to the prior art, and is not described herein again.

Step S14: comparing the original phoneme data with the calibration phoneme data, and determining the phoneme data with the same start and stop time and the same phoneme as the cross-checking phoneme data.

In this embodiment, although the calibration phoneme data is the phoneme data obtained by repairing the error in the original phoneme data, there may still be error information in the calibration phoneme data, if the neural network model is trained based on the calibration phoneme data, although the audio synthesis accuracy of the neural network model may be improved as compared with the method of training the neural network model based on the original phoneme data, it is difficult to guarantee the audio synthesis accuracy of the neural network model to be the highest at all, in order to guarantee the audio synthesis accuracy of the neural network model to be the highest as possible, after obtaining the calibration phoneme data obtained by repairing the error in the original phoneme data, the original phoneme data may be compared with the calibration phoneme data, and the phoneme data having the same start and end time and the same phoneme may be determined as the cross-validation phoneme data, the cross-testing phoneme is the phoneme with the same starting and ending time and content in the original phoneme and the calibration phoneme, so that the cross-testing phoneme data is the most accurate phoneme in the original phoneme data and the calibration phoneme data, and the audio synthesis accuracy of the neural network model can be ensured to be the highest when the neural network model is trained based on the cross-testing phoneme data subsequently, and the quality of the synthesized audio can be further ensured.

It should be noted that, since the cross-testing phoneme data is the phoneme data having the same start and stop time and the same phoneme as the calibration phoneme data in the original phoneme data and the calibration phoneme data, the data size of the cross-testing phoneme data is smaller than that of the original phoneme data and the calibration phoneme data, and therefore, when the neural network model is trained based on the cross-testing phoneme data, the training efficiency of the neural network model can be increased, and the efficiency of audio synthesis can be further improved.

Step S15: and processing the cross-testing phoneme data and the dry sound to obtain the cross-testing dry sound corresponding to the cross-testing phoneme data.

Step S16: and training the neural network model based on the mutual-testing phoneme data and the mutual-testing dry sound so as to perform audio synthesis based on the trained neural network model.

In this embodiment, in the process of training the neural network model for synthesizing audio, not only the phoneme data but also audio is needed, so after the mutual-test phoneme data is obtained, the mutual-test phoneme data and the dry sound are also needed to be processed to obtain the mutual-test dry sound corresponding to the mutual-test phoneme data, for example, in the dry sound, the dry sound corresponding to the mutual-test phoneme is determined as the mutual-test dry sound, so that the neural network model can be trained based on the mutual-test phoneme data and the mutual-test dry sound, and audio synthesis can be performed based on the trained neural network model.

It should be noted that, in the process of synthesizing audio based on the trained neural network model, it is sufficient to obtain the phoneme information of the audio to be synthesized, input the phoneme information into the trained neural network model, and then receive the audio output by the trained neural network model. The type of the Neural network model may be determined according to actual needs, and may be, for example, a Neural network model such as CNN (Convolutional Neural Networks), DNN (Deep Neural Networks), and waveRNN. And after the audio is synthesized, the synthesized audio can be played or stored, for example, played through a user terminal or in other manners, stored by a local disk, stored by a database, and the like.

In the application, after the dry sound, the original phoneme data and the calibration phoneme data are obtained, the calibration phoneme data or the original phoneme data are not directly applied to train the neural network model, but the original phoneme data and the calibration phoneme data are firstly compared, the phoneme data with the same start and stop time and the same phoneme is determined as the cross-test phoneme data, and the cross-test phoneme data is the phoneme data with the same start and stop time and the same content in the original phoneme data and the calibration phoneme data, so that the cross-test phoneme data is the most accurate phoneme data in the original phoneme data and the calibration phoneme data, that is, the accurate cross-test phoneme data can be obtained, correspondingly, in the dry sound, the dry sound corresponding to the cross-test phoneme data is determined as the cross-test dry sound, and then the accurate cross-test dry sound corresponding to the cross-test phoneme data can be obtained, therefore, if the neural network model is subsequently trained based on the mutual testing phoneme data and the mutual testing dry sound, the accuracy of the audio synthesis of the neural network model can be high due to the high accuracy of the mutual testing phoneme data and the mutual testing dry sound, and the quality of the synthesized audio can be high if the audio synthesis is carried out based on the trained neural network model. In addition, because the data volume of the mutual testing phoneme data and the mutual testing dry sound is small in the application, the application can also accelerate the training efficiency of the neural network model, and further improve the efficiency of audio synthesis, namely the application carries out technical processing on the phoneme data of different types through the phoneme mutual testing technology, and more effective phoneme results and dry sound audio are obtained, so that the training of the neural network model is facilitated, and the training efficiency and the tone quality of the synthesized audio are improved.

Fig. 3 is a flowchart of a specific audio synthesizing method according to an embodiment of the present application. Referring to fig. 3, the audio synthesizing method includes:

step S21: and acquiring the dry sound audio.

Step S22: original phoneme data corresponding to the dry sound audio is obtained, and the original phoneme data comprises starting and stopping times of phonemes in the dry sound audio.

Step S23: and acquiring calibration phoneme data obtained after error recovery is carried out on the original phoneme data.

Step S24: and in the original phoneme data, setting the phoneme data meeting the preset invalid rule as a sil phoneme to obtain the screened original phoneme data.

Step S25: and in the calibration phoneme data, setting the phoneme data meeting the preset invalid rule as a sil phoneme to obtain the screened calibration phoneme data.

Step S26: and comparing the screened original phoneme data with the screened calibration phoneme data, and determining the phoneme data with the same starting time and the same ending time and the same phoneme as the cross-checking phoneme data.

In this embodiment, since the original phoneme data and the calibration phoneme data may have invalid phoneme data, if the invalid phoneme data is used as the cross-test phoneme data, the invalid phoneme data may exist in the audio synthesized by the trained neural network model, which may affect the quality of the synthesized audio.

It can be understood that if the original phoneme data and the invalid phoneme data in the calibration phoneme data are directly deleted, a fault may exist in the original phoneme data and the phoneme data in the calibration phoneme data in time, and finally a fault may exist in the audio synthesized by the trained neural network model in time, which affects the quality of the synthesized audio, in order to avoid this, in the process of comparing the original phoneme data with the calibration phoneme data and determining the phoneme data with the same start and end times and the same phonemes as the mutually verified phoneme data, the phoneme data meeting the preset invalid rule may be set as a sil phoneme in the original phoneme data, so as to obtain the original phoneme data after being filtered; in the calibration phoneme data, setting the phoneme data meeting the preset invalid rule as a sil phoneme to obtain screened calibration phoneme data; and comparing the screened original phoneme data with the screened calibration phoneme data, and determining the phoneme data with the same starting time and ending time and the same phoneme as the mutual-testing phoneme. It should be noted that "sil" refers to a phoneme whose stem sound at the corresponding time is silent.

In practical applications, the type of the preset invalidation rule may be determined according to actual needs, for example, the preset invalidation rule may be that the duration is less than the preset duration. For example, there is a phoneme fragment with a start-stop time of t, t +5ms in the phoneme data, and the phoneme data of the fragment is not "sil", where t is a certain time, and as can be known from the actual pronunciation characteristics, a human voice fragment with a vocalization duration of only 5ms is not reasonable, because the phoneme data of the time period can be determined as an invalid phoneme, and the phoneme of the time period is changed to "sil".

In this embodiment, invalid phoneme data in the original phoneme data and the calibration phoneme data may be removed, and in the process of removing the invalid phoneme data, temporal continuity of the phoneme data may be ensured, temporal continuity of an audio synthesized by the trained neural network model may be ensured, and quality of audio synthesis may be ensured.

Step S27: and processing the cross-testing phoneme data and the dry sound to obtain the cross-testing dry sound corresponding to the cross-testing phoneme data.

Step S28: and training the neural network model based on the mutual-testing phoneme data and the mutual-testing dry sound so as to perform audio synthesis based on the trained neural network model.

Fig. 4 is a flowchart of a specific audio synthesizing method according to an embodiment of the present application. Referring to fig. 4, the audio synthesizing method includes:

step S31: and acquiring the dry sound audio.

Step S32: original phoneme data corresponding to the dry sound audio is obtained, and the original phoneme data comprises starting and stopping times of phonemes in the dry sound audio.

Step S33: and acquiring calibration phoneme data obtained after error recovery is carried out on the original phoneme data.

Step S34: and in the original phoneme data, setting the phoneme data meeting the preset invalid rule as a sil phoneme to obtain the screened original phoneme data.

Step S35: and in the calibration phoneme data, setting the phoneme data meeting the preset invalid rule as a sil phoneme to obtain the screened calibration phoneme data.

Step S36: in the screened calibration phoneme data, setting phoneme data with the same start-stop time as the screened original phoneme data and different phonemes as a sil phoneme to obtain processed calibration phoneme data; and determining the processed calibration phoneme data as the cross-checking phoneme data.

In this embodiment, in the process of comparing the original phoneme data after being removed with the calibration phoneme data after being removed, and determining the phoneme data with the same start/stop time and the same phoneme as the cross-checking phoneme, if the cross-checking phoneme data is directly extracted from the original phoneme data or the calibration phoneme data, the cross-checking phoneme data with discontinuous time may be obtained, and if the neural network model training is performed with the cross-checking phoneme data with discontinuous time, the processing capability of the neural network model on the time fault in the cross-checking phoneme data may be poor, and the audio quality synthesized by the neural network model may not be guaranteed, in order to avoid this, in the process of comparing the original phoneme data after being removed with the calibration phoneme data after being removed, and determining the phoneme data with the same start/stop time and the same phoneme as the cross-checking phoneme data, in the screened calibration phoneme data, a phoneme with the same start-stop time as that in the screened original phoneme data but different phonemes is set as a sil phoneme to obtain processed calibration phoneme data; and determining the processed calibration phoneme data as the cross-checking phoneme data. That is, the original phoneme data and the calibration phoneme data are invalidated by the aid of the sil phoneme, wherein the start time and the end time of the original phoneme data are the same, but the content of the original phoneme data and the calibration phoneme data are different, and the cross-testing phoneme data can be guaranteed to be still continuous phoneme data in time.

It should be noted that in a specific application scenario, in the process of comparing the screened-out original phoneme data with the screened-out calibration phoneme data and determining the phoneme data with the same start-stop time and the same phoneme as the cross-validation phoneme data, in the screened-out original phoneme data, the phoneme data with the same start-stop time and the same phoneme as those in the screened-out calibration phoneme data but different phonemes may be set as a sil phoneme to obtain processed original phoneme data; the processed original phoneme data is determined as cross-check phoneme data or the like.

That is, in the embodiment, by setting invalid phoneme data in the original phoneme data and the calibration phoneme data as a sil phoneme, and setting phoneme data with the same start/end time and different phonemes in the screened-out calibration phoneme data as a sil phoneme, the processed calibration phoneme data is obtained, and the processed calibration phoneme data is determined as cross-test phoneme data, and finally, cross-test phoneme data which does not include phoneme data with the same start/end time and different phonemes and is continuous in time is obtained, so that if the neural network model is subsequently trained based on the cross-test phoneme, the processing capability of the neural network model on the time-continuous phoneme data can be ensured, and the time continuity of the audio synthesized by the neural network model can be ensured.

Step S37: and processing the cross-testing phoneme data and the dry sound to obtain the cross-testing dry sound corresponding to the cross-testing phoneme data.

Step S38: and training the neural network model based on the mutual-testing phoneme data and the mutual-testing dry sound so as to perform audio synthesis based on the trained neural network model.

Fig. 5 is a flowchart of a specific audio synthesizing method according to an embodiment of the present application. Referring to fig. 5, the audio synthesizing method includes:

step S401: and acquiring the dry sound audio.

Step S402: original phoneme data corresponding to the dry sound audio is obtained, and the original phoneme data comprises starting and stopping times of phonemes in the dry sound audio.

Step S403: and acquiring calibration phoneme data obtained after error recovery is carried out on the original phoneme data.

Step S404: and in the original phoneme data, setting the phoneme data meeting the preset invalid rule as a sil phoneme to obtain the screened original phoneme data.

Step S405: and in the calibration phoneme data, setting the phoneme data meeting the preset invalid rule as a sil phoneme to obtain the screened calibration phoneme data.

Step S406: and in the screened calibration phoneme data, setting the phoneme data with the same starting and ending time as the screened original phoneme data and different contents as the sil phoneme to obtain the processed calibration phoneme data.

Step S407: and determining adjacent phoneme data in the processed calibration phoneme data.

Step S408: judging whether the start-stop time of the adjacent phoneme data is continuous or not; if the start-stop time of the adjacent phoneme data is not continuous, executing step S409; if the start time and the end time of the adjacent phoneme data are continuous, step S410 is executed.

Step S409: the start and end times of the adjacent phoneme data are adjusted to be continuous, the adjusted calibration phoneme data is determined as the cross-checking phoneme data, and step S411 is performed.

Step S410: the processed calibration phoneme data is directly determined as the cross-checking phoneme data, and step S411 is performed.

In this embodiment, if the start and end times of the adjacent phoneme data in the original phoneme data and the calibration phoneme data are not continuous, the obtained cross-testing phoneme may also have phoneme data with discontinuous start and end times, so that the cross-testing phoneme data is not continuous in time, and in order to ensure that the cross-testing phoneme data is continuous in time, in the process of determining the processed calibration phoneme data as the cross-testing phoneme data, the adjacent phoneme data in the processed calibration phoneme data may be determined; judging whether the start-stop time of the adjacent phoneme data is continuous or not; if the start-stop time of the adjacent phoneme data is not continuous, adjusting the start-stop time of the adjacent phoneme data to be continuous in the processed calibration phoneme data, and determining the adjusted calibration phoneme data as the cross-checking phoneme data; and if the starting time and the ending time of the adjacent phoneme data are continuous, directly determining the processed calibration phoneme data as the cross-checking phoneme data.

It should be noted that in the process of determining whether the start-stop time of the adjacent phoneme data is continuous, it may be determined whether the end time of the previous phoneme data in the adjacent phoneme data is equal to the start time of the next phoneme data, and if the end time of the previous phoneme data is not equal to the start time of the next phoneme data, it may be determined that the start-stop time of the adjacent phoneme data is not continuous; accordingly, in the process of adjusting the start-stop time of the adjacent phoneme data to be continuous in the processed calibration phoneme data, the end time of the previous phoneme data in the adjacent phoneme data may be adjusted to be the start time of the next phoneme data, or the start time of the next phoneme data in the adjacent phoneme data may be adjusted to be the end time of the previous phoneme data.

In this embodiment, in the process of determining the processed calibration phoneme data as the cross-validation phoneme data, by determining adjacent phoneme data in the processed calibration phoneme data and determining whether start-stop times of the adjacent phoneme data are continuous, if the start-stop times of the adjacent phoneme data are discontinuous, in the processed calibration phoneme data, the start-stop times of the adjacent phoneme data are adjusted to be continuous, so that temporal continuity of the cross-validation phoneme data is ensured, temporally continuous cross-validation phoneme data can be provided for the training process of the neural network model, the output of temporally continuous synthetic audio by the neural network model is facilitated, and the quality of the synthetic audio is ensured.

Step S411: and processing the cross-testing phoneme data and the dry sound to obtain the cross-testing dry sound corresponding to the cross-testing phoneme data.

Step S412: and training the neural network model based on the mutual-testing phoneme data and the mutual-testing dry sound so as to perform audio synthesis based on the trained neural network model.

Fig. 6 is a flowchart of a specific audio synthesizing method according to an embodiment of the present application. Referring to fig. 6, the audio synthesizing method includes:

step S501: and acquiring the dry sound audio.

Step S502: original phoneme data corresponding to the dry sound audio is obtained, and the original phoneme data comprises starting and stopping times of phonemes in the dry sound audio.

Step S503: and acquiring calibration phoneme data obtained after error recovery is carried out on the original phoneme data.

Step S504: and in the original phoneme data, setting the phoneme data meeting the preset invalid rule as a sil phoneme to obtain the screened original phoneme data.

Step S505: and in the calibration phoneme data, setting the phoneme data meeting the preset invalid rule as a sil phoneme to obtain the screened calibration phoneme data.

Step S506: and in the screened calibration phoneme data, setting the phoneme data with the same starting and ending time as the screened original phoneme data and different contents as the sil phoneme to obtain the processed calibration phoneme data.

Step S507: and determining adjacent phoneme data in the processed calibration phoneme data.

Step S508: judging whether the start-stop time of the adjacent phoneme data is continuous or not; if the start-stop time of the adjacent phoneme data is not continuous, step S509 is executed; if the start time and the end time of the adjacent phoneme data are continuous, step S510 is executed.

Step S509: the start and end times of the adjacent phoneme data are adjusted to be continuous, the adjusted calibration phoneme data is determined as the cross-checking phoneme data, and step S511 is performed.

Step S510: the processed calibration phoneme data is directly determined as the cross-checking phoneme data, and step S511 is performed.

Step S511: acquiring target start-stop time of a phoneme with content of sil in the cross-checking phoneme data; in the dry sound audio, dry sound content having the same start-stop time as the target start-stop time is determined.

Step S512: according to the generation sequence of the dry sound content, the dry sound content is divided into a starting section dry sound content, a middle section dry sound content and an ending section dry sound content.

Step S513: fade-out processing is carried out on the initial segment dry sound content, and the fade-out processing result is used as a mute result of the initial segment dry sound content; directly setting the dry sound content in the middle section to be mute.

Step S514: and performing fade-in processing on the end segment dry sound content, taking a fade-in processing result as a mute result of the end segment dry sound content, and taking the adjusted dry sound frequency as the cross-checking dry sound.

In this embodiment, in the training process of the neural network model, it is necessary to provide the mutual-testing dry sound for the neural network model, and in order to ensure that the mutual-testing dry sound is continuous in time, the mutual-testing phoneme data and the dry sound audio are processed to obtain the mutual-testing dry sound corresponding to the mutual-testing phoneme data, the target start-stop time of the phoneme whose content is sil in the mutual-testing phoneme data can be obtained; in the dry sound audio, the dry sound content with the same start-stop time as the target start-stop time is set to be mute, and the adjusted dry sound audio is used as the mutual-testing dry sound. Therefore, phoneme data in the cross-checking dry sound are continuous in time, no fault occurs, the output of the continuous synthetic audio in time by the neural network model is facilitated, and the quality of the synthetic audio is ensured.

In practical application, in a dry sound audio, in the process of setting dry sound content with the same start-stop time as a target start-stop time as silence, the change amount of the dry sound content is large at the start time and the end time of the dry sound content setting as silence, so that the dry sound content has data jump, and the cross-sectional phenomenon of cross-sectional dry sound can be detected; dividing the dry sound content into a starting section dry sound content, a middle section dry sound content and an ending section dry sound content according to the generation sequence of the dry sound content; fade-out processing is carried out on the initial segment dry sound content, and the fade-out processing result is used as a mute result of the initial segment dry sound content; directly setting the dry sound content of the middle section to be mute; and performing fade-in processing on the end segment dry sound content, and taking a fade-in processing result as a mute result of the end segment dry sound content.

In a specific application scenario, in the process of fading out the initial segment of dry sound content, the audio frequency of the initial segment of dry sound content can be multiplied by a preset cos function to obtain a fading out processing result; correspondingly, in the process of performing fade-in processing on the end segment dry sound content, the audio frequency of the end segment dry sound content can be multiplied by the preset sin function to obtain a fade-in processing result. The specific form of the preset cos function and the preset sin function may be determined according to actual needs, for example, the preset cos function and the preset sin function may both include an attenuation intensity index, and taking the preset cos function as an example, the preset cos function including the attenuation intensity index may be w ═ y × cos (at), and the like, where y represents the audio frequency of the dry sound content in the initial segment, w represents the fade-out processing result, a represents the attenuation intensity for controlling fade-out, and t represents time.

Step S515: and training the neural network model based on the mutual-testing phoneme data and the mutual-testing dry sound so as to perform audio synthesis based on the trained neural network model.

The following describes a technical solution in the present application, taking an audio synthesis process of a certain music client APP as an example.

The method comprises the steps that a music client APP obtains the dry sound audio input by a user;

the method comprises the steps that a music client APP carries out audio and phoneme alignment processing on dry sound audio through a voice recognition technology to obtain original phoneme data corresponding to the dry sound audio, the original phoneme data comprise starting and ending times of phonemes in the dry sound audio, the original phoneme data are assumed to be shown in FIG. 7, wherein a black waveform represents the dry sound audio, sil, I, ts, a and the like represent the phoneme data, and a vertical line between two phoneme data represents a spacing mark;

the music client APP performs error recovery on the original phoneme data through a post-processing technology to obtain calibration phoneme data, and the obtained calibration phoneme data is shown in FIG. 7;

setting the phoneme data with the duration less than the preset duration as a sil phoneme in the original phoneme data by the music client to obtain screened original phoneme data;

setting the phoneme data with the duration less than the preset duration as a sil phoneme in the calibration phoneme data by the music client to obtain screened calibration phoneme data;

in the screened calibration phoneme data, the music client sets the phoneme data with the same start-stop time and different contents as the sil phonemes to obtain processed calibration phoneme data;

the music client determines adjacent phoneme data in the processed calibration phoneme data;

the music client judges whether the start-stop time of the adjacent phoneme data is continuous; if the start-stop time of the adjacent phoneme data is not continuous, the start-stop time of the adjacent phoneme data is adjusted to be continuous, and the adjusted calibration phoneme data is determined as the cross-checking phoneme data; if the start time and the end time of the adjacent phoneme data are continuous, directly determining the processed calibration phoneme data as the cross-checking phoneme data; the resulting cross-testing phoneme can be shown in fig. 7;

the music client acquires target start-stop time of a phoneme with content of sil in the cross-checking phoneme data; determining dry sound content with the same starting and stopping time as the target starting and stopping time in the dry sound audio;

dividing the dry sound content into an initial section dry sound content, a middle section dry sound content and an ending section dry sound content by the music client according to the generation sequence of the dry sound content, multiplying the audio frequency of the initial section dry sound content by a preset cos function to obtain a fade-out processing result, and taking the fade-out processing result as a mute result of the initial section dry sound content; directly setting the dry sound content of the middle section to be mute; multiplying the audio frequency of the end section dry sound content by a preset sin function to obtain a fade-in processing result, taking the fade-in processing result as a mute result of the end section dry sound content, and taking the adjusted dry sound audio as the cross-checking dry sound;

the music client trains the neural network model based on the mutual-testing phoneme data and the mutual-testing dry sounds, and carries out audio synthesis based on the trained neural network model.

Referring to fig. 8, an audio synthesizing apparatus correspondingly disclosed in the embodiment of the present application is applied to a background server, and includes:

the system comprises a dry sound audio acquisition module 11, a data processing module and a data processing module, wherein the dry sound audio acquisition module is used for acquiring dry sound audio;

an original phoneme obtaining module 12, configured to obtain original phoneme data corresponding to the dry audio, where the original phoneme data includes start-stop times of phonemes in the dry audio, and the start-stop times include start times and end times;

a calibration phoneme obtaining module 13, configured to obtain calibration phoneme data obtained by performing error recovery on the original phoneme data;

the cross-testing phoneme obtaining module 14 is configured to compare the original phoneme data with the calibration phoneme data, and determine phoneme data with the same start-stop time and the same phoneme as the cross-testing phoneme data;

the mutual-testing dry sound acquisition module 15 is configured to process the mutual-testing phoneme data and the dry sound to obtain mutual-testing dry sound corresponding to the mutual-testing phoneme data;

and the model training module 16 is configured to train the neural network model based on the cross-testing phoneme data and the cross-testing noise, so as to perform audio synthesis based on the trained neural network model.

It can be seen that, in the present application, after obtaining the dry audio, the raw phoneme data and the calibration phoneme data, the calibration phoneme data or the raw phoneme data is not directly applied to train the neural network model, but the raw phoneme data and the calibration phoneme data are compared, the phoneme data with the same start and stop time and the same phoneme is determined as the cross-test phoneme data, and the cross-test phoneme data is the phoneme data with the same start and stop time and the same phoneme in the raw phoneme data and the calibration phoneme data, so the cross-test phoneme data is the most accurate phoneme data in the raw phoneme data and the calibration phoneme data, that is, the present application can obtain the accurate cross-test phoneme data, and accordingly, in the dry audio, after determining the dry sound corresponding to the cross-test phoneme data as the cross-test dry sound, the accurate cross-test dry sound corresponding to the cross-test phoneme data can be obtained, therefore, if the neural network model is subsequently trained based on the mutual testing phoneme data and the mutual testing dry sound, the accuracy of the audio synthesis of the neural network model can be high due to the high accuracy of the mutual testing phoneme data and the mutual testing dry sound, and the quality of the synthesized audio can be high if the audio synthesis is carried out based on the trained neural network model. In addition, because the data volume of the mutual testing phoneme data and the mutual testing dry sound is small in the application, the application can also accelerate the training efficiency of the neural network model, and further improve the efficiency of audio synthesis, namely the application carries out technical processing on the phoneme data of different types through the phoneme mutual testing technology, and more effective phoneme results and dry sound audio are obtained, so that the training of the neural network model is facilitated, and the training efficiency and the tone quality of the synthesized audio are improved.

In some specific embodiments, the mutual-testing phoneme obtaining module 14 specifically includes:

the original phoneme screening unit is used for setting the phoneme data meeting the preset invalid rule in the original phoneme data as a sil phoneme to obtain screened original phoneme data;

the calibration phoneme screening unit is used for setting the phoneme data meeting the preset invalid rule as a sil phoneme in the calibration phoneme data to obtain screened calibration phoneme data;

and the cross-testing phoneme determining unit is used for comparing the screened original phoneme data with the screened calibration phoneme data and determining the phoneme data with the same starting and ending time and the same phoneme as the cross-testing phoneme data.

In some embodiments, the predetermined invalidation rules comprise a duration of time less than a predetermined duration of time.

In some embodiments, the cross-testing phoneme determination unit is specifically configured to: in the screened calibration phoneme data, setting phoneme data with the same start-stop time as the screened original phoneme data and different phonemes as a sil phoneme to obtain processed calibration phoneme data; and determining the processed calibration phoneme data as the cross-checking phoneme data.

In some embodiments, the cross-testing phoneme determination unit is specifically configured to: determining adjacent phoneme data in the processed calibration phoneme data; judging whether the start-stop time of the adjacent phoneme data is continuous or not; if the start-stop time of the adjacent phoneme data is not continuous, adjusting the start-stop time of the adjacent phoneme data to be continuous in the processed calibration phoneme data, and determining the adjusted calibration phoneme data as the cross-checking phoneme data; and if the starting time and the ending time of the adjacent phoneme data are continuous, directly determining the processed calibration phoneme data as the cross-checking phoneme.

In some specific embodiments, the mutual inspection dry sound obtaining module 15 specifically includes:

a target start-stop time obtaining unit, configured to obtain a target start-stop time of a phoneme with content of sil in the cross-test phoneme data;

and the mutual-testing dry sound determining unit is used for setting dry sound content with the same starting and stopping time as the target starting and stopping time as silence in the dry sound audio and taking the adjusted dry sound audio as the mutual-testing dry sound.

In some embodiments, the cross-inspection dry sound determination unit is specifically configured to: determining the dry sound content with the same start-stop time and target start-stop time; dividing the dry sound content into a starting section dry sound content, a middle section dry sound content and an ending section dry sound content according to the generation sequence of the dry sound content; fade-out processing is carried out on the initial segment dry sound content, and the fade-out processing result is used as a mute result of the initial segment dry sound content; directly setting the dry sound content of the middle section to be mute; and performing fade-in processing on the end segment dry sound content, and taking a fade-in processing result as a mute result of the end segment dry sound content.

In some embodiments, the cross-inspection dry sound determination unit is specifically configured to: multiplying the audio frequency of the initial segment of dry sound content by a preset cos function to obtain a fade-out processing result; and multiplying the audio frequency of the dry sound content of the ending section by a preset sin function to obtain a fade-in processing result.

In some embodiments, the acquiring the dry audio, and the dry audio acquiring module 11 specifically includes:

and the dry sound audio acquisition unit is used for acquiring the dry sound audio with the WAV audio format.

Further, the embodiment of the application also provides electronic equipment. FIG. 9 is a block diagram illustrating an electronic device 20 according to an exemplary embodiment, and nothing in the figure should be taken as a limitation on the scope of use of the present application.

Fig. 9 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is used for storing a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the audio synthesis method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically a server.

In this embodiment, the power supply 23 is configured to provide a working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.

In addition, the storage 22 is used as a carrier for resource storage, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., and the resources stored thereon may include an operating system 221, a computer program 222, video data 223, etc., and the storage may be a transient storage or a permanent storage.

The operating system 221 is used for managing and controlling each hardware device and the computer program 222 on the electronic device 20, so as to realize the operation and processing of the processor 21 on the mass video data 223 in the memory 22, and may be Windows Server, Netware, Unix, Linux, and the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the audio synthesis method disclosed in any of the foregoing embodiments and executed by the electronic device 20. Data 223 may include various video data collected by electronic device 20.

Further, an embodiment of the present application further discloses a computer-readable storage medium, where a computer program is stored in the storage medium, and when the computer program is loaded and executed by a processor, the steps of the abnormal display detection method disclosed in any of the foregoing embodiments are implemented.

The computer-readable storage media to which this application relates include Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage media known in the art.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

25页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种合成语音与文本对齐的方法、装置及计算机储存介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!