Media data processing method and device, storage medium and computer equipment

文档序号：812512 发布日期：2021-03-26 浏览：37次中文

阅读说明：本技术 媒体数据处理方法及装置、存储介质、计算机设备 (Media data processing method and device, storage medium and computer equipment ) 是由张乐雨张慧敏于 2020-12-10 设计创作，主要内容包括：本申请公开了一种媒体数据处理方法及装置、存储介质、计算机设备,该方法包括：接收源媒体数据,其中,所述源媒体数据包括视频数据和源音频数据；对所述源音频数据进行语音转译得到转译文本数据,并对所述转译文本数据进行翻译得到目标语言的翻译文本数据；获取所述转译文本数据对应的文本语义参数,并基于所述文本语义参数对预设的声音合成参数进行调整；根据调整后的声音合成参数对所述翻译文本数据进行声音合成,得到所述目标语言对应的音频数据；将所述目标语言对应的音频数据以及所述视频数据进行合成,得到合成媒体数据。本申请可以使得媒体数据适合不同语言习惯人群观看,并且保留与源媒体数据的情感更匹配的声音特征,提高用户观看体验。(The application discloses a media data processing method and device, a storage medium and computer equipment, wherein the method comprises the following steps: receiving source media data, wherein the source media data comprises video data and source audio data; performing voice translation on the source audio data to obtain translated text data, and translating the translated text data to obtain translated text data of a target language; acquiring text semantic parameters corresponding to the translated text data, and adjusting preset sound synthesis parameters based on the text semantic parameters; performing voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language; and synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data. According to the method and the device, the media data can be suitable for people with different language habits to watch, the sound characteristics which are more matched with the emotion of the source media data are reserved, and the watching experience of a user is improved.)

1. A method for media data processing, comprising:

receiving source media data, wherein the source media data comprises video data and source audio data;

performing voice translation on the source audio data to obtain translated text data, and translating the translated text data to obtain translated text data of a target language;

acquiring text semantic parameters corresponding to the translated text data, and adjusting preset sound synthesis parameters based on the text semantic parameters;

performing voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;

and synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data.

2. The method according to claim 1, wherein translating the translated text data to obtain translated text data in a target language specifically comprises:

assembling the translated text data according to an input parameter assembling rule corresponding to a preset translation line to obtain translation input data corresponding to the translated text data;

calling the preset translation line, inputting the translation input data into the preset translation line for translation, and obtaining translation output data;

and analyzing the translation output data according to an output parameter analysis rule corresponding to a preset translation line to obtain the translation text data.

3. The method of claim 2, wherein prior to said invoking said preset translation line, said method further comprises:

obtaining a verification seed corresponding to the preset translation line, and generating a verification token corresponding to the verification seed according to a token generation rule;

and verifying the preset translation line by using the verification token, and determining that the preset translation line is in an adjustable state if the verification passes.

4. The method according to claim 1, wherein the obtaining of the text semantic parameter corresponding to the translation text data specifically includes:

dividing the translated text data according to a text structure corresponding to the translated text data to obtain a plurality of sentences corresponding to the translated text data;

and respectively acquiring semantic parameters corresponding to each statement, and determining text semantic parameters corresponding to the translated text data according to the semantic parameters corresponding to each statement.

5. The method according to claim 1, wherein the receiving source media data specifically comprises:

receiving the source media data sent by a video publishing terminal;

the synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data specifically includes:

acquiring a playing language corresponding to a video playing terminal, and acquiring audio data corresponding to the playing language from audio data corresponding to the target language;

synthesizing the audio data corresponding to the playing language and the video data to obtain playing media data;

and sending the playing media data to the video playing terminal.

6. The method according to claim 5, wherein the synthesizing the audio data corresponding to the playing language and the video data to obtain playing media data specifically comprises:

acquiring translation text data corresponding to the playing language;

and synthesizing the translation text data and the audio data corresponding to the playing language with the video data to obtain the playing media data.

7. The method according to claim 5 or 6, wherein the acquiring the playing language corresponding to the video playing terminal specifically includes:

determining the playing language of the video playing terminal according to the geographical position of the video playing terminal; alternatively, the first and second electrodes may be,

determining the playing language of the video playing terminal according to a common language corresponding to the video playing terminal; alternatively, the first and second electrodes may be,

and analyzing the playing language indicated by the playing instruction according to the playing instruction sent by the video playing terminal.

8. A media data processing apparatus, comprising:

the source data receiving module is used for receiving source media data, wherein the source media data comprise video data and source audio data;

the audio data translation module is used for performing voice translation on the source audio data to obtain translated text data and translating the translated text data to obtain translated text data of a target language;

the sound parameter adjusting module is used for acquiring text semantic parameters corresponding to the translated text data and adjusting preset sound synthesis parameters based on the text semantic parameters;

the voice synthesis module is used for carrying out voice synthesis on the translation text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;

and the media data synthesis module is used for synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data.

9. A storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the media data processing method of any one of claims 1 to 7.

10. A computer device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the media data processing method of any one of claims 1 to 7 when executing the computer program.

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a media data processing method and apparatus, a storage medium, and a computer device.

Background

With the continuous development of communication technology, users not only use intelligent terminal devices such as mobile phones, tablet computers and desktop computers to carry out conversation or inquire information, but also have wider application to other functions.

In the current video watching process, a video producer sends recorded audio and video data to a video server, and the video server forwards the video recorded by the video producer to a terminal of a video watcher for playing. However, users watching videos may be users around the world and cannot fully understand the language in the audio and video uploaded by the video producer, so that the experience of watching videos is poor, and the video playing amount of the video platform is difficult to increase.

Disclosure of Invention

In view of this, the present application provides a media data processing method and apparatus, a storage medium, and a computer device.

According to an aspect of the present application, there is provided a media data processing method including:

receiving source media data, wherein the source media data comprises video data and source audio data;

performing voice translation on the source audio data to obtain translated text data, and translating the translated text data to obtain translated text data of a target language;

acquiring text semantic parameters corresponding to the translated text data, and adjusting preset sound synthesis parameters based on the text semantic parameters;

performing voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;

and synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data.

Optionally, the translating the translated text data to obtain the translated text data of the target language specifically includes:

calling the preset translation line, inputting the translation input data into the preset translation line for translation, and obtaining translation output data;

and analyzing the translation output data according to an output parameter analysis rule corresponding to a preset translation line to obtain the translation text data.

Optionally, before the calling the preset translation line, the method further includes:

obtaining a verification seed corresponding to the preset translation line, and generating a verification token corresponding to the verification seed according to a token generation rule;

and verifying the preset translation line by using the verification token, and determining that the preset translation line is in an adjustable state if the verification passes.

Optionally, the obtaining of the text semantic parameter corresponding to the translation text data specifically includes:

dividing the translated text data according to a text structure corresponding to the translated text data to obtain a plurality of sentences corresponding to the translated text data;

Optionally, the receiving source media data specifically includes:

receiving the source media data sent by a video publishing terminal;

the synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data specifically includes:

acquiring a playing language corresponding to a video playing terminal, and acquiring audio data corresponding to the playing language from audio data corresponding to the target language;

synthesizing the audio data corresponding to the playing language and the video data to obtain playing media data;

and sending the playing media data to the video playing terminal.

Optionally, the synthesizing the audio data corresponding to the playing language and the video data to obtain playing media data specifically includes:

acquiring translation text data corresponding to the playing language;

and synthesizing the translation text data and the audio data corresponding to the playing language with the video data to obtain the playing media data.

Optionally, the acquiring a playing language corresponding to the video playing terminal specifically includes:

determining the playing language of the video playing terminal according to the geographical position of the video playing terminal; alternatively, the first and second electrodes may be,

determining the playing language of the video playing terminal according to a common language corresponding to the video playing terminal; alternatively, the first and second electrodes may be,

and analyzing the playing language indicated by the playing instruction according to the playing instruction sent by the video playing terminal.

According to another aspect of the present application, there is provided a media data processing apparatus including:

the source data receiving module is used for receiving source media data, wherein the source media data comprise video data and source audio data;

and the media data synthesis module is used for synthesizing the audio data corresponding to the target language and the video data to obtain synthesized media data.

Optionally, the audio data translation module specifically includes:

the input data assembling unit is used for assembling the translated text data according to an input parameter assembling rule corresponding to a preset translation line to obtain translation input data corresponding to the translated text data;

the translation data output unit is used for calling the preset translation circuit, inputting the translation input data into the preset translation circuit for translation, and obtaining translation output data;

and the translation text analysis unit is used for analyzing the translation output data according to an output parameter analysis rule corresponding to a preset translation line to obtain the translation text data.

Optionally, the apparatus further comprises:

the verification token generation module is used for acquiring a verification seed corresponding to the preset translation line before the preset translation line is called, and generating a verification token corresponding to the verification seed according to a token generation rule;

and the line verification module is used for verifying the preset translation line by using the verification token and determining that the preset translation line is in an adjustable state if the verification passes.

Optionally, the sound parameter adjusting module specifically includes:

the sentence dividing unit is used for dividing the translation text data according to a text structure corresponding to the translation text data to obtain a plurality of sentences corresponding to the translation text data;

and the semantic parameter determining unit is used for respectively acquiring the semantic parameters corresponding to each statement and determining the text semantic parameters corresponding to the translated text data according to the semantic parameters corresponding to each statement.

Optionally, the source data receiving module is specifically configured to: receiving the source media data sent by a video publishing terminal;

the media data synthesis module specifically comprises:

the playing language acquisition unit is used for acquiring a playing language corresponding to the video playing terminal and acquiring audio data corresponding to the playing language from the audio data corresponding to the target language;

the playing data synthesis unit is used for synthesizing the audio data corresponding to the playing language and the video data to obtain playing media data;

and the playing data sending unit is used for sending the playing media data to the video playing terminal.

Optionally, the playing data synthesizing unit specifically includes:

a played text acquisition subunit, configured to acquire translation text data corresponding to the played language;

and the playing data synthesizing subunit is used for synthesizing the translation text data and the audio data corresponding to the playing language and the video data to obtain the playing media data.

Optionally, the playing language obtaining unit specifically includes:

the first language acquisition subunit is used for determining the playing language of the video playing terminal according to the geographical position of the video playing terminal; alternatively, the first and second electrodes may be,

the second language obtaining subunit is configured to determine the playing language of the video playing terminal according to a common language corresponding to the video playing terminal; alternatively, the first and second electrodes may be,

and the third language acquisition subunit is used for analyzing the playing language indicated by the playing instruction according to the playing instruction sent by the video playing terminal.

According to yet another aspect of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described media data processing method.

According to yet another aspect of the present application, there is provided a computer device comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, the processor implementing the above media data processing method when executing the program.

By means of the technical scheme, according to the media data processing method and device, the storage medium and the computer device, after source media data are received, voice translation is firstly carried out on the source audio data contained in the source media data to obtain translated text data corresponding to the source audio data, then the translated text is translated into translated text data of a target language from a source language, sound synthesis parameters are adjusted according to text semantic parameters corresponding to the translated text, so that the translated text data are synthesized into audio data of a corresponding target language based on the adjusted sound synthesis parameters, and the audio data of the target language and video data contained in the source media data are assembled to obtain synthesized media data. Compared with the mode of directly playing live video in the prior art, the method and the device for playing the live video can convert source media data into media data of multiple different languages, bring convenience to users with different language habits to watch, can also obtain text semantic parameters corresponding to translated text data of the source audio data to determine sound synthesis parameters, and further utilize the sound synthesis parameters to carry out sound synthesis, so that the synthesized sound is more matched with emotion expressed by the source audio data, the look and feel similarity between the synthesized media data and the source media data is improved, the video watching experience of the users is improved, and the video playing quantity of a video platform is also favorably improved.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic flow chart illustrating a media data processing method according to an embodiment of the present application;

fig. 2 is a schematic flow chart illustrating another media data processing method provided by an embodiment of the present application;

fig. 3 is a schematic structural diagram of a media data processing apparatus according to an embodiment of the present application;

fig. 4 shows a schematic structural diagram of another media data processing device provided in the embodiment of the present application.

Detailed Description

The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

In this embodiment, a media data processing method is provided, as shown in fig. 1, the method including:

step 101, receiving source media data, wherein the source media data comprises video data and source audio data;

the media data processing method provided by the embodiment of the application can be used for processing media data recorded by a main broadcast in a live broadcast terminal device in a live broadcast platform and can also be used for processing media data uploaded by a video uploading party in a video platform. In the above embodiment, the live broadcast platform server receives source media data, where the source media data includes video data and audio data, and the language type corresponding to the audio data is a language used by a main broadcast.

Step 102, performing voice translation on source audio data to obtain translated text data, and translating the translated text data to obtain translated text data of a target language;

in this embodiment, after receiving the source audio data, the source audio data is first subjected to voice translation to obtain translated text data corresponding to the source audio data, that is, the source audio data is subjected to voice recognition to translate the voice data into text data, and further, in order to realize language conversion of the media data, the translated text data obtained by voice translation is translated, and the translated text data is translated into a target language to obtain translated text data, for example, the translated text data may be translated from a chinese language into an english language, a japanese language, or the like.

103, acquiring text semantic parameters corresponding to the translated text data, and adjusting preset sound synthesis parameters based on the text semantic parameters;

in this embodiment, in order to ensure that the processed media data can show natural voice effect and avoid too hard sound, after obtaining the translated text data, obtaining text semantic parameters corresponding to the translated text data, where the text semantic parameters may describe semantic information expressed by the source media data, for example, the source media data expresses a happy emotion of an author, and such happy emotion may be expressed by the text semantic parameters of the translated text. And then, the preset sound synthesis parameters can be adjusted based on the text semantic parameters, so that the adjusted sound synthesis parameters can reflect text semantics through some characteristics of sound, and the sound synthesis parameters specifically include sound fluctuation amplitude, fundamental frequency, speech speed, volume, sentence interval duration and the like. For example, the word rate is faster at happy hours and the sentence interval is shorter.

Step 104, carrying out voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;

in this embodiment, the translated text data is subjected to sound synthesis according to the adjusted sound synthesis parameters, and the translated text data is subjected to text-to-sound processing by using the sound synthesis parameters given to the text semantic information, so as to obtain audio data corresponding to the target language, thereby converting the source audio data corresponding to the source language in the source media data into the audio data of the target language.

And 105, synthesizing the audio data and the video data corresponding to the target language to obtain synthesized media data.

In this embodiment, after the audio data corresponding to the target language is generated, the audio data is assembled with the video data included in the source media data to obtain the composite media data, and finally, the source media data is converted from the source language to the composite media data corresponding to the target language, so that users with different language habits can understand the content expressed by the video, the video watching experience of the users is improved, and the video playing amount of the video platform is increased.

By applying the technical scheme of the embodiment, after source media data are received, voice translation is performed on the source audio data contained in the source media data to obtain translated text data corresponding to the source audio data, then the translated text is translated into translated text data of a target language from a source language, and a sound synthesis parameter is adjusted according to a text semantic parameter corresponding to the translated text, so that the translated text data are synthesized into audio data of a corresponding target language based on the adjusted sound synthesis parameter, and the audio data of the target language is assembled with video data contained in the source media data to obtain synthesized media data. Compared with the mode of directly playing live video in the prior art, the method and the device for playing the live video can convert source media data into media data of multiple different languages, bring convenience to users with different language habits to watch, can also obtain text semantic parameters corresponding to translated text data of the source audio data to determine sound synthesis parameters, and further utilize the sound synthesis parameters to carry out sound synthesis, so that the synthesized sound is more matched with emotion expressed by the source audio data, the look and feel similarity between the synthesized media data and the source media data is improved, the video watching experience of the users is improved, and the video playing quantity of a video platform is also favorably improved.

Further, as a refinement and an extension of the specific implementation of the foregoing embodiment, in order to fully illustrate the specific implementation process of the present embodiment, another media data processing method is provided, as shown in fig. 2, and the method includes:

step 201, receiving source media data sent by a video distribution terminal, wherein the source media data includes video data and source audio data;

in this embodiment, when the anchor terminal performs live broadcasting, the anchor terminal records content to obtain source media data, where the source media data includes video data and audio data, the anchor terminal sends the source media data to a live broadcast server, and the live broadcast server receives the source media data sent by the anchor terminal.

Step 202, performing voice translation on source audio data to obtain translated text data, and translating the translated text data to obtain translated text data of a target language;

step 203, acquiring text semantic parameters corresponding to the translated text data, and adjusting preset sound synthesis parameters based on the text semantic parameters;

step 204, carrying out voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;

the corresponding descriptions of step 202 to step 204 refer to the corresponding descriptions of step 102 to step 104, which are not repeated herein. Specifically, TTS technology can be used to synthesize speech, and text information generated by a computer or input from the outside is converted into understandable and fluent speech output technology.

Step 205, acquiring a playing language corresponding to the video playing terminal, and acquiring audio data corresponding to the playing language from the audio data corresponding to the target language;

in step 205, since the live broadcast server needs to process the source media data sent by the live video broadcast end and forward the processed source media data to the video broadcast terminal, in order to determine which language the source media data is converted into, in this embodiment, the target language may include multiple languages, the broadcast language corresponding to the video broadcast terminal is obtained, and the audio data corresponding to the broadcast language is found from the audio data corresponding to the multiple target languages, so that the audio data is used to synthesize the media data, which is convenient for people with habits of different languages to watch live video.

In the above embodiment, specifically, the playing language of the video playing terminal is determined according to the geographic location of the video playing terminal; or, determining the playing language of the video playing terminal according to the common language corresponding to the video playing terminal; or analyzing the playing language indicated by the playing instruction according to the playing instruction sent by the video playing terminal.

In this embodiment, the playing language may be determined according to the location of the video playing terminal, for example, if the geographic location of the video playing terminal is japan and the common language in the region is japanese, the playing language may be determined to be japanese. Or, the playing language carried in the live viewing request sent by the video playing terminal received by the live server may be analyzed according to the playing instruction. Or, the playing language may be determined directly according to the common language corresponding to the video playing terminal, for example, the language selected when the video is watched last time.

Step 206, synthesizing the audio data and the video data corresponding to the playing language to obtain playing media data;

specifically, translation text data corresponding to a playing language is acquired; and synthesizing the translation text data corresponding to the playing language, the audio data and the video data to obtain playing media data.

In the above embodiment, the translated text data corresponding to the playing language is used as the subtitle data, and the translated text data, the audio data, and the video data are synthesized to obtain the playing media data, so that not only the sound of the synthesized playing media data is matched with the language habit of the watching user, but also the subtitle is matched with the language habit of the watching user, and the video watching experience of the user is further improved.

Step 207, sending the playing media data to the video playing terminal.

In the above embodiment, after the playing media data is synthesized, the playing media data is sent to the video playing terminal for the user to watch.

It should be noted that, in a live broadcast scenario, generally, in order to ensure video playing quality, a live broadcast server generally caches a video for a period of time and then sends the video to a video playing terminal, for example, the video is cached for 30 seconds, and then the cached video may be segmented once every 15 seconds to obtain source media data, and each source media data is respectively subjected to playing language conversion, so that the video received by the video playing terminal is not blocked continuously, and video playing quality is ensured.

In any embodiment of the present application, the voice translating the source audio data in step 102 and step 202 to obtain the translated text data specifically includes:

102-1, assembling translated text data according to an input parameter assembling rule corresponding to a preset translation line to obtain translation input data corresponding to the translated text data;

step 102-2, calling a preset translation line, inputting translation input data into the preset translation line for translation, and obtaining translation output data;

and 102-3, analyzing the translation output data according to an output parameter analysis rule corresponding to a preset translation line to obtain translation text data.

In the above embodiment, first, an input parameter assembly rule corresponding to a preset translation line is obtained, then the translated text data to be translated is assembled according to the rule to obtain translated input data, the translated input data is used as an input parameter corresponding to the preset translation line, the preset translation line is called, the translated input data is input into the line to be translated to obtain an output parameter, that is, translated output data, further, in order to obtain the translated text data capable of being recognized by a computer, the translated output data needs to be analyzed according to an output parameter analysis rule corresponding to the preset translation line, and finally, the translated text data is obtained, so that the translated text data is translated into the translated text data by using the translation line, and the text data is converted from a source language to a target language. The preset translation line may be an interface of various terminals or browsers, such as a Baidu translation interface, a Google translation interface, and the like, and may also be a preset translation database interface.

In some application scenarios, some translation interfaces define a call validation rule in advance, and in order to avoid resource waste due to malicious calls, validation needs to be performed before the interface is called, in the above embodiment, before step 102-2, the method further includes:

102-4, acquiring a verification seed corresponding to a preset translation line, and generating a verification token corresponding to the verification seed according to a token generation rule;

and 102-5, verifying the preset translation line by using the verification token, and if the verification passes, determining that the preset translation line is in an adjustable state.

In the above embodiment, a verification seed corresponding to a preset translation route is obtained, then, according to a token generation rule agreed in advance by the preset translation route, an encryption process is performed according to the verification seed to generate a verification token, before the preset translation route is called, the verification is performed through the verification token, and after the verification is passed, the preset translation route is determined to be in an adjustable state, the preset translation route can be called only in the adjustable state, otherwise, the preset translation route cannot be called, so that the situation that the preset translation route is maliciously called and translation route resources are wasted is avoided, and the translation efficiency is improved. For example, a google translation interface is called to obtain a verification seed, and a verification token is generated according to the verification seed and timestamp information corresponding to the current time and a preset encryption algorithm, so that a translation interface request is verified.

In any embodiment of the present application, the obtaining of the text semantic parameter corresponding to the translated text data in step 103 and step 203 specifically includes:

step 103-1, segmenting the translated text data according to the text structure corresponding to the translated text data to obtain a plurality of sentences corresponding to the translated text data;

step 103-2, obtaining semantic parameters corresponding to each sentence respectively, and determining text semantic parameters corresponding to the translated text data according to the semantic parameters corresponding to each sentence.

In the foregoing embodiment, the translated text may be specifically segmented according to reading symbols (e.g., periods, question marks, exclamation marks, and the like) in the text according to a text structure of the translated text data, so as to convert the translated text into a plurality of sentences, and after sentence extraction is completed, feature words are extracted from each of the segmented sentences, where the feature words can be used to characterize emotion implied by the sentences, and for example, the feature words may include conjunctions, negatives, and the like. And carrying out syntactic analysis on each sentence, determining the word segmentation weight before and after the red conjunctions of each sentence, and carrying out polarity inversion or double negative identification on negative words. And comprehensively determining the score of the sentence according to the emotional vocabulary and the syntactic analysis result in each sentence, wherein the score can represent the semantic parameters of the sentence. For example, the lower the score of a sentence, the more negative the emotion characterized by the sentence; the higher the score of a sentence, the more positive the emotion characterized by the sentence is. For example, if a sentence has a score of-10, then the emotion characterized by the sentence is an extremely negative emotion (e.g., violence, anger, etc.); if the score of a sentence is-2, the emotion characterized by the sentence is a relatively negative emotion (such as low mood); if the score of a sentence is 0, the emotion represented by the sentence is neutral; if the score of a sentence is +7, it indicates that the emotion characterized by the sentence is a more positive emotion (e.g., very happy). And then, determining text semantic parameters corresponding to the translated text data based on the semantic parameters corresponding to each sentence, for example, taking the average value of the semantic parameters corresponding to each sentence as the text semantic parameters, so as to avoid that the emotion fluctuation shown by the finally synthesized sound is too large due to too large semantic parameter difference of a single sentence.

Further, as a specific implementation of the method in fig. 1, an embodiment of the present application provides a media data processing apparatus, as shown in fig. 3, the apparatus includes:

a source data receiving module 31, configured to receive source media data, where the source media data includes video data and source audio data;

the audio data translation module 32 is configured to perform voice translation on the source audio data to obtain translated text data, and translate the translated text data to obtain translated text data of a target language;

the sound parameter adjusting module 33 is configured to acquire a text semantic parameter corresponding to the translated text data, and adjust a preset sound synthesis parameter based on the text semantic parameter;

the voice synthesis module 34 is configured to perform voice synthesis on the translated text data according to the adjusted voice synthesis parameters to obtain audio data corresponding to the target language;

and a media data synthesizing module 35, configured to synthesize the audio data and the video data corresponding to the target language to obtain synthesized media data.

In a specific application scenario, as shown in fig. 4, optionally, the audio data translation module 32 specifically includes:

an input data assembling unit 321, configured to assemble the translated text data according to an input parameter assembling rule corresponding to a preset translation line, to obtain translation input data corresponding to the translated text data;

the translation data output unit 322 is configured to invoke a preset translation line, input the translation input data into the preset translation line, and perform translation to obtain translation output data;

and the translation text analysis unit 323 is configured to analyze the translation output data according to an output parameter analysis rule corresponding to a preset translation line to obtain translation text data.

In a specific application scenario, as shown in fig. 4, optionally, the apparatus further includes:

the verification token generation module 36 is configured to, before calling the preset translation line, obtain a verification seed corresponding to the preset translation line, and generate a verification token corresponding to the verification seed according to a token generation rule;

and the line verification module 37 is configured to verify the preset translation line by using the verification token, and if the verification passes, determine that the preset translation line is in an invokable state.

In a specific application scenario, as shown in fig. 4, optionally, the sound parameter adjusting module 33 specifically includes:

a sentence dividing unit 331, configured to divide the translated text data according to a text structure corresponding to the translated text data, so as to obtain a plurality of sentences corresponding to the translated text data;

the semantic parameter determining unit 332 is configured to obtain a semantic parameter corresponding to each sentence, and determine a text semantic parameter corresponding to the translated text data according to the semantic parameter corresponding to each sentence.

In a specific application scenario, as shown in fig. 4, optionally, the source data receiving module 31 is specifically configured to: receiving source media data sent by a video publishing terminal;

the media data synthesizing module 35 specifically includes:

a playing language obtaining unit 351, configured to obtain a playing language corresponding to the video playing terminal, and obtain audio data corresponding to the playing language from the audio data corresponding to the target language;

a playing data synthesizing unit 352, configured to synthesize audio data and video data corresponding to a playing language to obtain playing media data;

and the play data sending unit 353 is configured to send the play media data to the video play terminal.

Optionally, the playing data synthesizing unit 352 specifically includes:

a played text acquiring subunit 3521 configured to acquire translated text data corresponding to a played language;

and a play data synthesizing subunit 3522, configured to synthesize the translated text data corresponding to the play language, the audio data, and the video data to obtain play media data.

Optionally, the playing language obtaining unit 351 specifically includes:

the first language obtaining subunit 3511 is configured to determine a playing language of the video playing terminal according to the geographic location of the video playing terminal; alternatively, the first and second electrodes may be,

a second language obtaining subunit 3512, configured to determine a playing language of the video playing terminal according to a common language corresponding to the video playing terminal; alternatively, the first and second electrodes may be,

the third language obtaining subunit 3513 is configured to parse the playing language indicated by the playing instruction according to the playing instruction sent by the video playing terminal.

It should be noted that other corresponding descriptions of the functional units related to the media data processing apparatus provided in the embodiment of the present application may refer to the corresponding descriptions in the methods in fig. 1 to fig. 2, and are not described herein again.

Based on the methods shown in fig. 1 to 2, correspondingly, the present application further provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the media data processing method shown in fig. 1 to 2.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present application.

Based on the method shown in fig. 1 to fig. 2 and the virtual device embodiment shown in fig. 3 to fig. 4, in order to achieve the above object, the present application further provides a computer device, which may specifically be a personal computer, a server, a network device, and the like, where the computer device includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the media data processing method as described above with reference to fig. 1 to 2.

Optionally, the computer device may also include a user interface, a network interface, a camera, Radio Frequency (RF) circuitry, sensors, audio circuitry, a WI-FI module, and so forth. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., a bluetooth interface, WI-FI interface), etc.

It will be appreciated by those skilled in the art that the present embodiment provides a computer device architecture that is not limiting of the computer device, and that may include more or fewer components, or some components in combination, or a different arrangement of components.

The storage medium may further include an operating system and a network communication module. An operating system is a program that manages and maintains the hardware and software resources of a computer device, supporting the operation of information handling programs, as well as other software and/or programs. The network communication module is used for realizing communication among components in the storage medium and other hardware and software in the entity device.

Through the above description of the embodiments, those skilled in the art can clearly understand that the present application may be implemented by software plus a necessary general hardware platform, or may be implemented by hardware to receive source media data, and then perform voice translation on the source audio data included in the source media data to obtain translated text data corresponding to the source audio data, translate the translated text from the source language to translated text data in a target language, and adjust a sound synthesis parameter according to a text semantic parameter corresponding to the translated text, so as to synthesize the translated text data into audio data in the corresponding target language based on the adjusted sound synthesis parameter, and assemble the audio data in the target language with video data included in the source media data to obtain synthesized media data. Compared with the mode of directly playing live video in the prior art, the method and the device for playing the live video can convert source media data into media data of multiple different languages, bring convenience to users with different language habits to watch, can also obtain text semantic parameters corresponding to translated text data of the source audio data to determine sound synthesis parameters, and further utilize the sound synthesis parameters to carry out sound synthesis, so that the synthesized sound is more matched with emotion expressed by the source audio data, the look and feel similarity between the synthesized media data and the source media data is improved, the video watching experience of the users is improved, and the video playing quantity of a video platform is also favorably improved.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

16页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于语音检测的语音交互方法及其装置

Media data processing method and device, storage medium and computer equipment

相关技术

网友询问留言