Method and terminal for configuring subtitles through simultaneous translation of videos

文档序号：1173045 发布日期：2020-09-18 浏览：14次中文

阅读说明：本技术 一种视频同声翻译配置字幕方法及终端 (Method and terminal for configuring subtitles through simultaneous translation of videos ) 是由郭铭煌周春林蔡权权黄智全颜林明刘奕成于 2020-05-06 设计创作，主要内容包括：本发明提出了一种视频同声翻译配置字幕方法和终端,所述方法包括：接收用户选择的目标语言操作指令,提取视频文件中的图像信号和音频信号；将所述目标语言和音频信号发送至云端服务器或第三方语音转文字翻译软件进行同步翻译后,回传以所述目标语言的文字形式显示的目标字幕；基于时间同步信息,将所述目标字幕与图像信号进行实时匹配；当出现所述图像信号或音频信号与目标字幕不同步时,接收用户选择的音画延迟操作指令,以延迟所述图像信号或音频信号的输出时间；所述终端,通过实施上述方法,解决了视频文件没有字幕或者字幕不是用户所熟悉的语言的问题,保证了翻译后的所述目标字幕与视频图像同步,提高了用户观看视频过程中的用户体验。(The invention provides a method and a terminal for configuring subtitles by simultaneous translation of videos, wherein the method comprises the following steps: receiving a target language operation instruction selected by a user, and extracting an image signal and an audio signal in a video file; sending the target language and the audio signal to a cloud server or third-party voice-to-text translation software for synchronous translation, and then returning a target caption displayed in a text form of the target language; matching the target caption with an image signal in real time based on time synchronization information; when the image signal or the audio signal is asynchronous with the target caption, receiving a sound and picture delay operation instruction selected by a user to delay the output time of the image signal or the audio signal; by implementing the method, the terminal solves the problem that the video file has no subtitles or the subtitles are not the language familiar to the user, ensures the synchronization of the translated target subtitles and the video image, and improves the user experience of the user in the process of watching the video.)

1. A method for configuring subtitles in video simultaneous interpretation is characterized by comprising the following steps:

receiving a target language operation instruction selected by a user, and extracting an image signal and an audio signal in a video file;

sending the target language and the audio signal to a cloud server or third-party speech-to-text translation software;

the cloud server or third-party voice-to-text translation software synchronously translates the extracted audio signal and then returns a target caption displayed in the text form of the target language;

matching the target caption with the image signal in real time based on time synchronization information, and synchronously displaying the target caption on a display interface;

and when the image signal or the audio signal is asynchronous with the target caption in the playing process, receiving a sound and picture delay operation instruction selected by a user to delay the output time of the image signal or the audio signal so that the image signal or the audio signal is synchronously matched with the target caption.

2. The method for configuring subtitles in video simultaneous interpretation according to claim 1, wherein matching the target subtitles with the image signals in real time specifically comprises:

acquiring at least one audio data frame included in the audio signal, and determining a time stamp of the audio data frame;

acquiring the image signal corresponding to the time stamp of the audio data frame and an image data frame included in the image signal;

and combining the target subtitles corresponding to the audio data frames with the image data frames based on the time stamps of the audio data frames and performing real-time matching.

3. The method for configuring subtitles in video simultaneous interpretation according to claim 1, wherein receiving the target language operation instruction selected by a user and extracting the image signal and the audio signal comprises:

setting a suspension button in the display interface;

receiving a target language operation instruction selected by a user clicking the floating button;

extracting the image signal and the audio signal based on the selected target language.

4. The method for configuring subtitles through simultaneous translation of videos according to claim 3, wherein the step of synchronously displaying the target subtitles on the display interface comprises:

the floating button is converted into a floating window, and the floating window is used for displaying the target subtitle on the display interface;

and selecting a subtitle style operating instruction or a sound and picture delay operating instruction or a volume adjusting operating instruction by clicking the floating window, wherein the subtitle style comprises subtitle color, subtitle transparency and font size.

5. The method for configuring subtitles in video simultaneous interpretation according to claim 4, wherein the floating window can be operated by a gesture, and wherein the display position of the floating window on the display interface is moved or the display scale of the floating window is scaled.

6. A terminal, characterized in that the terminal comprises:

the storage module is used for storing a video playing program and a video simultaneous interpretation configuration subtitle program;

the configuration module is used for receiving a target language operation instruction selected by a user, receiving a sound and picture delay operation instruction selected by the user and displaying a configured caption;

the extraction module is used for extracting image signals and audio signals in the video file;

the transmitting module is used for transmitting the target language and the audio signal to a cloud server or third-party voice-to-text translation software for translation, and receiving target subtitles displayed in the text form of the target language returned by the cloud server or the third-party voice-to-text translation software;

the synchronization module is used for matching the image signal with the target caption in real time based on time synchronization information;

and the audio-visual module is used for displaying the image signal and playing the audio signal.

7. The terminal of claim 6, wherein the synchronization module is further configured to:

acquiring at least one audio data frame included in the audio signal, and determining a time stamp of the audio data frame;

acquiring the image signal corresponding to the time stamp of the audio data frame and an image data frame included in the image signal;

and combining the target subtitles corresponding to the audio data frames with the image data frames based on the time stamps of the audio data frames and performing real-time matching.

8. The terminal of claim 6, wherein the configuration module further comprises:

the floating button module is used for receiving the target language operation instruction selected by the user;

and the floating window module is used for displaying the target caption.

9. The terminal of claim 8, wherein the floating window module is further configured to:

and receiving a subtitle style operating instruction or a sound and picture delay operating instruction or a volume adjusting operating instruction selected by a user, wherein the subtitle style comprises subtitle color, subtitle transparency and font size.

10. The terminal of claim 9, wherein the floating window module is further configured to:

receiving a gesture operation instruction of a user, and moving the display position of the floating window on a display interface or zooming the display scale of the floating window.

Technical Field

The invention relates to the technical field of multimedia, in particular to a method and a terminal for configuring subtitles by simultaneous translation of videos.

Background

With the development of the internet and intelligent terminals, people increasingly use intelligent terminals such as mobile terminals, tablet computers, desktop computers and intelligent televisions to watch movies and videos, but some foreign movie and television works or videos have no subtitles or subtitles are not in a language familiar to users, and users have poor viewing experience in the watching process.

Thus, there is still a need for improvement and development of the prior art.

Disclosure of Invention

In view of the above situation, it is desirable to provide a method and a terminal for configuring subtitles in video simultaneous interpretation so as to solve the above problems.

The invention provides a method for configuring subtitles by simultaneous translation of videos, which comprises the following steps:

receiving a target language operation instruction selected by a user, and extracting an image signal and an audio signal in a video file;

sending the target language and the audio signal to a cloud server or third-party speech-to-text translation software;

matching the target caption with the image signal in real time based on time synchronization information, and synchronously displaying the target caption on a display interface;

Further, matching the target caption with the image signal in real time specifically includes:

acquiring at least one audio data frame included in the audio signal, and determining a time stamp of the audio data frame;

acquiring the image signal corresponding to the time stamp of the audio data frame and an image data frame included in the image signal;

and combining the target subtitles corresponding to the audio data frames with the image data frames based on the time stamps of the audio data frames and performing real-time matching.

Further, receiving the target language operation instruction selected by the user, and extracting the image signal and the audio signal specifically includes:

setting a suspension button in the display interface;

receiving a target language operation instruction selected by a user clicking the floating button;

extracting the image signal and the audio signal based on the selected target language.

Further, synchronously displaying the target subtitles on the display interface specifically includes:

the floating button is converted into a floating window, and the floating window is used for displaying the target subtitle on the display interface;

Further, the floating window can be operated through gestures, the display position of the floating window on the display interface is moved, or the display scale of the floating window is zoomed.

Further, the present invention also provides a terminal, including:

the storage module is used for storing a video playing program and a video simultaneous interpretation configuration subtitle program;

the extraction module is used for extracting image signals and audio signals in the video file;

the synchronization module is used for matching the image signal with the target caption in real time based on time synchronization information;

and the audio-visual module is used for displaying the image signal and playing the audio signal.

Further, the synchronization module is further configured to:

acquiring at least one audio data frame included in the audio signal, and determining a time stamp of the audio data frame;

acquiring the image signal corresponding to the time stamp of the audio data frame and an image data frame included in the image signal;

and combining the target subtitles corresponding to the audio data frames with the image data frames based on the time stamps of the audio data frames and performing real-time matching.

Further, the configuration module further comprises:

the floating button module is used for receiving the target language operation instruction selected by the user;

and the floating window module is used for displaying the target caption.

Further, the floating window module is further configured to:

receiving a gesture operation instruction of a user, and moving the display position of the floating window on a display interface or zooming the display scale of the floating window.

The invention provides a method and a terminal for configuring subtitles by simultaneous translation of videos, which are characterized in that in the process of playing videos, extracted audio signals are sent to a cloud server or third-party voice-to-text translation software according to the language requirements of users, are converted into target subtitles in the form of the language required by the users and are synchronously displayed with video images, when the image signals are asynchronous with the target subtitles, operation instructions are delayed according to the sound pictures selected by the users so as to delay the output time of the image signals or the audio signals, so that the target subtitles, the image signals and the audio signals are synchronously played, the problem that video files have no subtitles or subtitles are not in the language familiar to the users in the prior art is solved, and the synchronization rate of the translated target subtitles, the image signals and the audio signals is ensured, the user experience of the user in the process of watching the video is improved.

Drawings

Fig. 1 is a flowchart of a method for configuring subtitles in video simultaneous interpretation according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a display interface for setting the hover button in an embodiment of the invention.

FIG. 3 is a schematic diagram of a display interface after the hover button is clicked in an embodiment of the invention.

FIG. 4 is a diagram of a display interface for transitioning a hover button to a hover window in an embodiment of the present invention.

FIG. 5 is a diagram of a display interface after clicking a floating window in an embodiment of the invention.

Fig. 6 is a schematic view of a display interface after the display position of the floating window is changed by gesture operation in an embodiment of the present invention.

FIG. 7 is a schematic diagram of a display interface after zooming the display scale of the floating window through gesture operation in an embodiment of the present invention.

Fig. 8 is a flowchart of step S40 in the method for configuring subtitles by simultaneous video translation according to an embodiment of the present invention.

Fig. 9 is a block diagram of a terminal in an embodiment of the invention.

Description of the main elements

11 display interface

12 suspension button

13 suspension window

100 terminal

110 memory module

120 configuration module

121 suspension button module

122 floating window module

130 extraction module

140 sending module

150 synchronous module

160 audiovisual module

The following detailed description will further illustrate the invention in conjunction with the above-described figures.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Referring to fig. 1, the present invention provides a method for configuring subtitles by simultaneous interpretation of videos, where the method for configuring subtitles by simultaneous interpretation of videos includes the following steps:

and step S10, receiving the target language operation instruction selected by the user, and extracting the image signal and the audio signal in the video file.

In this embodiment, receiving the target language operation instruction selected by the user, and extracting the image signal and the audio signal specifically includes:

as shown in fig. 2-3, taking a mobile terminal such as a mobile phone as an example, a hover button 12 is set in the display interface 11; clicking the suspension button 12, wherein the suspension button 12 expands and displays a plurality of target language types to be selected, and receives an operation instruction of selecting the target language by clicking the suspension button 12 by a user; extracting the image signal and the audio signal based on the selected target language.

Further, the hover button 12 may be set to always display the played or non-played video file on the display interface 11, or may be set to only display the video file on the display interface 11 when the video file is played.

Further, the video file may be a video shot in real time, a locally stored video, or a video played online by a video player. The audio signal extraction may be to extract and translate all audio signals in the downloaded or cached video file in advance, so as to implement the function of translating before playing, or to extract and translate the audio signal while caching and playing the video file, so as to implement the function of translating while playing. The target language may be any one of chinese, english, french, german, japanese, korean, etc.

And step S20, sending the target language and the audio signal to a cloud server or third-party speech-to-text translation software.

And step S30, after the cloud server or the third-party speech-to-text translation software synchronously translates the extracted audio signal, returning the target caption displayed in the text form of the target language.

In this embodiment, the communication mode for sending the target language and the audio signal to the cloud server and returning the target caption by the cloud server may be any one of bluetooth, WIFI, 5G and mobile communication network. The third-party speech-to-text translation software can download the installed third-party translation software in advance, and can also be an off-line speech-to-text translation package downloaded in advance.

And step S40, matching the target subtitles with the image signals in real time based on the time synchronization information, and synchronously displaying the target subtitles on a display interface.

In this embodiment, the synchronously displaying the target subtitles on the display interface specifically includes:

as shown in fig. 4-7, on the display interface 11, the floating button 12 is turned into a floating window 13, and the floating window 13 is used for displaying the target subtitle on the display interface 11. By clicking the floating window 13, the floating window 13 expands to display a plurality of further operation instructions to be selected, for example: selecting a subtitle style operating instruction or a sound and picture delay operating instruction or a volume adjusting operating instruction. The subtitle style of the target subtitle comprises subtitle color, subtitle transparency and font size, for example, the color of the target subtitle which is selected by a user to be output finally is red, the subtitle transparency can be selected according to the needs of the user, the font size can be selected according to the needs of the user or zoomed by direct gesture operation, and other subtitle parameters or related settings can be selected.

Further, the floating window 13 may move the display position of the floating window 13 on the display interface 11 through a gesture operation, or zoom the display scale of the floating window 13, for example, drag the display position of the floating window 13 on the display interface 11 through a gesture operation, adjust the length or font size of the target subtitle through zooming the display scale of the floating window 13, and the like.

Further, matching the target subtitle with the image signal in real time, specifically including the steps shown in fig. 8:

s41, acquiring at least one audio data frame included in the audio signal, and determining the time stamp of the audio data frame;

s42, acquiring the image signal corresponding to the time stamp of the audio data frame and the image data frame included in the image signal;

and S43, combining the target subtitles corresponding to the audio data frames with the image data frames based on the time stamps of the audio data frames and performing real-time matching.

In this embodiment, the time stamp of the audio data frame may generally include a start time and an end time. And finally, combining the target subtitles corresponding to the audio data frames with the image data frames based on the starting time point and the ending time point and performing real-time matching, thereby realizing synchronous playing of the target subtitles and the image signals in the playing process.

Step S50, when the image signal or the audio signal is not synchronized with the target subtitle during playing, receiving a user-selected audio-visual delay operation instruction to delay the output time of the image signal or the audio signal, so that the image signal or the audio signal is synchronously matched with the target subtitle.

In this embodiment, when the image signal or the audio signal is not synchronized with the target subtitle during playing, the user clicks the floating window 13 to select the audio/video delay operation instruction, the audio/video delay operation instruction may provide a selection delay time, for example, 1S, 5S, 10S, and the like, and display the selection delay time on the floating window 13, and after receiving the delay time selected by the user, the image signal or the audio signal is played in a delayed manner according to the delay time selected by the user. The audio-visual delay operation command may also provide a selection advance time, for example, provide 1S, 5S, 10S and other specific selectable advance times, and display the advance times on the floating window 13, and after receiving the advance time selected by the user, the image signal or the audio signal is played in advance according to the advance time selected by the user. And after the user selects the specific delay time, the target subtitle is played earlier than the image signal, the advance time selected by the user is received, and the user can repeatedly and alternately select the delay time and/or the advance time until the target subtitle, the image signal and the audio signal are synchronously played.

The invention provides a method for configuring subtitles by simultaneous translation of videos, which comprises the steps of sending an extracted audio signal to a cloud server or third-party voice-to-text translation software according to the language requirement of a user in the video playing process, converting the extracted audio signal into a target subtitle in the language form required by the user, synchronously displaying the target subtitle and a video image, delaying an operation instruction according to a sound picture selected by the user when the image signal is asynchronous with the target subtitle, so as to delay the output time of the image signal or the audio signal, thereby synchronously playing the target subtitle, the image signal and the audio signal, solving the problem that a video file has no subtitles or subtitles are not in the language familiar to the user in the prior art scheme, and ensuring the synchronization rate of the translated target subtitle, the image signal and the audio signal, the user experience of the user in the process of watching the video is improved.

Referring to fig. 9, as an implementation of the methods shown in the above diagrams, the present invention provides a terminal 100, where the terminal 100 includes a storage module 110, a configuration module 120, an extraction module 130, a sending module 140, a synchronization module 150, and an audiovisual module 160. Fig. 3 shows only some of the modules of the terminal 100, but it is to be understood that not all of the shown modules are required to be implemented, and more or fewer modules may be implemented instead.

In this embodiment, the terminal 100 may be implemented in various forms, such as a mobile terminal, e.g., a mobile phone, a tablet computer, a notebook computer, a palm computer, etc., and a fixed terminal, e.g., a digital television, a desktop computer, etc.

The storage module 110 is configured to store a video playing program and a caption configuring program for simultaneous interpretation of videos.

In this embodiment, the storage module 110 may be an internal storage unit of the terminal 100, such as a hard disk or a memory of a mobile phone, an external storage device of the terminal, such as a plug-in hard disk, a smart card, a secure digital card, a flash memory card, and the like, and may include both the internal storage unit and the external storage device.

The configuration module 120 is configured to receive a target language operation instruction selected by a user, receive a sound and picture delay operation instruction selected by the user, and display a configured subtitle.

In this embodiment, the configuration module 120 further includes a floating button module 121 and a floating window module 122, where the floating button module 121 and the floating window module 122 are respectively displayed on the display interface 11 of the audiovisual module 160 in the form of a floating button 12 and a floating window 13. Wherein:

the floating button module 121 is configured to receive the target language operation instruction selected by the user;

the floating window module 122 is configured to display the target subtitle.

In this embodiment, the floating window module 122 is further configured to receive the subtitle style operating instruction or the audio-visual delay operating instruction or the volume adjusting operating instruction selected by the user. The subtitle style includes a subtitle color, a subtitle transparency and a font size, for example, the color of the target subtitle which is finally output by a user is red, the subtitle transparency can be selected according to the needs of the user, the font size can be selected according to the needs of the user or zoomed by direct gesture operation, and other subtitle parameters or related settings can be selected.

Further, the floating window module 122 is further configured to receive a gesture operation of a user, move a display position of the floating window 13 on the display interface 11, or zoom a display scale of the floating window 13, for example, drag the display position of the floating window 13 on the display interface 11 through the gesture operation, adjust a length or a font size of the target subtitle through zooming the display scale of the floating window 13, and the like.

The extracting module 130 is configured to extract an image signal and an audio signal in a video file.

The sending module 140 is configured to send the target language and the audio signal to a cloud server or a third-party speech-to-text translation software for translation, and receive a target subtitle displayed in a text form of the target language and returned by the cloud server or the third-party speech-to-text translation software.

In this embodiment, the communication mode between the sending module 140 and the cloud server may be any one of bluetooth, WIFI, 5G and mobile communication network.

The synchronization module 150 is configured to match the image signal with the target subtitle in real time based on time synchronization information.

In this embodiment, the synchronization module 150 is further configured to acquire at least one audio data frame included in the audio signal, and determine a timestamp of the audio data frame; acquiring the image signal corresponding to the time stamp of the audio data frame and an image data frame included in the image signal; and combining the target subtitles corresponding to the audio data frames with the image data frames based on the time stamps of the audio data frames and performing real-time matching.

Further, when the image signal and the target subtitle are not synchronized during playing, the floating window module 122 selects the audio/video delay operation command selected by the user, where the audio/video delay operation command may provide a selection delay time, for example, 1S, 5S, 10S, and the like, and after receiving the delay time selected by the user, the image signal or the audio signal is played in a delayed manner according to the delay time selected by the user. The audio-visual delay operation instruction may also provide a selection advance time, for example, provide 1S, 5S, 10S, and the like, which are specifically selectable, and after receiving the advance time selected by the user, the image signal or the audio signal is played in advance according to the advance time selected by the user. And after the user selects the specific delay time, the target subtitle is played earlier than the image signal, the advance time selected by the user is received, and the user can repeatedly and alternately select the delay time and/or the advance time until the target subtitle, the image signal and the audio signal are synchronously played.

The audiovisual module 160 is configured to display the image signal and play the audio signal.

The terminal provided by the invention transmits the extracted audio signal to the cloud server or the third-party voice-to-text translation software according to the language requirement of the user in the video playing process, converts the extracted audio signal into the target caption in the language form required by the user, synchronously displays the target caption and the video image, delays the output time of the image signal or the audio signal according to the voice and picture delay operation instruction selected by the user when the image signal is asynchronous with the target caption, so that the target caption, the image signal and the audio signal are synchronously played, solves the problem that the video file has no caption or the caption is not the language of the user in the prior art, and ensures the synchronization rate of the translated target caption, the image signal and the audio signal, the user experience of the user in the process of watching the video is improved.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit of the technical solutions of the present invention.

15页详细技术资料下载

Method and terminal for configuring subtitles through simultaneous translation of videos

相关技术

网友询问留言