AI translation method and system for automatically converting dialect into caption

文档序号：1099031 发布日期：2020-09-25 浏览：25次中文

阅读说明：本技术 一种将方言自动转换为字幕的ai翻译方法和系统 (AI translation method and system for automatically converting dialect into caption ) 是由白志勇于 2020-05-26 设计创作，主要内容包括：本发明公开了一种将方言自动转换为字幕的AI翻译方法和系统,包括以下步骤：获取视频流和音频流,对音频流进行预处理；生成预处理后的音频流的语音特征,并与方言语音特征库进行对比,对音频流进行翻译得到普通话；将翻译得到的普通话转换为TTS语音和文字,将文字作为字幕添加到视频流中。本发明根据各地方言和大陆普通话设计一个翻译系统,可以把各地方言翻译为普通话,转换成TTS语音和字幕,方便审核人员进行校对。(The invention discloses an AI translation method and system for automatically converting dialects into subtitles, which comprises the following steps: acquiring a video stream and an audio stream, and preprocessing the audio stream; generating voice characteristics of the preprocessed audio stream, comparing the voice characteristics with a dialect voice characteristic library, and translating the audio stream to obtain mandarin; and converting the translated mandarin into TTS voice and characters, and adding the characters into a video stream as subtitles. The invention designs a translation system according to the dialects of all the places and the continental mandarin, can translate the dialects of all the places into the mandarin, convert into TTS pronunciation and subtitle, facilitate the auditor to carry on the proofreading.)

1. An AI translation method for automatically converting dialects into subtitles, characterized by comprising the steps of:

s1, acquiring video stream and audio stream, and preprocessing the audio stream;

s2, generating voice characteristics of the preprocessed audio stream, comparing the voice characteristics with a dialect voice characteristic library, and translating the audio stream to obtain Mandarin;

s3, the translated mandarin chinese is converted into TTS speech and text, and the text is added to the video stream as subtitles.

2. The AI translation method according to claim 1, wherein the pre-processing of the audio stream in step S1 specifically includes the following steps

Carrying out voice activity detection, voice enhancement and voice separation on the audio stream, separating the voice from the voice stream, and recording a time axis of the voice in the audio stream;

and dotting on the time axis of the video stream according to the human voice starting point and the human voice ending point in the time axis.

3. The AI translation method according to claim 2, wherein in step S3, text is added as subtitles to the video stream according to the dotting position on the time axis of the video stream.

4. The AI translation method according to claim 2, further comprising the steps of:

s4, the audio stream is removed, and the TTS speech is added to the video stream according to the dotting position on the time axis of the video stream.

5. The AI translation method according to claim 1, wherein in step S2, the dialect speech feature library is sorted according to usage frequency, and after generating the speech features of the preprocessed audio stream, the dialect speech feature library is compared with the dialect speech feature library in sequence, and the audio stream is translated to obtain mandarin.

6. An AI translation system for automatically converting dialects into subtitles, comprising

The audio and video processing unit is used for acquiring a video stream and an audio stream and preprocessing the audio stream;

the audio translation unit is used for generating voice characteristics for the preprocessed audio stream, comparing the voice characteristics with a dialect voice characteristic library and translating the audio stream to obtain the mandarin;

and the conversion unit is used for converting the translated mandarin into TTS voice and characters, sending the characters to the audio and video processing unit, and adding the characters serving as subtitles to the video stream by the audio and video processing unit.

7. The AI translation system for automatically converting dialects into subtitles according to claim 6, wherein the audio/video processing unit pre-processes the audio stream, specifically, performs voice activity detection, voice enhancement and voice separation on the audio stream, separates voice from the audio stream, and records a time axis of the voice in the audio stream; and dotting on the time axis of the video stream according to the human voice starting point and the human voice ending point in the time axis.

8. The AI translation system according to claim 7 wherein the audio/video processing unit adds text as subtitles to the video stream based on the dotting position on the time axis of the video stream.

9. The AI translation system according to claim 7, wherein the audio/video processing unit is further configured to remove audio streams and add TTS speech to the video streams according to their dotting positions on the time axis.

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an AI translation method and system for automatically converting dialects into subtitles.

Background

At present, video and audio of an interviewee can be recorded by a reporter in a video and audio recording mode during interviewing, subtitles are required to be added on the video in the later period, so that the interviewee can be conveniently checked and understood by audiences, and the interviewing sometimes meets dialects all over the country, so that editing and auditing personnel can not accurately understand and match the subtitles.

Disclosure of Invention

The invention provides an AI translation method and system for automatically converting dialects into subtitles, which solve the problem that in the prior art, interviews sometimes encounter dialects all over the country, possibly resulting in the fact that editors and auditors cannot accurately understand and match subtitles.

The technical scheme of the invention is realized as follows:

an AI translation method for automatically converting dialects into subtitles, comprising the following steps:

s1, acquiring video stream and audio stream, and preprocessing the audio stream;

s3, the translated mandarin chinese is converted into TTS speech and text, and the text is added to the video stream as subtitles.

As a preferred embodiment of the present invention, the pre-processing of the audio stream in step S1 specifically includes the following steps

and dotting on the time axis of the video stream according to the human voice starting point and the human voice ending point in the time axis.

As a preferred embodiment of the present invention, in step S3, text is added to the video stream as subtitles according to the dotting position on the time axis of the video stream.

As a preferred embodiment of the present invention, the method further comprises the following steps:

s4, the audio stream is removed, and the TTS speech is added to the video stream according to the dotting position on the time axis of the video stream.

In step S2, the dialect speech feature library is sorted according to the usage frequency, the speech features of the preprocessed audio stream are generated, and then the preprocessed audio stream is compared with the dialect speech feature library in sequence, and the audio stream is translated to obtain mandarin.

An AI translation system for automatically converting dialects into subtitles, comprising

The audio and video processing unit is used for acquiring a video stream and an audio stream and preprocessing the audio stream;

As a preferred embodiment of the present invention, the audio/video processing unit performs preprocessing on the audio stream, specifically, performs voice activity detection, voice enhancement, and voice separation on the audio stream, separates the voice from the voice stream, and records a time axis of the voice in the audio stream; and dotting on the time axis of the video stream according to the human voice starting point and the human voice ending point in the time axis.

As a preferred embodiment of the present invention, the audio/video processing unit adds text as subtitles to the video stream according to a dotting position on a time axis of the video stream.

As a preferred embodiment of the present invention, the audio/video processing unit is further configured to remove the audio stream, and add the TTS speech to the video stream according to the dotting position on the time axis of the video stream.

The invention has the beneficial effects that: a translation system is designed according to dialects and continental mandarin, and the dialects can be translated into mandarin and converted into TTS voice and subtitles, so that the verification by auditors is facilitated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a plan flow diagram of an embodiment of an AI translation method for automatically converting dialects into subtitles according to the present invention;

fig. 2 is a schematic block diagram of an embodiment of an AI translation method for automatically converting dialects into subtitles according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "vertical", "upper", "lower", "horizontal", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.

In the description of the present invention, it should also be noted that, unless otherwise explicitly specified or limited, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly and may, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

As shown in fig. 1, the present invention provides an AI translation method for automatically converting dialects into subtitles, comprising the following steps:

s1, acquiring video stream and audio stream, and preprocessing the audio stream; the video stream and the audio stream can be independent video stream acquired by a camera and audio stream acquired by a microphone, and the camera and the microphone work simultaneously to ensure that the time axes of the video stream and the audio stream are aligned. The video stream and the audio stream may also be separated from the video file, and the video format of the video file may be rm, rmvb, mpeg1-4, mov, mtv, dat, wmv, avi, 3gp, amv, dmv, flv, etc.

S2, generating voice characteristics of the preprocessed audio stream, and extracting voice characteristic vectors through Fourier transform and discrete Fourier transform of the audio stream, and time-frequency transform operations such as filter banks, windowing smoothing, cepstrum analysis and the like; comparing the voice with a dialect voice feature library, and translating the audio stream to obtain the Mandarin;

in a specific operation process, whether the dialect is the dialect can be judged by obtaining the voice feature vectors of the common words, such as you, me, everyone, hello and the like, and comparing the voice feature vectors with the voice feature vectors of the corresponding words in the dialect voice feature library.

The dialect speech feature library stores speech segments of different dialects and their corresponding speech feature vectors.

S3, the translated mandarin chinese is converted into TTS speech and text, and the text is added to the video stream as subtitles.

As a preferred embodiment of the present invention, the pre-processing of the audio stream in step S1 specifically includes the following steps

Performing voice activity detection, voice enhancement and voice separation operations on the audio stream, wherein noise segments in the audio stream can be filtered through the voice activity detection, only voice segments which can possibly have noise are reserved, noise can be suppressed and voice can be enhanced by utilizing the voice enhancement, non-target user voice can be removed through the voice separation, only target user voice data are reserved, cleaner target user voice can be obtained through the preprocessing operations, the voice is separated from the voice stream, and a time axis of the voice in the audio stream is recorded; and dotting on the time axis of the video stream according to the human voice starting point and the human voice ending point in the time axis.

As a preferred embodiment of the present invention, in step S3, text is added to the video stream as subtitles according to the dotting position on the time axis of the video stream.

As a preferred embodiment of the present invention, the method further comprises the following steps:

s4, the audio stream is removed, and the TTS speech is added to the video stream according to the dotting position on the time axis of the video stream. By eliminating the audio stream, privacy protection for the visited person can be achieved.

In step S2, the audio stream may be translated by a one-to-one comparison of the speech feature vectors, or by training a dialect speech recognition model and inputting the audio stream into the model.

As a preferred embodiment of the invention, the contrast range can be further reduced by acquiring the geographical position of the interviewee.

As shown in FIG. 2, the present invention further provides an AI translation system for automatically converting dialects into subtitles, comprising

The audio and video processing unit is used for acquiring a video stream and an audio stream and preprocessing the audio stream; the video stream and the audio stream can be independent video stream acquired by a camera and audio stream acquired by a microphone, and the camera and the microphone work simultaneously to ensure that the time axes of the video stream and the audio stream are aligned. The video stream and the audio stream may also be separated from the video file, and the video format of the video file may be rm, rmvb, mpeg1-4, mov, mtv, dat, wmv, avi, 3gp, amv, dmv, flv, etc.

The audio translation unit is used for generating voice features for the preprocessed audio stream, and extracting voice feature vectors through Fourier transform and discrete Fourier transform of the audio stream and time-frequency transform operations such as filter banks, windowing smoothing, cepstrum analysis and the like; comparing the voice with a dialect voice feature library, and translating the audio stream to obtain the Mandarin;

The dialect speech feature library stores speech segments of different dialects and their corresponding speech feature vectors.

And the conversion unit is used for converting the translated mandarin into TTS voice and characters, sending the characters to the audio and video processing unit, and adding the characters serving as subtitles to the video stream by the audio and video processing unit.

As a preferred embodiment of the present invention, the audio/video processing unit performs preprocessing on the audio stream, specifically, performs voice activity detection, voice enhancement, and voice separation operations on the audio stream, filters out noise segments in the audio stream through the voice activity detection, only retains voice segments that may have noise, suppresses noise and enhances voice by using the voice enhancement, and only retains voice data of a target user while removing voice of a non-target user through the voice separation, and obtains a cleaner voice of the target user through the preprocessing operations, separates the voice from the voice stream, and records a time axis of the voice in the audio stream; and dotting on the time axis of the video stream according to the human voice starting point and the human voice ending point in the time axis.

As a preferred embodiment of the present invention, the audio/video processing unit adds text as subtitles to the video stream according to a dotting position on a time axis of the video stream. The start and end positions of the subtitles correspond to the start and end positions of the voice.

As a preferred embodiment of the present invention, the present invention further includes a positioning unit, and the positioning unit further reduces the contrast range by acquiring the geographical location of the interviewee.

It can be clearly understood by those skilled in the art that, for convenience and brevity, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

7页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：融合单语语言模型的汉越伪平行语料生成方法

AI translation method and system for automatically converting dialect into caption

相关技术

网友询问留言