Audio data labeling method and device

文档序号：685255 发布日期：2021-04-30 浏览：23次中文

阅读说明：本技术 一种音频数据标注方法及装置 (Audio data labeling method and device ) 是由陈孝良冯大航许孝先常乐于 2020-12-29 设计创作，主要内容包括：本发明提供一种音频数据标注方法及装置,该方法包括：提取视频文件中的音频数据；其中,所述视频文件包括字幕,并根据所述视频文件的字幕对所述音频数据进行标注。通过本发明提供的音频数据标注方法,不仅可以提高音频数据收集的效率,还可以提高音频数据标注的效率。(The invention provides an audio data labeling method and device, wherein the method comprises the following steps: extracting audio data in the video file; and the video file comprises subtitles, and the audio data is labeled according to the subtitles of the video file. The audio data labeling method provided by the invention can improve the efficiency of audio data collection and can also improve the efficiency of audio data labeling.)

1. An audio data labeling method, comprising:

extracting audio data in the video file; wherein the video file comprises subtitles;

and marking the audio data according to the subtitles of the video file.

2. The method of claim 1, wherein said annotating the audio data according to subtitles of the video file comprises:

under the condition that the video file comprises a text subtitle file, segmenting the audio data according to the starting time and the ending time of each section of subtitle in the text subtitle file respectively to obtain an audio segment corresponding to each section of subtitle;

and respectively labeling the audio segments corresponding to the subtitles according to the subtitles.

3. The method of claim 1, wherein said annotating the audio data according to subtitles of the video file comprises:

respectively identifying the subtitles of each frame of image in the image data under the condition that the subtitles of the video file are synthesized in the image data of the video file;

respectively determining the starting time and the ending time of each identified section of caption; the starting time of the subtitle is the time corresponding to the first frame image in the continuous multi-frame images corresponding to the subtitle in the image data, and the ending time of the subtitle is the time corresponding to the last frame image in the continuous multi-frame images corresponding to the subtitle in the image data;

segmenting the audio data according to the starting time and the ending time corresponding to each subtitle respectively to obtain an audio segment corresponding to each subtitle;

and respectively labeling the audio segments corresponding to the subtitles according to the subtitles.

4. The method of claim 3, wherein the separately identifying subtitles for each frame of image in the image data comprises:

determining a subtitle area of each frame of image in the image data;

and respectively identifying characters in the caption area of each frame of image to obtain the caption of each frame of image.

5. The method of claim 4, wherein determining the caption area for each frame of image in the image data comprises:

respectively determining an image area at a target position in each frame of image of the image data as a subtitle area of each frame of image;

Respectively positioning character areas of each frame of image in the image data;

respectively determining a target text area in the text area of each frame of image as a subtitle area of each frame of image; the pixel value of the target character area meets a first preset condition, or the format of the characters in the target character area meets a second preset condition.

6. An apparatus for annotating audio data, comprising:

the extraction module is used for extracting audio data in the video file; wherein the video file comprises subtitles;

and the marking module is used for marking the audio data according to the subtitles of the video file.

7. The apparatus of claim 6, wherein the labeling module comprises:

the first segmentation unit is used for segmenting the audio data according to the starting time and the ending time of each segment of caption in the text caption file respectively under the condition that the video file comprises the text caption file to obtain an audio segment corresponding to each segment of caption;

and the first labeling unit is used for labeling the audio segments corresponding to the subtitles according to the subtitles.

8. The apparatus of claim 6, wherein the labeling module comprises:

an identifying unit configured to identify subtitles of each frame of image in the image data, respectively, in a case where the subtitles of the video file are synthesized in the image data of the video file;

the determining unit is used for respectively determining the starting time and the ending time of each identified subtitle; the starting time of the subtitle is the time corresponding to the first frame image in the continuous multi-frame images corresponding to the subtitle in the image data, and the ending time of the subtitle is the time corresponding to the last frame image in the continuous multi-frame images corresponding to the subtitle in the image data;

the second segmentation unit is used for segmenting the audio data according to the starting time and the ending time corresponding to each subtitle to obtain an audio segment corresponding to each subtitle;

and the second labeling unit is used for labeling the audio segments corresponding to the subtitles according to the subtitles.

9. The apparatus of claim 8, wherein the identification unit comprises:

a determining subunit, configured to determine a subtitle region of each frame of image in the image data;

and the identification subunit is used for respectively identifying the characters in the subtitle area of each frame of image to obtain the subtitle of each frame of image.

10. The apparatus of claim 9, wherein the determining subunit is specifically configured to:

respectively determining an image area at a target position in each frame of image of the image data as a subtitle area of each frame of image;

Respectively positioning character areas of each frame of image in the image data;

11. An audio data annotation apparatus comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the audio data annotation method according to any one of claims 1 to 5.

12. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the audio data annotation method according to one of claims 1 to 5.

Technical Field

The invention relates to the technical field of audio, in particular to an audio data labeling method and device.

Background

An important factor of speech recognition is that audio data used for training, different recording devices, different scenes, accents existing in various places, and the like all have great influence on speech recognition effect. Audio data collection is a relatively labor intensive task and is currently often done by manually recording audio or collecting existing audio from the network for training. In addition, it is also a time consuming matter to label the collected audio data, for a section of audio, a label person often needs to listen to the audio repeatedly for many times to label the corresponding text, but the accuracy of the label is not very high. Therefore, the problem of low efficiency of audio data labeling exists in the prior art.

Disclosure of Invention

The embodiment of the invention provides an audio data labeling method and device, and aims to solve the problem of low audio data labeling efficiency in the prior art.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides an audio data annotation method. The method comprises the following steps:

extracting audio data in the video file; wherein the video file comprises subtitles;

and marking the audio data according to the subtitles of the video file.

In a second aspect, an embodiment of the present invention further provides an audio data annotation device. The audio data labeling apparatus includes:

the extraction module is used for extracting audio data in the video file; wherein the video file comprises subtitles;

and the marking module is used for marking the audio data according to the subtitles of the video file.

In a third aspect, an embodiment of the present invention further provides an audio data annotation device, which includes a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements the steps of the audio data annotation method described above.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements the steps of the audio data annotation method described above.

In the embodiment of the invention, the audio data in the video file is extracted; and the video file comprises subtitles, and the audio data is labeled according to the subtitles of the video file. The efficiency of audio data collection can be improved by extracting audio data from the video file for training, and in addition, the efficiency of audio data annotation can be improved by annotating the extracted audio data based on the subtitles of the video file.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flowchart of an audio data annotation method according to an embodiment of the present invention;

FIG. 2 is a flowchart of an audio data annotation method according to another embodiment of the invention;

FIG. 3 is a block diagram of an audio data annotation device according to an embodiment of the present invention;

fig. 4 is a block diagram of an audio data annotation device according to another embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides an audio data labeling method. Referring to fig. 1, fig. 1 is a flowchart of an audio data annotation method according to an embodiment of the present invention, as shown in fig. 1, including the following steps:

step 101, extracting audio data in a video file; wherein the video file includes subtitles.

In this embodiment, the video file may be one or more video files including subtitles. The video file including subtitles may be that the video file includes a text subtitle file, or that subtitles are synthesized in image frames of the video file. The text subtitle file may include each subtitle of the video file and a time of each subtitle, and the time of each subtitle may include a start time and an end time, for example, the start time of the subtitle a is 1 st second, and the end time of the subtitle a is 3 rd second, that is, the subtitle a is played in the 1 st second and the 3 rd second of the video file.

The above-mentioned extracting of the audio data in the video file may be converting the audio data from the video file, for example, the audio data may be converted from the video file by software such as Fast Forward MPEG (Fast Forward MPEG).

And 102, marking the audio data according to the subtitles of the video file.

In this embodiment, the audio data may be labeled based on the subtitles of the video file, for example, the subtitles of the video file may be directly used as labels of the corresponding audio data; or the subtitles of the video file can be used as the reference labels of the corresponding audio data, the reference labels of the audio data can be further corrected, and the corrected labels are used as the labels of the obtained audio data.

The audio data labeling method provided by the embodiment of the invention extracts the audio data in the video file; and the video file comprises subtitles, and the audio data is labeled according to the subtitles of the video file. The efficiency of audio data collection can be improved by extracting audio data from the video file for training, and in addition, the efficiency of audio data annotation can be improved by annotating the extracted audio data based on the subtitles of the video file.

Optionally, the annotating the audio data according to the subtitles of the video file may include:

and respectively labeling the audio segments corresponding to the subtitles according to the subtitles.

In this embodiment, the text subtitle file may include each subtitle of the video file and a start time and an end time of each subtitle, so that the audio data may be directly segmented according to the start time and the end time of each subtitle of the text subtitle file to obtain an audio segment corresponding to each subtitle, and the corresponding audio segment may be labeled based on the subtitle, for example, the subtitle may be directly used as a label of the corresponding audio segment, or the subtitle may be used as a reference label of the corresponding audio segment.

For example, if the text subtitle file includes subtitles a1 to a2, where a start time of the subtitle a1 is 1 st second, an end time of the subtitle a1 is 3 rd second, a start time of the subtitle a2 is 4 th second, and an end time of the subtitle a2 is 8 th second, the audio data is sliced based on the start time and the end time of the subtitle a1, so that an audio segment a1 including the 1 st second to the 3 rd second can be obtained, that is, an audio segment corresponding to the subtitle a1 is an audio segment a1, the audio data is sliced based on the start time and the end time of the subtitle a2, and an audio segment a2 including the 4 th second to the 8 th second can be obtained, that is, an audio segment corresponding to the subtitle a2 is an audio segment a 2. Subtitle a1 may further be annotated as audio clip a1 and subtitle a2 may be annotated as audio clip a 2.

In the embodiment of the invention, under the condition that the video file comprises the text subtitle file, the audio data are respectively segmented according to the starting time and the ending time of each section of subtitle in the text subtitle file to obtain the audio segment corresponding to each section of subtitle, and the audio segment corresponding to each section of subtitle is respectively labeled according to each section of subtitle, so that the efficiency of labeling the audio data can be improved, and the accuracy of labeling the audio data can be improved. In addition, the audio data are segmented according to the starting time and the ending time of each section of caption, so that a mute part in the audio data can be removed, and the quality of the audio data used for speech recognition training is improved.

Optionally, the labeling the audio data according to the subtitles of the video file includes:

respectively identifying the subtitles of each frame of image in the image data under the condition that the subtitles of the video file are synthesized in the image data of the video file;

segmenting the audio data according to the starting time and the ending time corresponding to each subtitle respectively to obtain an audio segment corresponding to each subtitle;

and respectively labeling the audio segments corresponding to the subtitles according to the subtitles.

In this embodiment, in the case where subtitles are synthesized in image data, subtitles of each frame of image in the image data may be recognized, and for example, subtitles of each frame of image in the image data may be recognized by an OCR (Optical Character Recognition) system.

Optionally, in order to further improve the accuracy of subtitle recognition, the OCR system may be obtained by training a Neural Network model based on a plurality of images including subtitles and a label corresponding to each image, where the Neural Network model may include, but is not limited to, CRNN (Convolutional Neural Network), CTC-CNN (connected time Classification Convolutional Neural Network), and the like.

Optionally, a subtitle region in each frame of image may be determined first, and then characters in the subtitle region may be identified as subtitles of the frame of image; or the characters in each frame of image can be directly identified, and the characters with the character format which is the preset character format in the identified characters are used as the subtitles of the frame of image, and the like.

The text format may include at least one of a font, a color, a font size, and the like, and the preset text format may include at least one of a preset font (e.g., a regular font, and the like), a preset text color (e.g., white or black, and the like), and a preset font size (e.g., a four-point size, a small four-point size, and the like). It should be noted that the preset text format may be reasonably set according to actual requirements, which is not limited in this embodiment.

In the case that the subtitles of each frame of image are identified, the embodiment can label the audio data according to the subtitles of each frame of image. Specifically, the start time and the end time of each identified subtitle may be determined according to the identified subtitle of each frame of image, for example, if N consecutive frames of images exist in the image data and all correspond to the first subtitle, that is, the subtitles identified from the N frames of images are all the first subtitle, and N is a positive integer, the time corresponding to the first frame of image in the N frames of images may be used as the start time of the first subtitle, and the time corresponding to the last frame of image in the N frames of images may be used as the end time of the first subtitle. Further, the audio data may be segmented according to the identified start time and end time of each subtitle to obtain an audio segment corresponding to each subtitle, and the corresponding audio segment is labeled based on the subtitle, for example, the subtitle is directly used as a label of the corresponding audio segment, or the subtitle is used as a reference label of the corresponding audio segment.

In the embodiment of the invention, under the condition that the subtitles are synthesized in the image data, the subtitles of each frame of image in the image data are respectively identified, and the audio data are labeled according to the identified subtitles of each frame of image, so that the efficiency of audio data labeling can be improved, and the accuracy of audio data labeling can be improved. In addition, the audio data are segmented according to the starting time and the ending time of each section of caption, so that a mute part in the audio data can be removed, and the quality of the audio data used for speech recognition training is improved.

Optionally, the separately identifying subtitles of each frame of image in the image data includes:

determining a subtitle area of each frame of image in the image data;

and respectively identifying characters in the caption area of each frame of image to obtain the caption of each frame of image.

In this embodiment, the foregoing determining a subtitle region of each frame of image in the image data, that is, an image region where a subtitle is located in each frame of image, optionally, the received image region selected by the user may be used as the subtitle region of each frame of image, for example, at least one frame of image in which a subtitle is synthesized in the image data may be displayed before the subtitle in the image data is identified, a frame selection input of the user in the frame of image is received, and an image region corresponding to the determined image position of the frame selection input in each frame of image is used as the subtitle region of the frame of image; or a text region satisfying a preset condition in at least one text region of each frame of image may be used as a subtitle region of each frame of image, and the preset condition may be reasonably set according to an actual situation, for example, a pixel value of the text region is within a preset pixel value range, or a font of a text in the text region is a preset font.

According to the embodiment of the invention, the subtitle region of each frame of image in the image data is determined, so that the characters in the subtitle region of each frame of image can be directly identified, for example, the characters in the subtitle region of each frame of image are identified through an OCR system, and the identified characters are used as the subtitles of each frame of image, so that some interference characters in the image can be filtered, and the accuracy of subtitle identification can be improved.

Optionally, the determining a subtitle region for each frame of image in the image data includes:

respectively determining an image area at a target position in each frame of image of the image data as a subtitle area of each frame of image;

Respectively positioning character areas of each frame of image in the image data;

In one embodiment, an image region at a target position in each frame of image of the image data may be determined as a subtitle region of each frame of image, where the target position may be an image position selected by a user before a subtitle of the image data is identified, for example, at least one frame of image in which a subtitle is synthesized in the image data may be displayed before the subtitle in the image is identified, a frame selection input of the user in the frame of image may be received, and the determined image position of the frame selection input may be used as a target position; or may be a preset image position.

In practical situations, the position of the subtitles of the video is often fixed, for example, in a lower area of the video or in a right area of the video, or in a left area of the video, or in an upper area of the video. In addition, the positions of subtitles in the respective frame images in the same video are often the same, and for example, the subtitles are all located in the lower area of the respective frame images in the video, or are all located in the upper area of the respective frame images in the video. Therefore, the present embodiment may determine a target position first, and determine an image area at the target position in each frame of image of the image data as a subtitle area of each frame of image.

According to the embodiment, the image area at the target position in each frame of image of the image data is determined as the subtitle area of each frame of image, so that the subtitle identification accuracy is ensured, and the efficiency of identifying the subtitle from the image is improved.

In another embodiment, at least one text region of each frame of image in the image data may be separately located, for example, at least one text region of each frame of image in the image data may be located by an OCR system, and a target text region of the at least one text region of each frame of image may be separately determined as a subtitle region of each frame of image, where the target text region may be a text region whose pixel value satisfies a first preset condition or a text region whose format satisfies a second preset condition.

The pixel value of the text region satisfying the first preset condition may include that the pixel value of each pixel point in the text region is within a preset pixel value range; or the average value of the pixel values of all the pixel points in the text area is within the range of the preset pixel value. The preset pixel value range may be set reasonably according to actual situations, for example, the preset pixel value range may be [200,255 ].

The format of the text in the text area satisfies a second preset condition, which may include at least one of a preset font of the text in the text area, a preset color of the text in the text area, and a preset font size of the text area. Wherein, the aforesaid is predetermine the typeface, is predetermine the colour and predetermines the font size and all can carry out reasonable setting according to actual conditions, for example, the aforesaid is predetermine the typeface and can be the regular script, and the aforesaid is predetermine the colour and can be white or black etc. the aforesaid is predetermine the font size and can be four numbers, little four numbers etc..

In the embodiment, at least one text area of each frame of image in the image data is respectively positioned, and the target text area in the at least one text area of each frame of image is respectively determined as the subtitle area of each frame of image, so that the flexibility of recognizing subtitles from the images can be improved while the accuracy of recognizing the subtitles is ensured.

Optionally, the labeling, according to the caption, the audio segment corresponding to the caption may include filtering the caption, and labeling, according to the filtered caption, the audio segment corresponding to the caption. The filtering process may include filtering some annotation words in the subtitles, for example, words included in parentheses in the subtitles, so as to improve the accuracy of the audio annotation.

Optionally, the labeling, according to the each subtitle, the audio segment corresponding to the each subtitle includes: respectively inputting a first caption and an audio clip corresponding to the first caption into a voice recognition engine to obtain a voice recognition result; the first subtitle is used as an answer of an audio clip corresponding to the first subtitle, and the first subtitle is any one subtitle; comparing the voice recognition result with the first caption; and under the condition that the similarity between the voice recognition result and the first caption is greater than a preset value, taking the voice recognition result as the label of the audio segment corresponding to the first caption.

In this embodiment, the first subtitle may be any one of the subtitles obtained by the recognition, or any one of the subtitles in the text subtitle file. The preset value can be set reasonably according to actual conditions, for example, the preset value can be 90%, 95%, and the like.

In the embodiment, each section of caption and the audio segment corresponding to each section of caption are respectively input into the speech recognition engine, so that the speech recognition engine can perform speech recognition on each audio segment by combining the caption corresponding to each audio segment, and further the accuracy of the speech recognition result can be improved. In addition, under the condition that the similarity between the semantic recognition result corresponding to each audio segment and the subtitle corresponding to the audio segment is greater than a preset value, the voice recognition result is used as the label of the audio segment, and the accuracy of the audio data label can be improved.

Referring to fig. 2, fig. 2 is a flowchart of an audio data annotation method according to another embodiment of the invention. As shown in fig. 2, the audio data annotation method provided by the embodiment of the present invention may include the following steps:

step 201, collecting a video file with subtitles, and transferring audio data from the video file.

For example, the audio data can be obtained by converting a video file by software such as FFMPEG.

Step 202, judging whether the video file comprises a text subtitle file.

In case that the text subtitle file is included in the video file, step 203 is performed, otherwise step 204 is performed.

And step 203, segmenting the audio data according to the time of the subtitles of the text subtitle file, and marking the subtitles as references.

In this step, the audio data may be segmented according to the start time and the end time of each segment of subtitles in the text subtitle file to obtain an audio segment corresponding to each segment of subtitles, and each segment of subtitles may be used as a reference label of its corresponding audio segment.

In practical situations, the subtitles are usually arranged, some words in spoken language are sometimes removed in the arranging process, for example, the word or words usually do not appear in the subtitles, and in such a situation, the subtitles cannot be directly regarded as labels. In the embodiment, the subtitles are used as reference labels, so that a labeling person can label each audio segment by combining the reference labels of each audio segment, and the labeling efficiency can be improved.

And step 204, selecting a subtitle area.

In practical situations, the position of the subtitle is often fixed for each video file or a series of video files, so the position of the subtitle can be selected in advance, and for each frame of image of the video file, the image area of the position is cut out as the subtitle area.

Step 205, the OCR system recognizes subtitles.

In this step, the subtitle region of each frame of image of the video file may be sequentially input to the OCR system for recognition, so as to obtain the subtitle of each frame of image.

And step 206, segmenting the audio data according to the recognition result of the OCR system, and taking the recognition result of the OCR system as a reference mark.

In this step, when the recognition results of the OCR systems are the same within a period of time, the recognition results of the OCR systems can be used as the reference labels of the audio segments corresponding to the period of time.

Therefore, the audio data labeling method provided by the embodiment of the invention can improve the efficiency of audio data labeling and the accuracy of audio data labeling.

Referring to fig. 3, fig. 3 is a structural diagram of an audio data annotation device according to an embodiment of the present invention. As shown in fig. 3, the audio data labeling apparatus 300 includes:

an extracting module 301, configured to extract audio data in a video file; wherein the video file comprises subtitles;

and the labeling module 302 is configured to label the audio data according to the subtitles of the video file.

Optionally, the labeling module includes:

and the first labeling unit is used for labeling the audio segments corresponding to the subtitles according to the subtitles.

Optionally, the labeling module includes:

and the second labeling unit is used for labeling the audio segments corresponding to the subtitles according to the subtitles.

Optionally, the identification unit includes:

a determining subunit, configured to determine a subtitle region of each frame of image in the image data;

and the identification subunit is used for respectively identifying the characters in the subtitle area of each frame of image to obtain the subtitle of each frame of image.

Optionally, the determining subunit is specifically configured to:

respectively determining an image area at a target position in each frame of image of the image data as a subtitle area of each frame of image;

Respectively positioning character areas of each frame of image in the image data;

The audio data labeling apparatus 300 according to the embodiment of the present invention can implement each process in the above method embodiments, and is not described herein again to avoid repetition.

The audio data labeling device 300 of the embodiment of the present invention includes an extraction module 301, configured to extract audio data in a video file; wherein the video file comprises subtitles; and the labeling module 302 is configured to label the audio data according to the subtitles of the video file. The efficiency of audio data collection can be improved by extracting audio data from the video file for training, and in addition, the efficiency of audio data annotation can be improved by annotating the extracted audio data based on the subtitles of the video file.

Referring to fig. 4, fig. 4 is a block diagram of an audio data annotation device according to another embodiment of the present invention, and as shown in fig. 4, the audio data annotation device 400 includes: a processor 401, a memory 402 and a computer program stored on the memory 402 and executable on the processor, the various components in the audio data annotation device 400 being coupled together by a bus interface 403, the computer program when executed by the processor 401 implementing the steps of:

extracting audio data in the video file; wherein the video file comprises subtitles;

and marking the audio data according to the subtitles of the video file.

Optionally, the computer program is further configured to, when executed by the processor 401, implement the following steps: