Method, apparatus, computer device and storage medium for processing video data

文档序号：1315139 发布日期：2020-07-10 浏览：10次中文

阅读说明：本技术 处理视频数据的方法、装置、计算机设备和存储介质 (Method, apparatus, computer device and storage medium for processing video data ) 是由王晨旭于 2020-03-25 设计创作，主要内容包括：本申请涉及一种处理视频数据的方法、装置、计算机设备和存储介质。所述方法包括：对当前视频中当前帧进行文本检测得到当前帧的文本区域；获取当前视频的台词区域,当前视频的台词区域根据当前视频的多个视频序列帧文本区域确定；获取文本区域与台词区域的重叠区域的区域信息；分别计算重叠区域的高度与对应的文本区域和台词区域的高度的比值得到第一比值和第二比值；当第一比值和第二比值中任意一个比值大于预设比值,将该文本区域作为台词区域。不同视频的台词区域存在差异,根据同一视频中多个视频帧自适应确定的台词区域,对该视频的视频帧的文本区域与自适应的台词区域的重叠区域的高度比值直接定位出台词区域,提高了定位的准确度。(The present application relates to a method, apparatus, computer device and storage medium for processing video data. The method comprises the following steps: performing text detection on a current frame in a current video to obtain a text area of the current frame; acquiring a speech-line region of a current video, wherein the speech-line region of the current video is determined according to a plurality of video sequence frame text regions of the current video; acquiring region information of an overlapping region of the text region and the speech area; respectively calculating the ratio of the height of the overlapping area to the height of the corresponding text area and the height of the corresponding line area to obtain a first ratio and a second ratio; and when any ratio of the first ratio and the second ratio is larger than a preset ratio, taking the text area as a speech-line area. The speech areas of different videos have difference, and the speech area is directly positioned according to the height ratio of the overlapping area of the text area of the video frame of the video and the self-adaptive speech area of a plurality of video frames in the same video according to the adaptively determined speech area, so that the positioning accuracy is improved.)

1. A method of processing video data, the method comprising:

acquiring a current frame in a current video, and performing text detection on the current frame to obtain at least one text area of the current frame;

acquiring a speech-line region of the current video, wherein the speech-line region of the current video is determined according to text regions of a plurality of video sequence frames of the current video;

acquiring region information of an overlapping region of each text region and the speech area, wherein the region information comprises height;

respectively calculating the ratio of the height of each overlapping area to the height of the corresponding text area and the height of the corresponding line area to obtain a first ratio and a second ratio;

and when any ratio of the first ratio and the second ratio is larger than a first preset threshold, taking the region information of the text region as the region information of the current speech-line region.

2. The method of claim 1, wherein the obtaining the current frame in the current video is preceded by:

acquiring a plurality of original video sequence frames corresponding to the current video, and performing text detection on each original video sequence frame to obtain a corresponding video sequence frame, wherein the pixel values of a text region and a non-text region of each video sequence frame are respectively 1 and 0;

accumulating the video sequence frames to obtain a heat map;

binarizing the heat map to obtain a binary image, wherein the binary image comprises at least one closed area, the pixel value of the closed area is 1, and the closed area is a cross area or a non-cross area;

when the closed region contains a cross region, performing region division on the cross region according to a preset division rule to obtain a non-cross region until the closed region is a non-cross region;

and carrying out region screening on the non-crossed region according to a preset speech area constraint rule to obtain the speech area.

3. The method according to claim 2, wherein the region screening of the non-intersecting region according to a preset speech-line region constraint rule to obtain the speech-line region comprises:

judging whether the height of each non-intersection area is smaller than a preset height threshold value or not, and setting an area identifier of each non-intersection area as a first identifier when the height of each non-intersection area is smaller than the preset height threshold value;

judging whether the width of each non-intersection region is smaller than a first preset width threshold or larger than a second preset width threshold, wherein the second preset width threshold is larger than the first preset width threshold, and when the width of each non-intersection region is smaller than the first preset width threshold or larger than the second preset width threshold, setting the region identifier of each non-intersection region as the first identifier; judging whether the initial pixel of each non-intersection area is positioned on the right side of a preset position, and setting the area identification of the non-intersection area as the first identification when the initial pixel of the non-intersection area is positioned on the right side of the preset position;

when the area identification of the remaining non-intersection area in the non-intersection area corresponding to the binary image is set as a second identification, extracting the area characteristic of the non-intersection area corresponding to the second identification, and determining the format type of the non-intersection area according to the area characteristic, wherein the format type comprises left alignment and central symmetry;

and when the number of the non-crossed areas corresponding to the first identifier is 0, screening out the lines areas according to the format types of the non-crossed areas corresponding to the second identifier.

4. The method according to claim 3, wherein when the number of the non-intersecting area corresponding to the first identifier is 0, the filtering out the speech-line area according to the format type of the non-intersecting area corresponding to the second identifier comprises:

when the speech-line format of the non-intersecting area corresponding to the second identifier comprises any speech-line format of left alignment or center symmetry, taking the non-intersecting area with the speech-line format of left alignment or center correspondence as the speech-line area;

when the non-intersection region corresponding to the second identifier contains a non-intersection region in any of left alignment or central symmetry and a non-intersection region not belonging to any of the left alignment or central symmetry, setting the pixel value in the non-intersection region not belonging to any of the left alignment or central symmetry to 0;

when the speech-line format of the non-intersecting region corresponding to the second identifier does not belong to any format of the left alignment or the central symmetry and the number of the non-intersecting regions is greater than 1, calculating the regional variance of the non-intersecting region corresponding to each second identifier, and taking the non-intersecting region corresponding to the maximum value in the regional variance as the speech-line region.

5. The method according to claim 3, wherein the extracting region features of the non-intersecting region corresponding to the second identifier and determining the format type of the non-intersecting region according to the region features comprises:

equally dividing the non-crossed region corresponding to the second identifier into a plurality of sub-regions from left to right, and calculating the pixel mean value of each sub-region;

when the pixel mean value corresponding to the non-crossed region corresponding to the second identifier decreases progressively from left to right, setting the line format corresponding to the non-crossed region corresponding to the second identifier as the left alignment;

dividing the non-intersecting region corresponding to the second identifier from the middle to obtain a left region and a right region, dividing the left region into a plurality of sub-regions equally from left to right, calculating a pixel mean value of the sub-region of each left region, dividing the right region into a plurality of sub-regions equally from left to right, calculating a pixel mean value of the sub-region of each right region, and when the pixel mean value of the sub-region of the left region increases from left to right and the pixel mean value of the right region decreases from left to right, setting the typhoon format corresponding to the non-intersecting region corresponding to the second identifier as the central symmetry.

6. The method of claim 3, further comprising:

and when the number of the non-crossed areas corresponding to the first identification is not 0, setting the pixel value in the non-crossed area corresponding to the first identification to be 0 to obtain a second heat map, and executing binarization to the second heat map until the area information of the speech-line area of the second heat map is obtained.

7. The method of claim 2, further comprising:

dividing a cross area in the heat map into an upper area and a lower area to obtain a first area and a second area;

calculating a column mean of the first region and the second region;

searching the number of maximum values in the column mean values of the first area and the second area;

when the number of the maximum values of the column mean value of the first area and the column mean value of the second area are both 1, setting the crossing type of the crossing area as vertical crossing;

when the number of the maximum values of the column mean values of the first area and the second area is greater than 1, setting the type of the crossing area as left-right crossing;

the step of performing a region on the cross region according to a preset division rule to obtain a non-cross region includes: and performing area division on the cross area corresponding to the cross type according to a preset division rule corresponding to the cross type to obtain a corresponding non-cross area.

8. The method according to claim 7, wherein the performing area division on the intersection region corresponding to the intersection type according to a preset division rule corresponding to the intersection type to obtain a corresponding non-intersection region comprises:

when the crossing type is vertical crossing, carrying out corner detection on the crossing region to obtain corner information of each corner, classifying each corner according to line coordinates in the corner information to obtain a category identifier of each corner information, constructing a rectangular region according to an intersection point of the same category identifier, and taking the constructed rectangular region as the non-crossing region;

and when the crossing type is left-right crossing, calculating a column average value of the crossing region, and splitting the region left and right according to the column where the minimum value among a plurality of maximum values in the column average value is located to obtain a corresponding non-crossing region.

9. An apparatus for processing video data, the apparatus comprising:

the text detection module is used for acquiring a current frame in a current video, and performing text detection on the current frame to obtain at least one text area of the current frame;

the speech area acquisition module is used for acquiring a speech area of the current video, and the speech area of the current video is determined according to text areas of a plurality of video sequence frames of the current video;

an overlap region acquisition module configured to acquire region information of an overlap region of each text region and the line region, where the region information includes a height;

a ratio calculation module, configured to calculate a ratio between the height of each overlap region and the height of the corresponding text region and the height of the corresponding speech region, respectively, to obtain a first ratio and a second ratio;

and the region screening module is used for taking the region information of the text region as the region information of the current speech-line region when any one of the first ratio and the second ratio is greater than a first preset threshold value.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 8 are implemented when the computer program is executed by the processor.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing video data, a computer device, and a storage medium.

Background

In application scenarios such as text search and text translation, text regions of the lines in the video need to be extracted first for subsequent identification and translation ocr. The current text detection algorithm is to directly detect text information in a preset video area, and take the text information at the position as a target text, however, there are many text areas except the area where the target text is located in the video, and the text detection algorithm directly detects the text, which may cause the detected area to possibly contain text information of a non-speech area, thereby reducing the accuracy of the detection result.

Disclosure of Invention

In order to solve the above technical problem, the present application provides a method, an apparatus, a computer device, and a storage medium for processing video data.

In a first aspect, the present application provides a method of processing video data, comprising:

acquiring a current frame in a current video, and performing text detection on the current frame to obtain at least one text area of the current frame;

acquiring a speech-line region of a current video, wherein the speech-line region of the current video is determined according to text regions of a plurality of video sequence frames of the current video;

acquiring region information of an overlapping region of each text region and a speech-line region, wherein the region information comprises height;

respectively calculating the ratio of the height of each overlapped region to the height of the corresponding text region and the corresponding line region to obtain a first ratio and a second ratio;

In a second aspect, the present application provides an apparatus for processing video data, comprising:

the text detection module is used for acquiring a current frame in a current video, and performing text detection on the current frame to obtain at least one text area of the current frame;

an overlap region acquisition module for acquiring region information of an overlap region of each text region and the line region, the region information including a height;

the ratio calculation module is used for calculating the ratio of the height of each overlapped area to the height of the corresponding text area and the corresponding speech area respectively to obtain a first ratio and a second ratio;

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

acquiring a current frame in a current video, and performing text detection on the current frame to obtain at least one text area of the current frame;

acquiring region information of an overlapping region of each text region and a speech-line region, wherein the region information comprises height;

respectively calculating the ratio of the height of each overlapped region to the height of the corresponding text region and the corresponding line region to obtain a first ratio and a second ratio;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring a current frame in a current video, and performing text detection on the current frame to obtain at least one text area of the current frame;

acquiring region information of an overlapping region of each text region and a speech-line region, wherein the region information comprises height;

respectively calculating the ratio of the height of each overlapped region to the height of the corresponding text region and the corresponding line region to obtain a first ratio and a second ratio;

The method, the device, the computer equipment and the storage medium for processing the video data comprise the following steps: acquiring a current frame in a current video, and performing text detection on the current frame to obtain at least one text area of the current frame; acquiring a speech-line region of a current video, wherein the speech-line region of the current video is determined according to text regions of a plurality of video sequence frames of the current video; acquiring region information of an overlapping region of each text region and a speech-line region, wherein the region information comprises height; respectively calculating the ratio of the height of each overlapped region to the height of the corresponding text region and the corresponding line region to obtain a first ratio and a second ratio; and when any ratio of the first ratio and the second ratio is larger than a first preset threshold, taking the region information of the text region as the region information of the current speech-line region. The speech areas of different videos have difference, and the speech area is directly positioned according to the height ratio of the overlapping area of the text area of the video frame of the video and the self-adaptive speech area of a plurality of video frames in the same video according to the adaptively determined speech area, so that the positioning accuracy is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a diagram of an application environment for a method of processing video data in one embodiment;

FIG. 2 is a flow diagram illustrating a method for processing video data in one embodiment;

FIG. 3 is a diagram illustrating corner detection results in one embodiment;

FIG. 4 is a block diagram of an apparatus for processing video data according to one embodiment;

FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

FIG. 1 is a diagram of an application environment for a method of processing video data in one embodiment. Referring to fig. 1, the method of processing video data is applied to a system for processing video data. The system for processing video data includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in FIG. 2, a method of processing video data is provided. The embodiment is mainly illustrated by applying the method to the terminal 110 (or the server 120) in fig. 1. Referring to fig. 2, the method for processing video data specifically includes the following steps:

step S201, obtaining a current frame in a current video, and performing text detection on the current frame to obtain at least one text region of the current frame.

Specifically, the current video refers to video data being processed, and the current video may be a television play, an episode of a television play, a movie, or other video data. The current frame refers to a video frame of video data being processed, and when a text region is detected to exist in the current video frame, the text region is obtained, that is, the text region contains region information including, but not limited to, the size, the position, and the like of the region. The text detection algorithm can adopt a common text detection algorithm, such as a deep learning algorithm, a sliding window-based method, a connected component-based method, stroke width transformation and the like.

In one embodiment, the text region may or may not exist in the video frame, and for the video frame in which the text information is not detected, the next video frame is acquired, and the text detection is performed on the next video frame.

Step S202, obtaining a speech area of the current video.

In this particular embodiment, the speech area of the current video is determined from text areas of a plurality of video sequence frames of the current video.

Specifically, text regions of a plurality of video frames in a current video are screened, during screening, a plurality of video sequence frames are accumulated, the video frames obtained through accumulation are subjected to region screening, a speech area is obtained, information corresponding to the speech area is speech area information, the speech area information comprises the size of the region and the position of the region, and the position can be represented by coordinates of rows and columns. The speech area information is self-adaptively determined by text areas obtained by detecting a plurality of video sequence frames of the current video, so that more accurate speech areas can be obtained. And the video sequence frame may be a sequence frame of the entire video or a sequence frame of partial data in the entire video. The lines may be subtitles of a movie, subtitles of a tv show, or lyrics of a music piece, etc. Wherein the number of the speech area can be one or more.

In step S203, region information of an overlapping region of each text region and the speech-line region is acquired, the region information including region information of height.

In this particular embodiment, the region information includes a height.

Step S204, respectively calculating the ratio of the height of each overlapped area to the height of the corresponding text area and the height of the corresponding line area to obtain a first ratio and a second ratio.

In step S205, when any one of the first ratio and the second ratio is greater than the first preset threshold, the region information of the text region is used as the region information of the current speech-line region.

Specifically, the overlap region refers to a region where there is overlap between the text region and the speech region. Calculating the ratio of the height of the overlapped region to the height of the text region to obtain a first ratio, calculating the ratio of the overlapped region to the height of the corresponding line region to obtain a second ratio, judging whether the first ratio and the second ratio are both smaller than a first preset threshold value, when both the first ratio and the second ratio are smaller than a first preset threshold, the overlapping portion in the height direction between the overlapping region and the text region and the speech-line region is smaller, the text area is a non-text area, otherwise when any ratio of the first ratio and the second ratio is larger than the first preset threshold, the text area with the ratio larger than the first preset threshold is used as the speech area, and the area information of the text area is used as the area information of the current speech area.

The method for processing video data comprises the following steps: acquiring a current frame in a current video, and performing text detection on the current frame to obtain at least one text area of the current frame; acquiring a speech-line region of a current video, wherein the speech-line region of the current video is determined according to text regions of a plurality of video sequence frames of the current video; acquiring region information of an overlapping region of each text region and a speech-line region, wherein the region information comprises height region information, and the region information comprises height; respectively calculating the ratio of the height of each overlapped region to the height of the corresponding text region and the corresponding line region to obtain a first ratio and a second ratio; and when any ratio of the first ratio and the second ratio is larger than a first preset threshold, taking the region information of the text region as the region information of the speech-line region. The speech areas of different videos are different, and the speech area of the text area of the video frame of the video is positioned according to the speech areas which are self-adaptively determined by a plurality of video frames in the same video, so that the positioning accuracy is improved, and the detection accuracy of the speech area is improved.

In one embodiment, before acquiring the current frame in the current video, the method further includes:

step S301, a plurality of original video sequence frames corresponding to the current video are obtained, and text detection is performed on each original video sequence frame to obtain a corresponding video sequence frame.

In the present specific embodiment, the pixel values of the text region and the non-text region of each video sequence frame are 1 and 0, respectively.

Step S302, accumulating each video sequence frame to obtain a heat map.

And step S303, binarizing the heat map to obtain a binary image.

In this embodiment, the binary image includes at least one closed region, the pixel value of the closed region is 1, and the closed region is an intersection region or a non-intersection region.

Specifically, the original video sequence frame refers to a video frame obtained by framing a current video, and the video sequence frame refers to a detection image obtained by performing text detection on all or part of the original video sequence frame of the video frame, wherein the video sequence frame is an original binary image. The framing duration of the current video during framing can be determined according to the display duration of the same speech. The framing duration can be larger when the video data with longer display duration is displayed, the framing duration can be smaller when the video data with shorter display duration is displayed, the framing duration can also be a preset fixed duration, and the specific setting mode can be customized according to requirements. If the display duration of the same speech is 1 second/3 seconds, the duration similar to 1 second/3 seconds can be selected during framing, framing can also be performed according to the same framing duration, sampling is performed from the framed video sequence frames, and sampling can be performed according to the display duration of the speech during sampling. The regions included in each video sequence frame have a text region and a non-text region, wherein the pixel values of the text region and the non-text region are 1 and 0, respectively. The method comprises the steps of superposing all video sequence frames to obtain superposed video frames, wherein the video frames are gray level images, the size of pixel values cannot be changed when 0 of a non-text area is accumulated, the pixel values 1 of the text area are accumulated to obtain a heat map, each video sequence frame is accumulated due to the fact that the text area in each video sequence frame is different, threshold segmentation needs to be carried out on the heat map due to the fact that the contour of the heat map is fuzzy, a binary image is obtained, the binary image comprises one or more closed areas, and the pixel value of each closed area is 1. The intersection region means that the closed region is not a standard rectangular region, i.e. there is an intersection between different closed regions. Non-intersecting regions refer to regions that alone do not intersect other enclosed regions. The crossing type of the crossing region includes an up-down crossing and a left-right crossing.

And step S304, when the closed region contains the crossed region, performing region division on the crossed region according to a preset division rule to obtain a non-crossed region until the closed region is the non-crossed region.

Step S305, performing region screening on the non-crossed region according to a preset speech area constraint rule to obtain a speech area.

Specifically, if all the closed regions are non-intersecting regions, region screening is performed on the line region according to a preset line region constraint rule, and the non-intersecting regions meeting the preset line region constraint rule are used as line regions. On the contrary, if the closed region includes the intersection region, the intersection region is divided into regions, so that the intersection region is divided into a plurality of non-intersection regions. All intersecting regions in the entire heat map are partitioned until the enclosed regions contained in the heat map are non-intersecting regions. The regions can be divided according to the cross type of the regions when the regions are divided, different types of cross regions can adopt different dividing methods, the cross type can be customized, and the dividing method of the cross region corresponding to each type can also be customized, for example, the cross type is defined to include up-down cross and left-right cross, wherein the region dividing rules corresponding to the up-down cross and the left-right cross are different. The upper and lower crossing regions can be constructed by using the detected corner information, and the left and right crossing regions can be divided according to the pixel distribution of the regions. And (4) carrying out region screening on the non-crossed regions in the heat map to obtain the speech-line regions. When the region screening is performed, a preset speech-line region constraint rule is adopted to screen the non-intersection region, wherein the preset speech-line region constraint rule includes but is not limited to one or more of a height constraint rule, a width constraint rule, a pixel distribution constraint rule of the region, a region position constraint rule and the like. And taking the non-crossed region meeting the preset speech-line region constraint rule as a speech-line region.

In one embodiment, after obtaining the speech-line region, binarizing the heat map only containing the speech-line region to obtain a corresponding binarized segmented image, and drying the binarized segmented image by using morphology to obtain region information containing the speech-line region, wherein the morphology includes erosion and expansion operations, and multiple erosion and expansion operations can be performed to generate a heatmask with a hot-spot region of 255 and the rest regions of 0 during drying.

In one embodiment, step S305 includes:

step S3051, judging whether the height of each non-intersecting area is smaller than a preset height threshold, and setting the area identifier of each non-intersecting area as a first identifier when the height of each non-intersecting area is smaller than the preset height threshold.

Step S3052, determining whether the width of each non-intersecting region is smaller than a first preset width threshold, or is larger than a second preset width threshold, where the second preset width threshold is larger than the first preset width threshold, and when the width of the non-intersecting region is smaller than the first preset width threshold, or is larger than the second preset width threshold, setting a region identifier of the non-intersecting region as a first identifier.

Step S3053, judging whether the start pixel of each non-intersection region is positioned on the right of the preset position, and setting the region identifier of the non-intersection region as a first identifier when the start pixel of the non-intersection region is positioned on the right of the preset position.

Step S3054, when the region identifiers of the remaining non-intersecting regions in the non-intersecting region corresponding to the binary image are set as second identifiers, extracting the region features of the non-intersecting region corresponding to the second identifiers, and determining the format type of the non-intersecting region according to the region features, wherein the format type comprises left alignment and central symmetry.

Step S3055, when the number of the non-intersecting regions corresponding to the first identifier is 0, filtering out the lines regions according to the format types of the non-intersecting regions corresponding to the second identifier.

Specifically, in step S3051, step S3052, and step S3053, a height threshold, a width threshold, and position information are respectively used to screen the non-intersecting region, and when the preset height threshold is not satisfied, that is, the height of the non-intersecting region is smaller than the preset height threshold, the width of the non-intersecting region is smaller than a first preset width threshold, the width of the non-intersecting region is larger than a second preset width threshold, and the starting and ending positions of the non-intersecting region are not located at preset positions, it is indicated that the non-intersecting region is not a typhoon region, and the non-intersecting region that does not satisfy any of the above four conditions is marked, and the marks are all first marks. And marking the non-intersection area meeting the four conditions as a second identification. And extracting the area characteristics of the non-intersection area corresponding to the second identifier, wherein the area characteristics can be the distribution characteristics of pixel values, and the distribution characteristics of the pixel values comprise descending from left to right, descending from the middle to two sides and the like. The speech-line formats of the non-intersecting areas of the distributed features which are gradually decreased from left to right are left aligned, and the speech-line formats of the non-intersecting areas of the distributed features which are gradually decreased from the middle to the two sides are centrosymmetric. If the non-intersecting area in the heat map meets the three conditions of height, width and position, area screening is carried out according to the speech-line format of the non-intersecting area corresponding to the second identifier, and the non-intersecting area containing the speech-line format is used as the speech-line area.

In one embodiment, step S3055, comprises:

step S30551, when the speech format of the non-intersecting region corresponding to the second identifier includes any one of the speech formats in left alignment or center symmetry, the non-intersecting region whose speech format is left alignment or center correspondence is used as the speech region.

Step S30552, when the non-intersecting region corresponding to the second identifier includes the non-intersecting region in the speech-line format that is either left-aligned or center-symmetric and the non-intersecting region that is not in the speech-line format that is either left-aligned or center-symmetric, setting the pixel value in the non-intersecting region that is not in the speech-line format that is either left-aligned or center-symmetric to 0.

Step S30553, when the speech-line format of the non-intersecting region corresponding to the second identifier does not belong to any format of left alignment or central symmetry, and the number of the non-intersecting regions is greater than 1, calculating a regional variance of the non-intersecting region corresponding to each second identifier, and taking the non-intersecting region corresponding to the maximum value in the regional variances as the speech-line region.

Specifically, when the number of the non-intersecting region corresponding to the first identifier is 0, it is determined whether the speech-line format of the non-intersecting region corresponding to the second identifier is left-aligned or centrosymmetric, and if all the speech-line formats are left-aligned or all the speech-line formats are right-aligned, the non-intersecting region corresponding to the second identifier is used as the speech-line region. If the part of the speech-line format of the non-crossed region corresponding to the second identifier is left-aligned or right-aligned, and the part of the speech-line format does not belong to left-aligned or right-aligned, the non-crossed region of the speech-line format which does not belong to left-aligned or right-aligned is deleted, that is, the pixel value of the region is set to 0. When the speech-line format of the non-intersecting region corresponding to the second identifier does not belong to any format of left alignment or central symmetry and the number of the non-intersecting region is more than one, calculating the regional variance of the non-intersecting region corresponding to each second identifier, and taking the non-intersecting region corresponding to the maximum value in the regional variance as the speech-line region. The variance remains the greatest because greater variance indicates greater variability between pixels, and greater variance between pixels in the line region, so the variance is greater relative to other regions.

In one embodiment, step S3054, includes:

step S30541, equally dividing the non-intersecting region corresponding to the second identifier into a plurality of sub-regions from left to right, and calculating a pixel mean value of each sub-region.

Step S30542, when the pixel mean values corresponding to the non-intersecting regions corresponding to the second identifier decrease from left to right, setting the lines format corresponding to the non-intersecting regions corresponding to the second identifier to be left-aligned.

Step S30543, dividing the non-intersecting region corresponding to the second identifier from the middle to obtain a left region and a right region, dividing the left region into a plurality of sub-regions equally from left to right, calculating a pixel mean value of the sub-region of each left region, dividing the right region into a plurality of sub-regions equally from left to right, calculating a pixel mean value of the sub-region of each right region, and when the pixel mean value of the sub-region of the left region increases from left to right and the pixel mean value of the right region decreases from left to right, setting the typhoon format corresponding to the non-intersecting region corresponding to the second identifier as central symmetry.

Specifically, the non-intersecting region corresponding to the second identifier is divided into a plurality of rectangular regions with the same region area size from left to right, the pixel mean value of the pixel points in each rectangular region is calculated, that is, the pixel values of all the pixel points in the rectangular region are summed, and the summed value is compared with the number of the pixel points in the rectangular region to obtain the pixel mean value. And judging whether the pixel mean value is in a decreasing trend from left to right or not, if so, performing left alignment on the line format of the non-intersection region corresponding to the second identifier. Dividing the non-crossed region which does not belong to the left alignment from the middle to obtain a left region and a right region, and respectively carrying out equal region division on the left region and the right region to obtain a plurality of rectangular regions which correspond to the left region and the right region and have the same area size. Respectively solving the pixel mean value of each rectangular area corresponding to the left area and the right area, and judging whether the left area is increased from left to right or not and whether the right area is decreased from left to right or not, namely whether the non-crossed area is decreased from the middle to two sides or not, wherein if the non-crossed area is decreased from the middle to two sides, the line format of the non-crossed area is centrosymmetric. And defining the speech-line format of the non-crossed region corresponding to the second identifier, and determining whether the non-crossed region corresponding to the second identifier is a speech-line region, wherein the speech-line format of the non-crossed region is generally left-aligned or center-symmetric if the non-crossed region is the speech-line region.

In an embodiment, the method for processing video data further includes:

step S306, when the number of the non-crossed areas corresponding to the first identification is not 0, setting the pixel value in the non-crossed area corresponding to the first identification to 0 to obtain a second heat map, and executing binarization to the second heat map until the area information of the line area of the second heat map is obtained.

Specifically, when the number of the non-intersecting region corresponding to the first identifier is not 0, it indicates that there is a candidate speech-line region in the graph, the non-intersecting region corresponding to the first identifier is deleted, that is, the pixel value of the non-intersecting region corresponding to the first identifier is set to 0, so as to obtain a new heat map (second heat map), the new heat map is binarized, that is, the segmentation is performed again, until the number of the non-intersecting region corresponding to the first identifier is 0, the process proceeds to step S3055 repeatedly until the region information of the speech-line region of the second heat map is obtained.

In an embodiment, the method for processing video data further includes:

step S307, dividing the intersection area in the heat map into an upper area and a lower area, and obtaining a first area and a second area.

In step S308, a column mean of the first region and the second region is calculated.

In step S309, the number of maximum values in the column mean of the first area and the second area is searched.

In step S310, when the number of the maximum values of the column mean of the first region and the column mean of the second region is 1, the crossing type of the crossing region is set to be vertical crossing.

In step S311, when both the maximum value of the column mean of the first region and the maximum value of the column mean of the second region are greater than 1, the type of the intersection region is set to be left-right intersection.

In this embodiment, step S304 includes: and carrying out region division on the crossed region corresponding to the crossed type according to a preset division rule corresponding to the crossed type to obtain a corresponding non-crossed region.

Specifically, the non-intersecting area in the heat map is divided into an upper area and a lower area, so as to obtain a first area and a second area, wherein the first area and the second area respectively correspond to the upper area and the lower area. The column mean value of each column of the first region and the second region is calculated, that is, the pixel values corresponding to each column in the first region are averaged, and if the first column in the first region includes 4 pixel values {10, 12, 14, 16}, the column mean value corresponding to the first column is 13. And respectively searching the maximum value in the column mean values in the first area and the second area. If the number of the maximum values included in the column mean values included in the first region and the second region is more than one, it indicates that two peaks appear in the column mean values, the intersection region belongs to left and right intersections, and if the number of the maximum values included in the column mean values included in the first region and the second region is one, it indicates that only one peak appears in the first region and the second region, and the intersection belongs to top and bottom intersections, where the top and bottom intersections and the left and right intersections are the intersection type of the intersection region. After the cross type of the cross region is determined, a method for performing region division on the cross region is determined according to the cross type, namely a preset division rule corresponding to the cross type is obtained, and the preset division rule corresponding to the cross type is adopted to perform region division on the corresponding cross region to obtain a plurality of non-cross regions. The preset division rules corresponding to the upper and lower intersections and the left and right intersections can be customized, for example, the preset division rules of the upper and lower intersection areas are rectangular areas constructed according to the detected corner points, and the preset division rules corresponding to the left and right intersections are frame positions of the rectangular areas determined according to the distribution characteristics of the pixels.

In one embodiment, step S304 includes: when the crossing type is vertical crossing, carrying out corner detection on the crossing region to obtain corner information of each corner, classifying each corner according to line coordinates in the corner information to obtain a category identifier of each corner information, constructing a rectangular region according to the intersection point of the same category identifier, and taking the constructed rectangular region as a non-crossing region; and when the crossing type is left-right crossing, calculating a column average value of the crossing region, and splitting the region left and right according to the column where the minimum value among a plurality of maximum values in the column average value is located to obtain a corresponding non-crossing region.

Specifically, the corner detection algorithm may adopt a common algorithm for detecting corners, and the common corner detection algorithm includes, but is not limited to, a harris corner detection algorithm, a modified harris corner detection algorithm, a Moravec corner detection algorithm, a FAST corner detection algorithm, and the like. Detecting a crossing region with a vertical crossing type in the heat map by adopting an angular point detection algorithm to obtain corresponding angular point information, dividing the crossing region according to the position of the angular point, for example, using two angular points positioned above as the angular point of a first rectangular region, using two angular points positioned below as the angular point of a lower rectangular region, determining whether the first rectangular region or a second rectangular region is located by the angular point positioned in the middle according to the crossing region, and selecting three angular points which are not located in the same straight line to construct the first rectangle and the second rectangle. As shown in fig. 3, corner points 401, 402 and 403 are used to construct a first rectangle, and corner points 403, 404 and 405 are used to construct a second rectangle.

And when the crossing type is left-right crossing, calculating the column mean value of each column in the crossing area, searching the position of a minimum value between two adjacent maximum values in the column mean value, and dividing the area into a left area and a right area by taking the column with the minimum value as a boundary. The left and right intersection regions are divided into regions by using a distribution of column average values, which is determined according to the specification of left and right intersections, the centers of the regions of the left and right intersections indicate that the text regions are different from each other, and the pixel values of the centers of the regions of the text regions in the heat map are large, so that the corresponding column average values are also large. The method of column mean is simple and convenient to realize and has high accuracy.

In a specific embodiment, the method for processing video data includes:

and accumulating the text detection area parts in all the image frames corresponding to all the current videos to generate a speech area original heatmap (horigin). The heatmap is consistent with the image frame size. Judging whether the proportion of the image frames with the text area in all the image frames exceeds a threshold value P, determining whether the video has no lines, directly filtering all the line areas and returning the line areas for the video without the lines, and otherwise, continuously judging Horigin.

And carrying out binarization processing after the Horigin normalization, extracting edges, and respectively segmenting a plurality of different closed regions determined by a plurality of edges. And if the cross region exists, performing region division on the cross region to obtain cross types, wherein the cross types comprise vertical cross and horizontal cross. And when the crossing type is judged, dividing the crossing area into an upper area and a lower area, counting the maximum values of the column mean values in the upper area and the lower area, if the upper area and the lower area have a plurality of maximum values, determining the crossing type of the locating crossing area as left-right crossing, and if the upper area and the lower area have 1 maximum value, determining the crossing type of the locating crossing area as up-down crossing.

For the upper and lower crossing regions, harris corner point detection is adopted, the regions to which the corner points belong are determined according to the positions of the corner points, and bounding boxes are respectively carried out to solve the external rectangles of the two regions as the divided regions, namely a plurality of non-crossing regions.

And for the left and right crossed regions, dividing the crossed region into an upper part and a lower part, respectively calculating the maximum value of the curves formed by the longitudinal mean values, and if any curve exceeds one maximum value, dividing the region according to the position of the minimum value between the maximum values to obtain a plurality of non-crossed regions.

The method comprises the following steps of (1) obtaining a region list D1 meeting the three conditions, if the average of the N subregions decreases from left to right, the word lattice of the non-intersecting region is left aligned, adding a left aligned region list L, if the center aligned feature is divided into a left part and a right part, respectively calculating left aligned and right aligned features for the two parts, if the left aligned and right aligned features are consistent with the right aligned feature, the right part of the region is considered to be symmetrical with the left aligned feature, the region is considered to be symmetrical with the center aligned feature, if the left aligned and right aligned features are consistent with the right aligned feature, the region is considered to be symmetrical with the center aligned, adding a left aligned and right aligned features, respectively, if the left aligned and right aligned features are consistent with the right aligned, the region is considered to be symmetrical with the center aligned, adding a symmetric word lattice centered, if the left aligned and right aligned features are consistent with the right aligned, if the left aligned, the right aligned, the left aligned and right aligned features are classified into a symmetric word lattice, if the right aligned, the region is classified into a symmetric region, the region, if the region is classified into a symmetric word lattice, the region is judged to be classified as a symmetric region, if the region, the region is classified into a No. 7, the region is added no more than the region, if the region, the region is added no more than the region, the region is added no more than the region is the region, the region is the region III # S464, if the integer 3, the region III, if the region III, the region III-.

And carrying out binarization on the Hfinil, and generating a heatmask with a hot spot region of 255 and the rest regions of 0 after multiple corrosion and expansion operations.

The method comprises the steps of obtaining a current frame of a current video, carrying out text detection on the current frame to obtain a text area of the current frame, respectively calculating each text area and a heatmask calculation cross area, respectively calculating whether the line occupation ratio of lines of the cross areas in the text area and the heatmask line occupation ratio are both smaller than a threshold value, if so, the text area is a non-speech area, and if not, the text area is a speech area.

The method for processing the video data generates a heatmap by accumulating the text areas of all the image frames, divides each hot spot area of the heatmap, calculates the left alignment and center alignment characteristics of each area, can accurately position the area where the lines are positioned, and can adapt to various line scenes such as left alignment, center alignment and the like.

FIG. 2 is a flow diagram illustrating a method for processing video data in one embodiment. It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 4, there is provided an apparatus 200 for processing video data, comprising:

the text detection module 201 is configured to obtain a current frame in a current video, and perform text detection on the current frame to obtain at least one text region of the current frame.

A speech area obtaining module 202, configured to obtain a speech area of a current video, where the speech area of the current video is determined according to text areas of multiple video sequence frames of the current video.

An overlap region obtaining module 203, configured to obtain region information of an overlap region between each text region and the speech-line region, where the region information includes region information of height, and the region information includes height.

The ratio calculation module 204 is configured to calculate a ratio of the height of each overlapping region to the height of the corresponding text region and the height of the corresponding speech region, respectively, to obtain a first ratio and a second ratio.

The region screening module 205 is configured to, when any one of the first ratio and the second ratio is greater than a first preset threshold, use the region information of the text region as the region information of the current speech-line region.

In an embodiment, the apparatus 200 for processing video data further includes:

the sequence frame acquisition module is used for acquiring a plurality of original video sequence frames corresponding to the current video, performing text detection on each original video sequence frame to obtain a corresponding video sequence frame, wherein the pixel values of a text region and a non-text region of each video sequence frame are respectively 1 and 0.

And the image accumulation module is used for accumulating all video sequence frames to obtain the heat map.

And the binarization module is used for binarizing the heat map to obtain a binary image, wherein the binary image comprises at least one closed area, the pixel value of the closed area is 1, and the closed area is a cross area or a non-cross area.

And the intersection region segmentation module is used for performing region division on the intersection region according to a preset division rule to obtain a non-intersection region when the intersection region is included in the closed region until the closed region is the non-intersection region.

And the speech area screening module is used for carrying out area screening on the non-crossed area according to a preset speech area constraint rule to obtain a speech area.

In one embodiment, the speech area screening module is specifically configured to determine whether the height of each non-intersecting area is smaller than a preset height threshold, and set an area identifier of the non-intersecting area as a first identifier when the height of the non-intersecting area is smaller than the preset height threshold; judging whether the width of each non-crossing area is smaller than a first preset width threshold or larger than a second preset width threshold, wherein the second preset width threshold is larger than the first preset width threshold, and when the width of each non-crossing area is smaller than the first preset width threshold or larger than the second preset width threshold, setting the area identification of each non-crossing area as a first identification; judging whether the initial pixel of each non-intersection area is positioned on the right side of the preset position, and setting the area identification of the non-intersection area as a first identification when the initial pixel of the non-intersection area is positioned on the right side of the preset position; when the area identification of the remaining non-intersection area in the non-intersection area corresponding to the binary image is set as a second identification, extracting the area characteristic of the non-intersection area corresponding to the second identification, and determining the format type of the non-intersection area according to the area characteristic, wherein the format type comprises left alignment and central symmetry; and when the number of the non-crossed areas corresponding to the first identification is 0, filtering out the lines areas according to the format types of the non-crossed areas corresponding to the second identification.

In one embodiment, the speech area screening module is specifically configured to, when the number of the non-intersecting areas corresponding to the first identifier is 0 and the speech formats of the non-intersecting areas corresponding to the second identifier include any one of left alignment and center symmetry, take the non-intersecting area with the speech format being left alignment or center correspondence as the speech area; when the number of the non-intersection areas corresponding to the first identification is 0 and the non-intersection areas corresponding to the second identification comprise the non-intersection areas in any of left alignment or central symmetry and the non-intersection areas not in any of left alignment or central symmetry, setting the pixel values in the non-intersection areas not in any of left alignment or central symmetry to be 0; when the number of the non-intersecting regions corresponding to the first identifier is 0, the speech-line format of the non-intersecting region corresponding to the second identifier does not belong to any format of left alignment or central symmetry, and the number of the non-intersecting regions is greater than 1, calculating the regional variance of the non-intersecting region corresponding to each second identifier, and taking the non-intersecting region corresponding to the maximum value in the regional variances as the speech-line region.

In one embodiment, the speech area screening module is specifically configured to equally divide the non-intersecting area corresponding to the second identifier into a plurality of sub-areas from left to right, and calculate a pixel mean value of each sub-area; when the pixel mean value corresponding to the non-crossed area corresponding to the second identifier decreases progressively from left to right, setting the line format corresponding to the non-crossed area corresponding to the second identifier to be left-aligned; dividing the non-crossed region corresponding to the second identifier from the middle to obtain a left region and a right region, dividing the left region into a plurality of sub-regions equally from left to right, calculating the pixel mean value of the sub-region of each left region, dividing the right region into a plurality of sub-regions equally from left to right, calculating the pixel mean value of the sub-region of each right region, and when the pixel mean value of the sub-region of the left region increases from left to right and the pixel mean value of the right region decreases from left to right, setting the typhoon format corresponding to the non-crossed region corresponding to the second identifier as central symmetry.

In an embodiment, the speech-line region screening module is specifically configured to, when the number of the non-intersecting regions corresponding to the first identifier is not 0, set the pixel value in the non-intersecting region corresponding to the first identifier to 0 to obtain a second heat map, and execute binarization on the second heat map until region information of the speech-line region of the second heat map is obtained.

In an embodiment, the apparatus 200 for processing video data further includes:

the cross area data processing module is used for dividing the cross area in the heat map into an upper area and a lower area to obtain a first area and a second area; calculating a column mean value of the first area and the second area; searching the number of maximum values in the column mean values of the first area and the second area; when the number of the maximum values of the column mean values of the first area and the second area is 1, setting the crossing type of the crossing area as vertical crossing; and when the maximum value number of the column mean value of the first area and the maximum value number of the column mean value of the second area are both greater than 1, setting the type of the cross area as left-right cross.

The cross region division module is specifically configured to perform region division on a cross region corresponding to the cross type according to a preset division rule corresponding to the cross type to obtain a corresponding non-cross region.

In one embodiment, the intersection region dividing module is specifically configured to, when the intersection type is vertical intersection, perform corner detection on an intersection region to obtain corner information of each corner, classify each corner according to line coordinates in the corner information to obtain a category identifier of each corner information, construct a rectangular region according to an intersection point of the same category identifier, and use the constructed rectangular region as a non-intersection region; and when the crossing type is left-right crossing, calculating a column average value of the crossing region, and splitting the region left and right according to the column where the minimum value among a plurality of maximum values in the column average value is located to obtain a corresponding non-crossing region.

FIG. 5 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the terminal 110 (or the server 120) in fig. 1. As shown in fig. 5, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected via a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program which, when executed by the processor, causes the processor to implement a method of processing video data. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform a method of processing video data. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the apparatus for processing video data provided herein may be implemented in the form of a computer program that is executable on a computer device such as that shown in fig. 5. The memory of the computer device may store various program modules constituting the apparatus for processing video data, such as a text detection module 201, a speech-line region acquisition module 202, an overlap region acquisition module 203, a ratio calculation module 204, and a region filtering module 205 shown in fig. 4. The computer program constituted by the respective program modules causes the processor to execute the steps in the method of processing video data of the embodiments of the present application described in the present specification.

For example, the computer device shown in fig. 5 may perform, by the text detection module 201 in the apparatus for processing video data shown in fig. 4, acquiring a current frame in a current video, and performing text detection on the current frame to obtain at least one text region of the current frame. The computer device may perform the step of acquiring the speech area of the current video through the speech area acquisition module 202, where the speech area of the current video is determined according to text areas of a plurality of video sequence frames of the current video. The computer device may perform acquiring region information of an overlapping region of each text region and the speech region, the region information including a height, by the overlapping region acquisition module 203. The computer device may perform the calculation of the ratio of the height of each overlapping region to the height of the corresponding text region and the corresponding speech region through the ratio calculation module 204 to obtain the first ratio and the second ratio. The computer device may perform, by the region filtering module 205, when any one of the first ratio and the second ratio is greater than a first preset threshold, using the region information of the text region as the region information of the current speech-line region.

In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring a current frame in a current video, and performing text detection on the current frame to obtain at least one text area of the current frame; acquiring a speech-line region of a current video, wherein the speech-line region of the current video is determined according to text regions of a plurality of video sequence frames of the current video; acquiring region information of an overlapping region of each text region and a speech-line region, wherein the region information comprises height; respectively calculating the ratio of the height of each overlapped region to the height of the corresponding text region and the corresponding line region to obtain a first ratio and a second ratio; and when any ratio of the first ratio and the second ratio is larger than a first preset threshold, taking the region information of the text region as the region information of the current speech-line region.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a plurality of original video sequence frames corresponding to a current video, and performing text detection on each original video sequence frame to obtain a corresponding video sequence frame, wherein the pixel values of a text region and a non-text region of each video sequence frame are respectively 1 and 0; accumulating each video sequence frame to obtain a heat map; binarizing the heat map to obtain a binary image, wherein the binary image comprises at least one closed area, the pixel value of the closed area is 1, and the closed area is a cross area or a non-cross area; when the closed region contains the crossed region, performing region division on the crossed region according to a preset division rule to obtain a non-crossed region until the closed region is the non-crossed region; and carrying out region screening on the non-crossed region according to a preset speech area constraint rule to obtain a speech area.

In one embodiment, performing region screening on the non-intersecting region according to a preset speech area constraint rule to obtain a speech area, including: judging whether the height of each non-intersecting area is smaller than a preset height threshold value or not, and setting the area identification of the non-intersecting area as a first identification when the height of the non-intersecting area is smaller than the preset height threshold value; judging whether the width of each non-crossing area is smaller than a first preset width threshold or larger than a second preset width threshold, wherein the second preset width threshold is larger than the first preset width threshold, and when the width of each non-crossing area is smaller than the first preset width threshold or larger than the second preset width threshold, setting the area identification of each non-crossing area as a first identification; judging whether the initial pixel of each non-intersection area is positioned on the right side of the preset position, and setting the area identification of the non-intersection area as a first identification when the initial pixel of the non-intersection area is positioned on the right side of the preset position; when the area identification of the remaining non-intersection area in the non-intersection area corresponding to the binary image is set as a second identification, extracting the area characteristic of the non-intersection area corresponding to the second identification, and determining the format type of the non-intersection area according to the area characteristic, wherein the format type comprises left alignment and central symmetry; and when the number of the non-crossed areas corresponding to the first identification is 0, filtering out the lines areas according to the format types of the non-crossed areas corresponding to the second identification.

In one embodiment, when the number of the non-intersecting area corresponding to the first identifier is 0, screening out the speech-line area according to the format type of the non-intersecting area corresponding to the second identifier includes: when the speech-line format of the non-intersecting area corresponding to the second identifier comprises any speech-line format of left alignment or center symmetry, taking the non-intersecting area with the speech-line format of left alignment or center correspondence as a speech-line area; when the non-intersection region corresponding to the second identifier contains the non-intersection region in any of the left alignment or central symmetry and the non-intersection region in any of the left alignment or central symmetry, setting the pixel value in the non-intersection region in any of the left alignment or central symmetry; and when the speech-line format of the non-crossed region corresponding to the second identifier does not belong to any format of left alignment or central symmetry and the number of the non-crossed regions is greater than 1, calculating the regional variance of the non-crossed region corresponding to each second identifier, and taking the non-crossed region corresponding to the maximum value in the regional variance as the speech-line region.

In one embodiment, extracting the region feature of the non-intersecting region corresponding to the second identifier, and determining the format type of the non-intersecting region according to the region feature includes: equally dividing the non-crossed region corresponding to the second identifier into a plurality of sub-regions from left to right, and calculating the pixel mean value of each sub-region; when the pixel mean value corresponding to the non-crossed area corresponding to the second identifier decreases progressively from left to right, setting the line format corresponding to the non-crossed area corresponding to the second identifier to be left-aligned; dividing the non-crossed region corresponding to the second identifier from the middle to obtain a left region and a right region, dividing the left region into a plurality of sub-regions equally from left to right, calculating the pixel mean value of the sub-region of each left region, dividing the right region into a plurality of sub-regions equally from left to right, calculating the pixel mean value of the sub-region of each right region, and when the pixel mean value of the sub-region of the left region increases from left to right and the pixel mean value of the right region decreases from left to right, setting the typhoon format corresponding to the non-crossed region corresponding to the second identifier as central symmetry.

In one embodiment, the processor, when executing the computer program, further performs the steps of: and when the number of the non-crossed areas corresponding to the first identification is not 0, setting the pixel value in the non-crossed area corresponding to the first identification to be 0 to obtain a second heat map, and executing binarization to the second heat map until the area information of the speech-line area of the second heat map is obtained.

In one embodiment, the processor, when executing the computer program, further performs the steps of: dividing a cross area in the heat map into an upper area and a lower area to obtain a first area and a second area; calculating a column mean value of the first area and the second area; searching the number of maximum values in the column mean values of the first area and the second area; when the number of the maximum values of the column mean values of the first area and the second area is 1, setting the crossing type of the crossing area as vertical crossing; when the number of the maximum values of the column mean values of the first area and the second area is greater than 1, setting the type of the crossing area as left-right crossing; carrying out region division on the crossed region according to a preset division rule to obtain a non-crossed region, wherein the method comprises the following steps: and carrying out region division on the crossed region corresponding to the crossed type according to a preset division rule corresponding to the crossed type to obtain a corresponding non-crossed region.

In one embodiment, performing area division on a cross region corresponding to a cross type according to a preset division rule corresponding to the cross type to obtain a corresponding non-cross region includes: when the crossing type is vertical crossing, carrying out corner detection on the crossing region to obtain corner information of each corner, classifying each corner according to line coordinates in the corner information to obtain a category identifier of each corner information, constructing a rectangular region according to the intersection point of the same category identifier, and taking the constructed rectangular region as a non-crossing region; and when the crossing type is left-right crossing, calculating a column average value of the crossing region, and splitting the region left and right according to the column where the minimum value among a plurality of maximum values in the column average value is located to obtain a corresponding non-crossing region.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring a current frame in a current video, and performing text detection on the current frame to obtain at least one text area of the current frame; acquiring a speech-line region of a current video, wherein the speech-line region of the current video is determined according to text regions of a plurality of video sequence frames of the current video; acquiring region information of an overlapping region of each text region and a speech-line region, wherein the region information comprises height; respectively calculating the ratio of the height of each overlapped region to the height of the corresponding text region and the corresponding line region to obtain a first ratio and a second ratio; and when any ratio of the first ratio and the second ratio is larger than a first preset threshold, taking the region information of the text region as the region information of the current speech-line region.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a plurality of original video sequence frames corresponding to a current video, and performing text detection on each original video sequence frame to obtain a corresponding video sequence frame, wherein the pixel values of a text region and a non-text region of each video sequence frame are respectively 1 and 0; accumulating each video sequence frame to obtain a heat map; binarizing the heat map to obtain a binary image, wherein the binary image comprises at least one closed area, the pixel value of the closed area is 1, and the closed area is a cross area or a non-cross area; when the closed region contains the crossed region, performing region division on the crossed region according to a preset division rule to obtain a non-crossed region until the closed region is the non-crossed region; and carrying out region screening on the non-crossed region according to a preset speech area constraint rule to obtain a speech area.

In one embodiment, the computer program when executed by the processor further performs the steps of: dividing a cross area in the heat map into an upper area and a lower area to obtain a first area and a second area; calculating a column mean value of the first area and the second area; searching the number of maximum values in the column mean values of the first area and the second area; when the number of the maximum values of the column mean values of the first area and the second area is 1, setting the crossing type of the crossing area as vertical crossing; when the number of the maximum values of the column mean values of the first area and the second area is greater than 1, setting the type of the crossing area as left-right crossing; carrying out region division on the crossed region according to a preset division rule to obtain a non-crossed region, wherein the method comprises the following steps: and carrying out region division on the crossed region corresponding to the crossed type according to a preset division rule corresponding to the crossed type to obtain a corresponding non-crossed region.

Those skilled in the art will appreciate that all or a portion of the processes in the methods of the embodiments described above may be implemented by computer programs that may be stored in a non-volatile computer-readable storage medium, which when executed, may include the processes of the embodiments of the methods described above, wherein any reference to memory, storage, database or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, non-volatile memory may include read-only memory (ROM), programmable ROM (prom), electrically programmable ROM (eprom), electrically erasable programmable ROM (eeprom), or flash memory, volatile memory may include Random Access Memory (RAM) or external cache memory, RAM is available in a variety of forms, such as static RAM (sram), Dynamic RAM (DRAM), synchronous sdram (sdram), double data rate sdram (ddr sdram), enhanced sdram (sdram), synchronous link (sdram), dynamic RAM (rdram) (rdram L), direct dynamic RAM (rdram), and the like, and/or external cache memory.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

21页详细技术资料下载

Method, apparatus, computer device and storage medium for processing video data

相关技术

网友询问留言