Automatic layout document label generation method

文档序号：169317 发布日期：2021-10-29 浏览：19次中文

阅读说明：本技术 一种自动版面文档标注生成方法 (Automatic layout document label generation method ) 是由黄鑫玮龚泽挚应翔寇喜超于 2021-07-19 设计创作，主要内容包括：本发明公开了一种自动版面文档标注生成方法,其针对文档图像的分析与识别类任务,能够快速生成大量复杂的文档图像,并且带有详细且准确的标注信息,为相关的算法开发提供支持。本发明提供了自动版面文档标注生成的一种解决方案,避免了人工标注的繁琐性和易错性,同时提供了详细且准确的标注信息；另外本发明可以合成多种语言的文档图像,极大地丰富了文档识别与分析相关的数据集,为多语言的相关算法开发提供统一的数据支持。(The invention discloses an automatic layout document label generation method, which aims at the analysis and identification tasks of document images, can quickly generate a large number of complex document images, has detailed and accurate label information and provides support for related algorithm development. The invention provides a solution for automatic layout document marking generation, avoids the complexity and the error of manual marking, and provides detailed and accurate marking information; in addition, the invention can synthesize document images of multiple languages, greatly enrich the data sets related to document identification and analysis, and provide uniform data support for the development of multi-language related algorithms.)

1. An automatic layout document label generation method comprises the following steps:

(1) randomly generating a page layout template of the document according to a default configuration file and user input information;

(2) rendering different areas in the page according to a page layout template and recording related labeling information, wherein the types of area rendering comprise background rendering, text rendering, image and graph rendering, table rendering and formula rendering; the marking information comprises surrounding frames of all logic areas in the page and character string information in the surrounding frames, the surrounding frames of the logic areas comprise a text surrounding frame, a formula surrounding frame, a table surrounding frame, an image surrounding frame and a graph surrounding frame, and the character string information comprises text character strings and LaTex formula codes;

(3) adding noise and deformation to the document image, storing, correspondingly changing the labeling information, and generating two parts: one part is the original document image and the labeling information thereof, and the other part is the document image added with noise and deformation and the labeling information thereof;

(4) circularly executing the steps until the number of the generated document images reaches a set value to form a document image training data set;

(5) and merging the labeling information in the data set according to the labeling type, and uniformly storing the labeling information as a labeling file with a specific format.

2. The automatic layout document label generation method of claim 1, wherein: the specific process of randomly generating the document page layout template in the step (1) is as follows: firstly, distributing border text areas such as headers and footers, then randomly generating the number of columns, the number of titles and the number of formulas of a page in a certain range, circularly determining the positions and the sizes of the headers and the numbers of the formulas in the page according to the number of the titles and the numbers of the formulas in sequence, further randomly distributing a plurality of areas in the rest whole page according to the determined areas and the number of the columns, and randomly distributing the categories of all the areas by taking preset text, image, graph and table proportion as probability, so that the content in the document is more consistent with a real document and has certain randomness; when the generated document image is in Chinese, Japanese and other languages, a vertical title area is segmented from a text box area meeting specific conditions, and finally other attributes such as line spacing, reading direction and the like are added to the areas of various categories.

3. The automatic layout document label generation method of claim 1, wherein: the background rendering is to randomly select a background image from a background library to serve as the whole background of the document page; the background library is composed of images that are nearly pure or simply matched by a few colors, and when a random background is not used, the background is defaulted to white.

4. The automatic layout document label generation method of claim 1, wherein: the text rendering is divided into paragraph rendering and title rendering, and the specific implementation process is as follows:

a. randomly selecting a font from a font library according to the language and the text type and setting a certain size;

b. selecting a text file to be sampled from a text corpus;

c. if the paragraph is rendered, random paragraph allocation is carried out according to the size of the whole text area, and the allocation information comprises the number of paragraphs and the number of lines in each paragraph; if the title is rendered, only one paragraph exists and the number of text lines is determined by the size of the area and the limit value of the line number;

sampling texts to be placed from the text files in a row basic unit according to the distributed paragraph information, and simultaneously ensuring that the number of the sampled characters is more than or equal to the minimum number N of the characters, so that no blank appears in the area and the sampling efficiency is ensured;

d. if the color difference value T between the foreground and the background is set to be larger than 0, selecting a background image as the background of the text area, simultaneously selecting a color from a color list as the color of the text, and enabling the color difference value D between the color and the background image to be larger than or equal to T; otherwise, the text is defaulted to black;

e. if the paragraph is rendered, adding a specified number of spaces in the head of the paragraph to imitate the indentation effect of the paragraph; if the title is rendered, filling format texts such as Figure, Table and the like with certain probability according to the type;

f. and rendering each character according to the reading direction and the text sequence to form a text segment.

5. The automatic layout document label generation method of claim 4, wherein: the number of paragraphs in the paragraph rendering process and the number of rows in each paragraph in the step c are calculated by the following formulas;

r_h＝r_h-((nLine-1)×(cSize+lSpace)+cSize+pSpace)

nLine＝randint(Lmin,Lmax)

Lmax＝(r_h-cSize)/(cSize+lSpace)+1

wherein: when the reading direction is horizontal reading, r_hIs the remaining height of the text region; when the reading direction is vertical reading, r_hFor the remaining width of the text area, nLine is the number of text lines in a segment, cSize is the font size, lSpace is the line spacing, pSpace is the segment spacing, Lmax is the number of lines which can be allocated at most in the current segment, Lmin is the number of lines allocated at least in one segment, and randint () is a random function; initial r_hFor the height of the text region, continuously iterating the formula until Lmax is smaller than Lmin, and calculating to obtain new r_hAccording to r_hRecalculating the value of Lmax, and if the current Lmax is an integer larger than 0, recalculating nLine according to the Lmax so as to complete the filling of the area and determine the number of sections and the number of lines in the sections;

the number of text lines in the title rendering process is determined by the following formula;

mline＝min(randint(1,Hmax),lCap)

lCap＝(R_h-cSize)/(cSize+lSpace)+1

wherein: when the reading direction is horizontal reading, R_hIs the height of the entire area; when the reading direction is vertical reading, R_hIs the whole areaThe width of the field, mLine, Hmax, is a set parameter value, i.e., the maximum number of lines representing the title, and lCap, is the maximum number of lines that the field can accommodate.

6. The automatic layout document label generation method of claim 4, wherein: the minimum number of characters N in the step c is calculated and determined by the following formula;

N＝W/cSize*mLine

wherein: w is the width of the text region, cSize is the font size, and mLine is the number of text lines within the text region.

7. The automatic layout document label generation method of claim 1, wherein: the specific implementation mode of the image and graphic rendering is as follows: firstly, dividing a filling area into a title area and a picture area, and performing text rendering in the title area; then randomly selecting a picture from the image or the graph library, and ensuring that the size of the picture meets the following standard so as to prevent the picture from generating serious distortion; finally, fusing the selected image to a picture area in the page by using Poisson fusion;

thesh₁＜w_p/w_r＜thesh₂

thesh₁＜h_p/h_r＜thesh₂

wherein: w is a_pAnd h_pWidth and height, w, of selected pictures_rAnd h_rWidth and height of picture area, thresh, respectively₁A threshold value of more than 0.5 and less than 1, thresh₂A threshold value greater than 1 and less than 1.5;

if the currently selected picture does not meet the standard, deleting the picture from the image or the graphic library, and re-randomly selecting the picture until the picture meeting the standard is found or the maximum selection times are reached; when the maximum selection times are reached and no picture meeting the standard is found, selecting a most appropriate picture from all the traversed pictures, wherein the picture is determined by the following formula;

dis＝|w_p/w_r-1|+λ|h_p/h_r-1|

wherein: dis is the distance between the picture size and the area size, the minimum distance indicates that the corresponding picture is most suitable for filling the picture area, and λ is a weight parameter.

8. The automatic layout document label generation method of claim 1, wherein: the table rendering comprises two conditions of table generation and table image fusion, wherein the table image fusion is used for enriching table styles, and the implementation mode of the table image fusion is consistent with the mode used by image and graph rendering; the table generation is realized in the following way: firstly, uniformly dividing a table area according to the font size and the self-defined cell spacing to obtain an initial table row number and an initial table column number, further reducing the column number according to a random proportion and ensuring that the reduced column number is more than or equal to 2; and then longitudinally and transversely splitting the table according to the number of columns and rows, and if the number of rows or columns is more than 2, randomly fusing two adjacent rows or two columns.

9. The automatic layout document label generation method of claim 1, wherein: the formula rendering mode is substantially the same as the image and graph rendering mode, and the difference is that the formula image is automatically generated by a randomly acquired LaTex formula code, and the specific execution process is as follows: firstly, randomly selecting a formula code in a LaTex source code library, then setting a canvas with a fixed size and a font with a fixed size, then rendering the formula in the canvas in the middle, and finally cutting the canvas according to the rendered text range.

10. The automatic layout document label generation method of claim 1, wherein: the labeling types in the step (5) are divided into two types: one for object detection tasks and one for recognition tasks; the annotation information of the object detection task comprises the coordinates of the bounding box and the types of the objects in the bounding box, and the annotation information is stored into two common formats of XML and JSON to adapt to the current mainstream object detection annotation format, so that the generated data can be conveniently applied to actual projects; the marking information of the identification task is mainly character string information in the bounding box and is stored as a text file.

Technical Field

The invention belongs to the technical field of image synthesis, and particularly relates to an automatic layout document label generation method.

Background

The digitization of physical documents relies on a series of image analysis and recognition techniques applied on document images, and the purpose of the digitization is to achieve the digital description of document contents, which mainly comprises texts, images, tables, formulas, graphs and logical relations among all parts.

With the rapid development of deep learning technology, more and more traditional technologies are replaced by algorithms based on neural networks, and the development of document digital algorithms based on deep learning is also developed continuously like spring shoots in the late spring and shows quite good effect. Different from the traditional algorithm development based on artificial design features, the automatic feature learning algorithm based on deep learning usually needs a large amount of data with labels to perform model fitting; however, in the present situation, most of the existing document image data sets are established for analysis of english documents, and due to the difficulty in obtaining annotation information, these data sets often have the disadvantages of inaccurate annotation information, single annotation category, small data size, and the like, so that they have great limitations.

The technology for automatically generating data is an important means for making up for the shortage of training data in algorithm development, a typical example is the technology for automatically synthesizing scene text detection and recognition data, and the technology greatly promotes the development progress of text detection and recognition algorithms based on deep learning.

The training data generation method and system mentioned in chinese patent application No. 202011378838.6 is a data generation technology for document images, and the technology only describes the module structure and the relationship between modules of the designed document generation system in a general way, and does not relate to a specific image generation method, such as how text materials and picture materials are placed, how a format library and an image element library are obtained, and in addition, the synthetic document of the patent technology lacks an important document object category of a formula.

The Chinese patent with application number 202110084710.7 proposes a training data generation system and method for general OCR based on machine learning in the field of character recognition, and the technology utilizes different linguistic data, background images, fonts and colors to synthesize vertical and horizontal texts, and adds different noises and deformations to provide rich OCR training data; however, the number of characters rendered in the synthesized image of the patent technology is far smaller than the number of characters of a text line in a document, and the synthesized image can only be applied to a character recognition task, so that the synthesized image can only be applied to scene character recognition, and the application range has great limitation.

Disclosure of Invention

In view of the above, the present invention provides an automatic layout document annotation generation method, which can rapidly generate a large number of document images for the tasks of analyzing and identifying the document images, and provides support for the development of related algorithms with detailed and accurate annotation information.

An automatic layout document label generation method comprises the following steps:

(1) randomly generating a page layout template of the document according to a default configuration file and user input information;

the configuration file is a set of parameters and comprises the number range of formulas and titles in the page, the proportion of different types of areas, the language of the text, a color list, the color difference value of the background and the foreground, and the width and the height of the document; the user input information is used for updating and setting the parameters; the rich layout parameter information can provide a solid foundation for the layout diversity and the content richness of the document image.

The page layout template is expressed as a plurality of rectangular frames distributed in a page and area attributes in the rectangular frames, the area attributes comprise content types and content attributes, the content types comprise texts, titles, formulas, images and tables, and the content attributes comprise line spacing, paragraph spacing, languages, area names, area upper left corner coordinates, area width and height, boundary blank width and reading direction; the reading direction refers to the direction of correctly reading the text in a certain area, and comprises vertical reading and horizontal reading so as to realize different reading forms of the text such as Chinese, Japanese and the like; the page layout template basically covers all object types and attributes thereof in the existing document, thereby ensuring the universality of the document layout.

(2) Rendering different areas in the page according to a page layout template and recording related labeling information, wherein the types of area rendering comprise background rendering, text rendering, image and graph rendering, table rendering and formula rendering; the marking information comprises surrounding frames of all logic areas in the page and character string information in the surrounding frames, the surrounding frames of the logic areas comprise a text surrounding frame, a formula surrounding frame, a table surrounding frame, an image surrounding frame and a graph surrounding frame, and the character string information comprises text character strings and LaTex formula codes;

the text surrounding box is divided into a character surrounding box, a word surrounding box, a text line surrounding box and a paragraph surrounding box so as to describe text information in detail; the text character string is composed of two-dimensional word character strings, the first dimension is a plurality of text line records, the second dimension is a word character string list in a single text line, text line character string integration and single character splitting can be conveniently carried out, text pictures corresponding to words and text lines can also be conveniently obtained through frame coordinates, and data enhancement operations such as perspective transformation, distortion transformation, noise addition and the like are carried out, so that support is provided for text recognition training.

(3) Adding noise and deformation to the document image, storing, correspondingly changing the labeling information, and generating two parts: one part is the original document image and the labeling information thereof, and the other part is the document image added with noise and deformation and the labeling information thereof;

the added noise is mainly Gaussian noise and salt and pepper noise, and the added deformation comprises perspective transformation and distortion transformation, wherein the distortion transformation is to realize the bending effect of the image by utilizing a curve model, and the curve model comprises a cubic curve, a sine curve and the like and is used for simulating the curved surface deformation of books and magazines.

(4) Circularly executing the steps until the number of the generated document images reaches a set value to form a document image training data set;

(5) and merging the labeling information in the data set according to the labeling type, and uniformly storing the labeling information as a labeling file with a specific format.

Further, the specific process of randomly generating the document page layout template in the step (1) is as follows: firstly, distributing border text areas such as headers and footers, then randomly generating the number of columns, the number of titles and the number of formulas of a page in a certain range, circularly determining the positions and the sizes of the headers and the numbers of the formulas in the page according to the number of the titles and the numbers of the formulas in sequence, further randomly distributing a plurality of areas in the rest whole page according to the determined areas and the number of the columns, and randomly distributing the categories of all the areas by taking preset text, image, graph and table proportion as probability, so that the content in the document is more consistent with a real document and has certain randomness; when the generated document image is in Chinese, Japanese and other languages, a vertical title area is segmented from a text box area meeting specific conditions, and finally other attributes such as line spacing, reading direction and the like are added to the areas of various categories.

Further, the background rendering is to randomly select a background image from a background library to serve as the overall background of the document page; the background library is composed of images that are nearly pure or simply matched by a few colors, and when a random background is not used, the background is defaulted to white.

Further, the text rendering is divided into paragraph rendering and title rendering, and the specific implementation process is as follows:

a. randomly selecting a font from a font library according to the language and the text type and setting a certain size;

the text type comprises a title and a paragraph, so that different text forms are distinguished, and other text attributes such as a font, a size and the like are determined; the font library contains font files in different languages and different styles and is sorted by language and style for selection.

b. Selecting a text file to be sampled from a text corpus;

the text corpus comprises text data in different fields such as novel, news, literature, encyclopedia and the like, and is divided into English, Japanese, Chinese and the like for storage in different languages. The text sources in various forms ensure the diversity of text regions and provide rich and reliable data support for document text detection and identification.

c. If the paragraph is rendered, random paragraph allocation is carried out according to the size of the whole text area, and the allocation information comprises the number of paragraphs and the number of lines in each paragraph; if the title is rendered, only one paragraph exists and the number of text lines is determined by the size of the area and the limit value of the line number;

f. and rendering each character according to the reading direction and the text sequence to form a text segment.

Further, the number of paragraphs and the number of rows in each paragraph in the paragraph rendering process in step c are calculated by the following formulas;

r_h＝r_h-((nLine-1)×(cSize+lSpace)+cSize+pSpace)

nLine＝randint(Lmin,Lmax)

Lmax＝(r_h-cSize)/(cSize+lSpace)+1

wherein: when the reading direction is horizontal reading, r_hIs the remaining height of the text region; when the reading direction is vertical reading, r_hFor the remaining width of the text area, nLine is the number of text lines in a segment, cSize is the font size, lSpace is the line spacing, pSpace is the segment spacing, Lmax is the number of lines which can be allocated at most in the current segment, Lmin is the number of lines allocated at least in one segment, and randint () is a random function; initial r_hFor the height of the text region, continuously iterating the formula until Lmax is smaller than Lmin, and calculating to obtain new r_hAccording to r_hReckoningAnd calculating the value of Lmax, and if the current Lmax is an integer larger than 0, recalculating nLine according to the Lmax so as to complete the filling of the area and determine the number of the sections and the number of the lines in the sections.

The number of text lines in the title rendering process is determined by the following formula;

mline＝min(randint(1,Hmax),lCap)

lCap＝(R_h-cSize)/(cSize+lSpace)+1

wherein: when the reading direction is horizontal reading, R_hIs the height of the entire area; when the reading direction is vertical reading, R_hFor the width of the whole area, mLine is the number of text lines in the text area, Hmax is a set parameter value, i.e. the maximum number of lines representing the title, and lCap is the maximum number of lines that the area can accommodate.

The paragraph division and line number determination of the text are realized by combining the characters such as font size, line spacing, paragraph spacing, reading direction and the like on the basis of the space size of the text region, so that not only is the diversity division and filling of the text region realized, but also the completeness and efficiency of text filling are ensured.

Further, the minimum number of characters N in the step c is determined by the following formula;

N＝W/cSize*mLine

wherein: w is the width of the text region, cSize is the font size, and mLine is the number of text lines within the text region.

Further, the specific implementation manner of the image and graphic rendering is as follows: firstly, dividing a filling area into a title area and a picture area, and performing text rendering in the title area; then randomly selecting a picture from the image or the graph library, and ensuring that the size of the picture meets the following standard so as to prevent the picture from generating serious distortion; finally, fusing the selected image to a picture area in the page by using Poisson fusion;

thesh₁＜w_p/w_r＜thesh₂

thesh₁＜h_p/h_r＜thesh₂

in order to ensure the efficiency of selecting a proper picture and simultaneously ensure the efficiency, if the currently selected picture does not accord with the standard, deleting the picture from the image or the graphic library, and randomly selecting the picture again until the picture which accords with the standard is found or the maximum selection times is reached; when the maximum selection times are reached and no picture meeting the standard is found, selecting a most appropriate picture from all the traversed pictures, wherein the picture is determined by the following formula;

dis＝|w_p/w_r-1|+λ|h_p/h_r-1|

wherein: dis is the distance between the picture size and the region size, the minimum distance indicates that the corresponding picture is most suitable for filling the picture region, and λ is a weighting parameter (for adjusting the influence degree of the width and height on dis). The difference value of the length of two sides of the area to be filled and the length of two sides of the picture is used as the standard for evaluating the image, so that the width and the height of the selected image can be ensured to be as close to the width and the height of the area to be filled as possible during image filling, the image distortion is reduced, and particularly when the filled pictures are table pictures and formula pictures.

Further, the table rendering comprises two conditions of table generation and table image fusion, wherein the table image fusion is used for enriching the table style, and the implementation mode of the table image fusion is consistent with the mode used by image and graphic rendering; the table generation is realized in the following way: firstly, uniformly dividing a table area according to the font size and the self-defined cell spacing to obtain an initial table row number and an initial table column number, further reducing the column number according to a random proportion and ensuring that the reduced column number is more than or equal to 2; and then longitudinally and transversely splitting the table according to the number of columns and rows, and if the number of rows or columns is more than 2, randomly fusing two adjacent rows or two columns.

Further, the implementation of the formula rendering is substantially the same as the image and graphic rendering, except that the formula image is automatically generated by a randomly obtained LaTex formula code, and the specific implementation process is as follows: firstly, randomly selecting a formula code in a LaTex source code library, then setting a canvas with a fixed size and a font with a fixed size, then rendering the formula in the canvas in the middle, and finally cutting the canvas according to the rendered text range. The formula picture obtained in the mode not only can enable the bounding box of the formula to be more accurate, but also can provide LaTex source code marking information of the formula for formula analysis.

Further, the annotation types in the step (5) are divided into two types: one for object detection tasks and one for recognition tasks; the annotation information of the object detection task comprises the coordinates of the bounding box and the types of the objects in the bounding box, and the annotation information is stored into two common formats of XML and JSON to adapt to the current mainstream object detection annotation format, so that the generated data can be conveniently applied to actual projects; the marking information of the identification task is mainly character string information in the bounding box and is stored as a text file.

The invention provides a solution for automatic layout document marking generation, which avoids the complexity and error-prone property of manual marking and provides detailed and accurate marking information; in addition, the invention can synthesize document images of multiple languages, greatly enrich the data sets related to document identification and analysis, and provide uniform data support for the development of multi-language related algorithms.

Drawings

FIG. 1 is a schematic overall flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of the layout division process of the method of the present invention.

FIG. 3 is a schematic view of a region rendering process of the method of the present invention.

FIG. 4 is an example of a page layout template generated by the present invention.

FIG. 5 is an example of a document image generated after rendering according to the present invention.

FIG. 6 is an example of generating a visualization image with annotation information according to the present invention.

FIG. 7 is an example of a morphed image generated by the present invention.

Fig. 8 is an example of box information labeled in JSON format.

Fig. 9 is an example of box information labeled in XML format.

Detailed Description

In order to more specifically describe the present invention, the following detailed description is provided for the technical solution of the present invention with reference to the accompanying drawings and the specific embodiments.

As shown in fig. 1, the method for generating an automatic layout document label of the present invention comprises the following steps:

(1) and randomly generating document page layout information according to the configuration file and the user input information.

The specific parameter settings in this embodiment are as follows: the number of formulas and titles in the page ranges from 1 to 3, the ratio of the appearance of different categories of regions (text, images, graphics, tables) is 4:2:1:2, the language of the text is random, the color difference between the background and the foreground is 0, and the width and height of the document are 960 and 1280, respectively. Fig. 2 shows a page layout dividing process, which mainly aims to divide a page and determine attributes of each region.

Fig. 4 shows a two-column layout division result generated under the above parameters, which contains tables, texts, titles, formulas, no border text regions such as headers and footers, and the text language is japanese and therefore has a vertically read title. In the figure, firstly, a boundary text area of a footer position is determined, then 1 horizontal title and 1 formula position are determined, then the remaining areas are divided into 5 rectangular areas, and the category of each area is determined according to the proportion of texts, images, graphs and tables, wherein the probability of texts is 4/(4+2+1+2) ═ 4/9. When the text language synthesized at this time is randomly Japanese, a vertical title box is divided from the text box conforming to the specific length, and finally the region attributes of all the regions are set, wherein the reading direction attribute of the specific vertical title box is vertical reading, and the other attributes are defaults.

(2) And rendering different areas in the page and recording related marking information according to the page layout information.

Fig. 3 shows a whole region rendering process, which is divided into a picture rendering and a text rendering according to different materials, wherein the picture may be generated or cropped from a public data set or other document images. In the invention, the formula and the table can be generated, because the formula and the table contain important information which needs to be identified when the document is analyzed, and the generated picture can record the information; the graph is obtained mainly by cutting public data sets such as POD (POD) and PubLayNet, and the image acquisition is mainly obtained from object identification data sets such as VOC and COCO.

Fig. 5 shows the rendered image, where the text rendering text font size cSize ranges from 25 to 35, and the values of the parameters min, lSpace, and pSpace are 3, 0.2 × cSize, and 0.5 × cSize, respectively. The reading direction of the text at the upper right corner and the vertical title text is vertical reading; the footer text at the bottom is small compared with the text in the text, and the position and the number of words are random within the area, two fonts as titles of titles appear in a bold and enlarged form, and the line number limit value of the title is set to 3, that is, two tables and the line number of two titles is at most 3.

The filling of the two Table areas is realized by Table picture fusion and is divided into a header part and a Table part, the two rows of characters are Table headers, because the format texts such as tables, tables and the like are in a certain probability, and the keywords do not appear in the header part of the header. Whether the table picture size meets the standard threshold thresh₁And thresh₂Set to 0.8 and 1.2, respectively.

The label information in fig. 5 includes bounding boxes of characters, words, text lines, paragraphs, tables, formulas, character strings within the characters, words, text lines, and character strings of the LaTex code of the formula. FIG. 6 is a visual display showing all bounding box information; in japanese and chinese, words are represented as continuous strings of characters with punctuation as a demarcation point and without punctuation.

(3) And adding noise and deformation on the document image, storing the noise and the deformation, and simultaneously changing corresponding labeling information.

Fig. 7 shows the image after the deformation is added, the model of the bending deformation is a cubic curve model, and three-dimensional rotation and translation operations are performed, and the labeled information at this time can be converted as required, for example, the labeled rectangular frame is recorded in the form of a segmentation map, and the deformation transformation can be implemented by applying the deformation transformation to the original segmentation map, or a polygon is used to represent the rectangular frame and apply the corresponding deformation transformation.

(4) And circularly executing the steps until the number of the generated images reaches a set value.

(5) And merging the labeling information according to the labeling type, and storing the labeling information as a labeling file with a specific format.

And finally, merging all the labeled information of the pictures, storing the bounding box information in a JSON file and an XML file, and storing the character string information in a text file. Fig. 8 and fig. 9 show part of the markup information in the JSON and XML file, respectively, where the markup information is the table bounding box in the upper left corner of fig. 5. Segment in JSON label is polygon represented by point set, iscrowd equal to 0 represents current object as single object and segment is represented by polygon, whereas iscrowd if equal to 1 represents current object as a group of objects and segment is represented by RLE (run-length encoding) format, image _ id is number of document image, area is area of segment, bbox is coordinate information of rectangle, represented by upper left coordinate, width and height, category _ id is belonging category, here 3 represents table, id is number of this table in all labels. Compared with JSON, XML labeling is simpler, wherein name is the classification name of the object, bndbox is the upper left coordinate and the lower right coordinate of the object, difficult indicates whether the object is difficult to identify, and 0 generally indicates that the object is not difficult to identify.

The embodiments described above are presented to enable a person having ordinary skill in the art to make and use the invention. It will be readily apparent to those skilled in the art that various modifications to the above-described embodiments may be made, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

18页详细技术资料下载

Automatic layout document label generation method

相关技术

网友询问留言