DOM tree-based full-type text replacement method, system, device and storage medium

文档序号：1505360 发布日期：2020-02-07 浏览：26次中文

阅读说明：本技术 一种基于dom树的全类型文本替换方法、系统、装置及存储介质 (DOM tree-based full-type text replacement method, system, device and storage medium ) 是由杜卫红谢立欧蒋立民郑永乐詹锦州于 2019-10-11 设计创作，主要内容包括：本发明提供了一种基于DOM树的全类型文本替换方法、系统、装置及存储介质,该方法包括：扫描网站的DOM树,获取网站的静态文件以及图片,由网站的静态文件中得到字体映射关系文件；将字体文件中每个字体单元进行转换为图片；对字体文件转换后的图片进行图像识别,获取图片内实际的文字,建立网站文字与实际文字的映射关系以及提取网站图片的内容；获取网站内的源码,将源码内的文字根据映射关系转换为实际文字,将网站图片进行预筛选,对网站的数据图片进行文本识别,提取图片内有效信息,从而做到全文本替换。本发明的有益效果是：本发明通过多线程将每个字体转换为图片格式,通过图像识别建立真实文字与源码文字的映射关系,进行大量训练文字识别模型,可以精确地获取网站显示的内容。(The invention provides a method, a system, a device and a storage medium for replacing full-type texts based on a DOM tree, wherein the method comprises the following steps: scanning a DOM tree of a website, acquiring a static file and an image of the website, and obtaining a font mapping relation file from the static file of the website; converting each font unit in the font file into a picture; performing image recognition on the picture converted from the font file, acquiring actual characters in the picture, establishing a mapping relation between the website characters and the actual characters and extracting the content of the website picture; the method comprises the steps of obtaining source codes in a website, converting characters in the source codes into actual characters according to a mapping relation, pre-screening website pictures, performing text recognition on data pictures of the website, and extracting effective information in the pictures, so that full text replacement is achieved. The invention has the beneficial effects that: according to the method, each font is converted into the picture format through multithreading, the mapping relation between the real characters and the source code characters is established through image recognition, a large number of character recognition models are trained, and the content displayed by the website can be accurately acquired.)

1. A DOM tree based full-type text replacement method is characterized by comprising the following steps:

step 1: scanning a DOM tree of a website, acquiring a static file and an image of the website, and obtaining a font mapping relation file from the static file of the website;

step 2: converting each font unit in the font file into a picture;

and step 3: performing image recognition on the picture converted from the font file, acquiring actual characters in the picture, establishing a mapping relation between the website characters and the actual characters and extracting the content of the website picture;

2. The full-type text replacement method according to claim 1, wherein:

in the step 1, scanning a DOM tree of the website to obtain a static file and an image of the website, and obtaining a font mapping relation file in an svg format from the static file of the website;

in step 2, each font unit in the font file is converted into the png format picture.

3. The full-type text replacement method according to claim 1, wherein in the step 4, a source code in a website is obtained, characters in the source code are converted into actual characters according to the mapping relation in the step 3, website pictures are pre-screened, useless pictures of the website are filtered through image fuzzy search, text recognition is performed on data pictures of the website, effective information in the pictures is extracted, and therefore full-text replacement is performed;

in the step 3, the pictures after the font file conversion are preprocessed, the characters of the font library corresponding to the image file are identified by utilizing deep learning, the characters with the highest similarity are obtained, and then multi-round learning optimization is carried out, so that the actual characters in the pictures are obtained, the mapping relation between the website characters and the actual characters is established, and the content of the website pictures is extracted;

in the step 4, text recognition is performed on the data picture of the website, and the implementation manner of extracting the effective information in the picture is as follows: preprocessing a website picture, performing text recognition on the website picture, judging the picture to be a data picture when the ratio of the size of characters in the picture to the size of the picture exceeds a set threshold value, preprocessing the data picture, removing existing irrelevant elements, analyzing characters of the data picture, judging the characters to be non-important characters if the ratio of the transparency of the characters to the average transparency of the characters exceeds the set threshold value, and filtering the characters.

4. The full-type text replacement method according to claim 3, further comprising the step of 5: setting a timer, and executing the step 1 at regular time;

in the step 1, transmitting and storing the static file and the font mapping relation file to an OSS server;

in the step 3, the irrelevant elements comprise a watermark and an interference line;

in the step 3, the preprocessing of the picture after the font file conversion comprises the steps of extracting a character area, graying, denoising, binaryzation, character segmentation and normalization processing of the image characters;

in the step 4, the preprocessing of the website picture includes:

step 4.1: carrying out image color fusion on the character picture, changing the color picture into a gray-scale picture, and carrying out weighted average on R, G, B three components according to different weights in an RGB (red, green and blue) model of the character picture according to correlation and other indexes;

step 4.2: smoothing the image by using an OTSU maximum inter-class variance method;

step 4.3: carrying out dimension reduction processing on the image to eliminate noise on the image;

step 4.4: carrying out inclination correction on the font;

step 4.5: cutting the adhered character body, and splicing broken characters;

the implementation manner of the step 4.5 is as follows: sending the character image into a convolution network to extract characteristic values to obtain n vectors, and then sending the n vectors into an LSTM network; obtaining an m-dimensional vector, and calculating a corresponding position font through an optimized softmax function; the optimized function is the multiplication by a filter omega after the softmax function.

5. A DOM tree based full-type text replacement system, comprising:

a scanning module: the system comprises a database, a database server and a display, wherein the database is used for storing a DOM (document object model) tree of a website;

a conversion module: the system comprises a font file, a storage unit and a processing unit, wherein the font file is used for converting each font unit in the font file into a picture;

an identification module: the system is used for carrying out image recognition on the picture converted by the font file, acquiring actual characters in the picture, establishing a mapping relation between the website characters and the actual characters and extracting the content of the website picture;

a processing module: the method is used for acquiring source codes in the website, converting characters in the source codes into actual characters according to the mapping relation of the identification module, pre-screening website pictures, filtering useless pictures of the website, performing text identification on data pictures of the website, and extracting effective information in the pictures, so that full text replacement is achieved.

6. The full-type text replacement system according to claim 5, wherein:

in the scanning module, scanning a DOM tree of a website to obtain a static file and an image of the website, and obtaining a font mapping relation file in an svg format from the static file of the website;

and in the conversion module, converting each font unit in the font file into the png format picture.

7. The full-type text replacement system according to claim 5, wherein in the processing module, source codes in a website are acquired, characters in the source codes are converted into actual characters according to the mapping relation of the recognition module, website pictures are pre-screened, useless pictures of the website are filtered through image fuzzy search, text recognition is performed on data pictures of the website, effective information in the pictures is extracted, and therefore full-text replacement is performed;

in the identification module, preprocessing is carried out on the pictures after the font files are converted, the characters of a font library corresponding to the image files are identified by utilizing deep learning, the characters with the highest similarity are obtained, and then multi-round learning optimization is carried out, so that the actual characters in the pictures are obtained, the mapping relation between the website characters and the actual characters is established, and the content of the website pictures is extracted;

in the processing module, text recognition is carried out on the data pictures of the website, and the implementation mode of extracting effective information in the pictures is as follows: preprocessing a website picture, performing text recognition on the website picture, judging the picture to be a data picture when the ratio of the size of characters in the picture to the size of the picture exceeds a set threshold value, preprocessing the data picture, removing existing irrelevant elements, analyzing characters of the data picture, judging the characters to be non-important characters if the ratio of the transparency of the characters to the average transparency of the characters exceeds the set threshold value, and filtering the characters.

8. The full-type text replacement system according to claim 7, further comprising a timing module: the scanning module 1 is used for setting a timer and executing scanning regularly;

transmitting and storing the static file and the font mapping relation file in the scanning module to an OSS server;

in the identification module, the extraneous elements include a watermark and an interference line;

in the identification module, preprocessing the picture after font file conversion comprises extracting a character area, graying, denoising, binaryzation, character segmentation and normalization processing on image characters;

in the processing module, the website pictures are preprocessed through the first processing module to the fifth processing module,

a first processing module: the system is used for carrying out image color fusion on the character picture, changing the color picture into a gray-scale picture, and carrying out weighted average on R, G, B three components in different weights in an RGB (red, green and blue) model of the character picture according to correlation and other indexes;

a second processing module: the method is used for smoothing the image by using the OTSU maximum inter-class variance method;

a third processing module: the system is used for carrying out dimension reduction processing on the image and eliminating noise on the image;

a fourth processing module: the method is used for correcting the inclination of the font;

a fifth processing module: the cutting device is used for cutting the adhered character body and splicing broken characters;

the fifth processing module is implemented in the following manner: sending the character image into a convolution network to extract characteristic values to obtain n vectors, and then sending the n vectors into an LSTM network; obtaining an m-dimensional vector, and calculating a corresponding position font through an optimized softmax function; the optimized function is the multiplication by a filter omega after the softmax function.

9. A DOM tree based full-type text replacement apparatus, comprising: memory, a processor and a computer program stored on the memory, the computer program being configured to implement the steps of the full-type text replacement method of any one of claims 1-4 when invoked by the processor.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program configured to, when invoked by a processor, implement the steps of the full-type text replacement method of any of claims 1-4.

Technical Field

The invention relates to the technical field of networks, in particular to a method, a system, a device and a storage medium for replacing full-type texts based on a DOM tree.

Background

With the advancement of science and technology, networks become a part of life and work of people, and contents such as characters in pictures on websites are difficult to extract, so that data acquisition is influenced, great troubles are brought to users, and a solution is needed urgently.

At present, a website renders characters in a webpage by calling a self-defined font file, the source code characters of the webpage are corresponding font codes, real data cannot be obtained by means of obtaining the source codes of the website, meanwhile, the website converts the characters into pictures, meanwhile, interference such as watermarks is additionally arranged, and the data acquisition difficulty is increased.

Disclosure of Invention

The invention provides a DOM tree-based full-type text replacement method, which comprises the following steps:

step 1: scanning a DOM tree of a website, acquiring a static file and an image of the website, and obtaining a font mapping relation file from the static file of the website;

step 2: converting each font unit in the font file into a picture;

As a further improvement of the present invention, in step 1, scanning a DOM tree of a website to obtain a static file and an image of the website, and obtaining a font mapping relationship file in an svg format from the static file of the website;

in step 2, each font unit in the font file is converted into the png format picture.

As a further improvement of the present invention, in step 4, a source code in the website is obtained, characters in the source code are converted into actual characters according to the mapping relationship in step 3, website pictures are pre-screened, useless pictures of the website are filtered through image fuzzy search, text recognition is performed on data pictures of the website, effective information in the pictures is extracted, and therefore full text replacement is performed;

As a further improvement of the present invention, the full-type text replacement method further comprises the step 5: setting a timer, and executing the step 1 at regular time;

in the step 1, transmitting and storing the static file and the font mapping relation file to an OSS server;

in the step 3, the irrelevant elements comprise a watermark and an interference line;

in the step 4, the preprocessing of the website picture includes:

step 4.2: smoothing the image by using an OTSU maximum inter-class variance method;

step 4.3: carrying out dimension reduction processing on the image to eliminate noise on the image;

step 4.4: carrying out inclination correction on the font;

step 4.5: cutting the adhered character body, and splicing broken characters;

The invention also discloses a DOM tree-based full-type text replacement system, which comprises:

a scanning module: the system comprises a database, a database server and a display, wherein the database is used for storing a DOM (document object model) tree of a website;

a conversion module: the system comprises a font file, a storage unit and a processing unit, wherein the font file is used for converting each font unit in the font file into a picture;

As a further improvement of the invention, in the scanning module, scanning a DOM tree of a website to obtain a static file and an image of the website, and obtaining a font mapping relationship file in an svg format from the static file of the website;

and in the conversion module, converting each font unit in the font file into the png format picture.

As a further improvement of the invention, in the processing module, a source code in a website is obtained, characters in the source code are converted into actual characters according to the mapping relation of the identification module, website pictures are pre-screened, useless pictures of the website are filtered through image fuzzy search, text identification is carried out on data pictures of the website, effective information in the pictures is extracted, and therefore full text replacement is carried out;

As a further improvement of the present invention, the full-type text replacement system further comprises a timing module: the scanning module 1 is used for setting a timer and executing scanning regularly;

transmitting and storing the static file and the font mapping relation file in the scanning module to an OSS server;

in the identification module, the extraneous elements include a watermark and an interference line;

in the processing module, the website pictures are preprocessed through the first processing module to the fifth processing module,

a second processing module: the method is used for smoothing the image by using the OTSU maximum inter-class variance method;

a third processing module: the system is used for carrying out dimension reduction processing on the image and eliminating noise on the image;

a fourth processing module: the method is used for correcting the inclination of the font;

a fifth processing module: the cutting device is used for cutting the adhered character body and splicing broken characters;

The invention also provides a device for replacing the full-type text based on the DOM tree, which comprises: a memory, a processor, and a computer program stored on the memory, the computer program configured to, when invoked by the processor, perform the steps of the full-type text replacement method of the present invention.

The present invention also provides a computer readable storage medium having stored thereon a computer program configured to, when invoked by a processor, perform the steps of the full-type text replacement method of the present invention.

The invention has the beneficial effects that: the invention converts each font into the picture format through multithreading, establishes the mapping relation between the real characters and the source code characters through image recognition, performs a large amount of training character recognition models, can accurately acquire the content displayed by the website, and simultaneously can keep updating the font files of the website by the timer established by the invention, thereby maintaining the high robustness and high reusability of the invention.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

fig. 2 is a system architecture diagram of the present invention.

Detailed Description

As shown in FIG. 1, the invention discloses a DOM tree based full-type text replacement method, which comprises the following steps:

step 1: scanning a DOM tree of a website, acquiring a static file and an image of the website, and obtaining a font mapping relation file from the static file of the website;

step 2: converting each font unit in the font file into a picture;

and 4, step 4: and (3) acquiring a source code in the website, converting characters in the source code into actual characters according to the mapping relation in the step (3), pre-screening website pictures, filtering useless pictures of the website, performing text recognition on data pictures of the website, and extracting effective information in the pictures, thereby achieving full text replacement. The method comprises the steps of screening pictures of a website by an image fuzzy search technology, screening the pictures containing website data, removing interference information such as watermarks of the website by an image identification technology, and extracting characters and other useful information of the pictures.

The full-type text replacement of the present invention refers to converting the content displayed on the page into text with a uniform format, including but not limited to pictures and tables. Meanwhile, the DOM tree-based full-type text replacement method is also suitable for the APP of the mobile terminal, and the same steps are carried out on the page of the APP, so that the text content with the uniform coding format is obtained.

In the step 1, scanning a DOM tree of the website to obtain a static file and an image of the website, and obtaining a font mapping relation file in an svg format from the static file of the website;

in step 2, each font unit in the font file is converted into the png format picture.

In the step 4, the source code in the website is obtained, the characters in the source code are converted into actual characters according to the mapping relation in the step 3, website pictures are pre-screened, useless pictures of the website are filtered through image fuzzy search, text recognition is carried out on data pictures of the website, effective information in the pictures is extracted, and therefore full text replacement is achieved.

In the step 3, the pictures after the font file conversion are preprocessed, the characters of the font library corresponding to the image file are identified by utilizing deep learning, the characters with the highest similarity are obtained, and then multi-round learning optimization is carried out, so that the actual characters in the pictures are obtained, the mapping relation between the website characters and the actual characters is established, and the content of the website pictures is extracted;

The full-type text replacement method further comprises the following steps: and 5: setting a timer, and executing the step 1 regularly, thereby obtaining the static file of the website regularly, once the font file is updated by the website, the system renews the font mapping relation, and the high usability of the system is kept.

The invention also has the following characteristics:

1. through the multithreading operation mode, the TPS and the concurrency requirements of the system are met, meanwhile, the text replacement of the image can be accelerated, and the operation efficiency of the system is improved.

2. The image files needing to be identified are distributed through the RabbitMQ, and a high-availability distributed architecture is built by using Keepaldd and Haproxy, so that the operation time can be greatly reduced.

In the step 1, the static file and the font mapping relationship file are transmitted and stored to the OSS server, the font mapping file generated every day is stored by the OSS server, and the static file is acquired and stored to the OSS server, so that the access pressure to the website is reduced, and the text replacement speed is increased.

The invention has different preprocessing logics for character recognition of a font mapping file and image character recognition, generally speaking, characters in a website font mapping file are finished, and characters in an image have a plurality of interference lines and are not finished, so that the image characters need to be subjected to substeps of character region extraction, graying, noise reduction, binarization, character segmentation and normalization to be recognized.

In the step 3, the irrelevant elements comprise a watermark and an interference line;

in the step 4, the preprocessing of the website picture includes:

step 4.1: and carrying out image color fusion on the character picture, and changing the color picture into a gray level picture. In the RGB model of text images, the three components of r (red), g (green), and b (blue) are weighted and averaged with different weights according to the correlation and other indicators.

Step 4.2: and smoothing the image by using an OTSU maximum inter-class variance method.

Step 4.3: and performing dimension reduction processing on the image to eliminate noise on the image.

Step 4.4: and (5) carrying out inclination correction on the font.

Step 4.5: cutting the adhered character body and splicing broken characters.

The implementation manner of the step 4.5 is as follows: the character image is sent to a convolution network to extract characteristic values, n vectors are obtained, and then the n vectors are sent to an LSTM network. And obtaining an m-dimensional vector, and calculating the corresponding position font through an optimized softmax function (namely sm-omega). The optimized function is the multiplication by a filter omega after the softmax function. Ω is to cope with the problem that the accuracy of the softmax function is poor when approaching 0 and approaching 1. In the scene of picture processing, using linear weights results in a high probability of precision errors being distributed on both sides of 0 and 1. Therefore, the accuracy of the sm-omega function is improved by utilizing the nonlinear weight, so that the accuracy error is uniformly distributed in a probability interval, and the confidence coefficient of the model is improved.

The invention carries out cooperative optimization based on the CPU with less core number, and the performance of the invention can still keep the original performance under a plurality of scenes with over-high CPU usage.

The invention also carries out deep learning on character recognition, establishes a model capable of coping with different font fonts, and can carry out high-precision character recognition through the model with wide adaptation range and comprehensive font number.

The method comprises the steps of scanning the DOM tree to obtain text information and pictures, filtering irrelevant elements such as watermarks, interference lines and the like of the pictures through a picture character recognition technology to obtain characters of the website pictures, and restoring the picture information and the character information of the website into the content of the website through the DOM tree.

In summary, the present invention solves the problem of font replacement of a web site by scanning a DOM tree of the web site, extracting characters and pictures thereof, converting a font file of the web site into a processable image file by using an image transcoding technology, acquiring character information of the picture by using an image recognition technology, establishing a mapping relationship between the picture and the font file, analyzing fonts in the web site by using the mapping relationship, acquiring real data, acquiring the font file at regular time, recognizing the image characters by using a large number of training models, and establishing a high-precision mapping relationship. The method comprises the steps of acquiring a data website by using a script framework, acquiring font files and picture files with confusion, converting the font files in an svg format into png pictures which are easier to operate, carrying out image recognition on the pictures, returning the corresponding relation between the font files and recognition results, acquiring website source codes, converting characters with font confusion into correct texts through the corresponding relation, and replacing the correct texts and the image recognition results with texts and pictures in the source codes.

The invention also discloses a DOM tree-based full-type text replacement system, which comprises:

a scanning module: the system comprises a database, a database server and a display, wherein the database is used for storing a DOM (document object model) tree of a website;

a conversion module: the system comprises a font file, a storage unit and a processing unit, wherein the font file is used for converting each font unit in the font file into a picture;

In the scanning module, scanning a DOM tree of a website to obtain a static file and an image of the website, and obtaining a font mapping relation file in an svg format from the static file of the website;

and in the conversion module, converting each font unit in the font file into the png format picture.

In the processing module, the source codes in the website are obtained, characters in the source codes are converted into actual characters according to the mapping relation of the identification module, website pictures are pre-screened, useless pictures of the website are filtered through image fuzzy search, text identification is carried out on data pictures of the website, effective information in the pictures is extracted, and therefore full-text replacement is achieved. The method comprises the steps of screening pictures of a website by an image fuzzy search technology, screening the pictures containing website data, removing interference information such as watermarks of the website by an image identification technology, and extracting characters and other useful information of the pictures.

In the identification module, preprocessing is carried out on the pictures after the font files are converted, the characters of a font library corresponding to the image files are identified by utilizing deep learning, the characters with the highest similarity are obtained, and then multi-round learning optimization is carried out, so that the actual characters in the pictures are obtained, the mapping relation between the website characters and the actual characters is established, and the content of the website pictures is extracted;

The full-type text replacement system further includes: a timing module: the scanning module is used for setting a timer and executing scanning regularly.

Transmitting and storing the static file and the font mapping relation file in the scanning module to an OSS server;

in the identification module, the extraneous elements include a watermark and an interference line;

in the processing module, the website pictures are preprocessed through a first processing module to a fifth processing module.

A first processing module: and carrying out image color fusion on the character picture, and changing the color picture into a gray level picture. In the RGB model of text images, the three components of r (red), g (green), and b (blue) are weighted and averaged with different weights according to the correlation and other indicators.

A second processing module: and smoothing the image by using an OTSU maximum inter-class variance method.

A third processing module: and performing dimension reduction processing on the image to eliminate noise on the image.

A fourth processing module: and (5) carrying out inclination correction on the font.

A fifth processing module: cutting the adhered character body and splicing broken characters.

The fifth processing module is implemented in the following manner: the character image is sent to a convolution network to extract characteristic values, n vectors are obtained, and then the n vectors are sent to an LSTM network. And obtaining an m-dimensional vector, and calculating the corresponding position font through an optimized softmax function (namely sm-omega). The optimized function is the multiplication by a filter omega after the softmax function. Ω is to cope with the problem that the accuracy of the softmax function is poor when approaching 0 and approaching 1. In the scene of picture processing, using linear weights results in a high probability of precision errors being distributed on both sides of 0 and 1. Therefore, the accuracy of the sm-omega function is improved by utilizing the nonlinear weight, so that the accuracy error is uniformly distributed in a probability interval, and the confidence coefficient of the model is improved.

The invention establishes a set of immediate and efficient replacement system through font files and pictures of the website, replaces the text returned by the website with the text displayed by the website, and converts the pictures of the website into the extracted text, thereby realizing the purpose of data acquisition.

As shown in fig. 2, the production end includes a DOM tree parser, a SVG picture converter, a picture cloud storage, and the like. The method comprises the steps that the process is unchanged, after a webpage DOM tree is analyzed, font file and picture conversion is carried out, pictures are obtained and then stored in a cloud storage system, the webpage text is converted into a main message after being serialized, the link of the webpage font picture file and the pictures is used as a sub-message, a production end issues the messages, in a distributed system, a high-availability load balancing environment is built by using Keepalived and Haproxy, and different messages are processed by each node at the same time. The consumption end extracts the picture text and the font picture of the message, acquires the ID of the main message after asynchronous processing, delivers the extracted text information to the main message for text replacement, finally completes the full text replacement of the webpage, and saves the website template to a cloud storage system, thereby improving the efficiency of the system.

The invention also discloses a device for replacing the full-type text based on the DOM tree, which comprises: a memory, a processor, and a computer program stored on the memory, the computer program configured to, when invoked by the processor, perform the steps of the full-type text replacement method of the present invention.

The invention also discloses a computer readable storage medium storing a computer program configured to implement the steps of the full-type text replacement method of the invention when invoked by a processor.

The invention converts each font into the picture format through multithreading, establishes the mapping relation between the real characters and the source code characters through image recognition, performs a large amount of training character recognition models, can accurately acquire the content displayed by the website, and simultaneously can keep updating the font files of the website by the timer established by the invention, thereby maintaining the high robustness and high reusability of the invention.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

13页详细技术资料下载

DOM tree-based full-type text replacement method, system, device and storage medium

相关技术

网友询问留言