Text document-based character recognition method, device, equipment and storage medium

文档序号：169622 发布日期：2021-10-29 浏览：41次中文

阅读说明：本技术 基于文本文档的文字识别方法、装置、设备及存储介质 (Text document-based character recognition method, device, equipment and storage medium ) 是由曾博王燕蒙王少军于 2021-07-27 设计创作，主要内容包括：本发明涉及人工智能领域,公开了一种基于文本文档的文字识别方法、装置、设备及存储介质,用于提高文本识别效率。所述基于文本文档的文字识别方法包括：接收待识别的文本文档,并对文本文档进行预处理,得到标准化的文本图像；基于预置的光学字符识别引擎,对标准化的文本图像进行文字识别,得到标准化的文本图像对应的初始文本信息；对标准化的文本图像进行分割,得到多个文本图像片段；对多个文本图像片段和初始文本信息进行文本向量化处理,得到初始图像向量和初始文本向量；获取初始文本向量对应的标注信息,并根据标注信息确定各初始图像向量对应的目标文本信息。此外,本发明还涉及区块链技术,目标文本信息可存储于区块链节点中。(The invention relates to the field of artificial intelligence, and discloses a text document-based character recognition method, device, equipment and storage medium, which are used for improving text recognition efficiency. The text document-based character recognition method comprises the following steps: receiving a text document to be identified, and preprocessing the text document to obtain a standardized text image; based on a preset optical character recognition engine, performing character recognition on the standardized text image to obtain initial text information corresponding to the standardized text image; segmenting the standardized text image to obtain a plurality of text image segments; performing text vectorization processing on the plurality of text image fragments and the initial text information to obtain an initial image vector and an initial text vector; and acquiring the labeling information corresponding to the initial text vectors, and determining the target text information corresponding to each initial image vector according to the labeling information. In addition, the invention also relates to a block chain technology, and the target text information can be stored in the block chain node.)

1. A method for recognizing characters based on text documents is characterized in that the method for recognizing characters based on text documents comprises the following steps:

receiving a text document to be identified, and preprocessing the text document to obtain a standardized text image;

based on a preset optical character recognition engine, performing character recognition on the standardized text image to obtain initial text information corresponding to the standardized text image;

segmenting the standardized text image according to initial text information corresponding to the standardized text image to obtain a plurality of text image segments;

inputting the plurality of text image segments into a trained image feature extraction model for image vectorization to obtain initial image vectors corresponding to the text image segments, and performing text vectorization on the initial text information to obtain initial text vectors corresponding to the initial text information;

and acquiring label information corresponding to the initial text vectors from a preset text image information base, and determining target text information corresponding to each initial text vector according to the label information.

2. The method of claim 1, wherein the receiving a text document to be recognized and preprocessing the text document to obtain a normalized text image comprises:

receiving a text document to be identified, and converting the text document into an image format to obtain an initialized text image;

carrying out binarization processing on the initialized text image to obtain a black-and-white image;

traversing the gray value of each pixel point in the black-and-white image, and performing noise reduction processing on the gray value of each pixel point to obtain a noise reduction image;

and correcting the noise-reduced image according to a preset image correction algorithm to obtain a standardized text image.

3. The method of claim 1, wherein the optical character recognition engine includes a two-way long-and-short-term memory recurrent neural network model, and the preset-based optical character recognition engine performs character recognition on the normalized text image to obtain initial text information corresponding to the normalized text image, and includes:

inputting the text image into an input layer of the bidirectional long-time memory recurrent neural network model for matrixing to obtain a first feature matrix of the text image;

inputting the first feature matrix into a coding layer of the bidirectional long-time memory recurrent neural network model for feature extraction to obtain a second feature matrix;

inputting the second feature matrix into a decoding layer of the bidirectional long-time memory recurrent neural network model for feature decoding to obtain a third feature matrix;

inputting the third feature matrix into a full-connection layer of the bidirectional long-time memory recurrent neural network model for feature classification to obtain a text feature classification label corresponding to the text image;

and setting the text feature classification label as an index, and searching a preset text dictionary to obtain initial text information corresponding to the standardized text image.

4. The method of claim 1, wherein the image feature extraction model comprises a bidirectional coding BERT model, and the inputting of the plurality of text image segments into the trained image feature extraction model for image vectorization to obtain initial image vectors corresponding to the text image segments comprises:

inputting the text image segments into the convolution layer of the bidirectional coding BERT model for feature extraction to obtain first feature vectors corresponding to the text image segments;

inputting each first feature vector into an excitation layer of the bidirectional coding BERT model for nonlinear mapping to obtain a plurality of second feature vectors;

and inputting each second feature vector into the pooling layer of the bidirectional coding BERT model for dimension reduction processing to obtain an initial image vector corresponding to each text image segment.

5. The method of claim 1, wherein the performing text vectorization processing on the initial text information to obtain an initial text vector corresponding to the initial text information comprises:

performing word segmentation processing on the initial text information based on a preset word segmentation algorithm to obtain a word segmentation result;

based on a preset one-hot coding algorithm, carrying out sparse vectorization processing on the word segmentation result to obtain a sparse vector corresponding to the initial text information;

and mapping the sparse vector corresponding to the initial text information into a dense vector based on a preset word embedding algorithm to obtain the initial text vector corresponding to the initial text information.

6. The method of claim 1, wherein before the receiving the text document to be recognized and preprocessing the text document to obtain a normalized text image, the method further comprises:

obtaining a sample file in a text document format, and converting the sample file into an image format to obtain a sample image;

extracting sample text information in the sample file, and performing word segmentation processing on the sample text information to obtain a word segmentation result;

based on the word segmentation result, carrying out segmentation processing on the sample image to obtain a plurality of sample image fragments;

performing text vectorization processing on the word segmentation result to obtain a sample text vector, and performing image vectorization processing on the plurality of sample image segments to obtain a sample image vector;

and according to the sample text information corresponding to the sample text vector, carrying out sequence annotation on the sample image vector to obtain annotation information corresponding to each sample image segment, and generating a text image information base.

7. The method of claim 6, wherein the obtaining of label information corresponding to the initial text vector from a preset text-image information base and determining target text information corresponding to each initial image vector according to the label information comprises:

searching a target text vector corresponding to the initial text vector in the text image information base, and acquiring marking information corresponding to the target text vector;

acquiring target image vectors corresponding to the initial image vectors in the text image information base according to the labeling information corresponding to the target text vectors, and respectively judging whether the similarity between the target image vectors corresponding to the initial image vectors and the initial image vectors is smaller than a preset threshold value;

and if the similarity between the target image vector corresponding to each initial image vector and each initial image vector is smaller than a preset threshold value, extracting the target text information corresponding to each initial image vector from the labeling information corresponding to the target image vector.

8. A text document based word recognition apparatus, comprising:

the receiving module is used for receiving a text document to be identified and preprocessing the text document to obtain a standardized text image;

the recognition module is used for carrying out character recognition on the standardized text image based on a preset optical character recognition engine to obtain initial text information corresponding to the standardized text image;

the segmentation module is used for segmenting the standardized text image according to the initial text information corresponding to the standardized text image to obtain a plurality of text image segments;

the vectorization module is used for inputting the text image segments into a trained image feature extraction model for image vectorization to obtain initial image vectors corresponding to the text image segments, and performing text vectorization on the initial text information to obtain initial text vectors corresponding to the initial text information;

and the determining module is used for acquiring the marking information corresponding to the initial text vectors from a preset text image information base and determining the target text information corresponding to each initial text vector according to the marking information.

9. A text document based word recognition device, comprising: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invoking the instructions in the memory to cause the text document based word recognition device to perform a text document based word recognition method according to any one of claims 1-7.

10. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement a text document based word recognition method according to any one of claims 1-7.

Technical Field

The invention relates to the field of machine learning, in particular to a text document-based character recognition method, a text document-based character recognition device, text document-based character recognition equipment and a storage medium.

Background

The text documents comprise bill documents, contract documents, academic documents and the like, and valuable data can be extracted through character recognition of the text documents for perfecting the information database.

Most of the existing text document recognition technologies are based on Optical Character Recognition (OCR) to intelligently recognize text documents, and then correct recognition results through a natural language model, so as to achieve the purpose of text recognition. However, in the prior art, the accuracy of text document identification is often limited to the capability of the model, and errors during identification easily cause errors during correction, and thus, the accuracy of the existing text document identification method still needs to be improved.

Disclosure of Invention

The invention provides a text document-based character recognition method, a text document-based character recognition device, text document-based character recognition equipment and a text document-based storage medium, which are used for improving the accuracy of text document recognition.

The invention provides a character recognition method based on a text document in a first aspect, which comprises the following steps:

receiving a text document to be identified, and preprocessing the text document to obtain a standardized text image;

segmenting the standardized text image according to initial text information corresponding to the standardized text image to obtain a plurality of text image segments;

Optionally, in a first implementation manner of the first aspect of the present invention, the receiving a text document to be recognized, and preprocessing the text document to obtain a standardized text image includes:

receiving a text document to be identified, and converting the text document into an image format to obtain an initialized text image;

carrying out binarization processing on the initialized text image to obtain a black-and-white image;

traversing the gray value of each pixel point in the black-and-white image, and performing noise reduction processing on the gray value of each pixel point to obtain a noise reduction image;

and correcting the noise-reduced image according to a preset image correction algorithm to obtain a standardized text image.

Optionally, in a second implementation manner of the first aspect of the present invention, the optical character recognition engine includes a bidirectional long-and-short memory recurrent neural network model, and the performing character recognition on the standardized text image based on a preset optical character recognition engine to obtain initial text information corresponding to the standardized text image includes:

inputting the text image into an input layer of the bidirectional long-time memory recurrent neural network model for matrixing to obtain a first feature matrix of the text image;

inputting the first feature matrix into a coding layer of the bidirectional long-time memory recurrent neural network model for feature extraction to obtain a second feature matrix;

inputting the second feature matrix into a decoding layer of the bidirectional long-time memory recurrent neural network model for feature decoding to obtain a third feature matrix;

and setting the text feature classification label as an index, and searching a preset text dictionary to obtain initial text information corresponding to the standardized text image.

Optionally, in a third implementation manner of the first aspect of the present invention, the image feature extraction model includes a bidirectional coding BERT model, and the inputting the plurality of text image segments into the trained image feature extraction model for image vectorization processing to obtain initial image vectors corresponding to the text image segments includes:

inputting the text image segments into the convolution layer of the bidirectional coding BERT model for feature extraction to obtain first feature vectors corresponding to the text image segments;

inputting each first feature vector into an excitation layer of the bidirectional coding BERT model for nonlinear mapping to obtain a plurality of second feature vectors;

Optionally, in a fourth implementation manner of the first aspect of the present invention, the performing text vectorization processing on the initial text information to obtain an initial text vector corresponding to the initial text information includes:

performing word segmentation processing on the initial text information based on a preset word segmentation algorithm to obtain a word segmentation result;

based on a preset one-hot coding algorithm, carrying out sparse vectorization processing on the word segmentation result to obtain a sparse vector corresponding to the initial text information;

Optionally, in a fifth implementation manner of the first aspect of the present invention, before the receiving a text document to be recognized, and preprocessing the text document to obtain a normalized text image, the text-document-based character recognition method further includes:

obtaining a sample file in a text document format, and converting the sample file into an image format to obtain a sample image;

extracting sample text information in the sample file, and performing word segmentation processing on the sample text information to obtain a word segmentation result;

based on the word segmentation result, carrying out segmentation processing on the sample image to obtain a plurality of sample image fragments;

Optionally, in a sixth implementation manner of the first aspect of the present invention, the obtaining, in a preset text image information library, tagging information corresponding to the initial text vector, and determining, according to the tagging information, target text information corresponding to each initial text vector includes:

searching a target text vector corresponding to the initial text vector in the text image information base, and acquiring marking information corresponding to the target text vector;

The second aspect of the present invention provides a text document-based character recognition apparatus, including:

the receiving module is used for receiving a text document to be identified and preprocessing the text document to obtain a standardized text image;

Optionally, in a first implementation manner of the second aspect of the present invention, the receiving module is specifically configured to:

receiving a text document to be identified, and converting the text document into an image format to obtain an initialized text image;

carrying out binarization processing on the initialized text image to obtain a black-and-white image;

traversing the gray value of each pixel point in the black-and-white image, and performing noise reduction processing on the gray value of each pixel point to obtain a noise reduction image;

and correcting the noise-reduced image according to a preset image correction algorithm to obtain a standardized text image.

Optionally, in a second implementation manner of the second aspect of the present invention, the optical character recognition engine includes a bidirectional long-and-short term memory recurrent neural network model, and the recognition module is specifically configured to:

inputting the text image into an input layer of the bidirectional long-time memory recurrent neural network model for matrixing to obtain a first feature matrix of the text image;

inputting the first feature matrix into a coding layer of the bidirectional long-time memory recurrent neural network model for feature extraction to obtain a second feature matrix;

inputting the second feature matrix into a decoding layer of the bidirectional long-time memory recurrent neural network model for feature decoding to obtain a third feature matrix;

and setting the text feature classification label as an index, and searching a preset text dictionary to obtain initial text information corresponding to the standardized text image.

Optionally, in a third implementation manner of the second aspect of the present invention, the image feature extraction model includes a bidirectional coding BERT model, and the vectorization module is configured to:

inputting the text image segments into the convolution layer of the bidirectional coding BERT model for feature extraction to obtain first feature vectors corresponding to the text image segments;

inputting each first feature vector into an excitation layer of the bidirectional coding BERT model for nonlinear mapping to obtain a plurality of second feature vectors;

Optionally, in a fourth implementation manner of the second aspect of the present invention, the vectorization module is further configured to:

performing word segmentation processing on the initial text information based on a preset word segmentation algorithm to obtain a word segmentation result;

based on a preset one-hot coding algorithm, carrying out sparse vectorization processing on the word segmentation result to obtain a sparse vector corresponding to the initial text information;

Optionally, in a fifth implementation manner of the second aspect of the present invention, the text-document-based word recognition apparatus further includes:

the system comprises a sample acquisition module, a sample analysis module and a sample analysis module, wherein the sample acquisition module is used for acquiring a sample file in a text document format and converting the sample file into an image format to obtain a sample image;

the sample word segmentation module is used for extracting sample text information in the sample file and carrying out word segmentation processing on the sample text information to obtain a word segmentation result;

the sample segmentation module is used for carrying out segmentation processing on the sample image based on the word segmentation result to obtain a plurality of sample image fragments;

the sample vectorization module is used for performing text vectorization processing on the word segmentation result to obtain a sample text vector, and performing image vectorization processing on the plurality of sample image fragments to obtain a sample image vector;

and the sample labeling module is used for performing sequence labeling on the sample image vectors according to the sample text information corresponding to the sample text vectors to obtain labeling information corresponding to each sample image segment and generate a text image information base.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the determining module is specifically configured to:

searching a target text vector corresponding to the initial text vector in the text image information base, and acquiring marking information corresponding to the target text vector;

A third aspect of the present invention provides a text document-based character recognition apparatus, including: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the text document based word recognition device to perform the text document based word recognition method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the above-mentioned text document-based word recognition method.

In the technical scheme provided by the invention, a text document to be identified is received and preprocessed to obtain a standardized text image; based on a preset optical character recognition engine, performing character recognition on the standardized text image to obtain initial text information corresponding to the standardized text image; segmenting the standardized text image according to initial text information corresponding to the standardized text image to obtain a plurality of text image segments; inputting the plurality of text image segments into a trained image feature extraction model for image vectorization to obtain initial image vectors corresponding to the text image segments, and performing text vectorization on the initial text information to obtain initial text vectors corresponding to the initial text information; and acquiring label information corresponding to the initial text vectors from a preset text image information base, and determining target text information corresponding to each initial text vector according to the label information. In the embodiment of the invention, a server performs labeling preprocessing on a text document to be recognized to obtain a text image, performs primary text recognition on the text image based on an optical character recognition engine to obtain initial text information, then the server divides the text image, performs vectorization processing on the divided text image segment and the initial text information to obtain an initial text vector and an initial image vector, and finally, the server acquires labeling information of the initial image vector according to the initial text vector to obtain target text information.

Drawings

FIG. 1 is a diagram of an embodiment of a text document-based word recognition method according to an embodiment of the present invention;

FIG. 2 is a diagram of another embodiment of a text document-based word recognition method according to an embodiment of the present invention;

FIG. 3 is a diagram of an embodiment of a text document based word recognition apparatus according to an embodiment of the present invention;

FIG. 4 is a diagram of another embodiment of a text document based word recognition apparatus according to the present invention;

FIG. 5 is a diagram of an embodiment of a text document based word recognition apparatus according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a text document-based character recognition method, a text document-based character recognition device, text document-based character recognition equipment and a text document-based storage medium, which are used for improving the accuracy of text recognition.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of understanding, a specific flow of the embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a text document-based character recognition method in the embodiment of the present invention includes:

101. receiving a text document to be identified, and preprocessing the text document to obtain a standardized text image;

it is to be understood that the executing subject of the present invention may be a text document based word recognition device, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

In this embodiment, the text document to be recognized is a text document having a text format, such as a portable document format PDF document, a WORD processor application program WORD document, a spreadsheet EXCEL document, and the like, and in order to improve the accuracy of character recognition in the text document, the text recognition and image recognition technology is combined in the present invention, so that the accuracy of character recognition based on the text document is improved. The text documents to be recognized can comprise a plurality of text documents, and the server supports the user to upload the text documents to be recognized in batch and preprocess the text documents to be recognized in batch, so that the text recognition is more efficient.

In this embodiment, in order to obtain a standardized text image, the server performs a series of preprocessing operations on a text document to be recognized, including recognizing a document format of the text document, and paging the text document according to the recognized document format, and the server converts the text document into an image format according to a paging result, where each page corresponds to one text image, and the paging result includes page number information of each page that is incremented by page, and can be used for the server to sequentially splice subsequent text recognition results, thereby ensuring the order of the text.

In this embodiment, the standardized text image is a text image that conforms to a preset format, such as image size, image color, image angle, image format, and image size, and the server standardizes and processes the text document, so that the calculation cost of a subsequent model can be reduced, and the character recognition efficiency can be improved.

102. Based on a preset optical character recognition engine, performing character recognition on the standardized text image to obtain initial text information corresponding to the standardized text image;

in this embodiment, an Optical Character Recognition (OCR) engine refers to an intelligent engine that analyzes and recognizes an image file of text data to obtain text and layout information. The preset optical character recognition engine adopts a neural network model for character recognition, and the network structure of the optical character recognition engine comprises: the convolutional neural network CNN + bidirectional long-time memory cyclic neural network LSTM, specifically, a server inputs a text image into initialized CNN to extract image features, the server performs serialization processing on the image features through the LSTM, and finally the server obtains initial text information in the text image through classification labeling of the sequences, so that the character recognition efficiency is greatly improved, and the generalization capability of the model is also improved.

In this embodiment, the optical character recognition engine further introduces an attention mechanism, and specifically, after the server inputs the text image into the initialized CNN and performs image feature extraction, the attention model calculates the attention weight of the new state with respect to the attention weight of the state and the previous state of the recurrent neural network RNN. The server then inputs the CNN features and weights into the RNN, and the server obtains the result, i.e., the initial text information, by encoding and decoding the CNN features and weights. Because the text image is based on the image of the text document and has no complex image background, the character recognition accuracy of the server based on the preset optical character recognition engine is higher, and data guarantee is provided for subsequent sequence marking.

103. Segmenting the standardized text image according to initial text information corresponding to the standardized text image to obtain a plurality of text image segments;

in this embodiment, in order to improve the efficiency of character recognition, the server divides the text image into a plurality of text image segments, specifically, the server cuts and divides the text image according to the sentence fracture information in the initial text information to obtain the text image segment of each sentence in the text image, for example, the server recognizes the division points of the text image according to the sentence numbers in the initial text information, and divides the standardized text image according to the division points to obtain the text image segments corresponding to different sentences. The server may also perform clipping and segmentation on the text image according to the segmentation information in the initial text information, which is not specifically limited herein.

104. Inputting a plurality of text image segments into a trained image feature extraction model to perform image vectorization processing to obtain initial image vectors corresponding to the text image segments, and performing text vectorization processing to initial text information to obtain initial text vectors corresponding to the initial text information;

in this embodiment, the image feature extraction model is a visual bidirectional coding visual BERT model trained based on a bidirectional coding (BERT) model, and is a BERT-based multi-modal application, and the server combines images and texts in the process of training the model, and can perform sentence-image relationship prediction to generate vectorized representation of the images, that is, an initial image vector, and Word Embedding vectors (Word Embedding) are used for vectorization processing of text information, and can map words to a lower-dimensional vector space of the words based on the relationship between the words to obtain an initial text vector for subsequent text recognition.

105. And acquiring the label information corresponding to the initial text vectors from a preset text image information base, and determining the target text information corresponding to each initial text vector according to the label information.

In this embodiment, the server obtains the label information corresponding to the initial text vector from the preset text image information base, searches for the target image vector corresponding to the initial text vector, then searches for the initial text information corresponding to the initial text vector from the preset text dictionary according to the label information, and finally determines the target text information corresponding to the target image vector, that is, the specific characters in the text image segment, according to the text information.

Further, the server stores the target text information in a blockchain database, which is not limited herein.

In the embodiment of the invention, a server performs labeling preprocessing on a text document to be recognized to obtain a text image, performs primary text recognition on the text image based on an optical character recognition engine to obtain initial text information, then the server divides the text image, performs vectorization processing on the divided text image segment and the initial text information to obtain an initial text vector and an initial image vector, and finally, the server acquires labeling information of the initial image vector according to the initial text vector to obtain target text information.

Referring to fig. 2, another embodiment of the text document-based character recognition method according to the embodiment of the present invention includes:

201. receiving a text document to be identified, and preprocessing the text document to obtain a standardized text image;

specifically, the server receives a text document to be identified, and converts the text document into an image format to obtain an initialized text image; the server carries out binarization processing on the initialized text image to obtain a black and white image; the server traverses the gray value of each pixel point in the black-and-white image and performs noise reduction processing on the gray value of each pixel point to obtain a noise reduction image; and the server corrects the noise-reduced image according to a preset image correction algorithm to obtain a standardized text image.

In this optional embodiment, the server converts the text document to be recognized into a standardized text image through a series of preprocessing operations, so as to perform subsequent text image recognition, where the preprocessing process includes: and carrying out image binarization, noise reduction and correction to finally obtain a standardized text image which can be used for image text recognition.

Further, the server obtains a sample file in a text document format, and converts the sample file into an image format to obtain a sample image; the server extracts sample text information in the sample file and performs word segmentation processing on the sample text information to obtain word segmentation results; the server carries out segmentation processing on the sample image based on the word segmentation result to obtain a plurality of sample image fragments; the server performs text vectorization processing on the word segmentation result to obtain a sample text vector, and performs image vectorization processing on the plurality of sample image segments to obtain a sample image vector; and the server carries out sequence annotation on the sample image vectors according to the sample text information corresponding to the sample text vectors to obtain annotation information corresponding to each sample image segment and generate a text image information base.

In this optional embodiment, the server obtains the sample files in the text document format in batch to generate a text-image information base, where the text-image information base includes massive sample text information, sample text vectors corresponding to the sample text information, sample images, sample image fragments, sample image vectors corresponding to the sample image fragments, and label information corresponding to the sample image fragments, and the text-image information base may be used for text information retrieval of the image vectors, so as to identify image texts.

202. Based on a preset optical character recognition engine, performing character recognition on the standardized text image to obtain initial text information corresponding to the standardized text image;

specifically, the server inputs a text image into an input layer of a bidirectional long-time memory recurrent neural network model for matrixing processing to obtain a first feature matrix of the text image; the server inputs the first feature matrix into a coding layer of a bidirectional long-time memory recurrent neural network model for feature extraction to obtain a second feature matrix; the server inputs the second characteristic matrix into a decoding layer of the bidirectional long-time memory recurrent neural network model for characteristic decoding to obtain a third characteristic matrix; the server inputs the third feature matrix into a full connection layer of a bidirectional long-time memory recurrent neural network model for feature classification, and a text feature classification label corresponding to the text image is obtained; the server sets the text feature classification labels as indexes, searches a preset text dictionary and obtains initial text information corresponding to the standardized text image.

In the optional embodiment, the bidirectional Long and Short Term Memory recurrent Neural network model Bi-LSTM combines the Long and Short Term Memory Neural network model LSTM (Long Short-Term Memory) and the recurrent Neural network model rnn (recurrent Neural network), and adopts a bidirectional coding mode, so that the bidirectional Long and Short Term Memory recurrent Neural network model obtains an excellent effect on image and character recognition, the server firstly matrixing an input layer of a text image input model to convert a text image into a digital matrix to obtain a first feature matrix, then the server inputs the first feature matrix into an encoding layer to perform feature extraction, that is, convolution calculation is performed on the first feature matrix by adopting multiple convolution factors to obtain a second feature matrix, and then the server inputs the second feature matrix into a decoding layer to perform feature decoding to obtain a third feature matrix of a forward result, and performing reverse operation, namely, after the server performs negation on the first feature matrix, obtaining a third feature matrix of a reverse result through the coding layer and the decoding layer, and finally, inputting the third feature matrices of the forward result and the reverse result into the full-connection layer by the server for feature classification to obtain a text feature classification label corresponding to the text image, so that the server can search the initial text information corresponding to the text image through the text feature classification label.

203. Segmenting the standardized text image according to initial text information corresponding to the standardized text image to obtain a plurality of text image segments;

the execution process of step 203 is similar to the execution process of step 103, and detailed description thereof is omitted here.

204. Inputting a plurality of text image segments into a trained image feature extraction model to perform image vectorization processing to obtain target image vectors corresponding to the text image segments, and performing text vectorization processing to initial text information to obtain initial text vectors corresponding to the initial text information;

specifically, the server inputs a plurality of text image segments into a convolution layer of a bidirectional coding BERT model for feature extraction to obtain a first feature vector corresponding to each text image segment; the server inputs each first feature vector into an excitation layer of a bidirectional coding BERT model to carry out nonlinear mapping so as to obtain a plurality of second feature vectors; and the server inputs each second feature vector into a pooling layer of the bidirectional coding BERT model for dimension reduction processing to obtain an initial image vector corresponding to each text image segment.

In this optional embodiment, the image feature extraction model includes a bidirectional coding BERT model, the bidirectional coding BERT model includes a convolution layer, an excitation layer, and a pooling layer, and the server obtains an initial image vector corresponding to each text image segment after sequentially passing through each network layer of the bidirectional coding BERT model, and the initial image vector is used to represent a feature sequence of each text image segment.

Optionally, the server may further input the plurality of text image fragments and the initial text information to the trained bimodal recognition ViLBERT model, the coding layer of the bimodal recognition ViLBERT model codes each text image fragment and the initial text information corresponding to each text image fragment respectively to obtain a coding result corresponding to each text image fragment and a coding result of the initial text information corresponding to each text image fragment, after the two modalities are coded, the server performs a common attention calculation on each coding result through a common attention machine system network of the bimodal recognition ViLBERT model, that is, each coding result is subjected to attention calculation by using its Query (QU) and a Value and Key of a coding result of another modality to obtain an initial image vector corresponding to each text image fragment and an initial text vector corresponding to the initial text information, the text image fragment is in one mode, and the initial text information is in the other mode.

Specifically, the server performs word segmentation processing on the initial text information based on a preset word segmentation algorithm to obtain a word segmentation result; the server performs sparse vectorization processing on the word segmentation result based on a preset one-hot coding algorithm to obtain a sparse vector corresponding to the initial text information; and the server maps the sparse vector corresponding to the initial text information into a dense vector based on a preset word embedding algorithm to obtain the initial text vector corresponding to the initial text information.

In this optional embodiment, the preset word segmentation algorithm includes: the method comprises the steps of a statistical language N-Gram algorithm, a conditional random field CRF Word segmentation algorithm and a shortest path Word segmentation algorithm, wherein words can be converted into sparse vectors which can be understood by a machine by a server through a One-Hot Encoding algorithm, and the sparse vectors are mapped into dense vectors with high dimension by the server based on a Word Embedding algorithm (Word Embedding), so that initial text vectors representing initial text information are obtained.

205. Searching a target text vector corresponding to the initial text vector in a text image information base, and acquiring marking information corresponding to the target text vector;

in this embodiment, the server searches a target text vector corresponding to the initial text vector in a sample text vector of a text image information base, so as to obtain annotation information corresponding to the initial text vector, the server searches a preset text dictionary through the annotation information, so as to obtain text information corresponding to the target text vector, that is, text information corresponding to the initial text vector, and the server can determine target text information corresponding to the target image vector through a corresponding relationship between the target text vector and the target image vector.

206. Acquiring target image vectors corresponding to the initial image vectors in a text image information base according to the labeling information corresponding to the target text vectors, and respectively judging whether the similarity between the target image vectors corresponding to the initial image vectors and the initial image vectors is smaller than a preset threshold value;

in the embodiment, the server reads a corresponding target image vector from a sample image vector of a text image information base through the target text vector, and then the server judges whether the similarity between the target image vector and an initial image vector is smaller than a preset threshold value by calculating the similarity between the target image vector and the initial image vector, namely whether the similarity between the target image vector and the initial image vector is smaller than the preset threshold value, if the similarity between the target image vector and the initial image vector is smaller than the preset threshold value, the target text information corresponding to the target image vector can be used as a final identification result, otherwise, the target text information corresponding to the target image vector cannot be used as the final identification result, and the server continues to match the target image vector in the text image information base until all sample image vectors in the text image information base are matched, or until the target image vector is matched.

207. And if the similarity between the target image vector corresponding to each initial image vector and each initial image vector is smaller than a preset threshold value, extracting the target text information corresponding to each initial image vector from the labeling information corresponding to the target image vector.

In this embodiment, the server can find target text information, that is, specific characters in a text image segment, carried by the label information in a preset text dictionary through the label information corresponding to the target image vector, optionally, after the server extracts the target text information, the target text information can be compared with the initial text information, when different sentence pairs occur, the server performs natural language processing on the different sentence pairs, determines a target sentence in the sentence pair that meets the natural language expression mode, and outputs the target sentence as a final recognition result to the user.

In the embodiment of the invention, the server acquires the target text vector in the text image information base according to the initial text vector, acquires the target image vector according to the target text vector, and then determines the target text information by judging the difference between the initial image vector and the target image vector.

In the above description of the text document based character recognition method in the embodiment of the present invention, referring to fig. 3, a text document based character recognition apparatus in the embodiment of the present invention is described below, where an embodiment of a text document based character recognition apparatus in the embodiment of the present invention includes:

the receiving module 301 is configured to receive a text document to be identified, and preprocess the text document to obtain a standardized text image;

the recognition module 302 is configured to perform character recognition on the standardized text image based on a preset optical character recognition engine to obtain initial text information corresponding to the standardized text image;

a segmentation module 303, configured to segment the standardized text image according to initial text information corresponding to the standardized text image to obtain a plurality of text image segments;

a vectorization module 304, configured to input the multiple text image segments into a trained image feature extraction model for image vectorization processing, so as to obtain initial image vectors corresponding to the text image segments, and perform text vectorization processing on the initial text information, so as to obtain initial text vectors corresponding to the initial text information;

the determining module 305 is configured to obtain labeling information corresponding to the initial text vector from a preset text image information base, and determine target text information corresponding to each initial text vector according to the labeling information.

Further, the target text information is stored in the blockchain database, which is not limited herein.

In the embodiment of the invention, a server performs labeling preprocessing on a text document to be recognized to obtain a text image, performs primary text recognition on the text image based on an optical character recognition engine to obtain initial text information, then divides a standardized text image, performs vectorization processing on a divided text image segment and the initial text information to obtain an initial text vector and an initial image vector, and finally obtains labeling information of the initial image vector according to the initial text vector to obtain target text information.

Referring to fig. 4, another embodiment of the text document-based character recognition apparatus according to the embodiment of the present invention includes:

the receiving module 301 is configured to receive a text document to be identified, and preprocess the text document to obtain a standardized text image;

Optionally, the receiving module 301 is specifically configured to:

receiving a text document to be identified, and converting the text document into an image format to obtain an initialized text image;

carrying out binarization processing on the initialized text image to obtain a black-and-white image;

traversing the gray value of each pixel point in the black-and-white image, and performing noise reduction processing on the gray value of each pixel point to obtain a noise reduction image;

and correcting the noise-reduced image according to a preset image correction algorithm to obtain a standardized text image.

Optionally, the optical character recognition engine includes a bidirectional long-and-short-term memory recurrent neural network model, and the recognition module 302 is specifically configured to:

inputting the text image into an input layer of the bidirectional long-time memory recurrent neural network model for matrixing to obtain a first feature matrix of the text image;

inputting the first feature matrix into a coding layer of the bidirectional long-time memory recurrent neural network model for feature extraction to obtain a second feature matrix;

inputting the second feature matrix into a decoding layer of the bidirectional long-time memory recurrent neural network model for feature decoding to obtain a third feature matrix;

and setting the text feature classification label as an index, and searching a preset text dictionary to obtain initial text information corresponding to the standardized text image.

Optionally, the image feature extraction model includes a bidirectional coding BERT model, and the vectorization module 304 is configured to:

inputting the text image segments into a convolution layer of the bidirectional coding BERT model for feature extraction to obtain a first feature vector corresponding to each text image segment;

inputting each first feature vector into an excitation layer of the bidirectional coding BERT model for nonlinear mapping to obtain a plurality of second feature vectors;

and inputting each second feature vector into a pooling layer of the bidirectional coding BERT model for dimension reduction processing to obtain an initial image vector corresponding to each text image segment.

Optionally, the vectorization module 304 is further configured to:

performing word segmentation processing on the initial text information based on a preset word segmentation algorithm to obtain a word segmentation result;

based on a preset one-hot coding algorithm, carrying out sparse vectorization processing on the word segmentation result to obtain a sparse vector corresponding to the initial text information;

Optionally, the text document-based word recognition apparatus further includes:

the sample obtaining module 306 is configured to obtain a sample file in a text document format, and convert the sample file into an image format to obtain a sample image;

the sample word segmentation module 307 is configured to extract sample text information in the sample file, and perform word segmentation processing on the sample text information to obtain a word segmentation result;

a sample segmentation module 308, configured to perform segmentation processing on the sample image based on the word segmentation result to obtain a plurality of sample image segments;

a sample vectorization module 309, configured to perform text vectorization on the word segmentation result to obtain a sample text vector, and perform image vectorization on the multiple sample image segments to obtain a sample image vector;

and the sample labeling module 310 is configured to perform sequence labeling on the sample image vectors according to the sample text information corresponding to the sample text vectors, obtain labeling information corresponding to each sample image segment, and generate a text image information base.

Optionally, the determining module 305 is specifically configured to:

searching a target text vector corresponding to the initial text vector in the text image information base, and acquiring marking information corresponding to the target text vector;

Fig. 3 and 4 above describe the text document based word recognition apparatus in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the text document based word recognition apparatus in the embodiment of the present invention is described in detail from the perspective of hardware processing.

Fig. 5 is a schematic structural diagram of a text document based word recognition device 500 according to an embodiment of the present invention, which may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) for storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on storage medium 530 may include one or more modules (not shown), each of which may include a sequence of instructions operating on XXX device 500. Still further, the processor 510 may be configured to communicate with the storage medium 530 to execute a series of instructional operations on the storage medium 530 on the text document based word recognition device 500.

Text document based word recognition device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art will appreciate that the text document based word recognition device architecture illustrated in FIG. 5 does not constitute a limitation of text document based word recognition devices and may include more or less components than those illustrated, or some components in combination, or a different arrangement of components.

The invention also provides a text document-based word recognition device, which comprises a memory and a processor, wherein computer readable instructions are stored in the memory, and when being executed by the processor, the computer readable instructions cause the processor to execute the steps of the text document-based word recognition method in the above embodiments.

The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the text document based word recognition method.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

22页详细技术资料下载

Text document-based character recognition method, device, equipment and storage medium

相关技术

网友询问留言