Unstructured data document processing method and related equipment

文档序号：1832043 发布日期：2021-11-12 浏览：11次中文

阅读说明：本技术 非结构化数据文档处理方法及相关设备 (Unstructured data document processing method and related equipment ) 是由张耀宏李艾玲魏宁霞张华� 贺桂萍党引刘莉莉刘畅陈晓双周旭东陆春于 2021-05-31 设计创作，主要内容包括：本公开提供一种非结构化数据文档处理方法及相关设备。该方法包括：利用基于深度学习的文字识别模型,对非结构化数据文档进行文字识别,以获得文字内容；采用信息抽取算法从所述文字内容中抽取关键信息；将所述关键信息转化为结构化数据予以存储或输出。这种非结构化数据文档处理方法通过数字化手段实现非结构化数据的提取转换,可以减少手工处理数据的工作量,节省人力资源成本。(The disclosure provides an unstructured data document processing method and related equipment. The method comprises the following steps: performing character recognition on the unstructured data document by using a character recognition model based on deep learning to obtain character contents; extracting key information from the text content by adopting an information extraction algorithm; and converting the key information into structured data to be stored or output. The unstructured data file processing method realizes extraction and conversion of unstructured data through a digital means, can reduce the workload of manual data processing, and saves the cost of human resources.)

1. An unstructured data document processing method, comprising:

performing character recognition on the unstructured data document by using a character recognition model based on deep learning to obtain character contents;

extracting key information from the text content by adopting an information extraction algorithm;

and converting the key information into structured data to be stored or output.

2. The method of claim 1, wherein word-recognizing the unstructured data document using the word-recognition model comprises:

detecting a character area in the unstructured data document through a text detection model;

and performing character recognition on the character area through the character recognition model.

3. The method of claim 2, wherein the text detection model comprises one of a fast R-CNN model, a full convolutional network FCN model, and a connecting text candidate network CTPN model.

4. The method of claim 2, wherein the word recognition model comprises:

a combination of a convolutional neural network CNN, a cyclic neural network RNN, and a connection time classification CTC; or

A combination of the CNN, Seq2Seq model and attention mechanism.

5. The method of any one of claims 1 to 4, wherein extracting key information from the textual content using an information extraction algorithm comprises:

and extracting information entities from the text content as the key information through character mode-based extraction, grammar mode-based extraction or semantic mode-based extraction.

6. The method of any one of claims 1 to 4, wherein extracting key information from the textual content using an information extraction algorithm comprises:

and extracting entity relations from the text contents as the key information through an extraction model based on supervised learning or an extraction model based on remote supervised learning.

7. The method of any one of claims 1 to 4, wherein extracting key information from the textual content using an information extraction algorithm comprises:

and extracting information entities from the text content as the key information through an extraction model based on deep learning.

8. The method of claim 7, wherein the deep learning based extraction model comprises a combination of a bidirectional long short term memory network (BilsTM) and a Conditional Random Field (CRF).

9. An unstructured-data-document processing apparatus, comprising:

the character recognition module is used for carrying out character recognition on the unstructured data document by utilizing a character recognition model based on deep learning so as to obtain character contents;

the information extraction module is used for extracting key information from the text content by adopting an information extraction algorithm;

and the conversion module is used for converting the key information into structured data to be stored or output.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable by the processor, the processor implementing the method of any one of claims 1 to 8 when executing the computer program.

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a method and related device for processing an unstructured data document.

Background

Character recognition is one of branches of computer vision research field, is an application of mode recognition and artificial intelligence field, and uses optical technology and computer technology to recognize characters printed or handwritten on paper and convert the characters into a form which can be accepted by a computer and understood by people.

Information extraction is to create a structured representation of the selected information from the text and then store the converted structured, semi-structured information in a database for user query or further analysis.

However, the conventional character recognition technology also faces the problem of insufficient recognition accuracy when the image quality is poor.

Disclosure of Invention

In view of the above, the present disclosure is directed to an unstructured data document processing method and related apparatus.

Based on the above purpose, the unstructured data document processing method provided by the present disclosure includes:

performing character recognition on the unstructured data document by using a character recognition model based on deep learning to obtain character contents;

extracting key information from the text content by adopting an information extraction algorithm;

and converting the key information into structured data to be stored or output.

Based on the same inventive concept, the present disclosure also provides an unstructured data document processing apparatus, comprising:

the information extraction module is used for extracting key information from the text content by adopting an information extraction algorithm;

and the conversion module is used for converting the key information into structured data to be stored or output.

Based on the same inventive concept, the present disclosure also provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable by the processor, wherein the processor implements the unstructured data document processing method provided by the present disclosure when executing the computer program.

From the above, the unstructured data document processing method and the related device provided by the disclosure can effectively analyze and apply massive unstructured data, can comprehensively bring the unstructured data into an audit view, provide valuable audit information for auditors, provide powerful data support for audit concerns, and improve the breadth and depth of information audit work. Meanwhile, extraction and conversion of unstructured data are achieved through a digital means, workload of manual data processing can be reduced, human resource cost is saved, and audit efficiency is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the present disclosure or related technologies, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow diagram of a method of unstructured data document processing according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of text recognition in an embodiment of the present disclosure;

FIG. 3 is a flow chart of information extraction using a supervised learning based extraction model in an embodiment of the present disclosure;

FIG. 4 is a flowchart of information extraction using an extraction model based on deep learning according to an embodiment of the present disclosure;

FIG. 5 is an architecture diagram of a deep learning based decimation model according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of an unstructured data document processing apparatus of an embodiment of the present disclosure;

fig. 7 is a block diagram of an electronic device of an embodiment of the disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

It is to be noted that technical terms or scientific terms used in the embodiments of the present disclosure should have a general meaning as understood by those having ordinary skill in the art to which the present disclosure belongs, unless otherwise defined. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items.

Optical Character Recognition (OCR) refers to a process of inspecting characters printed on paper using an electronic device, determining their shapes by detecting dark and light patterns, and then translating the shapes into computer text using a Character Recognition method: the method is a technology for converting characters in a paper document into an image file with a black-white dot matrix in an optical mode aiming at print characters, and converting the characters in the image into a text format through recognition software for further editing and processing by word processing software. How to debug or use auxiliary information to improve recognition accuracy is the most important issue of OCR.

Conventional OCR recognition faces the following problems, including: the image quality is poor, the quality of text pictures to be recognized in a plurality of scenes is often poor, and the scenes have serious interference curves, inclination, dim light or exposure distortion; the accuracy requirement is particularly high, and in some scenarios, the accuracy requirement of the user for the numerical value is particularly high. For example, the accuracy of characters such as tax rate, amount, currency and the like in the text is up to 100%; the identification content is complex, and the text content has complex diversity, such as: different fonts and colors can exist in the text, English numbers with similar decimal points, special characters, special symbol connection words and digital contents can be easily missed, and the recognition difficulty of the user is high; the variety of languages is wide, and with the development of globalization, OCR technology should be able to recognize a variety of languages and a variety of characters.

Meanwhile, the conversion from unstructured data to structured data can be realized through an OCR recognition technology and an information extraction technology. However, when the unstructured audit data with huge amount is faced, how to store, query, analyze, mine and utilize the massive information resources is very critical. On one hand, whether unstructured data is processed or not is related to the comprehensiveness and integrity of audit content, and the quality of internal audit is directly influenced. On the other hand, whether unstructured data can be effectively processed or not is directly influenced by the auditing efficiency and effect.

In order to solve the above problems, the applicant of the present disclosure proposes a method and related apparatus for processing an unstructured data document, which first use an OCR technology to identify uploaded unstructured data as text content, then extract text content corresponding to a target document from the text content by using an information extraction technology according to a preset rule, and finally extract key information from the text content corresponding to the target document according to a preset rule expression, and convert the obtained result into a table and output the table to a database. The extraction and conversion of unstructured data are realized through a digital means, the workload of manual data processing can be reduced, the human resource cost is saved, and the auditing efficiency is further improved.

As an alternative embodiment, referring to fig. 1, the unstructured-data document processing method provided by the present disclosure includes:

and step S101, performing character recognition on the unstructured data document by using a character recognition model based on deep learning to obtain character contents.

In this step, various unstructured data documents are first manually or automatically uploaded, and single document uploading and batch document uploading can be performed. Such as: and uploading the construction contract documents, wherein the format of the documents is at least one of a plurality of formats including doc, docx, pdf and the like.

In the step, the content in the whole document is obtained and recognized by using an OCR technology, wherein the OCR technology comprises a traditional OCR technology and an OCR technology based on deep learning, and the OCR technology based on deep learning automatically detects the area where the characters in the text are located by using the capability of a model algorithm, distinguishes the characters from a background part, and obtains the category and the position information of the text.

And step S102, extracting key information from the text content by adopting an information extraction algorithm.

In this step, a corresponding rule expression is set in advance or corpus labeling is performed according to key information to be extracted; in the processing of the text content identified in step S102, the text content corresponding to the document where the key information is located is first identified and extracted, and then the key information is extracted from the extracted text content.

Taking the construction contract documents as an example, firstly extracting the content corresponding to the construction contract similar documents from the identified text content, then taking the contract name, the engineering name, the construction period and other information as key information, writing a regular expression or corpus marking on the key information, and extracting the key information from the content of the construction contract similar documents according to the obtained result.

And step S103, converting the key information into structured data to be stored or output.

In the step, the extracted unstructured data are converted into structured data which can be directly stored in a database, so that more service data supports are provided for a service system; and the data can also be exported in a table form, so that the workers can perform operations such as offline editing or data analysis.

Still take the construction contract document as an example, the extracted key information is converted into a form and output to a foreground page, and meanwhile, the text content of the document where the key information is located, namely the contract text to which the key information belongs, is output to the foreground page together and is processed by a worker. And the extracted key content is input into a database for storage, so that the working personnel can acquire the key information of the specified construction contract through inquiry.

As an alternative embodiment, referring to fig. 2, the step S101 of identifying the unstructured data document further includes:

step S201, a text detection model is used to detect the area where the text is located.

In this step, the text detection model includes one of a Faster R-CNN model, an FCN model, and a connected CTPN model.

The fast R-CNN model integrates candidate frame selection, feature extraction, classification and detection frame regression into a network, generates candidate regions for images, extracts features, judges feature types and corrects the positions of the candidate frames. The character region classification and the accurate positioning of the position of the character region classification can be realized, the classification of the candidate region is calculated by utilizing the region feature map, and the final accurate position of the detection frame is obtained through region regression again.

The FCN model provides a fully-connected network with a position-sensitive distributed convolution network instead of an ROI (region of interest) posing layer, solves the problem that the time consumption is large due to the fact that a structure behind an ROI poling layer needs to run once for each sample region in an Faster R-CNN model, enables feature sharing to be achieved in the whole network, and solves the contradiction that translation invariance is required for object classification and translation change is required for object detection.

The CTPN model is a text detection model which is most widely applied at present, and the basic assumption is that a single character is easier to detect compared with a text line with higher heterogeneous degree, so that the detection similar to R-CNN is firstly carried out on the single character, then the bidirectional LSTM is added in a detection network, a detection structure forming sequence provides the context characteristics of the text, and a plurality of characters can be combined to obtain the text line.

And step S202, identifying the area by adopting a character identification model.

In the step, the character recognition model is a combination of a CNN model, an RNN model and a CTC model; or a combination of the CNN model, Seq2Seq model and attention mechanism. Both of which employ a CNN encoder to extract the essential features of the image.

The combination of the CNN model, RNN model and CTC model is a popular character recognition model at present, and can be used for distinguishing longer text sequences. The CNN characteristics are used as input, and the bidirectional LSTM carries out sequence processing, so that the character recognition efficiency is greatly improved, and the generalization capability of the model is also improved. Firstly, obtaining a characteristic diagram through classification, and then translating the result through CTC to obtain output.

The combination of the CNN model, the Seq2Seq model and the attention mechanism takes the CNN characteristics as input by introducing the attention mechanism, calculates the attention weight of a new state by the attention model according to the attention weight of the state of the RNN model and the previous state, then inputs the CNN characteristics and the weights into the RNN model, and obtains the result by encoding and decoding.

As an alternative embodiment, the step S102 of obtaining the key information of the text content by using the information extraction technology may be implemented by a plurality of methods, including: character pattern-based extraction, grammar pattern-based extraction, semantic pattern-based extraction, and extraction models based on supervised learning, remote supervised learning, and deep learning, respectively.

The extraction based on the character patterns is the most direct extraction mode, the character patterns expressing specific relations are expressed as a group of regular expressions, and then the data extraction can be realized by matching the input texts. This approach has a high requirement on the similarity of the text to the pattern, and therefore, is often used to extract the content with the fixed description pattern, as well as the text generated by the fixed template. However, the predefined character pattern will fail if the text content of the template to be matched changes slightly.

The extraction based on the grammar mode describes the extraction mode by introducing grammar information (including lexical, syntactic and the like) contained in the text, so that the expression capability of the mode can be obviously enhanced, and the accuracy and the recall rate of the extraction mode are further improved. This approach is more powerful than character pattern-based extraction expression, while still ensuring pattern matching correctness. And the grammar pattern only depends on the grammar knowledge of human, and most people can easily construct the pattern, so the acquisition cost of the grammar pattern is relatively low. Grammatical patterns are also ubiquitous in all types of languages, and are applicable to different types of text.

The extraction based on the semantic mode introduces semantic elements such as concepts on the basis of the extraction based on the grammar mode, thereby realizing more accurate expression of the range of mode adaptation and enhancing the description capability of the mode.

As an alternative embodiment, taking the contract class document as an example, the contract signing date, the contract name and the amount are extracted. Firstly, for an information entity to be extracted, a corresponding mode is constructed according to an expression mode of a text, and the accuracy and the coverage rate of extraction based on a character mode can meet the requirement because the description of a document text which is analyzed by a user and used for extracting contract related entity information has strong commonality. And matching the input content according to the constructed character mode, and storing the corresponding information entity.

The method comprises the steps of extracting based on character patterns, taking input texts as character sequences, constructing character patterns, and expressing the character patterns of a type of relation as a group of regular expressions. Table 1 is an example of an information entity and a corresponding regular expression:

TABLE 1 information entities and corresponding regular expressions

As an alternative embodiment, referring to fig. 3, the information extraction using the extraction model based on supervised learning provided by the present disclosure includes:

step S301, data preparation.

In this step, all the data to be analyzed are collected as basic analysis data

Step S302, marking the linguistic data.

In the step, a part of representative files are screened from the collected files, manual marking is carried out on key information (such as contract names, contract numbers, payment modes and the like) required to be extracted from the files, and training data are provided for subsequently constructing a text extraction model.

Step S303, training a text extraction model.

In the step, an information extraction model is constructed by utilizing an information extraction technology of machine learning based on the corpus data labeled in the last step.

Step S304, model-based information extraction.

In the step, a trained information extraction model is used for extracting information of all files in the collection, unstructured text data are converted into structured data, and key analysis indexes are extracted for subsequent analysis.

As an alternative embodiment, the present disclosure provides a deep learning based extraction model, including an RNN, CNN or attention mechanism based extraction model. Referring to fig. 4, information extraction using an extraction model based on deep learning includes:

step S401, corpus participles are trained.

In this step, the Chinese word segmentation is a basic step of Chinese text processing, and when natural language processing is performed, word segmentation needs to be performed first, and the word segmentation effect directly affects the effect of the model.

Step S402, constructing features.

In this step, constructing features according to the result of word segmentation and the part-of-speech thereof in step S401 includes: the length of the word; part of speech of the word; whether a word is a number; whether a word is English; before and after (1 or 2) words; the part of speech of (1 or 2) words before and after a word and the length of (1 or 2) words before and after a word.

Step S403, marking the corpus

In this step, each word in the bid document is marked by using the BIEO marking method. Taking the item name as an example, the first word of the item name in the text is labeled as B, the final word is labeled as E, the middle word is labeled as I, and the rest words are labeled as O.

Step S404, training a model.

In this step, an RNN long-term memory network LSTM modeling training model is used, and with reference to fig. 5, the obtained model includes:

and an input layer, wherein a group of words are generated after the words of the input text are segmented, each word is represented as a vector, and the vector of each word is inquired from the model by using a pre-trained word2vec model and is transmitted to the next layer.

The BilSTM layer comprises two LSTMs, a forward input sequence and a reverse input sequence, so that the model can simultaneously consider the features extracted by the forward process and the features extracted by the backward process, namely the past features and the future features are considered, the input word vector sequence (w0, w1, w2 and … wn) is used as the input of each time step of the bidirectional LSTM, and the hidden state sequence (h1, h2, h3 and … hn) output by the forward LSTM and the hidden state sequence (r1, 2 and r3.. And then, accessing the two LTSMs into the same linear layer, reducing the dimension of the implicit state sequence vector to k dimension, wherein k is the label number of the corpus set, and further obtaining the automatically extracted sentence characteristics which are recorded as C (C1, C2 and c3.. Each item is a score value of a word classified to the jth label, if softmax is carried out later, the k classification is carried out on each position independently, so that the labeled information cannot be utilized by labeling each position, and the result is input to a CRF layer for labeling.

And the CRF layer is used for carrying out sentence-level sequence labeling on the upper-layer result. The BILSTM layer learns the context information and outputs the result to the CRF layer through an implicit layer. And inputting the scoring value of each word belonging to each class by the CRF layer, and selecting the sequence with the highest predicted score as the best answer through sequence marking. Wherein the input corpus is processed using the BIO notation. B denotes the start position of the entity, I denotes the middle or end position of the entity, and O denotes not the entity. If the CONTRACT content is marked, the CONTRACT content can be marked with B-CONTRACT at the beginning, wherein I-CONTRACT represents the middle or end position of the CONTRACT content entity, the payment mode can be marked with B-PAY at the beginning, and I-PAY represents the middle or end position of the payment mode.

Step S405, model prediction.

In this step, a new bid document is input by using the established model, and data marking is performed, wherein the word marked as BIE is the name of the item to be found.

As an optional embodiment, after the text content of the unstructured data document is obtained in step S101, a preset text comparison algorithm may be further used to perform comparison between two or more unstructured data documents, that is, the text content of different unstructured data documents obtained through text recognition is input into a computer program generated according to the text comparison algorithm, and the computer program is executed to perform comparison of the text content and output a comparison result.

As an optional embodiment, after the text content of the unstructured data document is obtained in step S101, the text content may be further checked through a preset checking keyword, where the checking keyword includes a contract name, a contract amount, or a contract amount number recorded in a capital or lower case manner. When the text content is checked, in addition to checking and checking the text content of a single document according to the selected checking keyword, the text content corresponding to a plurality of documents can be checked, for example, whether the contract amount and the bid-winning unit name recorded in the contract and the bid-winning notice of the same project are consistent or not can be checked; whether the invoices are consistent with the money amounts respectively recorded in the contracts, and the like. Configuring an audit rule according to the audit keyword, wherein the audit rule comprises outputting an audit result and exporting the text content and the corresponding unstructured data document for the unstructured data document corresponding to the text content conforming to the audit keyword; otherwise, outputting the result of non-passing the audit.

The unstructured data document processing method provided by the disclosure can effectively analyze and apply unstructured data, can bring the unstructured data into an audit view comprehensively, provides valuable audit information for auditors, provides powerful data support for audit concerns, and improves the breadth and depth of information audit work. Meanwhile, extraction and conversion of unstructured data are achieved through a digital means, workload of manual data processing can be reduced, human resource cost is saved, and audit efficiency is further improved.

It should be noted that the method of the embodiments of the present disclosure may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may only perform one or more steps of the method of the embodiments of the present disclosure, and the devices may interact with each other to complete the method.

It should be noted that the above describes some embodiments of the disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same inventive concept, the disclosure also provides an unstructured data document processing device corresponding to any of the above embodiments.

Referring to fig. 6, the unstructured-data-document processing apparatus includes:

the character recognition module 601 is configured to perform character recognition on the unstructured data document by using a character recognition model based on deep learning to obtain character contents.

An information extraction module 602, configured to extract key information from the text content by using an information extraction algorithm.

And a conversion module 603, configured to convert the key information into structured data for storage or output.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations of the present disclosure.

The apparatus of the foregoing embodiment is used to implement the corresponding unstructured data document processing method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above embodiments, the present disclosure further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the program, the unstructured-data document processing method according to any of the above embodiments is implemented.

Fig. 7 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the foregoing embodiment is used to implement the corresponding unstructured data document processing method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-described embodiment methods, the present disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the unstructured-data-document processing method according to any of the above embodiments.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the above embodiment are used to enable the computer to execute the unstructured data document processing method described in any of the above embodiments, and have the beneficial effects of corresponding method embodiments, which are not described herein again.

It should be noted that the embodiments of the present disclosure can be further described in the following ways:

an unstructured data document processing method, comprising: performing character recognition on the unstructured data document by using a character recognition model based on deep learning to obtain character contents; extracting key information from the text content by adopting an information extraction algorithm; and converting the key information into structured data to be stored or output.

Optionally, performing word recognition on the unstructured data document by using the word recognition model includes: detecting a character area in the unstructured data document through a text detection model; and performing character recognition on the character area through the character recognition model.

Optionally, the text detection model includes one of a Faster R-CNN model, a full convolution network FCN model, and a connected text candidate network CTPN model.

Optionally, the text recognition model includes: a combination of a convolutional neural network CNN, a cyclic neural network RNN, and a connection time classification CTC; or a combination of CNN, Seq2Seq model and attention mechanism.

Optionally, the extracting key information from the text content by using an information extraction algorithm includes: and extracting information entities from the text content as the key information through character mode-based extraction, grammar mode-based extraction or semantic mode-based extraction.

Optionally, the extracting key information from the text content by using an information extraction algorithm includes: and extracting entity relations from the text contents as the key information through an extraction model based on supervised learning or an extraction model based on remote supervised learning.

Optionally, the deep learning-based extraction model comprises a combination of a bidirectional long-short term memory network BiLSTM and a conditional random field CRF.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the present disclosure, also technical features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present disclosure as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the disclosure. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the present disclosure, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the present disclosure are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that the embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

The disclosed embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalents, improvements, and the like that may be made within the spirit and principles of the embodiments of the disclosure are intended to be included within the scope of the disclosure.

16页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种昏暗环境下矿车车牌识别的方法

Unstructured data document processing method and related equipment

相关技术

网友询问留言