Legal information extraction model, method, system, device and auxiliary system

文档序号:1170283 发布日期:2020-09-18 浏览:8次 中文

阅读说明:本技术 法律信息抽取模型及方法及系统及装置及辅助系统 (Legal information extraction model, method, system, device and auxiliary system ) 是由 翁洋 李鑫 王竹 其他发明人请求不公开姓名 于 2020-08-11 设计创作,主要内容包括:本发明公开了法律信息抽取模型及方法及系统及装置及辅助系统,涉及自然语言处理领域,包括:定义裁判文书中需要标注的实体类型;标注选取的若干裁判文书中的实体类型;基于法律分词数据集和实体识别数据集得到训练集;建立裁判文书法律信息抽取模型,利用训练集训练裁判文书法律信息抽取模型;将法律信息待抽取的裁判文书输入训练后的裁判文书法律信息抽取模型,输出裁判文书中的法律信息抽取结果;裁判文书法律信息抽取模型结构包括:词嵌入层、共享-私有信息抽取器、任务特有CRF层和任务判别器;本发明以公开的裁判文书为起点,最终实现裁判文书中相关重要法律信息要素的抽取。(The invention discloses a legal information extraction model, a method, a system, a device and an auxiliary system, relating to the field of natural language processing and comprising the following steps: defining entity types to be marked in the referee document; marking the entity types in the selected referee documents; obtaining a training set based on the legal word segmentation data set and the entity recognition data set; establishing a judge document legal information extraction model, and training the judge document legal information extraction model by using a training set; inputting a referee document with legal information to be extracted into a trained referee document legal information extraction model, and outputting a legal information extraction result in the referee document; the structure of the official document legal information extraction model comprises the following steps: the system comprises a word embedding layer, a sharing-private information extractor, a task specific CRF layer and a task discriminator; the invention takes the open referee document as a starting point, and finally realizes the extraction of relevant important legal information elements in the referee document.)

1. A referee document legal information extraction model, comprising: the system comprises a word embedding layer, a sharing-private information extractor, a task specific CRF layer and a task discriminator; the word embedding layer is used for converting words in the sentence into word vectors; the shared-private information extractor is composed of BI-LSTM, the shared-private information extractor comprises 2 private information extractors and a shared information extractor, wherein one private information extractor is used for learning the boundary line in the word segmentation task, the other private information extractor is used for learning the boundary line in the entity recognition task, and the shared information extractor is used for learning the boundary line shared by the word segmentation task and the entity recognition task; the task-specific CRF layer is respectively connected with the output representation of two private information extractors BI-LSTM; the task specific CRF layer is used for outputting label representations corresponding to the word segmentation task and the entity identification task; the task discriminator is the lower layer input of the shared information extractor, and the shared information extractor can learn the boundary line characteristics which are common in the word segmentation task and the entity recognition task through the countertraining mode of the task discriminator and the shared information extractor.

2. A legal information extraction method, comprising:

defining entity types to be marked in the referee document;

marking entity types in a plurality of referee documents based on the defined entity types to obtain marked entity identification data sets;

obtaining a legal word segmentation data set, and obtaining a training set based on the legal word segmentation data set and the entity recognition data set;

establishing a referee document legal information extraction model, and training the referee document legal information extraction model by using a training set to obtain a trained referee document legal information extraction model;

inputting a referee document with legal information to be extracted into a trained referee document legal information extraction model, and outputting a legal information extraction result in the referee document;

wherein, referee's document legal information extraction model structure includes: the system comprises a word embedding layer, a sharing-private information extractor, a task specific CRF layer and a task discriminator; the word embedding layer is used for converting words in the sentence into word vectors; the shared-private information extractor is composed of BI-LSTM, the shared-private information extractor comprises 2 private information extractors and a shared information extractor, wherein one private information extractor is used for learning the boundary line in the word segmentation task, the other private information extractor is used for learning the boundary line in the entity recognition task, and the shared information extractor is used for learning the boundary line shared by the word segmentation task and the entity recognition task; the task specific CRF layer is respectively connected with the output representations of the two private information extractors BI-LSTM and is used for outputting label representations corresponding to the word segmentation task and the entity recognition task; the task discriminator is the lower layer input of the shared information extractor, and the shared information extractor learns the boundary line characteristics common to the word segmentation task and the entity recognition task through the countervailing training mode of the task discriminator and the shared information extractor.

3. The legal information extraction method of claim 2, wherein the entities in the referee document are marked in the form of BIO, B denotes the beginning of the entity, I denotes the middle character of the entity, and O denotes a character irrelevant to the entity.

4. The legal information extraction method of claim 2, wherein the legal segmentation data set and the entity identification data set are divided into a training set, a cross validation set and a test set, the training set is used for training the official document legal information extraction model, the cross validation set is used for validating the official document legal information extraction model, and the test set is used for testing the official document legal information extraction model.

5. The legal information extraction method of claim 2, wherein when training the legal information extraction model of the referee document, each sentence in the legal participle data set and the entity recognition data set is input into the word embedding layer for word embedding, and each word will get a word vector trained in advance.

6. The legal information extraction method of claim 2, wherein each word in the word segmentation task is output as BEMS, wherein B represents the beginning of the word, E represents the end of the word, M represents the word, and S represents a single word.

7. The legal information extraction method of claim 2, wherein when training the official document legal information extraction model, word segmentation task and entity recognition task are trained in turn, meanwhile, a countermeasure loss function and updated parameter settings are introduced, and finally, an optimal model is obtained through parameter tuning.

8. A legal information extraction system, comprising:

the definition unit is used for defining the entity types needing to be marked in the referee document;

the marking unit is used for marking the entity types in the plurality of referee documents based on the defined entity types to obtain marked entity identification data sets;

the training set obtaining unit is used for obtaining a legal word segmentation data set and obtaining a training set based on the legal word segmentation data set and the entity recognition data set;

the model establishing and training unit is used for establishing a referee document legal information extraction model, and training the referee document legal information extraction model by using a training set to obtain a trained referee document legal information extraction model;

the legal information extraction unit is used for inputting the referee document with legal information to be extracted into the trained referee document legal information extraction model and outputting a legal information extraction result in the referee document;

wherein, referee's document legal information extraction model structure includes: the system comprises a word embedding layer, a sharing-private information extractor, a task specific CRF layer and a task discriminator; the word embedding layer is used for converting words in the sentence into word vectors; the shared-private information extractor is composed of BI-LSTM, the shared-private information extractor comprises 2 private information extractors and a shared information extractor, wherein one private information extractor is used for learning the boundary line in the word segmentation task, the other private information extractor is used for learning the boundary line in the entity recognition task, and the shared information extractor is used for learning the boundary line shared by the word segmentation task and the entity recognition task; the task-specific CRF layer is respectively connected with the output representation of two private information extractors BI-LSTM; the task specific CRF layer is used for outputting label representations corresponding to the word segmentation task and the entity identification task; the task discriminator is the lower layer input of the shared information extractor, and the shared information extractor can learn the boundary line characteristics which are common in the word segmentation task and the entity recognition task through the countertraining mode of the task discriminator and the shared information extractor.

9. A legal information extraction apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method as claimed in any one of claims 2 to 7 when executing the computer program.

10. A legal case auditing assistance system, the system comprising:

the judicial case library is used for storing the referee documents and the corresponding legal information;

the legal information extraction system is connected with the judicial case library and is used for extracting corresponding legal information from the referee documents and storing the extracted legal information and the corresponding referee documents in the judicial case library;

the query unit is used for querying the corresponding referee document and the corresponding legal information from the judicial case library by the user;

the display unit is used for displaying the information inquired by the inquiry unit;

wherein, the legal information extraction system includes: the definition unit is used for defining the entity types needing to be marked in the referee document;

the marking unit is used for marking the entity types in the plurality of referee documents based on the defined entity types to obtain marked entity identification data sets;

the training set obtaining unit is used for obtaining a legal word segmentation data set and obtaining a training set based on the legal word segmentation data set and the entity recognition data set;

the model establishing and training unit is used for establishing a referee document legal information extraction model, and training the referee document legal information extraction model by using a training set to obtain a trained referee document legal information extraction model;

the legal information extraction unit is used for inputting the referee document with legal information to be extracted into the trained referee document legal information extraction model and outputting a legal information extraction result in the referee document;

wherein, referee's document legal information extraction model structure includes: the system comprises a word embedding layer, a sharing-private information extractor, a task specific CRF layer and a task discriminator; the word embedding layer is used for converting words in the sentence into word vectors; the shared-private information extractor is composed of BI-LSTM, the shared-private information extractor comprises 2 private information extractors and a shared information extractor, wherein one private information extractor is used for learning the boundary line in the word segmentation task, the other private information extractor is used for learning the boundary line in the entity recognition task, and the shared information extractor is used for learning the boundary line shared by the word segmentation task and the entity recognition task; the task-specific CRF layer is respectively connected with the output representation of two private information extractors BI-LSTM; the task specific CRF layer is used for outputting label representations corresponding to the word segmentation task and the entity identification task; the task discriminator is the lower layer input of the shared information extractor, and the shared information extractor can learn the boundary line characteristics which are common in the word segmentation task and the entity recognition task through the countertraining mode of the task discriminator and the shared information extractor.

Technical Field

The invention relates to the field of natural language processing, in particular to a legal information extraction model, a legal information extraction method, a legal information extraction system, a legal information extraction device, a legal information extraction medium and a legal case trial auxiliary system in a referee document.

Background

The referee document is a document with legal significance which is issued to the parties according to specific case conditions after the trial and treatment is carried out by the national court in combination with the request matters or dispute matters of the parties. At present, a large number of legal information elements exist in a judge document, and a large number of subsequent judging processes are facilitated in the construction of a legal information case base. The conventional legal information element extraction method of the referee document is realized by continuously perfecting a regular engine or converting an information extraction task into a named entity identification task based on concluding related rules through legal experts, but common judicial problems such as incomplete word meaning matching exist in a sequence labeling mode, so that the accuracy of extracting the legal elements is low, and meanwhile, a large amount of manual labeling is needed for a specific entity identification type, and the quality and quantity dependence of the labeled data in the entity identification task are high.

Disclosure of Invention

In order to solve the problem of low accuracy of extracting legal information in the trial process of the people's court, the invention aims to take the open referee document as a starting point and finally realize the extraction of relevant important legal information elements in the referee document.

In order to achieve the above object, the present invention provides a referee document law information extraction model, comprising: the system comprises a word embedding layer, a sharing-private information extractor, a task specific CRF layer and a task discriminator; the word embedding layer is used for converting words in the sentence into word vectors; the shared-private information extractor is composed of BI-LSTM, the shared-private information extractor comprises 2 private information extractors and a shared information extractor, wherein one private information extractor is used for learning the boundary line in the word segmentation task, the other private information extractor is used for learning the boundary line in the entity recognition task, and the shared information extractor is used for learning the boundary line shared by the word segmentation task and the entity recognition task; the task-specific CRF layer is respectively connected with the output representation of two private information extractors BI-LSTM; the task specific CRF layer is used for outputting label representations corresponding to the word segmentation task and the entity identification task; the task discriminator is the lower layer input of the shared information extractor, and the shared information extractor can learn the boundary line characteristics which are common in the word segmentation task and the entity recognition task through the countertraining mode of the task discriminator and the shared information extractor. Through the judge document legal information extraction model, the preset relevant information content can be automatically extracted, and the accuracy of information extraction is improved.

In order to achieve the above object, the present invention further provides a legal information extraction method, including:

defining entity types to be marked in the referee document;

marking the entity types in the selected referee documents based on the defined entity types to obtain marked entity identification data sets;

obtaining a public legal word segmentation data set, and obtaining a training set based on the legal word segmentation data set and an entity recognition data set;

establishing a referee document legal information extraction model, and training the referee document legal information extraction model by using a training set to obtain a trained referee document legal information extraction model;

inputting a referee document with legal information to be extracted into a trained referee document legal information extraction model, and outputting a legal information extraction result in the referee document;

wherein, referee's document legal information extraction model structure includes: the system comprises a word embedding layer, a sharing-private information extractor, a task specific CRF layer and a task discriminator; the word embedding layer is used for converting words in the sentence into word vectors; the shared-private information extractor is composed of BI-LSTM, the shared-private information extractor comprises 2 private information extractors and a shared information extractor, wherein one private information extractor is used for learning the boundary line in the word segmentation task, the other private information extractor is used for learning the boundary line in the entity recognition task, and the shared information extractor is used for learning the boundary line shared by the word segmentation task and the entity recognition task; the task specific CRF layer is respectively connected with the output representations of the two private information extractors BI-LSTM and is used for outputting label representations corresponding to the word segmentation task and the entity recognition task; the task discriminator is the lower layer input of the shared information extractor, and the shared information extractor learns the boundary line characteristics common to the word segmentation task and the entity recognition task through the countervailing training mode of the task discriminator and the shared information extractor.

Preferably, in the method, the entity in the referee document is marked in the form of BIO, B represents the beginning of the entity, I represents the middle character of the entity, and O represents a character irrelevant to the entity.

Preferably, the legal segmentation data set and the entity recognition data set are divided into a training set, a cross validation set and a test set, the training set is used for training the official document legal information extraction model, the cross validation set is used for validating the official document legal information extraction model, and the test set is used for testing the official document legal information extraction model.

Preferably, in the method, when the legal information extraction model of the referee document is trained, word embedding is carried out on each sentence input word embedding layer in the legal participle data set and the entity recognition data set, and each word obtains a word vector which is trained in advance.

Preferably, each word in the word segmentation task is output BEMS in the method, wherein B represents the beginning of a word, E represents the end of a word, M represents a word, and S represents a single word.

Preferably, in the method, when the judgment document legal information extraction model is trained, the word segmentation task and the entity recognition task are trained in turn, meanwhile, the countermeasure loss function and the updated parameter setting are introduced, and finally, the optimal model is obtained through parameter adjustment.

Corresponding to the method, the invention also provides a legal information extraction system, which comprises:

the definition unit is used for defining the entity types needing to be marked in the referee document;

the marking unit is used for marking the entity types in the selected referee documents based on the defined entity types to obtain marked entity identification data sets;

the training set obtaining unit is used for obtaining a public legal word segmentation data set and obtaining a training set based on the legal word segmentation data set and the entity recognition data set;

the model establishing and training unit is used for establishing a referee document legal information extraction model, and training the referee document legal information extraction model by using a training set to obtain a trained referee document legal information extraction model;

the legal information extraction unit is used for inputting the referee document with legal information to be extracted into the trained referee document legal information extraction model and outputting a legal information extraction result in the referee document;

wherein, referee's document legal information extraction model structure includes: the system comprises a word embedding layer, a sharing-private information extractor, a task specific CRF layer and a task discriminator; the word embedding layer is used for converting words in the sentence into word vectors; the shared-private information extractor is composed of BI-LSTM, the shared-private information extractor comprises 2 private information extractors and a shared information extractor, wherein one private information extractor is used for learning the boundary line in the word segmentation task, the other private information extractor is used for learning the boundary line in the entity recognition task, and the shared information extractor is used for learning the boundary line shared by the word segmentation task and the entity recognition task; the task-specific CRF layer is respectively connected with the output representation of two private information extractors BI-LSTM; the task specific CRF layer is used for outputting label representations corresponding to the word segmentation task and the entity identification task; the task discriminator is the lower layer input of the shared information extractor, and the shared information extractor can learn the boundary line characteristics which are common in the word segmentation task and the entity recognition task through the countertraining mode of the task discriminator and the shared information extractor.

The invention also provides a legal information extraction device, which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor realizes the steps of the legal information extraction method when executing the computer program.

The present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the legal information extraction method.

The invention also provides a legal case auditing auxiliary system based on the legal information extraction system, which comprises:

the judicial case library is used for storing the referee documents and the corresponding legal information;

the legal information extraction system is connected with the judicial case library and is used for extracting corresponding legal information from the referee documents and storing the extracted legal information and the corresponding referee documents in the judicial case library;

the query unit is used for querying the corresponding referee document and the corresponding legal information from the judicial case library by the user;

and the display unit is used for displaying the information inquired by the inquiry unit.

Legal workers such as judges can quickly inquire legal information required by the legal affair management auxiliary system, and are convenient for quickly and auxiliarily finishing the management of the affairs.

One or more technical schemes provided by the invention at least have the following technical effects or advantages:

the invention realizes that after a referee document is input into the referee document legal information extraction model, the preset relevant information content is automatically extracted by adopting a method of resisting transfer learning, the accuracy of information extraction is improved, and meanwhile, the dependency on entity identification data volume is reduced due to the introduction of word segmentation tasks.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention;

FIG. 1 is a flow chart of a referee document information extraction method based on resist transfer learning;

fig. 2 is a schematic diagram of the composition of the legal information extraction system.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflicting with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described and thus the scope of the present invention is not limited by the specific embodiments disclosed below.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种Markdown特征感知的无监督关键词提取方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!