HTML-based information intelligent extraction technology method

文档序号：1287555 发布日期：2020-08-28 浏览：7次中文

阅读说明：本技术 基于html的信息智能提取技术的方法 (HTML-based information intelligent extraction technology method ) 是由佘俊周宇鹏余少锋麻建超廖崇阳柳本林罗勇于 2020-01-15 设计创作，主要内容包括：本发明属信息处理技术领域,特别涉及一种基于HTML的信息智能提取技术的方法,本方法根据实体种子集中的多个实体种子,从目标语料中提取多个实体候选和多个属性候选,建立实体与属性的关联关系,分别从所述多个实体中确定出至少一个目标实体；最后,将提取的目标实体存储在目标实体集中,将提取的目标属性存储在与目标实体关联的目标属性集中,将相邻的语义关联的行合并为段落,与相邻行不存在语义关联的行独立成为段落,得到结构化文本；建立包含关键词的关键信息表单；通过特征获取关键信息,并该关键信息写入所述关键信息表单中,完成关键信息提取。通过本方法得到结构化文本,使得信息可分析和统计,极大的为研究工作提供了全数据的分析环境。(The invention belongs to the technical field of information processing, and particularly relates to a method for information intelligent extraction technology based on HTML (hypertext markup language). The method comprises the steps of extracting a plurality of entity candidates and a plurality of attribute candidates from a target corpus according to a plurality of entity seeds in an entity seed set, establishing an incidence relation between an entity and an attribute, and determining at least one target entity from the plurality of entities respectively; finally, storing the extracted target entity in a target entity set, storing the extracted target attribute in a target attribute set associated with the target entity, merging adjacent semantically-associated lines into paragraphs, and independently forming the lines which are not semantically associated with the adjacent lines into the paragraphs to obtain the structured text; establishing a key information form containing key words; and obtaining key information through the characteristics, and writing the key information into the key information form to finish key information extraction. The structured text is obtained by the method, so that the information can be analyzed and counted, and a full-data analysis environment is greatly provided for research work.)

1. The method of the information intelligent extraction technology based on the HTML is characterized in that: extracting a plurality of entity candidates and a plurality of attribute candidates from a target corpus according to a plurality of entity seeds in an entity seed set, wherein the entity seed set is composed of a plurality of entity seeds belonging to a target category; establishing an incidence relation between entities and attributes according to a plurality of entities and the plurality of attribute candidates, wherein the plurality of entities comprise the plurality of entity seeds and the plurality of entity candidates; the determining at least one target entity from the plurality of entities and at least one target attribute from the plurality of attribute candidates according to the association relationship between the entities and the attributes includes:

according to the association coefficient of each entity and each attribute in the association relationship between the entities and the attributes, scoring each entity in the plurality of entities and each attribute candidate in the plurality of attribute candidates; determining the at least one target entity from the plurality of entities according to the scoring results of the plurality of entities; determining the at least one target attribute from the plurality of attribute candidates according to the scoring results of the plurality of attribute candidates; finally, storing the extracted target entity in a target entity set, storing the extracted target attribute in a target attribute set associated with the target entity, merging adjacent semantically-associated lines into paragraphs, and independently forming the lines which are not semantically associated with the adjacent lines into the paragraphs to obtain the structured text; establishing a key information form containing key words; and obtaining key information through the characteristics, and writing the key information into the key information form to finish key information extraction.

2. The method of claim 1, wherein determining at least one target entity from the plurality of entities and at least one target attribute from the plurality of attribute candidates according to the association relationship between the entities and the attributes comprises: according to the association coefficient of each entity and each attribute in the association relationship between the entities and the attributes, scoring each entity in the plurality of entities and each attribute candidate in the plurality of attribute candidates; determining the at least one target entity from the plurality of entities according to the scoring results of the plurality of entities; and determining the at least one target attribute from the attribute candidates according to the scoring results of the attribute candidates.

Technical Field

The invention belongs to the technical field of information processing, and particularly relates to an information intelligent extraction technology method based on HTML (hypertext markup language).

Background

With the rapid development of electronic technology and the arrival of the big data era, more and more data are stored in an information system in the form of hypertext markup Language, hypertext markup Language (HTML) is processed into structured text through a natural Language processing technology, and extracting entities and attributes from the HTML text is an important step for converting unstructured text into structured text.

For a large amount of unstructured texts, the methods of manual reading and manual understanding are adopted, and the problems of large workload, subjectivity in understanding and the like exist. Therefore, how to convert unstructured data into computer-understandable structured data and quickly and accurately extract key information from the computer-understandable structured data becomes an urgent technical problem to be solved. In the process of converting unstructured text into structured text, entity extraction and attribute extraction are generally carried out in two separate stages. When the method is specifically implemented, firstly, entity candidates are extracted from a given unstructured text according to entity seeds in an entity seed set of a given target category, the similarity between the entity candidates and the entity seeds is calculated according to the context of the entity candidates in a given corpus, the entity candidates with the similarity between the entity candidates and the entity seeds being greater than the preset similarity are used as target entities, then, attribute candidates are extracted from the given corpus according to attribute seeds in a given attribute seed set, the similarity between the attribute candidates and the attribute seeds is calculated according to the context of the attribute candidates in the given corpus, the attribute candidates with the similarity between the attribute candidates and the attribute seeds being greater than the preset similarity are used as target attributes, and the problem of semantic drift often exists in the information extraction process due to the fact that the similarity setting needs to be manually set.

In addition, in the prior art, in the information extraction link of HTML, a structured information group is output according to given page preprocessing and extraction rule setting so as to facilitate query and analysis. However, the method does not extract the key information, and the information pushed to the user is still a complete file, so that the key information cannot be extracted quickly and accurately.

Disclosure of Invention

In order to solve the problems of semantic drift and incapability of extracting key information in the information extraction process, the invention provides an information intelligent extraction technology method based on HTML (hypertext markup language).

The technical scheme adopted by the invention is as follows:

extracting a plurality of entity candidates and a plurality of attribute candidates from a target corpus according to a plurality of entity seeds in an entity seed set, wherein the entity seed set is composed of a plurality of entity seeds belonging to a target category;

establishing an incidence relation between entities and attributes according to a plurality of entities and the plurality of attribute candidates, wherein the plurality of entities comprise the plurality of entity seeds and the plurality of entity candidates;

and respectively determining at least one target entity from the plurality of entities according to the incidence relation between the entities and the attributes, and determining at least one target attribute from the plurality of attribute candidates.

According to the association coefficient of each entity and each attribute in the association relationship between the entities and the attributes, scoring each entity in the plurality of entities and each attribute candidate in the plurality of attribute candidates;

determining the at least one target entity from the plurality of entities according to the scoring results of the plurality of entities;

determining the at least one target attribute from the plurality of attribute candidates according to the scoring results of the plurality of attribute candidates;

finally, storing the extracted target entity in a target entity set, storing the extracted target attribute in a target attribute set associated with the target entity, merging adjacent semantically-associated lines into paragraphs, and independently forming the lines which are not semantically associated with the adjacent lines into the paragraphs to obtain the structured text; establishing a key information form containing key words; and obtaining key information through the characteristics, and writing the key information into the key information form to finish key information extraction.

The entity seed set is composed of a plurality of entity seeds belonging to a target category;

determining the at least one target entity from the plurality of entities according to the scoring results of the plurality of entities;

and determining the at least one target attribute from the attribute candidates according to the scoring results of the attribute candidates.

According to the method for the HTML-based information intelligent extraction technology, a plurality of entity candidates and a plurality of attribute candidates are extracted from a target corpus according to a plurality of entity seeds in an entity seed set, the incidence relation between the entities and the attributes is established according to the entities and the attribute candidates, at least one target entity is determined from the entities according to the incidence relation between the entities and the attributes, at least one target attribute is determined from the attribute candidates, and the target entity and the target attribute are determined according to the incidence relation between the entities and the attributes, so that the problem of semantic drift in the information extraction process is solved, and the effect of avoiding the semantic drift in the information extraction process is achieved; meanwhile, the method can quickly and accurately acquire key information through characteristics, greatly reduces the time for manually extracting data, improves the research efficiency and accuracy, and creates value for the analysis process. The structured text is obtained by the method, so that the information can be analyzed and counted, and a full-data analysis environment is greatly provided for research work.

Detailed Description

4页详细技术资料下载

HTML-based information intelligent extraction technology method

相关技术

网友询问留言