Semantic element extraction method for XBRL field ontology

文档序号:1087412 发布日期:2020-10-20 浏览:8次 中文

阅读说明:本技术 一种面向xbrl领域本体的语义基元提取方法 (Semantic element extraction method for XBRL field ontology ) 是由 潘定 叶迪 梁倬骞 于 2020-07-14 设计创作,主要内容包括:发明公开了一种面向XBRL领域本体的语义基元提取方法,具体步骤为:步骤1、通过会计词典中提取、整理会计术语的定义文本;步骤2、对文本进行切词、去停用词和去重处理;步骤3、构建会计术语有向网络图;步骤4、基于会计词典构建网络图后,利用MATLAB R2016a计算出各节点的PageRank值,作为语义基元提取的依据,该面向XBRL领域本体的语义基元提取方法,解决了目前基于当前流行的机器学习算法试图解决语义基元提取难点,该种方法虽然有效地减少了人工以及时间成本,但抽取出的术语存在大量噪声、领域特性不突出且无法验证其有效性的问题。(The invention discloses a semantic element extraction method for an XBRL field ontology, which comprises the following specific steps: step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary; step 2, performing word segmentation, word stop removal and duplicate removal on the text; step 3, constructing an accounting term directed network graph; and 4, after a network graph is constructed based on an accounting dictionary, the PageRank value of each node is calculated by utilizing MATLAB R2016a and is used as a basis for semantic element extraction, and the semantic element extraction method oriented to the XBRL field ontology solves the problem that the semantic element extraction difficulty is attempted to be solved based on the currently popular machine learning algorithm, although labor and time costs are effectively reduced, the extracted terms have a large amount of noise, the field characteristics are not outstanding, and the validity of the extracted terms cannot be verified.)

1. A semantic element extraction method for an XBRL field ontology specifically comprises the following steps:

step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary;

step 2, performing word segmentation, word stop removal and duplicate removal on the text;

step 3, constructing an accounting term directed network graph;

step 4, after a network graph is constructed on the basis of an accounting dictionary, the PageRank value of each node is calculated by using MATLAB R2016a and is used as a basis for semantic element extraction;

and 5, merging the semantic elements based on the synonym forest.

2. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the step 1, the text for defining the accounting terms is manually extracted and arranged, and is summarized in Excel.

3. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: and 2, specifically, performing word segmentation by using a jieba package carried by Python, importing 4 accounting terms in a counting dictionary into a custom dictionary, then establishing a stop word list, and performing de-duplication processing on words in a definition text of each term.

4. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the specific construction idea in the step 3, the vocabulary and the definition text after word segmentation are taken as nodes, a directed edge exists between the vocabulary and the definition text, specifically, the vocabulary points to a plurality of definition text vocabularies, and another vocabulary B appears in the definition text of a vocabulary A, so that a directed edge exists between A, B, specifically, a directed edge of A points to B.

5. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the step 4, semantic primitives exist in a point with the maximum PageRank value in the loop and leaf nodes in the non-loop.

6. The XBRL field ontology-oriented semantic primitive extraction method according to claim 1, wherein the method comprises the following steps: in the step 5, words with different definitions and similar forms exist in the extracted semantic elements are merged.

7. The XBRL field ontology-oriented semantic primitive extraction method according to claim 2, wherein the method comprises the following steps: the Excel is used for structuring and sorting an accounting dictionary.

Technical Field

The invention relates to the technical field of XBRL field ontologies, in particular to a semantic element extraction method for XBRL field ontologies.

Background

The domain ontology is a specification description of a shared concept model in a specific domain, reflects a knowledge structure of the domain through representation of concepts and relations thereof, is helpful for enhancing human-computer interaction and information exchange between machines, and is also called a form ontology because the XBRL domain ontology is a set of financial report term systems and related examples based on sharing and formalization principles when oriented to the financial report domain. The needed classification standard can be automatically generated through the XBRL field ontology, and reasoning and checking on financial data are supported, so that research on the XBRL field ontology is very meaningful, but at present, no systematic and complete ontology is built in the financial reporting field, and the ontology-based financial reporting research is mostly focused on discussion and simple verification of a theoretical process and is not achieved by the system. The main reasons are that no professional concept system guides the application of the mark in the XBRL field, and the semantics of the concept in the XBRL financial report is weak, which influences the production and data sharing of the XBRL financial report.

The current XBRL field lacks standardized knowledge description, so the difficulty is met in the aspect of solving the readability of computer to XBRL financial information, the use breadth and the development prospect of XBRL are hindered, the difficulty of semantic primitive extraction is tried to be solved based on the current popular machine learning algorithm, though the method effectively reduces labor cost and time cost, the extracted terms have a large amount of noise, the field characteristics are not outstanding, and the validity of the extracted terms cannot be verified.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a semantic element extraction method facing to an XBRL field ontology, which solves the problem that the extracted terms have a large amount of noise, are not outstanding in field characteristics and cannot verify the validity of the extracted terms although the method effectively reduces labor and time costs by trying to solve the problem of semantic element extraction based on the current popular machine learning algorithm.

In order to achieve the purpose, the invention is realized by the following technical scheme: a semantic element extraction method for an XBRL field ontology specifically comprises the following steps:

step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary;

step 2, performing word segmentation, word stop removal and duplicate removal on the text;

step 3, constructing an accounting term directed network graph;

step 4, after a network graph is constructed on the basis of an accounting dictionary, the PageRank value of each node is calculated by using MATLAB R2016a and is used as a basis for semantic element extraction;

and 5, merging the semantic elements based on the synonym forest.

Preferably, in the step 1, a definition text of the accounting term is manually extracted and arranged, and is summarized in Excel.

Preferably, step 2 is to cut words by using a jieba package carried by the Python, and to import 4 accounting terms in the counting dictionary into the custom dictionary, and then to establish a deactivation vocabulary, and to perform de-duplication processing on the words in the definition text of each term.

Preferably, in step 3, the specific construction idea is to use the vocabulary and the cut-word definition text as nodes, there is a directed edge between the vocabulary and the definition text, specifically, the vocabulary points to a plurality of definitions text vocabularies, and another vocabulary B appears in the definition text of a vocabulary a, then there is a directed edge between A, B, specifically, a directed edge where a points to B.

Preferably, the semantic primitives in step 4 exist in the point with the largest PageRank value in the loop and the leaf nodes in the non-loop.

Preferably, in the step 5, words with different definitions in similar forms exist in the extracted semantic elements, and are merged.

Preferably, Excel is used for structured arrangement of an accounting dictionary.

Advantageous effects

The invention provides a semantic element extraction method for an XBRL field ontology. The method has the following beneficial effects:

according to the semantic element extraction method for the XBRL field ontology, the semantic elements are merged based on the synonym forest, the expression efficiency of the semantic elements is guaranteed to a large extent, the largest field knowledge range is expressed in the smallest semantic element scale, and the problem that the extracted terms have a large amount of noise, the field characteristics are not outstanding and the validity of the extracted terms cannot be verified although the method effectively reduces labor cost and time cost.

Drawings

FIG. 1 is a flowchart of the semantic element extraction method oriented to XBRL domain ontology according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a technical solution: a semantic element extraction method for an XBRL field ontology specifically comprises the following steps:

step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary;

step 2, performing word segmentation, word stop removal and duplicate removal on the text;

and 3, constructing an accounting term directed network graph.

And 4, constructing a network graph based on an accounting dictionary, and calculating the PageRank value of each node by using MATLAB R2016a as a basis for semantic primitive extraction.

And 5, merging the semantic elements based on the synonym forest.

Further, in step 1, the text for defining the accounting terms is manually extracted and collated, and is summarized in Excel.

Further, step 2 is specifically to cut words by using a jieba package carried by the Python, to introduce 4 accounting terms in the counting dictionary into the custom dictionary, then to establish a disabled word list, and to perform de-duplication processing on the words in the definition text of each term.

Further, in step 3, the specific construction idea is to use the vocabulary and the defined text after word segmentation as nodes, there is a directed edge between the vocabulary and the defined text, specifically, the vocabulary points to a plurality of defined text vocabularies, and another vocabulary B appears in the defined text of a vocabulary a, then there is a directed edge between A, B, specifically, a directed edge where a points to B.

Further, in step 4, semantic primitives exist in the point with the largest PageRank value in the loop and the leaf nodes in the non-loop.

Furthermore, in step 5, words with different definitions and similar forms exist in the extracted semantic elements, and are merged.

Further, Excel is used for structured arrangement of an accounting dictionary.

A semantic element extraction method for an XBRL field ontology specifically comprises the following steps: step 1, extracting and sorting a definition text of an accounting term from an accounting dictionary, manually extracting and sorting the definition text of the accounting term from the text in the step 1, and summarizing the definition text into Excel, wherein the Excel is used for structured sorting of the accounting dictionary;

in the invention, step 2, word cutting, word stop and duplicate removal processing are carried out on the text, wherein the step 2 specifically comprises the steps of utilizing a jieba packet carried by Python to carry out word cutting, leading 4 accounting terms in a counting dictionary into a user-defined dictionary, then establishing a word stop word list, and carrying out duplicate removal processing on words in a definition text of each term;

step 3, constructing an accounting term directed network graph; in the step 3, the specific construction idea takes the vocabulary and the definition text after word segmentation as nodes, a directed edge is arranged between the vocabulary and the definition text, specifically, the vocabulary points to a plurality of definition text vocabularies, and another vocabulary B appears in the definition text of a vocabulary A, so that a directed edge exists between A, B, specifically, a directed edge pointing to B of A;

in the invention, after a network graph is constructed based on an accounting dictionary in step 4, a PageRank value of each node is calculated by using MATLAB R2016a as a basis for semantic element extraction, and in step 4, the semantic elements exist in a point with the maximum PageRank value in a loop and leaf nodes in a non-loop;

in the invention, step 5, the semantic elements based on the synonym forest are merged, and the extracted semantic elements with different definitions in similar forms are merged in step 5.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

6页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:离线语义解析方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!