News event activity name extraction method based on deep learning

文档序号：1964138 发布日期：2021-12-14 浏览：17次中文

阅读说明：本技术 一种基于深度学习的新闻事件活动名称抽取方法 (News event activity name extraction method based on deep learning ) 是由杨瀚朱婷婷温序铭于 2021-11-16 设计创作，主要内容包括：本发明公开了一种基于深度学习的新闻事件活动名称抽取方法,包括步骤：S1,收集新闻文本数据并标注其中的新闻事件活动名称,构建新闻事件活动名称数据集；S2,利用预训练模型与深度学习方法构建新闻事件活动名称抽取模型,并利用所述新闻事件活动名称数据集训练所述新闻事件活动名称抽取模型；S3,利用步骤S2中训练好的新闻事件活动名称抽取模型,对输入的新闻文本进行预测,获得新闻文本中包含的新闻事件活动名称；本发明具有抽取新闻事件活动名称完整、准确和高效的优点。(The invention discloses a deep learning-based news event activity name extraction method, which comprises the following steps: s1, collecting news text data, marking news event names in the news text data, and constructing a news event name data set; s2, constructing a news event name extraction model by using a pre-training model and a deep learning method, and training the news event name extraction model by using the news event name data set; s3, predicting the input news text by using the news event name extraction model trained in the step S2 to obtain the news event name contained in the news text; the invention has the advantages of complete, accurate and efficient extraction of news event names.)

1. A news event name extraction method based on deep learning is characterized by comprising the following steps:

s1, collecting news text data, marking news event names in the news text data, and constructing a news event name data set;

s2, constructing a news event name extraction model by using a pre-training model and a deep learning method, and training the news event name extraction model by using the news event name data set;

and S3, predicting the input news text by using the news event name extraction model trained in the step S2 to obtain the news event name contained in the news text.

2. The deep learning-based news event name extraction method as claimed in claim 1, wherein the step S1 comprises the sub-steps of:

s12, copying N parts of K news text data which are completed into sentences and distributing the K news text data to N mutually independent labeling systems for data labeling, wherein N is a positive integer;

s13, establishing an evaluation center service, collecting the labeled data of N labeling systems, evaluating the labeling quality of the N labeling systems, returning the data with labeled disputes to the labeling systems until the labeled disputes are eliminated, and generating a news event activity name data set after the preselected set conditions are met.

3. The deep learning-based news event name extraction method of claim 1, wherein the constructing a news event name extraction model in step S2 includes constructing: the system comprises a text character coding layer, a text word segmentation coding layer, a text word coding layer, a text characteristic fusion layer and an event activity name extraction layer.

4. The deep learning-based news event name extraction method as claimed in claim 1, wherein the step S3 comprises the sub-steps of:

s31, dividing the collected news text data into sentences according to the Chinese sentence dividing symbols, and inputting the divided news text data into the news event name extraction model;

s32, obtaining event name candidate set contained in news text by using the news event name extraction modelWhereinRepresenting the number of candidate event activity names,represents the name of the r-th event activity;

and S33, post-processing the news event name to obtain an event name prediction result contained in the input news text data.

5. The deep learning-based news event name extraction method as claimed in claim 2, wherein in step S12, after said copying N copies and distributing to N mutually independent labeling systems, N news practitioners perform data labeling.

6. The deep learning-based news event name extraction method as claimed in claim 2, wherein the step S13 includes the sub-steps of:

s131, setting the dispute resolution threshold value asSetting a data quality overreview threshold；

S132, based on the labeled data of the same text content of the N labeling systems,if the N labeling systems generate M different labeling results for the same data, wherein M is a positive integer; the number of the ith labeling result is m_iI =1, 2.. M, calculating annotation dispute decision weightsThe following were used:

wherein the content of the first and second substances,calculating a function of the maximum value;

s133, judging whether a dispute exists in the labeling result of the current news text, wherein the judging method comprises the following steps:

without any dispute between the fact that,dispute;

s134, based on the determination result in step S133, performs processing: if the labeling result of the current news text disputes, completely removing all the labels of the current news text by the N systems, returning the labels to the N systems for re-labeling, if the labeling result of the current news text disputes, recording the labeling result as dispute-free labeling text data, and counting the number of the dispute-free labeling text data to be marked;

s135, repeating the steps S132 to S134 for all K news text data, and calculating the proportion of the number of the non-dispute annotation text data to the total number of the textsThe calculation formula is as follows:

；

s136, if the calculation result of S135 satisfies the conditionAnd exporting the K news text data and the optimal labeling result thereof as a news event activity name data set, wherein the optimal labeling result is defined as: each piece of news text data is in M kinds of marking results of all N marking systems,

number ofThe ith labeling result is the most; if the calculation result of S135 does not satisfy the conditionThen, steps S132 to S135 are repeated until the condition is satisfiedAnd then exporting the K news text data and the optimal labeling result thereof as a news event activity name data set, wherein the definition of the optimal labeling result is as follows: the quantity of each piece of news text data in all M marking results of N marking systemsThe most ith labeling result.

7. A deep learning based news event name extraction method as claimed in claim 3, wherein constructing the text character encoding layer comprises the sub-steps of: performing character-level coding on a text by using a pre-training model BERT, and converting each character j of an input news text into a real number vector with a set dimensionAnd the dimension is denoted as p.

8. A deep learning based news event name extraction method as claimed in claim 3, wherein constructing the text participle coding layer comprises the sub-steps of:

S2A1, performing word segmentation on the input news text, and performing word segmentation and labeling according to BMES rules;

S2A2, defining word segmentation coding matrixThe system comprises a first behavior BMES rule, a second behavior BMES rule, a third behavior BMES rule, a fourth behavior BMES rule and a fourth behavior BMES rule, wherein the first behavior BMES rule comprises a participle code corresponding to B, the second behavior BMES rule comprises a participle code corresponding to M, the third behavior BMES rule comprises a participle code corresponding to E, and the fourth behavior BMES rule comprises a participle code corresponding to S;

S2A3, encoding matrix by word segmentationConverting each character j of input news text into real number vector with dimension pAnd constructing a word segmentation embedding matrixJ-th behavior of HL number of characters of the input news text.

9. A deep learning based news event name extraction method as claimed in claim 8, wherein constructing the text word coding layer comprises the sub-steps of:

S2B1, counting the number L of characters of the input news text;

S2B2, constructing an initialization participle expression matrix, wherein each element in the initialization Q is 0;

S2B3, constructing a character embedding matrixWherein the jth row of U is；

S2B4, updating a participle expression matrix into；

S2B5, calculating a word embedding matrix of the input news text according to the following formula：

Wherein the content of the first and second substances,representing a matrix multiplication.

10. The deep learning-based news event name extraction method of claim 9, wherein constructing a text feature fusion layer comprises the sub-steps of:

S2C1, embedding the participles obtained in the step S2A3 into a matrixAnd the character embedding matrix obtained in step S2B3And the term embedding matrix obtained in the step S2B5Sequentially splicing to obtain a three-dimensional text representation matrix；

S2C2, constructing a convolutional neural network layerTo pairPerforming convolution operation and obtaining a convolved fusion three-dimensional text representation matrix；

S2C3, constructing a maximum pooling layer P pairPerforming maximal pooling in a second dimension resulting in a fused text representation matrix；

S2C4, constructing a context semantic fusion layerPerforming context semantic fusion on the fusion text expression matrix by adopting a bidirectional long-short term memory neural network to obtain a context text expression matrix。

11. The deep learning-based news event name extraction method as claimed in claim 9, wherein the step S2B4 comprises the sub-steps of:

S2B41, initializing text word initial character statistics；

S2B42, if the mark obtained after the I-th character of the input news text is subjected to word segmentation and labeling according to the BMES rule is S, the order is given；

If the label B is obtained after the first character of the input news text is subjected to word segmentation and labeling according to the BMES rule, the order is givenLet us order；

If the label obtained after the I character of the input news text is subjected to word segmentation and labeling according to the BMES rule is M or E, the order is givenWherein；

S2B43, for the input news text, the step S2B42 is performed for each character in turn from the first character, thereby completing the updated participle representation matrix which is recorded as。

12. The deep learning based news event name extraction method of claim 10, wherein constructing the event name extraction layer comprises the sub-steps of: context text representation matrix by using CRF algorithmAs input, a prediction of the event activity name is obtained.

13. The deep learning based news event name extraction method of claim 4, wherein the post-processing in step S33 includes the sub-steps of:

s331, if the current event activity nameDeleting the current event activity name from the event activity name candidate set if only one character is included;

if the current event activity nameIf the initial character or the final character is one of a pause sign, a comma, a semicolon, a colon sign, a period sign, an exclamation mark, a question mark and an ellipsis mark, deleting the corresponding signs at the head and the tail and keeping the rest texts as event activity names in a candidate set;

if the current event activity nameIf the marking result of the initial character after the word segmentation marking is carried out by using the BMES rule is not one of B or S, deleting the current event activity name from the event activity name candidate set;

if the current event activity nameIf the marking result of the initial character after the word segmentation marking is carried out by using the BMES rule is not one of E or S, deleting the current event activity name from the event activity name candidate set;

s332, respectively aiming atIn case (3), step (S331) is executed in sequence to obtain the modified event activity nameAnd selecting as a final prediction result.

Technical Field

The invention relates to the field of extraction of news text content information, in particular to a method for extracting news event activity names based on deep learning.

Background

Under the new media era, news information data received every day shows explosive growth, and the rapid mastering of news information contents becomes an urgent task. The method has the advantages that the news event activity name can be automatically extracted, the requirement that a user conveniently checks each current hot event can be met, the method can also be used for recommending, removing the duplicate, improving the business scenes such as the event ranking list and the like, and the method has very important practical significance.

At present, the news event activity name extraction is realized by adopting a sequence marking technology direction, and is a task similar to entity identification but difficult to identify entities. The practical production faces a plurality of difficult problems: for example, in the Chinese word segmentation problem, the event activity name extracted finally is incomplete due to wrong word segmentation, and the Chinese word is cut off mistakenly; in addition, the event activity name is often longer than the entity, the problem of capture and transmission errors of the context information needs to be considered for longer text information, otherwise, incomplete information extraction is easy to occur; event activity names tend to contain more complex syntactic features, and structural information is also more complex and more varied relative to entities.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a news event activity name extraction method based on deep learning to solve the problems in the background, and has the advantages of completeness, accuracy and high efficiency in extracting news event activity names.

The purpose of the invention is realized by the following scheme:

a news event activity name extraction method based on deep learning comprises the following steps:

s1, collecting news text data, marking news event names in the news text data, and constructing a news event name data set;

s2, constructing a news event name extraction model by using a pre-training model and a deep learning method, and training the news event name extraction model by using the news event name data set;

and S3, predicting the input news text by using the news event name extraction model trained in the step S2 to obtain the news event name contained in the news text.

Further, step S1 includes the sub-steps of:

Further, the constructing of the news event campaign name extraction model in step S2 includes constructing: the system comprises a text character coding layer, a text word segmentation coding layer, a text word coding layer, a text characteristic fusion layer and an event activity name extraction layer.

Further, step S3 includes the sub-steps of:

s31, dividing the collected news text data into sentences according to the Chinese sentence dividing symbols, and inputting the divided news text data into the news event name extraction model;

and S33, post-processing the news event name to obtain an event name prediction result contained in the input news text data.

Further, in step S12, after the N copies are distributed to N independent labeling systems, N news practitioners perform data labeling.

Further, step S13 includes the sub-steps of:

s131, setting the dispute resolution threshold value asSetting a data quality overreview threshold；

S132, based on the labeled data of the same text content of the N labeling systems, if the N labeling systems generate M different labeling results for the same data, wherein M is a positive integer; the number of the ith labeling result is m_iI =1, 2.. M, calculating annotation dispute decision weightsThe following were used:

wherein the content of the first and second substances,calculating a function of the maximum value;

s133, judging whether a dispute exists in the labeling result of the current news text, wherein the judging method comprises the following steps:

without any dispute between the fact that,dispute;

s134, based on the judgment result of the step S133And (3) treatment: if the labeling result of the current news text disputes, the N systems completely remove the labeling of the current news text and return to the N systems for re-labeling, if the labeling result of the current news text disputes, the text data are recorded as dispute-free labeling text data, the number of the dispute-free labeling text data is counted, and the text data are recorded as dispute-free labeling text data；

S135, repeating the steps S132 to S134 for all K news text data, and calculating the proportion of the number of the non-dispute annotation text data to the total number of the textsThe calculation formula is as follows:

；

number ofThe ith labeling result is the most; if the calculation result of S135 does not satisfy the conditionThen, steps S132 to S135 are repeated until the condition is satisfiedThen, exporting the K news text data and the optimal labeling result thereof as a news event activity name data set, wherein the optimal labeling result is defined as: the quantity of each piece of news text data in all M marking results of N marking systemsThe most ith labeling result.

Further, constructing the text character encoding layer comprises the sub-steps of: performing character-level coding on a text by using a pre-training model BERT, and converting each character j of an input news text into a real number vector with a set dimensionAnd the dimension is denoted as p.

Further, constructing the text participle coding layer comprises the sub-steps of:

S2A1, performing word segmentation on the input news text, and performing word segmentation and labeling according to BMES rules;

Further, constructing the text word encoding layer comprises the sub-steps of:

S2B1, counting the number L of characters of the input news text;

S2B2, constructing an initial participle representation matrixEach element in the initialization Q is 0;

S2B3, constructing a character embedding matrixWherein the jth row of U is；

S2B4, updating a participle expression matrix into；

S2B5, calculating a word embedding matrix of the input news text according to the following formula：

Wherein the content of the first and second substances,representing a matrix multiplication.

Further, constructing a text feature fusion layer comprises the sub-steps of:

S2C2, constructing a convolutional neural network layerTo pairPerforming convolution operation and obtaining a convolved fusion three-dimensional text representation matrix；

S2C3, constructing a maximum pooling layer P pairPerforming maximal pooling in a second dimension resulting in a fused text representation matrix；

Further, step S2B4 includes the sub-steps of:

S2B41, initializing text word initial character statistics；

S2B42, if the mark obtained after the I-th character of the input news text is subjected to word segmentation and labeling according to the BMES rule is S, the order is given；

If the label B is obtained after the first character of the input news text is subjected to word segmentation and labeling according to the BMES rule, the order is givenLet us order；

If the label obtained after the I character of the input news text is subjected to word segmentation and labeling according to the BMES rule is M or E, the order is givenWherein；

Further, constructing the event activity name extraction layer comprises the sub-steps of: context text representation matrix by using CRF algorithmAs input, a prediction of the event activity name is obtained.

Further, the post-processing in step S33 includes the sub-steps of:

s331, if the current event activity nameDeleting the current event activity name from the event activity name candidate set if only one character is included;

s332, respectively aiming atStep S331 is performed in sequence, and the modified event activity name candidate set is obtained as the final prediction result.

The invention has the beneficial effects that:

the embodiment of the invention solves the problems extracted in the background and has the advantages of complete, accurate and efficient extraction of news event names.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a frame diagram of a news event name extraction method based on deep learning in an embodiment of the present invention.

Fig. 2 is a flowchart of steps for constructing an automatic evaluation center service according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a process of extracting news event names by using a deep learning-based news event name extraction system.

Detailed Description

All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.

As shown in fig. 1, a deep learning-based news event name extraction method includes the steps of:

s1, collecting news text data, marking news event names in the news text data, and constructing a news event name data set;

s2, constructing a news event name extraction model by using a pre-training model and a deep learning method, and training the news event name extraction model by using the news event name data set;

and S3, predicting the input news text by using the news event name extraction model trained in the step S2 to obtain the news event name contained in the news text.

In other alternative embodiments of the present invention, it should be further explained that step S1 includes the sub-steps of:

s11, dividing the collected news text data into sentences according to the Chinese sentence dividing symbols, and recording the number K of the news texts after the sentences are divided, wherein K is a positive integer; news text data can be collected through the Internet, broadcast television and newspapers and magazines; the Chinese sentence division symbol includes: a period ("-"), an exclamation point ("!"), a question mark, an ellipsis ("… …");

In other alternative embodiments of the present invention, it should be further explained that the constructing the news event name extraction model in step S2 includes constructing: the system comprises a text character coding layer, a text word segmentation coding layer, a text word coding layer, a text characteristic fusion layer and an event activity name extraction layer.

In other alternative embodiments of the present invention, it should be further explained that step S3 includes the sub-steps of:

s31, dividing the collected news text data into sentences according to the Chinese sentence dividing symbols, and inputting the divided news text data into the news event name extraction model; the Chinese sentence division symbol includes: a period ("-"), an exclamation point ("!"), a question mark, an ellipsis ("… …");

and S33, post-processing the news event name to obtain an event name prediction result contained in the input news text data.

In another alternative embodiment of the present invention, it should be further explained that, in step S12, after the N copies are copied and distributed to N independent annotation systems, the N news practitioners perform data annotation.

In other alternative embodiments of the present invention, it should be further explained that, as shown in fig. 2, step S13 includes the sub-steps of:

s131, setting the dispute resolution threshold value asSetting a data quality overreview threshold；

S132, based on the labeled data of the same text content of the N labeling systems, if the N labeling systems generate M different labeling results for the same data, wherein M is a positive integer; the number of the ith labeled result is m_iI =1, 2.. M, calculating annotation dispute decision weightsThe following were used:

wherein the content of the first and second substances,calculating a function of the maximum value;

s133, judging whether a dispute exists in the labeling result of the current news text, wherein the judging method comprises the following steps:

without any dispute between the fact that,dispute;

s134, based on the determination result in step S133, performs processing: if the labeling result of the current news text disputes, the N systems completely remove the labeling of the current news text and return to the N systems for re-labeling, if the labeling result of the current news text disputes, the text data are recorded as dispute-free labeling text data, the number of the dispute-free labeling text data is counted, and the text data are recorded as dispute-free labeling text data；

；

In other optional embodiments of the present invention, it should be further explained that constructing the text character encoding layer includes the sub-steps of: performing character-level coding on a text by using a pre-training model BERT, and converting each character j of an input news text into a real number vector with a set dimensionAnd the dimension is denoted as p.

In other alternative embodiments of the present invention, it should be further explained that, as shown in fig. 3, constructing the text participle coding layer includes the sub-steps of:

S2A1, performing word segmentation on the input news text, and performing word segmentation and labeling according to BMES rules; in the step, the input news text is segmented, and an open source toolkit jieba can be used; wherein, the "BMES" rule means: the first character is marked as 'B', the middle character is marked as 'M', and the end character is marked as 'E'; the words or punctuation formed by the single character are marked as "S";

In other optional embodiments of the present invention, it should be further explained that constructing the text word encoding layer includes the sub-steps of:

S2B1, counting the number L of characters of the input news text;

S2B2, constructing an initial participle representation matrixEach element in the initialization Q is 0;

S2B3, constructing a character embedding matrixWherein the jth row of U is；

S2B4, updating a participle expression matrix into；

S2B5, calculating a word embedding matrix of the input news text according to the following formula：

Wherein the content of the first and second substances,representing a matrix multiplication.

In other alternative embodiments of the present invention, it should be further explained that, as shown in fig. 3, the building of the text feature fusion layer includes the sub-steps of:

S2C2, constructing a convolutional neural network layerTo pairPerforming convolution operation and obtaining a convolved fusion three-dimensional text representation matrix；

S2C3, constructing a maximum pooling (Maxpool) layer P pairPerforming maximal pooling in a second dimension resulting in a fused text representation matrix；

In other alternative embodiments of the present invention, it should be further explained that step S2B4 includes the sub-steps of:

S2B41, initializing text word initial character statistics；

S2B42, if the mark obtained after the I-th character of the input news text is subjected to word segmentation and labeling according to the BMES rule is S, the order is given；

If the label B is obtained after the first character of the input news text is subjected to word segmentation and labeling according to the BMES rule, the order is givenLet us order；

If the label obtained after the I character of the input news text is subjected to word segmentation and labeling according to the BMES rule is M or E, the order is givenWherein；

In other alternative embodiments of the present invention, it should be further explained that constructing the event activity name extraction layer includes the sub-steps of: context text representation matrix by using CRF algorithmAs input, a prediction of the event activity name is obtained.

In other alternative embodiments of the present invention, it should be further explained that the post-processing in step S33 includes the sub-steps of:

s331, if the current event activity nameDeleting the current event activity name from the event activity name candidate set if only one character is included;

s332, respectively aiming atStep S331 is performed in sequence, and the modified event activity name candidate set is obtained as the final prediction result.

Other embodiments than the above examples may be devised by those skilled in the art based on the foregoing disclosure, or by adapting and using knowledge or techniques of the relevant art, and features of various embodiments may be interchanged or substituted and such modifications and variations that may be made by those skilled in the art without departing from the spirit and scope of the present invention are intended to be within the scope of the following claims.

The functionality of the present invention, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium, and all or part of the steps of the method according to the embodiments of the present invention are executed in a computer device (which may be a personal computer, a server, or a network device) and corresponding software. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, or an optical disk, exist in a read-only Memory (RAM), a Random Access Memory (RAM), and the like, for performing a test or actual data in a program implementation.

16页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：影视作品选角方法及系统

News event activity name extraction method based on deep learning

相关技术

网友询问留言