Chinese document abstraction type abstract method

文档序号：907633 发布日期：2021-02-26 浏览：2次中文

阅读说明：本技术 中文文档抽取式摘要方法 (Chinese document abstraction type abstract method ) 是由游新冬吕学强李宝安孙少奇于 2020-12-15 设计创作，主要内容包括：本申请公开了一种中文文档抽取式摘要方法,包括：文本向量化；识别基本篇章单元；抽取摘要句,所述文本向量化,包括：对输入文本进行分句、分词、标识符插入操作,使用中文Bert预处理模型,对文本进行向量化。本申请实施例提供的中文文档抽取式摘要方法,利用Bert模型进行文本向量化,更好地捕捉长文本上下文的语义,提升信息抽取的准确性；在识别中文长文本的基本篇章单元的基础上,以基本篇章单元为抽取对象,降低摘要抽取的冗余度；最后利用Transformer神经网络抽取模型,实现基本篇章单元的抽取,提升了摘要句抽取的准确率。(The application discloses a Chinese document abstraction type summarization method, which comprises the following steps: vectorizing the text; identifying basic chapter units; extracting abstract sentences, and vectorizing the text, wherein the text vectorization comprises the following steps: the operations of sentence segmentation, word segmentation and identifier insertion are carried out on the input text, and the Chinese Bert preprocessing model is used for vectorizing the text. According to the extraction type abstract method of the Chinese document, the text vectorization is carried out by utilizing the Bert model, the semantics of the long text context are better captured, and the accuracy of information extraction is improved; on the basis of identifying the basic chapter units of the long text in Chinese, the basic chapter units are taken as extraction objects, and the redundancy of abstract extraction is reduced; and finally, extracting the model by using a Transformer neural network to realize the extraction of basic chapter units and improve the accuracy of abstract sentence extraction.)

1. A Chinese document abstraction method is characterized by comprising the following steps:

vectorizing the text;

identifying basic chapter units;

and (5) extracting abstract sentences.

2. The method of claim 1, wherein the text vectorization comprises: the operations of sentence segmentation, word segmentation and identifier insertion are carried out on the input text, and the Chinese Bert preprocessing model is used for vectorizing the text.

3. The method of claim 1, wherein the text vectorization comprises: for a text with i sentences D ═ S₁，S₂，L，S_iE.g. D, preprocessing the input text by two special marks, and inserting a [ CLS ] in front of each sentence]Identification, in Bert [ CLS]For aggregating features from a sentence or a pair of sentences, inserting a SEP after each sentence]Identification by [ CLS]Identification sum [ SEP]Identifying the sequential characteristics of a text, performing word segmentation pretreatment on the text, and performing word segmentation on the text by adopting a StanfordNLP toolkit;

definition V_tFor character embedding of a text, encoding is carried out through a Bert preprocessing model, and each character is embedded;

definition V_sFor interval embedding of text, interval embedding is used for distinguishing a plurality of sentences in the text, and when the sentence number i is an odd number, the sentence S is divided_iIs defined as E_ACorresponding to this, when the sentence number i is an even number, the sentence S_iIs defined as E_B；

Definition V_pFor text position embedding, the total number of the segmented text is n characters, and the definition [ E ]₁，E₂，L，E_n]Representing the order of each character;

by [ CLS]And [ SEP ]]To distinguish the position of each sentence, each sentence vector is represented as T_i＝[V_t V_s V_p]And i belongs to D, completing vectorization of the text.

4. The method of claim 1, wherein the identifying the base chapter units comprises:

for an input text, a Berkeley NLP tool is used, the punctuation of the whole Chinese sentence is taken as a boundary, and the text is divided into sentences to obtain the whole text sentence;

according to the comma clause principle, further clauses are divided into the whole clause, the comma is taken as a boundary, the clauses of the whole sentence are obtained, and meanwhile, the left and the right of each clause are distinguished and marked as a left clause and a right clause;

segmenting each clause by using a Jieba segmentation tool for the obtained clauses;

performing part-of-speech recognition on each word by using a Jieba part-of-speech recognition tool for the words after word segmentation;

and according to the clause division rule, utilizing the basic chapter unit identification rule base to identify the clauses to obtain a basic chapter unit of the sentence.

5. The method of claim 1, wherein the extracting the abstract sentence comprises:

providing a neural network extraction model based on a Transformer, extracting basic chapter units and finally generating an abstract;

the neural network extraction model comprises three layers of superimposed transformers, each Transformer is a Seq2Seq model of a full Attention mechanism and consists of an Encoder and a Decoder, and the Transformer consists of 6 encoders and 6 decoders.

6. The method of claim 5, wherein the extracting the abstract sentence further comprises: input text T using equation (1)_iPerforming Positional Encodings on the obtained content,

P_i＝PE(T_i) (1)

input vector X for an encoder_iFormula (2) makes Inputs Encodings and Positional Encodings constitute the input vector of the encoder,

X_i＝[T_i，P_i] (2)

initializing three matrices W_q，W_k，W_vTo makeRespectively with X by the formula (3)_iMultiplying to obtain Q, W, V,

Q＝X_i×W_q (3)

the Attention is calculated using equation (4) for the current input state,

further calculating by using a Multi-Head Attention mechanism, wherein Multi-Head Attention projects h different linear transformation pairs Q, K and V through formulas (5) and (6), and finally different Attention results are spliced;

head_i＝Attention(QW_i ^Q，KW_i ^K，VW_i ^V) (5)

MultiHead(Q，K，V)＝Concat(head₁，K，head_h)W^Or (6)

then, the output Z of the Multi-Head is subjected to an addition normalization calculation, and then is calculated by a formula (7) through a feedforward network,

FFN(Z)＝max(0，ZW₁+b₁)W₂+b₂ (7)

and finally, performing addition normalization calculation once again to complete the whole Encoder layer.

7. The method of claim 6, wherein the extracting the abstract sentence further comprises: the structure of the Decoder layer is the same as that of the Encoder layer, but mask Multi-Head attachment is added on the foremost layer, and Masking is added on the first Multi-Head Attention sublayer in the Decoder, so that the fact that when the position i is predicted, only the output with the position smaller than i is relied on is ensured, and the fact that the position i is predicted cannot contact with future information is ensured;

after the Decoder is finished, a linear layer is used for output, classification is carried out through a formula (8), and whether the current input text is extracted or not is judged;

where 1 represents decimation and 0 represents no decimation.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the method of any one of claims 1-7.

9. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the program is executed by a processor to implement the method according to any one of claims 1-7.

Technical Field

The application relates to the technical field of text processing, in particular to a Chinese document abstraction type summarization method.

Background

With the rapid development of the internet, in the face of a large amount of text information such as news, documents, reports and the like, the traditional reading mode needs people to read the whole text by themselves, the core content is summarized, the efficiency is low, the cost is high, and how to quickly and accurately acquire the summary main abstract of the long text is an urgent problem to be solved. Text summarization is one of the most important research works in the field of natural language processing, and becomes a research hotspot with the rise of deep learning, but the summarization of a long text in Chinese faces more challenges, and the problems of short long text-summarization corpus, inaccurate summarization information extraction, target summarization redundancy, missing summary sentences and the like exist.

Text summarization methods can be divided into abstract and generative summarization. The abstract method includes selecting high-level sentences from a document according to the characteristics of words and sentences, and putting them together to generate an abstract, the importance of the sentences depending on the statistical and linguistic characteristics of the sentences. Generative abstractions are used to understand the main concepts in a given document and then express them in clear natural language. The abstract and the generated abstract have respective advantages and disadvantages.

The abstract content can be guaranteed to be from the original text to the maximum extent by the aid of the abstract, and task objects of the abstract are very suitable for text carriers such as scientific documents, legal documents, medical diagnosis books and the like, so that accuracy of the abstract content can be improved, and inaccurate and even wrong information can be prevented from being generated. However, the abstract has a great disadvantage that the abstract is a sentence in a text, and when the value to be extracted is determined, the correct abstract sentence is not extracted, so that the abstract content is lost, and the extracted abstract content has great redundancy. Because of the broad problem of the Chinese expression rule, the contents expressed by a sentence are not all important and can not be used as the components of the abstract, the redundancy problem existing in the extraction mode abstract taking the sentence as the unit is serious, in a sentence, only part of the contents should be extracted, and other parts do not accord with the extraction as the abstract. In addition to the defects of the abstract itself, the method faces a greater challenge in the aspect of the abstract research of the Chinese long text, and the text and the abstract of the current public main corpus are insufficient in length, most of the text and the abstract are short texts, and the number of the Chinese long text-abstract corpus is insufficient.

Disclosure of Invention

The application aims to provide a Chinese document abstraction type summarization method. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

According to an aspect of an embodiment of the present application, there is provided a method for abstracting a chinese document, including:

vectorizing the text;

identifying basic chapter units;

and (5) extracting abstract sentences.

Further, the text vectorization includes: the operations of sentence segmentation, word segmentation and identifier insertion are carried out on the input text, and the Chinese Bert preprocessing model is used for vectorizing the text.

Further, the text vectorization includes:

for a text with i sentencesThe input text is preprocessed by two special marks, and a [ CLS ] is inserted before each sentence]Identification, in Bert [ CLS]For aggregating features from a sentence or a pair of sentences, inserting a SEP after each sentence]Identification by [ CLS]Identification sum [ SEP]Identifying the sequential characteristics of a text, performing word segmentation pretreatment on the text, and performing word segmentation on the text by adopting a StanfordNLP toolkit;

definition V_tFor character embedding of a text, encoding is carried out through a Bert preprocessing model, and each character is embedded;

Definition V_pFor text position embedding, the total number of n characters of the text after word segmentation is definedRepresenting the order of each character;

by [ CLS]And [ SEP ]]To distinguish the position of each sentence, each sentence vector is represented as T_i＝[V_t V_s V_p]And i belongs to D, completing vectorization of the text.

Further, the identifying the basic chapter unit includes:

for an input text, a Berkeley NLP tool is used, the punctuation of the whole Chinese sentence is taken as a boundary, and the text is divided into sentences to obtain the whole text sentence;

segmenting each clause by using a Jieba segmentation tool for the obtained clauses;

performing part-of-speech recognition on each word by using a Jieba part-of-speech recognition tool for the words after word segmentation;

and according to the clause division rule, utilizing the basic chapter unit identification rule base to identify the clauses to obtain a basic chapter unit of the sentence.

Further, the abstract sentence extraction includes:

providing a neural network extraction model based on a Transformer, extracting basic chapter units and finally generating an abstract;

Further, the extracting the abstract sentence further includes: input text T using equation (1)_iPerforming Positional Encodings on the obtained content,

P_i＝PE(T_i) (1)

input vector X for an encoder_iFormula (2) makes Inputs Encodings and Positional Encodings constitute the input vector of the encoder,

X_i＝[T_i，P_i] (2)

initializing three matrices W_q，W_k，W_vUsing the formula (3) and X, respectively_iMultiplying to obtain Q, W, V,

Q＝X_i×W_q (3)

the Attention is calculated using equation (4) for the current input state,

MultiHead(Q，K，V)＝Concat(head₁，K，head_h)W^Or (6)

then, the output Z of the Multi-Head is subjected to an addition normalization calculation, and then is calculated by a formula (7) through a feedforward network,

FFN(Z)＝max(0，ZW₁+b₁)W₂+b₂ (7)

and finally, performing addition normalization calculation once again to complete the whole Encoder layer.

Further, the extracting the abstract sentence further includes: the structure of the Decoder layer is the same as that of the Encoder layer, but mask Multi-Head attachment is added on the foremost layer, and Masking is added on the first Multi-Head Attention sublayer in the Decoder, so that the fact that when the position i is predicted, only the output with the position smaller than i is relied on is ensured, and the fact that the position i is predicted cannot contact with future information is ensured;

after the Decoder is finished, a linear layer is used for output, classification is carried out through a formula (8), and whether the current input text is extracted or not is judged;

where 1 represents decimation and 0 represents no decimation.

According to another aspect of the embodiments of the present application, there is provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the method described above.

According to another aspect of embodiments of the present application, there is provided a non-transitory computer readable storage medium having stored thereon a computer program, which is executed by a processor, to implement the above-described method.

The technical scheme provided by one aspect of the embodiment of the application can have the following beneficial effects:

according to the extraction type abstract method of the Chinese document, the text vectorization is carried out by utilizing the Bert model, the semantics of the long text context are better captured, and the accuracy of information extraction is improved; on the basis of identifying the basic chapter units of the long text in Chinese, the basic chapter units are taken as extraction objects, and the redundancy of abstract extraction is reduced; and finally, extracting the model by using a Transformer neural network to realize the extraction of basic chapter units and improve the accuracy of abstract sentence extraction.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application, or may be learned by the practice of the embodiments. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 illustrates an overall block diagram of the BETES method of one embodiment of the present application;

FIG. 2 shows a text vector generation model diagram of one embodiment of the present application;

FIG. 3 illustrates a quick identification model diagram of basic chapter units according to one embodiment of the present application;

FIG. 4 shows a schematic drawing of an extraction model of an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

One embodiment of the present application provides a Chinese document abstraction-type summarization method, which may be abbreviated as BETES.

Text vectorization is an extremely important task for natural language processing, a computer cannot directly read a text, and vectorization is to digitize a high-dimensional text and encode the high-dimensional text into a low-dimensional vector. Early text vectorization methods were based on statistical methods such as One-hot coding, TF-IDF, N-Gram, which are costly, characterized by discrete, sparse representations, vectors with too high dimensionality and inability to represent word-of-ambiguity. To improve the accuracy of the model and reduce the cost, google proposed Word2Vec in 2013, which is one of the most commonly used Word embedding models at present.

In addition to Word2Vec, GloVe is also a common Word vector model, and GloVe can be regarded as global Word2Vec with an object function and a weight function replaced, is easier to parallelize, faster in speed and easy to train on large-scale corpora.

Although Word2Vec and GloVe have good effects, dynamic optimization cannot be performed on a specific task, context features are difficult to obtain for long texts, the problem can be well solved due to the fact that a Bert pre-training model appears, dependence of longer distance can be captured more efficiently, and context representation can be performed better, so that semantics of texts can be captured better by text vectorization through Bert, and the method has a better effect when being applied to downstream tasks.

Basic chapter unit identification

For basic discourse units, many scholars at home and abroad give own definitions, different theoretical opinions are not completely the same, Chinese is more complex and richer in grammar and expression, and the expression mode of the basic discourse units of Chinese is not completely defined at present.

The identification of basic chapter units in the analysis of language sentences is a fundamental and important research direction in natural language processing, and the identification of basic chapter units is roughly divided into two methods, the first method is to analyze chapter and sentence structures through the knowledge of natural linguistics, determine the expression form of the basic chapter units, and then identify the basic chapter units through establishing rules. In addition, in order to further improve the recognition accuracy, the current mainstream method is to label a chinese discourse topic structure corpus (CDTC) corpus and train an automatic recognition model through deep learning, although the recognition accuracy is improved, a large amount of corpuses need to be labeled, which is high in labor and time cost and not beneficial to being directly and flexibly used in other fields.

Extraction type abstract method

With the development of supervised methods such as machine learning and deep learning, the supervised methods are applied more and more in the aspect of abstract generation, and the supervised methods can be roughly divided into three types, namely a conditional random field method, a machine learning method and a neural network method. The machine learning method is applied to abstract extraction work, extraction features can be automatically learned according to the characteristics of a corpus, abstract sentences can be extracted more accurately, tasks are more detailed, however, machine learning also has the defects of the machine learning method, when the data volume is large, the efficiency of a machine learning algorithm is low, when the data is diversified, the flexibility of the extraction method is low, and a better extraction method cannot be formed according to complex data.

In recent years, a deep learning method of a neural network has become a mainstream method of text summarization, and a method for generating a summary is also rapidly updated and iterated along with the development of the neural network. At present, with the rapid development of neural networks, natural language processing work should closely follow the latest neural networks, and better effects can be achieved on specific tasks through the latest neural networks.

Chinese long document abstraction method (BETES)

Aiming at the problems existing in the abstract extraction of the Chinese long text, in order to improve the accuracy of abstract sentence extraction and reduce the redundancy of the extracted abstract, a BETES method is provided, which mainly comprises three parts:

the text vectorization part: the method comprises the steps of preprocessing an input text, performing operations such as sentence segmentation, word segmentation and identifier insertion, and vectorizing the text by using a Chinese Bert preprocessing model.

Basic chapter identification part: a basic chapter unit identification model based on rules is designed, and the basic chapter unit can be automatically identified aiming at the long text of Chinese scientific and technical literature.

An abstract extraction model: designing a neural network extraction model based on a Transformer, automatically extracting the generated basic chapter units, and then fusing the extracted basic chapter units to generate a final abstract. The overall framework of the BETES method is shown in FIG. 1.

Text vector generation model

For the long text data of Chinese, according to the thought of the Bert pre-training model, three parts of encoding, namely word embedding, segment embedding and position embedding, are required to be performed, the three parts of text encoding are added, and finally, the sentence encoding in the whole chapter is obtained, and the structure diagram is shown in FIG. 2.

For a textThere are i sentences, first, the input text is preprocessed by two special marks, and a [ CLS ] is inserted before each sentence]Identification, in original Bert [ CLS]For aggregating features from a sentence or a pair of sentences, inserting a SEP after each sentence]The mark can represent the sequential characteristic of a piece of text through a plurality of marks, and then the text is subjected to word segmentation preprocessing and is subjected to word segmentation by adopting a StanfordNLP toolkit. The vectorization representation of the Chinese long text is completed through three parts of work:

1) first, define V_tFor character embedding of a text, encoding is carried out through a Bert preprocessing model, and each character is embedded;

2) definition V_sFor interval embedding of text, interval embedding is used for distinguishing a plurality of sentences in the text, and when the sentence number i is an odd number, the sentence S is divided_iIs defined as E_ACorresponding to this, when the sentence number i is an even number, the sentence S_iIs defined as E_B；

3) Definition V_pFor text position embedding, the total number of n characters of the text after word segmentation is definedIndicating the order of each character.

Finally, by [ CLS]And [ SEP ]]To distinguish the position of each sentence, each sentence vector is represented as T_i＝[V_t V_sV_p]I belongs to D, thus completing the vectorization work of the text.

Two, basic discourse unit identification model

Currently, for the identification of basic text units, a corpus is labeled, a deep learning identification model is trained, and the basic text units are identified by using the model. The deep learning based method, although improving the accuracy of recognition, consumes a lot of manpower and time costs. After comprehensive consideration, the object of the embodiment of the present application is a chinese long text scientific and technical document, which is relatively fixed in language description and sentence syntax, and aims to quickly obtain the basic chapter units of the text, which are flexibly integrated into the whole abstract extraction framework, so that a quick identification model of the basic chapter units is provided for the actual requirements of the current work, and the flow chart is shown in fig. 3.

The algorithm flow of the basic discourse unit is as follows:

algorithm 1 Chinese long text Edus recognition

The flow of the Chinese long text basic chapter unit identification algorithm is as follows:

1) for the input text, a Berkeley NLP tool is used, and the text is divided by taking the punctuations of the whole sentence common in Chinese as boundaries, such as periods, question marks, exclamation marks and the like, so as to obtain the whole text sentence.

2) And according to the principle of comma clauses, further separating the whole sentence, taking the comma as a boundary to obtain the sentence of the whole sentence, and simultaneously, distinguishing the left and the right of each sentence and marking the left clause and the right clause.

3) And (4) segmenting each clause by using a Jieba segmentation tool for the obtained clauses.

4) And (4) performing part-of-speech recognition on each word by using a Jieba part-of-speech recognition tool for the words after word segmentation.

5) The words and sentences segmented by using commas as the mark points are not all used as basic discourse units, so that rules are needed to be judged, a basic discourse unit identification rule base (part of rules are shown in table 1) is formulated according to the clause segmentation rules aiming at the sentence expression of the Chinese scientific and technical literature, and the sub-sentences are identified to finally obtain the basic discourse units of a sentence.

TABLE 1 basic chapter unit identification rules

Third, abstract sentence extraction model

The work of extracting the abstract is generally regarded as that the sentences of the original text are subjected to score ranking, the sentence combination with the highest score ranking is selected as the final abstract, the object of the work is to extract basic chapter units with finer granularity, and therefore the model is to extract the basic chapter units with higher score ranking, which is communicated with the extraction of abstract sentences in principle and method.

For extracting abstract sentences of texts, the mainstream method is a deep learning method, and sentences are automatically extracted by training a neural network model. Currently common methods, whether LSTM or GRU, mostly or baseline with recurrent neural networks. In the field of machine translation, the superiority of a transform neural network is fully verified, and the transform-based neural network is also used in a text summarization task aiming at an English data set, so that the effect is obviously improved. Therefore, for the long Chinese text, a transform-based neural network extraction model is provided, and the generated basic chapter units are extracted to finally generate the abstract.

The extraction model of the embodiment of the present application is a stack of three layers of transformers, each of which is a Seq2Seq model of a full Attention mechanism, and its structure is composed of an Encoder and a Decoder, and how to implement the extraction model of basic chapter units using the transformers is described below, and the structure diagram is shown in fig. 4.

The Transformer is composed of 6 encoders and 6 decoders, wherein the Encoder structure diagram is shown in the left side of FIG. 4, the input object is a long text, and thus, the input text T is input by using the formula (1)_iPosition Encodings (position coding) are performed,

P_i＝PE(T_i) (1)

input vector X for an encoder_iEquation (2) makes Inputs Encodings and Positional Encodings form the input vector of the encoder,

X_i＝[T_i，P_i] (2)

initializing three matrices W_q，W_k，W_vUsing the formula (3) and X, respectively_iMultiplying to obtain Q, W, V,

Q＝X_i×W_q (3)

using equation 4 for the current input state, the Attention is calculated,

the Multi-Head Attention mechanism is used for further calculation, and Multi-Head Attention projects h different linear transformation pairs Q, K and V through formulas (5) and (6), and finally different Attention results are spliced together.

MultiHead(Q，K，V)＝Concat(head₁，K，head_h)W^O _r (6)

Then, the output Z of the Multi-Head is subjected to an addition normalization calculation, and then is calculated by a formula (7) through a feedforward network,

FFN(Z)＝max(0，ZW₁+b₁)W₂+b₂ (7)

and finally, performing addition normalization calculation again, thus completing the whole Encoder layer.

For the Decoder layer, as shown on the right side of fig. 4, the structure is the same as that of the Encoder layer, except that a Masked Multi-Head Attention (Multi-Head Attention mechanism) is added to the foremost layer, and for the first Multi-Head Attention sublayer in the Decoder, Masking (Masking layer) needs to be added to ensure that the position i is predicted only by the output with the position smaller than i, and that the ith position is predicted without contacting future information. feed-forward is feed forward.

And after the Decoder is finished, a linear layer is used for outputting, classification is carried out through a formula (8), and whether the current input text is extracted or not is judged.

Where 1 represents decimation and 0 represents no decimation.

Experiments were performed in another example of the present application. Aiming at a Chinese long text, firstly, a Chinese long text-abstract data set is established, for the setting of an experiment, nine-group comparison experiments are set in three dimensions, the effectiveness of the BETES method is proved, and a ROUGE evaluation index is selected on the evaluation index.

Data set

In the field of automatic summarization of chinese texts, there are few text-summary data sets, and there are two current mainstream data sets, which are a hahara large lcts news green blog short text news summary data set and a NLPCC (2015, 2017, 2018) chinese news summary data set, respectively. However, in these common corpora, the original text and the generated summary length are both short, and table 2 is a statistical analysis of the data lengths of the three corpora.

TABLE 2 corpus Length statistics

Data set	Average length of text	Length of abstract
			LCSTS	112	14
NLPCC2017	1036	45
			CSL	235	20

As can be seen from the above table, the length of the text and the abstract of the currently mainstream text abstract data set is very short, even if the length of the data set of NLPCC is only a middle text, the chinese long text abstract data set is currently in a missing state, the text meeting the requirement of a certain length is difficult to obtain, the manual construction process is difficult, and a large amount of manpower and material resources are consumed. Aiming at the condition of deficiency of the Chinese long text abstract corpus, a Chinese long text-abstract data set is constructed, and the construction process of the data set is as follows.

Firstly, in the field of artificial intelligence, ten directions are selected, data sources are a HowNet and a Wanfang literature website, Chinese long-text scientific and technical literature is obtained through a manual and automatic combination method, and then an abstract is constructed for the literature.

The abstract of the scientific and technological literature is constructed, the expression of the abstract sentences has a specific expression mode, the sentences capable of constructing the abstract are extracted by using the expression mode of the innovative point sentences of the scientific and technological literature summarized by the great amount [24], meanwhile, the original abstract of the literature is used as a reference, the extracted sentences and the original abstract are screened and combined, and manual filtering and screening are assisted to construct the abstract of the scientific and technological literature.

The constructed Chinese long text-abstract data set has 3208 scientific and technical documents after screening and processing, and the average text length and the average abstract length of the data set are 3802 and 145 through statistical analysis. The corpus used in the embodiments of the present application is disclosed in the following website.

https：//drive.google.com/file/d/1tfml9zC37WoTRfaNL6efjrrmcRbMizmq

Comparative experiment

Through the method introduced above, the comparison experiment is performed in three dimensions, firstly the comparison in the text vectorization method, secondly the comparison in the basic discourse unit method, and finally the comparison in the model method.

Text vectorization experiments: the method for vectorizing the current mainstream Chinese text is selected, Word2Vec and GloVe Word vector models are used for verifying the effectiveness of vectorizing the text provided by the embodiment of the application, the Word2Vec and the GloVe adopt a pre-training model of Chinese Wikipedia, the Word vector is set to be 300-dimensional, and a BERT-Base-Chinese pre-training model is adopted in a comparison experiment. In the aspect of extracting model selection, in order to save training cost, only a simple linear classifier is selected for training, and abstract sentences are extracted through binary classification.

Extraction of model comparison experiment: in the comparison experiment of the model, the embodiment of the application selects a mainstream abstract extraction model method, and sets four groups of comparison experiments, which are respectively as follows:

1) the reinforcement learning extraction model uses a novel extraction method proposed by Narayan [25] and the like, firstly Glove is used for vectorizing and expressing a text, then sentence selection is conceptualized as scoring the sentence, a ROUGE evaluation index is globally optimized through a reinforcement learning target, and the sentence with the highest score is selected.

2) The bidirectional LSTM neural network extraction model is also a current mainstream extraction type abstract method by taking Bi-LSTM as the neural network for extraction model training, a Bi-LSTM abstract sentence extraction method proposed by Xiao [26] in a paper is used, Glove is used for text vectorization, and the Bi-LSTM builds the neural network for training to extract abstract sentences from a text.

3) The Bi-LSTM + Attention extraction model keeps the text vectorization the same on the basis of experiment 3, and meanwhile, an Attention mechanism is added on the basis of LSTM, which is a relatively common neural network model in the NLP field at present.

4) The BETES model method is characterized in that the text vectorization is carried out by adopting the Bert, the model is extracted by using the Beet + Transformer, for the rigor of the experiment, the influence caused by the Bert text vectorization is eliminated, only the difference on the model is compared, and the Bert + Bi-LSTM is used for replacing the original text Glove text vectorization with the Bert text vectorization, so that the fairness on the same dimension is ensured.

Basic chapter unit comparison experiment: in addition to the text vectorization comparison experiment, in order to verify the effectiveness of the BETES method provided in the embodiment of the present application in the abstract task, the basic chapter unit is used as a finer-grained extraction object. On the basis of experiment 4, Bert is adopted to carry out text vectorization expression, a basic chapter unit of a text is obtained through a basic chapter unit identification algorithm, and a neural network extraction model based on a Transformer is adopted to verify the final effect of the BETES method provided by the text.

Procedure of experiment

The experiments were conducted in a PyTorch 1.4.0 environment using 4 TeslaV100 GPUs for parallel training, and the method of the comparative experiment is described in detail below.

In comparative experiment 1, a simple machine learning method is adopted, only one linear classifier is set, a SoftMax classifier is selected, and after text vectorization, the sentence is classified by classification to identify whether the sentence can be selected as a summary sentence.

In comparative experiment 2, since the original text data set is english, a Glove word vector model is uniformly used in text vectorization, the word vector is set to 200 dimensions, the batch size is 20, 20 rounds of training are performed in each batch, the learning rate is 0.001, and the loss function is a loss function based on reinforcement learning provided in the text.

In a comparative experiment 3, a bidirectional LSTM is adopted, wherein the number of hidden units in the LSTM is 300, the batch size of each training data is 128, the learning rate is 0.0001, a binary cross entropy loss function is adopted, each batch is trained for 50 rounds, the last layer of the multi-layer perceptron is 100-dimensional, and finally, the maximum probability of sentence extraction is calculated through a 1-dimensional linear layer.

In a comparison experiment 4, dropout in a Bert model is 0.1, the number of hidden layer units is 768, the number of hidden layers is 12, a glue activation function is adopted, a neural network is two layers of transformers, the learning rate is 0.002, and after 10000 steps of training, the last layer is a 1-dimensional linear layer, so that the probability of extracting sentences is obtained.

In the basic discourse unit comparison experiment, on the basis of the comparison experiment 4, in the experiment process, after text vectorization is obtained, the basic discourse unit is automatically identified through the basic discourse unit identification method provided by the embodiment of the application, then the basic discourse unit is used as input, training is carried out by utilizing the extraction method of Bert + Transformer, and finally the extraction of the basic discourse unit is realized.

Results and evaluation of the experiments

The evaluation index of this embodiment uses the route as an evaluation index of the experimental result of this embodiment, and the route is an evaluation method for aligning to the summary generation task, and is now the most widely used evaluation index, and F1 values of route-1, route-2, and route-L are respectively calculated to evaluate the results of each comparison experiment, and the experimental results are shown in table 3:

TABLE 3 results of the experiment

The experimental result shows that in the selection of the text vectorization model, under the condition of ensuring the consistency of the extraction method, the effect of the Word2Vec and GloVe Word vector models is very small, and the effect of text vectorization by using Bert is obviously improved compared with the effect of using the Word2Vec and GloVe Word vector models. On the extraction model of the abstract sentence, the influence of text vectorization is eliminated, and the effect is better than that of the current mainstream abstract extraction model by using the Bert + Transformer as the abstract extraction model. In the last comparison experiment, after the optimal extraction model is determined, when the basic chapter unit is used as an extraction object with finer granularity, the extraction effect of the abstract sentence is further improved, and therefore, the effectiveness of the BETES method is proved.

And extracting the Chinese scientific and technical literature by using an optimal extraction model Bert + Transformer obtained in a comparison experiment. Meanwhile, the BETES method of the examples of the present application was used for comparison, and the comparison results of the examples are shown in Table 5:

table 5 examples comparative results

As can be seen from table 6, when the best extraction model Bert + Transformer is used to extract the abstract sentences from the long document, the number of sentences extracted by the model is fixed, when the number is large, the abstract redundancy is caused, the training amount of the model is increased, and when the number is small, the required sentences may not be extracted, so that the abstract sentences are lost, and the accuracy of the final abstract is affected. The BETES method takes the basic chapter unit as an extraction object with finer granularity, and in the sentence extraction process, much redundant information is extracted without extraction, so that the redundancy of the extracted summary can be reduced, more information is extracted, the final summary information is prevented from being lost, and finally, the effectiveness of the method in the embodiment of the application is proved.

The embodiment of the application takes abstract extraction of a Chinese long text as a research object, provides a BETES method, and constructs a Chinese long text-abstract corpus based on rules and manual auxiliary screening; text vectorization is carried out by utilizing a Bert preprocessing model, so that the semantics of the long text context can be better captured, and the accuracy of information extraction is improved; on the basis of identifying the basic chapter units of the long text in Chinese, the basic chapter units are taken as extraction objects, and the redundancy of abstract extraction is reduced; and finally, extracting the model by using a Transformer neural network to realize the extraction of the basic chapter units and improve the accuracy of abstract sentence extraction. The BETES method improves the accuracy and reduces the redundancy in the extraction type summarization process of the Chinese long text, and the ROUGE score is superior to the mainstream summarization extraction method.

The above-mentioned embodiments only express the embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

19页详细技术资料下载

Chinese document abstraction type abstract method

相关技术

网友询问留言