Method for rewriting abstract of Chinese patent

文档序号:907632 发布日期:2021-02-26 浏览:4次 中文

阅读说明:本技术 中文专利摘要改写方法 (Method for rewriting abstract of Chinese patent ) 是由 吕学强 游新冬 董志安 于 2020-12-15 设计创作,主要内容包括:本申请公开了一种中文专利摘要改写方法,包括:文档预处理;句子分布式表示;句子抽取。本申请实施例提供的中文专利摘要改写方法,通过引入专利术语词典,基于强化学习的句子抽取方法,抽取专利说明书文本的关键句,利用Transformer深度神经网络文本生成方法生成候选摘要,最终融合专利原始摘要信息,经过语义去重和排序得到改写摘要,实现了端到端的专利摘要改写,并且在ROUGE-1、ROUGE-2和ROUGE-L评价标准上表现极佳,明显优于其他序列生成基准方法,有利于降低人工改写的成本,提高专利数据加工的工作效率。(The application discloses a Chinese patent abstract rewriting method, which comprises the following steps: preprocessing a document; sentence distributed representation; and (5) sentence extraction. According to the Chinese patent abstract rewriting method provided by the embodiment of the application, the patent term dictionary is introduced, the key sentences of the patent specification text are extracted through a sentence extraction method based on reinforcement learning, the candidate abstract is generated through a Transformer deep neural network text generation method, the original abstract information of the patent is finally fused, the rewritten abstract is obtained through semantic deduplication and sequencing, end-to-end patent abstract rewriting is achieved, the performance is excellent on the evaluation standards of ROUGE-1, ROUGE-2 and ROUGE-L, the method is obviously superior to other sequence generation reference methods, the cost of manual rewriting is reduced, and the working efficiency of patent data processing is improved.)

1. A Chinese patent abstract rewriting method is characterized by comprising the following steps: and (5) preprocessing the document.

2. The method of claim 1, wherein the method for rewriting a chinese patent abstract further comprises:

sentence distributed representation;

and (5) sentence extraction.

3. The method of claim 2, wherein the document pre-processing comprises: and performing word segmentation and part-of-speech tagging on sentences of the patent documents by using a word segmentation tool.

4. The method of claim 2, wherein the sentence distributed representation comprises:

the final vector representation of the sentence is calculated using Doc2 Vec.

5. The method of claim 2, wherein the sentence extraction comprises:

sentence representations of documents are learned using Doc2Vec and a pointer network extracts sentences based on the sentence representations.

6. The method of claim 2, wherein the sentence extraction comprises: the hidden states of the encoder and decoder are defined as (e) respectively1,K,en) And (d)1,K,dm) (ii) a Training a pointer network by adopting an LSTM structure, and circularly extracting a key sentence expressed based on Doc2 Vec; the extraction probability is calculated as

P(ji|j1,...,ji-1)=soft max(ut)

For LSTM, at each output instant, dtIs the output of the decoder LSTM, w and v are training parameters; at each moment, the decoder performs a mechanism of attention, first to ejTo obtain a context vectorsoftmax will vector ujNormalizationAnd obtaining the extraction probability for the output distribution on the input dictionary.

7. The method of claim 2, wherein the sentence extraction comprises:

rewriting the extracted document sentence into a abstract sentence by using a generation network; the method uses a Transformer model and adds a replication mechanism to directly replicate the unknown words.

8. The method of claim 2, wherein the sentence extraction comprises:

respectively optimizing each sub-module using maximum likelihood estimation, training an extractor to select important sentences, and generating a rewritten summary using a generator; reinforcement learning is applied to train the complete model end-to-end.

9. The method of claim 8, wherein the sentence extraction comprises:

formulating the sentence selection as a classification;

selecting sentences using a greedy algorithm to maximize global summary level ROUGE evaluation values, ROUGE-L from a single sentence levelrecall(kt,lt) Maximum score for each artificial abstract sentence ltExactly matching 1 candidate sentence k from a documentt(ii) a Marking a pseudo training label on the candidate sentence, and then training an extractor by using the minimized cross entropy loss;

combining a plurality of key sentences and abstracts obtained based on the pseudo-training labels into { (key sentences, abstracts),. -, (key sentences, abstracts) };

generating a network-based Transformer model to minimize cross-entropy loss of decoder language model at each generation stepWherein theta isabsIs a set of training parameters of the generator, wmIs the mth generated word;

structure oneHidden Markov Decision Process (MDP): at each extraction time t, the proxy mechanism observes the current state st=(K,kt-1) Taking an extraction action at:π(st)=P(kt|k1,K,kt-1) To extract a document sentence ktThe generator then proceeds to this extracted sentence ktOverwrite, feed back a reward:

wherein T is the generator;

total reward accumulated throughout reinforcement learning processθπAs a network parameter of pi(s) and thetaπ={τ1,τ2,K,πNN is the number of times of extraction;

training an extractor with a strategy-based reinforcement learning;

defining a state value function Vπ(s) evaluating a reward value earned by the drawing action; definition ofIs a baseline reward that is used to evaluate the gain function:

for overall expectation, the gradient maximization of R (τ) is achieved using the following strategyn):

Training criticic minimizes variance loss:

learning and extracting the number of sentences;

in the reinforcement learning training stage, a stop vector V is addedEOEHaving the same dimensions as the sentence representation; will execute VEOEIs set as the reward function of the drawing action of (1)F1

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the method of any one of claims 1-9.

Technical Field

The application relates to the technical field of text processing, in particular to a method for rewriting a Chinese patent abstract.

Background

Patent documents are one of the most effective carriers of technical information, and analysts of enterprise competitive information can extract a large amount of useful information from the patent documents through detailed and rigorous analysis, so that the published patent information is used by the enterprise, and the unique economic value of the enterprise is realized. The manual rewriting of the patent abstract is an important technical means for acquiring patent information. With the increasing number of patent applications, the manual rewriting cost of the patent abstract is higher and higher, and how to automatically rewrite the patent abstract by using the text automatic abstract technology becomes more important. The existing text automatic summarization method has the problems of sentence redundancy and low accuracy in the process of processing multi-sentence summarization and rewriting, and can not meet the requirements of patent data deep processing.

Disclosure of Invention

The application aims to provide a Chinese patent abstract rewriting method. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

According to an aspect of an embodiment of the present application, there is provided a method for rewriting a chinese patent abstract, including:

preprocessing a document;

sentence distributed representation;

and (5) sentence extraction.

Further, the document preprocessing comprises:

and performing word segmentation and part-of-speech tagging on sentences of the patent documents by using a word segmentation tool.

Further, the sentence distributed representation comprises:

the final vector representation of the sentence is calculated using Doc2 Vec.

Further, the sentence extraction includes:

sentence representations of documents are learned using Doc2Vec and a pointer network extracts sentences based on the sentence representations.

Further, the sentence extraction includes: the hidden states of the encoder and decoder are defined as (e) respectively1,K,en) And (d)1,K,dm) (ii) a Using an LSTM structure to train a pointer network, cyclically extractingTaking a key sentence expressed based on Doc2 Vec; the extraction probability is calculated as

P(ji|j1,...,ji-1)=softmax(ut)

For LSTM, at each output instant, dtIs the output of the decoder LSTM, w and v are training parameters; at each moment, the decoder performs a mechanism of attention, first to ejTo obtain a context vectorsoftmax will vector ujThe output distribution on the input dictionary is normalized to obtain the extraction probability.

Further, the sentence extraction includes:

rewriting the extracted document sentence into a abstract sentence by using a generation network; the method uses a Transformer model and adds a replication mechanism to directly replicate the unknown words.

Further, the sentence extraction includes:

respectively optimizing each sub-module using maximum likelihood estimation, training an extractor to select important sentences, and generating a rewritten summary using a generator; reinforcement learning is applied to train the complete model end-to-end.

Further, the sentence extraction includes:

formulating the sentence selection as a classification;

selecting sentences using a greedy algorithm to maximize global summary level ROUGE evaluation values, ROUGE-L from a single sentence levelrecall(kt,lt) Maximum score for each artificial abstract sentence ltExactly matching 1 candidate sentence k from a documentt(ii) a Marking a pseudo training label on the candidate sentence, and then training an extractor by using the minimized cross entropy loss;

combining a plurality of key sentences and abstracts obtained based on the pseudo-training labels into { (key sentences, abstracts),. -, (key sentences, abstracts) };

generating a network-based Transformer model to minimize cross-entropy loss of decoder language model at each generation stepWherein theta isabsIs a set of training parameters of the generator, wmIs the mth generated word;

constructing a hidden Markov Decision Process (MDP): at each extraction time t, the proxy mechanism observes the current state st=(K,kt-1) Taking an extraction action at:π(st)=P(kt|k1,K,kt-1) To extract a document sentence ktThe generator then proceeds to this extracted sentence ktOverwrite, feed back a reward:

wherein T is the generator;

total reward accumulated throughout reinforcement learning processθπAs a network parameter of pi(s) and thetaπ={τ1,τ2,K,πNN is the number of times of extraction;

training an extractor with a strategy-based reinforcement learning;

defining a state value function Vπ(s) evaluating a reward value earned by the drawing action; definition ofIs a baseline reward that is used to evaluate the gain function:

for overall expectation, the gradient maximization of R (τ) is achieved using the following strategyn):

Training criticic minimizes variance loss:

learning and extracting the number of sentences;

in the reinforcement learning training stage, a stop vector V is addedEOEHaving the same dimensions as the sentence representation; will execute VEOEIs set as the reward function of the drawing action of (1)F1

According to another aspect of the embodiments of the present application, there is provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the method described above.

The technical scheme provided by one aspect of the embodiment of the application can have the following beneficial effects:

according to the Chinese patent abstract rewriting method provided by the embodiment of the application, the patent term dictionary is introduced, the key sentences of the patent specification text are extracted through a sentence extraction method based on reinforcement learning, the candidate abstract is generated through a Transformer deep neural network text generation method, the original abstract information of the patent is finally fused, the rewritten abstract is obtained through semantic deduplication and sequencing, end-to-end patent abstract rewriting is achieved, the performance is excellent on the evaluation standards of ROUGE-1, ROUGE-2 and ROUGE-L, the method is obviously superior to other sequence generation reference methods, the cost of manual rewriting is reduced, and the working efficiency of patent data processing is improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application, or may be learned by the practice of the embodiments. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a method for rewriting a chinese patent abstract according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The Reinforcement Learning (Learning) mechanism of the embodiment of the present application connects an extractor (digest extraction) and a generator (digest generation) for end-to-end training. The embodiments of the present application use sentence-level rewards to optimize the decimator while keeping the ml (maximum likelihood) trained generative decoder fixed, thereby achieving a two-fold result. The embodiment of the application first uses an extractor agent to select key sentences, and then uses a Transformer model to rewrite the extracted sentences in turn. To overcome the non-differentiable metrics and training of the decimator, embodiments of the present application use an Actor-Critic policy gradient with a sentence-level metric reward to connect the two neural networks and learn sentence importance for the original document-digest pair that is not labeled. The sentence-level reinforcement learning of the embodiments of the present application takes into account a word-sentence hierarchy, which better models the language structure and makes parallelization possible. The rewrite network is a simple encode-calibrate-decode model, and the pseudo document-digest sentence pairs are trained by automatic matching criteria. Therefore, the method of the embodiment of the application combines the advantages of the generation method of simply rewriting sentences and generating new words from complete words, and simultaneously adopts the extraction behavior to improve the quality, speed and stability of the whole model. According to the experimental result, the method of the embodiment of the application has a good effect on all the ROUGE evaluation indexes on the real data set.

The embodiment of the application provides a patent abstract rewriting method RLCPAR based on an original abstract and a specification. The patent abstract is combined with patent characteristics and a patent term vocabulary, and a new abstract is generated by fusing the patent abstract and the specification content based on an automatic abstract technology, so that the patent abstract is rewritten. An automatic abstract model based on reinforcement learning connection extraction formula and generation formula is provided, and the advantages of extraction formula and generation formula methods are combined. The model achieves a good effect on all indexes of multiple versions of a Chinese medicinal material patent abstract data set, effectively utilizes the hierarchical structure of words and sentences, and fuses the characteristics of the words and the semantic information of the sentences. Finally, aiming at the Chinese medicinal material patent field text, the abstract generated by the RLCPAR has high semantic correlation with the artificial abstract.

RLCPAR: chinese patent abstract rewriting based on reinforcement learning

The embodiment of the application summarizes a given long text document into a task of ordered key sentences, and then combines the key sentences into a multi-sentence abstract. The extractor sequentially extracts important sentences from the document and then rewrites this subset of key sentences into a summary by the generator. The RLCPAR is formed by connecting the two sub-modules by using a reinforcement learning mechanism.

Pretreatment of

The embodiment of the application is based on a patent term vocabulary and a Chinese herbal medicine dictionary, and a jieba word segmentation tool is used for carrying out word segmentation and part-of-speech tagging on sentences, as shown in table 1.

Table 1 sentence preprocessing example

Distributed representation of sentences

Doc2Vec is a method of document distributed expression that directly converts sentences or paragraphs into fixed-dimension vectors. Embodiments of the present application use Doc2Vec to compute the final vector representation of the sentence. This represents not only the relationship between words but also the relationship between sentences and documents.

Extraction model

The key sentence extraction module may be considered to sequentially extract key sentences from the document: the embodiment of the application utilizes sentence representation of a Doc2Vec learning document and a Pointer Network (Pointer Network) to extract sentences based on the sentence representation.

For convenience of representation, the embodiments of the present application define hidden states of an encoder and a decoder as (e)1,K,en) And (d)1,K,dm). The embodiment of the application adopts an LSTM structure to train a pointer network, and cyclically extracts the key sentences expressed based on Doc2 Vec. The extraction probability calculation formula is as follows:

P(ji|j1,...,ji-1)=soft max(ut) (2)

for LSTM, at each output instant, dtIs the output result of the decoder LSTM, w and v are training parameters. At each moment, the decoder performs a mechanism of attention: first of all, pay attention to ejTo obtain a context vectorsoftmax will vector ujThe (length n) is normalized to the output distribution on the input dictionary to obtain the extraction probability. The specific process comprises the following steps: the extraction probability of the extracted sentences is set to be zero by force, so that the model is prevented from using repeated sentence sets, and sentence redundancy is avoided. This operation is not trivial and can therefore only be trained in reinforcement learning. The model is similar to classifying all sentences of a document in each extraction step.

Generating networks

The generation network rewrites the extracted document sentences into concise abstract sentences. The embodiment of the application uses a standard Transformer model and adds a replication mechanism to directly replicate an unknown word (OOV).

The sequential nature of RNNs makes it more difficult to take full advantage of modern fast computing devices, such as TPUs and GPUs, which are good at parallel rather than sequential processing. The non-chronological drawbacks of Convolutional Neural Networks (CNNs), the number of steps required to combine information from long-distance portions of the input still increases with distance.

In contrast, the Transformer is a new neural network structure based on the self-attention mechanism, which is particularly suitable for language understanding task. The working principle is as follows: at each step, it applies a self-attention mechanism that directly captures the relationships between all words in the sentence, regardless of their respective positions. More specifically, to calculate the next representation of a given word, the current word is compared to each of the remaining words in the sentence. The result of these comparisons is an attention score for each word in the sentence. These attention scores determine the contribution of other words to the next representation of the current word. The attention score is then weighted as a weighted average of all word representations, which is input into a fully connected network, generating a new representation of the current vocabulary.

Reinforced learning

Whereas the decimator of embodiments of the present application performs non-differentiable sentence decimation, embodiments of the present application apply a standard policy gradient approach to bridge back propagation and form an end-to-end trainable computational graph. However, it is not feasible to simply train the entire model in an end-to-end fashion starting with a randomly initialized network. When randomly initialized, the decimator will typically select irrelevant sentences, so the generator will have difficulty learning to abstractly rewrite. On the other hand, without a trained generator, the decimator would get a noisy reward, which leads to a wrong estimation of the strategy gradient and suboptimal strategy. Therefore, the embodiment of the present application first optimizes each sub-module using Maximum Likelihood (ML) estimation: the decimator is trained to select important sentences and a generator is used to generate a rewritten summary. Finally, reinforcement learning is applied to train the complete model end-to-end, as shown in FIG. 1.

An extractor: in the above extraction model, the embodiments of the present application formulate sentence selection as a classification. However, the original data set is not annotated for the importance of each sentence. Thus, embodiments of the application select sentences using a greedy algorithm to maximize the global summary level of the ROUGE evaluation value, according to the individual sentence level ROUGE-Lrecall(kt,lt) Maximum score for each artificial abstract sentence ltExactly matching 1 candidate sentence k from a documentt. These candidate sentences are labeled with pseudo-training labels and then the decimator is trained with minimal cross-entropy loss.

A generator: the method and the device for the pseudo-training of the key sentences combine a plurality of key sentences and abstracts obtained based on the pseudo-training labels into { (key sentences, abstracts),. }. Generating network-based Transformer models to minimize decoder language model at each generation stepCross entropy lossWherein theta isabsIs a set of training parameters of the generator, wmIs the mth generated word.

The reinforcement learning is mainly to solve the problem that discrete texts can not be derived in the training process. The following is a brief description of how the policy gradient technique is applied to optimize RLCPAR. To turn the decimator into a reinforcement learning agent, the embodiments of the present application construct a hidden Markov Decision Process (MDP): at each extraction time t, the proxy mechanism observes the current state st=(K,kt-1) Taking an extraction action at:π(st)=P(kt|k1,K,kt-1) To extract a document sentence ktThe generator then proceeds to this extracted sentence ktOverwrite, feed back a reward:

where T is the generator. Total reward accumulated throughout reinforcement learning processθπAs a network parameter of pi(s) and thetaπ={τ1,τ2,K,πNAnd N is the number of times of extraction. The decimator may then be trained with strategy-based reinforcement learning. The extractor uses ROUGE-LrecallThis is because the embodiment of the present application intends to extract a sentence containing as much information as possible for rewriting. Use of generatorsMore suitably, because the generator should overwrite the extracted sentence k as compactly as possible, ensuring that the semantics are not distortedt

The embodiment of the application uses a synchronous version Advantage Actor-Critic reinforcement learning algorithm of a classic A3C algorithmTo optimize the decimator while defining a state value function Vπ(s) to evaluate the prize value obtained by the drawing action. Also defineIs a baseline reward that is used to evaluate the gain function:

for overall expectation, embodiments of the present application maximize R (τ) using the following strategy gradientn):

And training criticic minimizes variance loss:

if the decimator selects a good sentence, the matching ROUGE value will be high after the generator has overwritten, thus encouraging this action to be taken. If an incorrect sentence is selected, while the generator still generates its rewritten version, the summary does not match the ground truth, and a lower route score is a penalty for this behavior. Embodiments of the present application use reinforcement learning as sentence extraction guidance without changing the language model of the generator, while applying reinforcement learning at the word level, which may tend to earn high scores at the expense of language fluency.

The number of sentences extracted is learned. In a typical reinforcement learning setting like a game, episodes are typically terminated by the environment. On the other hand, in text summarization, the extractor does not know in advance how many summary sentences are generated for a given article. The embodiment of the application makes an important, simple and intuitive change to solve the problem: on-policyA "stop" operation is added to the operation space. In the reinforcement learning training stage, the embodiment of the application adds the stop vector VEOEIt has the same dimensions as the sentence representation. Pointer network decoder converts VEOEConsidered as one of the draw candidates, thus naturally leading to a stop action in the random strategy. The embodiment of the application executes VEOEIs set as the reward function of the drawing action of (1)F1(a better measure of bag of words information); and for any extraneous, unwanted drawing steps, the drawer receives a zero prize. Thus, the model is encouraged to draw sentences while there are still standard digest sentences remaining (to accumulate intermediate rewards) and to learn to stop by optimizing global route and avoiding additional draws. In general, such modifications allow for dynamic decisions based on the number of sentences of the input document, without the need to adjust a fixed number of steps, and enable data-driven adaptation to any particular data set or application.

The existing summary generation system has the problems of generating repeated and redundant words and phrases on long documents and the like. To alleviate this problem, overlay mechanisms and Beam Search (Beam Search) in the testing phase can be employed in conjunction with ternary grammars to avoid. RLCPAR performs well without doing this because the summary sentences are generated from mutually exclusive document sentences, which naturally avoids redundancy. However, with a simple reordering strategy, embodiments of the present application further improve summary quality by removing some "cross-sentence" repetitions: at the sentence level, the same bundle search ternary grammar is applied to avoid redundancy. Embodiments of the present application retain all k sentence candidates generated by the bundle search, where k is the size of the bundle. Next, all k of the n generated summary sentences are reorderednCombine to produce a useful diversified re-ordered list. Where the smaller the number of repeated n-grams the better, the resulting rewritten digest.

For example, in one embodiment of the present application, an "chinese medicine" patent abstract rewriting data set is constructed, which includes 11400 patent specifications in total, and 11400 manually rewritten patent abstracts provided by a certain patent company. The full text of the patent specification and the patent abstract are subjected to word segmentation, sentence segmentation and stop word filtering to form a (document, artificial abstract) set. Data set partitioning case: the training set 9000, the validation set 1200, and the test set 1200. The training data sample consists of a patent application number, an artificial abstract, an abstract and a specification, wherein the artificial abstract is added with a preparation method on the basis of an original abstract.

For all data sets, the evaluation criteria used were ROUGE-1, ROUGE-2 and ROUGE-L. The ROUGE is an automatic abstract evaluation method.

Parameter setting of a Chinese patent abstract rewriting model: the input limits the maximum length to 100 words, the sentence exceeding the maximum length intercepts the first 100 words, and the deficiency is filled with < PAD >. The program is implemented by a PyTorch deep learning model framework, and the RNN encoder and decoder are composed of LSTM structural units, with the addition of an attention mechanism, an overlay mechanism, and a pointer generator. The hidden layers of the decoder and encoder are set to 256. When the loss value of the validation set is not lower than the current minimum loss value within 5 rounds, the model terminates learning prematurely. Decoded using the Beam-Search algorithm in the decoder, with Beam-Size set to 4. The number of sentences for decoding and generating the abstract is 3-6 sentences, the maximum length of each sentence is 100 words, and the minimum length is 6 words.

The experimental environment is as follows: a linux server, two intel to strong (R) processors E5-2603v4, and a GPU model NVIDIA Tesla K40M.

The main parameter settings in the model are shown in table 2 below.

Table 2 parameter settings in the model

The original abstract is a summary of the specification, contains important information content of the specification and adds own knowledge of human beings. In an embodiment of the application, the original abstract and the specification are spliced to be used as all inputs of the model, so that the obtained rewritten abstract reserves the content of the original abstract, a new sentence is extracted from the text of the specification, the completeness and the fluency of the rewritten abstract are greatly improved, and the result of the RLCPAR method exceeds the original abstract.

The abstract after the RLCPAR rewrite is abstract, and in one embodiment of the present application, the abstract score is calculated as the ratio of the new n-gram in the generated abstract to the absence in the input document. One potential reason for this is that when trained using a single sentence pair, the model learns to delete more document words in order to write a single simple sentence as a compact sentence in human summarization, thereby improving n-gram novelty. The revised abstract is simpler than the original abstract, the information is more complete, the invention comprises four basic elements of invention name, raw material composition, preparation process and efficacy, and the machine revised abstract is closer to the result revised by human experts.

Analysis by synthesis, RLCPAR does not encode every sentence of the input long document sequence, but uses a rough-to-fine approach of human inspiration, first extracts all salient sentences and then decodes (rewrites) them, which is parallel. This also avoids almost all redundancy problems, as the model has selected non-redundant salient sentences for summary generation. In order to improve additional benefits, the embodiment of the application also fuses the original abstract, and the completeness of the abstract is guaranteed.

An electronic device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the Chinese patent abstract rewriting method of any one of the above embodiments.

According to the Chinese patent abstract rewriting method based on reinforcement learning, the patent term dictionary is introduced, the key sentences of the patent specification text are extracted through a sentence extraction method based on reinforcement learning, the candidate abstract is generated through a transform deep neural network text generation method, the original abstract information of the patent is finally fused, the rewritten abstract is obtained through semantic deduplication and sequencing, end-to-end patent abstract rewriting is achieved, the method is excellent in performance on the evaluation standards of ROUGE-1, ROUGE-2 and ROUGE-L and is obviously superior to other sequence generation reference methods, the cost of manual rewriting is reduced, and the working efficiency of patent data processing is improved.

The embodiment of the application adopts a Chinese patent abstract rewriting model, and the model notices the hierarchical structure information of words and sentences. RLCPAR is able to extract and rewrite the abstract, the extractor performs the extraction of key sentences alone, and then simply applies the generator on the extracted set of sentences to abstract the abstract system. The generator of the RLCPAR rewrites each sentence and generates new words from a large vocabulary, so that each word in the overall abstract of the embodiment of the present application is regenerated, and the RLCPAR is classified into text generation. The present application embodiment demonstrates how the best model selects key sentences and then rewrites them. The reader can see how the generator briefly rewrites the extracted sentences while preserving the ground truth. The method integrates the advantages of key sentence extraction and text generation, rewrites the patent abstract, and obtains better results on abstract data in the Chinese patent field of Chinese traditional medicine. The cost of manual rewriting is reduced, and the working efficiency of patent data processing is improved.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in a strict order unless explicitly stated in the present embodiment, and may be performed in other orders. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The above-mentioned embodiments only express the embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:中文文档抽取式摘要方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!