Chinese question automatic generation method and device based on domain terms and key sentences

文档序号：1889937 发布日期：2021-11-26 浏览：26次中文

阅读说明：本技术 一种基于领域术语和关键句的中文问题自动生成方法及装置 (Chinese question automatic generation method and device based on domain terms and key sentences ) 是由赵军董勤伟查显光吴俊� 赵新冬戴威于聪聪于 2021-09-01 设计创作，主要内容包括：本发明公开了一种基于领域术语和关键句的中文问题生成方法及装置,该方法包括对输入的文档中的句子建立依存句法结构,并依据依存句法规则生成候选领域术语,对生成的候选领域术语进行评估并排序,基于排序结果抽取出指定数量的领域术语；以及对输入的文档中的句子的词进行TF-IDF计算来表示句子,并采用T-TextRank算法计算句子的重要性,基于重要性排序结果抽取出指定数量的关键句；最后基于抽取的领域术语和关键句生成中文选择题题干,中文填空题题干和生成中文问答题题干。使用本发明方法提取的领域术语和关键句可以极大地提高生成问题的重要性,具有广泛的应用前景。(The invention discloses a Chinese problem generating method and a device based on domain terms and key sentences, wherein the method comprises the steps of establishing a dependency syntax structure for sentences in an input document, generating candidate domain terms according to a dependency syntax rule, evaluating and sequencing the generated candidate domain terms, and extracting a specified number of domain terms based on a sequencing result; performing TF-IDF calculation on words of sentences in the input document to represent the sentences, calculating the importance of the sentences by adopting a T-TextRank algorithm, and extracting a specified number of key sentences based on an importance sorting result; and finally, generating a Chinese selection question stem, a Chinese filling-in blank question stem and a Chinese question and answer question stem based on the extracted domain terms and key sentences. The field terms and key sentences extracted by the method can greatly improve the importance of problem generation, and have wide application prospect.)

1. A Chinese question generation method based on domain terms and key sentences is characterized by comprising the following steps:

extracting domain terms and key sentences in the document based on dependency syntax analysis;

generating a plurality of types of questions based on the extracted domain terms and the key sentences;

wherein extracting domain terms in the document based on dependency parsing includes:

establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to dependency syntax rules;

evaluating and ranking the generated candidate domain terms;

extracting a specified number of domain terms based on the sorting result;

extracting key sentences in the document based on the dependency syntax analysis comprises the following steps:

calculating TF-IDF values of words in the input document;

calculating similarity between sentences in the document based on the TF-IDF value;

calculating the importance of the sentences based on the similarity between the sentences and sequencing;

and extracting a specified number of key sentences based on the importance sorting result of the sentences.

2. The method for generating Chinese problems based on domain terms and key sentences according to claim 1, wherein the dependency syntax structure is built by any one of the following ways:

stanford-dependent parser, neural network-based dependent parser in the Hanlp toolkit, and post search dependent parser based on the ArcEager transfer system.

3. The method of claim 1, wherein the dependency syntax rules are:

(dep)？+(amod|nn)*+(nsubj|dobj)；

wherein? Indicates none or more, indicates one or more, dep indicates dependency, amod indicates adjective modifiers, nn indicates noun compound modifiers, nsubj indicates nominal subjects, and dobj indicates direct objects.

4. The method of claim 1, wherein the evaluating and ranking the generated candidate domain terms comprises:

filtering non-terms in the candidate domain terms based on the part-of-speech filtering rule, and deleting when the part of speech in the candidate domain terms meets any one of the following rules:

a. the end of the word is a number word, a preposition word, a conjunctive word and a position word;

b. a non-noun;

c. containing delimiters or symbols;

filtering the candidate domain terms based on the grammatical filtering rules, and retaining when the candidate domain terms satisfy any one of the following rules:

d. noun + noun;

e. adjectives or nouns + nouns;

calculating a score for the filtered candidate domain terms:

wherein s represents a score, f_wordFrequency, f, of the current candidate domain term_(i)Is the product of the number of candidate domain terms with frequency i and the word frequency i, C₁Is the total number of candidate domain terms with frequency 1, C is the total number of all candidate domain terms extracted, n is the maximum frequency, a is the hyperparameter;

the scores for the candidate domain terms are ranked from large to small.

5. The method of claim 1, wherein the calculating the TF-IDF value of the word in the input document comprises:

wherein, word_pTF-IDF value, c, representing the word p_nIs the number of times a word p appears in the document, N is the total number of words in the document, m represents the number of sentences in the document, e_pRepresenting the number of sentences containing the word p.

6. The method of claim 5, wherein the calculating the similarity between sentences in the document based on the TF-IDF value comprises:

wherein, w_ijRepresenting a sentence S_iAnd sentence S_jSimilarity of (2), word_ipThen represents the sentence S_iTF-IDF value of the Chinese word p, word_jpRepresenting a sentence S_jTF-IDF value of the Chinese word p.

7. The method of claim 6, wherein the calculating the importance of sentences based on similarity between sentences and ranking comprises:

sentences are expressed as nodes, and the two-way full connection between the sentences forms a graph;

the importance of the sentence is iteratively calculated using the T-TextRank algorithm as follows until convergence:

wherein WS (S)_i) Representing a sentence S_iD is the set damping coefficient, In (S)_i) Indicating a pointing node S_iAll nodes of, Out (S)_j) Represents a node S_jAll pointed nodes;

the convergence values of the sentence importance are sorted from large to small.

8. The method for generating a chinese question based on domain terms and key sentences according to claim 1, wherein the generating of multiple types of questions based on extracted domain terms and key sentences includes generating a chinese choice question stem, generating a chinese blank-filling question stem and generating a chinese answer question stem;

the generating of the Chinese choice question stem comprises the following steps:

obtaining an extracted key sentence list, matching in the key sentence list by using field terms in a field term library as key information, selecting sentences which contain terms and serve as subject or object components as question stems, and taking corresponding field term contents as correct options of the selected question;

generating a choice topic interference term based on at least one of the following strategies:

segmenting the domain terms, and selecting the domain terms with the same part of speech as the correct options as interference items;

selecting a domain term which is the same as the correct option affix as an interference item;

obtaining a word vector of the domain term through training a word2vec model, and selecting the domain term as an interference term based on cosine similarity of the domain term word vector;

selecting field terms with similar occurrence frequencies in the document from a field term library as interference items;

the generating of the Chinese gap filling question stem comprises the following steps:

matching in a key sentence library by using field terms in a field term library as key information, selecting sentences which contain terms and serve as subject or object components as question stems, and replacing corresponding field term contents with transverse lines to generate Chinese filling-in-the-blank question stem;

the generating of the Chinese question and answer stem comprises the following steps:

when the key sentence contains at least one of the following characteristic words: "means", "is a | class", "also known", "is defined as", "is short", "is also called", "is used", "is also called" and "is called", and also includes field terms, then the noun explanation question stem is generated;

when the key sentence contains at least one of the following causality associated words: "because", "so", and "so", the contents of the sentence indicating the cause are replaced with the question words, generating the fact question stem.

9. A Chinese question generation device based on domain terms and key sentences is characterized by comprising the following steps:

the extraction module is used for extracting the domain terms and the key sentences in the document based on the dependency syntax analysis;

and the number of the first and second groups,

the generating module is used for generating multi-type questions based on the extracted domain terms and the key sentences;

the extraction module comprises a first extraction module and a second extraction module;

the first extraction module is used for extracting the first extraction module,

establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to dependency syntax rules;

evaluating and ranking the generated candidate domain terms;

extracting a specified number of domain terms based on the sorting result;

the second extraction module is used for extracting the first extraction module,

calculating TF-IDF values of words in the input document;

calculating similarity between sentences in the document based on the TF-IDF value;

calculating the importance of the sentences based on the similarity between the sentences and sequencing;

and extracting a specified number of key sentences based on the importance sorting result of the sentences.

10. The apparatus of claim 9, wherein the first extraction module is specifically configured to,

filtering non-terms in the candidate domain terms based on the part-of-speech filtering rule, and deleting when the part of speech in the candidate domain terms meets any one of the following rules:

a. the end of the word is a number word, a preposition word, a conjunctive word and a position word;

b. a non-noun;

c. containing delimiters or symbols;

filtering the candidate domain terms based on the grammatical filtering rules, and retaining when the candidate domain terms satisfy any one of the following rules:

d. noun + noun;

e. adjectives or nouns + nouns;

calculating a score for the filtered candidate domain terms:

the scores for the candidate domain terms are ranked from large to small.

11. The apparatus of claim 9, wherein the second extraction module is specifically configured to,

the TF-IDF value of the word in the input document is calculated as follows:

12. The apparatus of claim 11, wherein the second extraction module is specifically configured to,

sentences are expressed as nodes, and the two-way full connection between the sentences forms a graph;

the importance of the sentence is iteratively calculated using the T-TextRank algorithm as follows until convergence:

wherein WS (S)_i) Representing a sentence S_iD is the set damping coefficient, In (S)_i) Indicating a pointing node S_iAll nodes of, Out (S)_j) Represents a node S_jAll nodes pointed to, w_ijRepresenting a sentence S_iAnd sentence S_jSimilarity of (2), w_jkRepresenting a sentence S_jAnd sentence S_kThe similarity of (2);

the convergence values of the sentence importance are sorted from large to small.

13. The device for generating a chinese question based on domain terms and key sentences according to claim 9, wherein the generating module comprises a first generating module, a second generating module and a third generating module;

the first generating module is used for generating,

generating a choice topic interference term based on at least one of the following strategies:

segmenting the domain terms, and selecting the domain terms with the same part of speech as the correct options as interference items;

selecting a domain term which is the same as the correct option affix as an interference item;

obtaining a word vector of the domain term through training a word2vec model, and selecting the domain term as an interference term based on cosine similarity of the domain term word vector;

selecting field terms with similar occurrence frequencies in the document from a field term library as interference items;

the second generating module is configured to generate a second set of parameters,

the third generating module is used for generating,

Technical Field

The invention belongs to the technical field of information extraction, and particularly relates to a Chinese question automatic generation method based on domain terms and key sentences.

Background

In recent years, knowledge assessment or performance assessment is crucial to educational institutions and enterprises, and assessment in the form of question questionnaires is an effective evaluation strategy. However, the conventional manual-based problem generation requires a lot of manpower and time, and thus the research of automatic problem generation changes this situation. The problem automatic generation technology is to screen and extract important knowledge from the information in the document by using an information technology and automatically generate problems, thereby replacing the traditional mode of extracting manual writing problems from a test question library.

Existing solutions, such as automatic generation of chinese facts using grammar rule templates (Liu M, RUS V, Liu l. automatic knowledge generation [ J ]. IEEE Transactions on Learning Technologies,2016,10(2):1-1.) proposed by Liu et al in 2016 and methods for problem generation using adverb and preposition information (Khullar P, RACHNA K, HASE M, et al. automatic knowledge generation using relative terms and adverbs [ C ]// Proceedings of ACL 2018, Student Research works hop.2018: 153-.

However, the automatic question generation methods are all to ask questions by selecting nouns in sentences by using some question templates. Because Chinese does not have natural separators between words compared with English, documents in some specific fields often have word segmentation errors, thereby affecting the effect of problem generation. The traditional problem generation technology divides the vocabulary of the field of the human resource strategy, which not only causes the problem generation quality to be low, but also does not fully examine the field knowledge point. And problem generation for these domain knowledge is more valuable for employee assessment or student learning.

Disclosure of Invention

The invention aims to provide a Chinese question automatic generation method and device based on domain terms and key sentences, and the importance of the problem generation can be greatly improved through the extracted domain terms and key sentences.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the invention provides a Chinese problem generation method based on domain terms and key sentences, which comprises the following steps:

extracting domain terms and key sentences in the document based on dependency syntax analysis;

generating a plurality of types of questions based on the extracted domain terms and the key sentences;

wherein extracting domain terms in the document based on dependency parsing includes:

establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to dependency syntax rules;

evaluating and ranking the generated candidate domain terms;

extracting a specified number of domain terms based on the sorting result;

extracting key sentences in the document based on the dependency syntax analysis comprises the following steps:

calculating TF-IDF values of words in the input document;

calculating similarity between sentences in the document based on the TF-IDF value;

calculating the importance of the sentences based on the similarity between the sentences and sequencing;

and extracting a specified number of key sentences based on the importance sorting result of the sentences.

Further, the dependency syntax structure is built in any one of the following ways:

stanford-dependent parser, neural network-based dependent parser in the Hanlp toolkit, and post search dependent parser based on the ArcEager transfer system.

Further, the dependency syntax rule is:

(dep)？+(amod|nn)*+(nsubj|dobj)；

Further, evaluating and ranking the generated candidate domain terms includes:

filtering non-terms in the candidate domain terms based on the part-of-speech filtering rule, and deleting when the part of speech in the candidate domain terms meets any one of the following rules:

a. the end of the word is a number word, a preposition word, a conjunctive word and a position word;

b. a non-noun;

c. containing delimiters or symbols;

filtering the candidate domain terms based on the grammatical filtering rules, and retaining when the candidate domain terms satisfy any one of the following rules:

d. noun + noun;

e. adjectives or nouns + nouns;

calculating a score for the filtered candidate domain terms:

wherein s represents a score, f_wordFrequency, f, of the current candidate domain term_(i)Is the product of the number of candidate domain terms with frequency i and the word frequency i, C₁Is the total number of candidate domain terms with frequency 1, C is all of the decimatedA total number of candidate domain terms, n being a maximum frequency, a being a hyperparameter;

the scores for the candidate domain terms are ranked from large to small.

Further, the calculating the TF-IDF value of the word in the input document comprises:

Further, the calculating the similarity between sentences in the document based on the TF-IDF value includes:

Further, the calculating the importance of the sentences and the ordering based on the similarity between the sentences comprises:

sentences are expressed as nodes, and the two-way full connection between the sentences forms a graph;

the importance of the sentence is iteratively calculated using the T-TextRank algorithm as follows until convergence:

wherein WS (S)_i) Representing a sentence S_iD is the set damping coefficient, In (S)_i) Indicating a pointing node S_iAll nodes of, Out (S)_j) Represents a node S_jPointed toA node exists;

the convergence values of the sentence importance are sorted from large to small.

Further, the multi-type questions are generated based on the extracted domain terms and the key sentences, and the method comprises the steps of generating Chinese selection question questions, generating Chinese blank filling question questions and generating Chinese question and answer question questions;