Chinese question automatic generation method and device based on domain terms and key sentences

文档序号:1889937 发布日期:2021-11-26 浏览:26次 中文

阅读说明:本技术 一种基于领域术语和关键句的中文问题自动生成方法及装置 (Chinese question automatic generation method and device based on domain terms and key sentences ) 是由 赵军 董勤伟 查显光 吴俊� 赵新冬 戴威 于聪聪 于 2021-09-01 设计创作,主要内容包括:本发明公开了一种基于领域术语和关键句的中文问题生成方法及装置,该方法包括对输入的文档中的句子建立依存句法结构,并依据依存句法规则生成候选领域术语,对生成的候选领域术语进行评估并排序,基于排序结果抽取出指定数量的领域术语;以及对输入的文档中的句子的词进行TF-IDF计算来表示句子,并采用T-TextRank算法计算句子的重要性,基于重要性排序结果抽取出指定数量的关键句;最后基于抽取的领域术语和关键句生成中文选择题题干,中文填空题题干和生成中文问答题题干。使用本发明方法提取的领域术语和关键句可以极大地提高生成问题的重要性,具有广泛的应用前景。(The invention discloses a Chinese problem generating method and a device based on domain terms and key sentences, wherein the method comprises the steps of establishing a dependency syntax structure for sentences in an input document, generating candidate domain terms according to a dependency syntax rule, evaluating and sequencing the generated candidate domain terms, and extracting a specified number of domain terms based on a sequencing result; performing TF-IDF calculation on words of sentences in the input document to represent the sentences, calculating the importance of the sentences by adopting a T-TextRank algorithm, and extracting a specified number of key sentences based on an importance sorting result; and finally, generating a Chinese selection question stem, a Chinese filling-in blank question stem and a Chinese question and answer question stem based on the extracted domain terms and key sentences. The field terms and key sentences extracted by the method can greatly improve the importance of problem generation, and have wide application prospect.)

1. A Chinese question generation method based on domain terms and key sentences is characterized by comprising the following steps:

extracting domain terms and key sentences in the document based on dependency syntax analysis;

generating a plurality of types of questions based on the extracted domain terms and the key sentences;

wherein extracting domain terms in the document based on dependency parsing includes:

establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to dependency syntax rules;

evaluating and ranking the generated candidate domain terms;

extracting a specified number of domain terms based on the sorting result;

extracting key sentences in the document based on the dependency syntax analysis comprises the following steps:

calculating TF-IDF values of words in the input document;

calculating similarity between sentences in the document based on the TF-IDF value;

calculating the importance of the sentences based on the similarity between the sentences and sequencing;

and extracting a specified number of key sentences based on the importance sorting result of the sentences.

2. The method for generating Chinese problems based on domain terms and key sentences according to claim 1, wherein the dependency syntax structure is built by any one of the following ways:

stanford-dependent parser, neural network-based dependent parser in the Hanlp toolkit, and post search dependent parser based on the ArcEager transfer system.

3. The method of claim 1, wherein the dependency syntax rules are:

(dep)?+(amod|nn)*+(nsubj|dobj);

wherein? Indicates none or more, indicates one or more, dep indicates dependency, amod indicates adjective modifiers, nn indicates noun compound modifiers, nsubj indicates nominal subjects, and dobj indicates direct objects.

4. The method of claim 1, wherein the evaluating and ranking the generated candidate domain terms comprises:

filtering non-terms in the candidate domain terms based on the part-of-speech filtering rule, and deleting when the part of speech in the candidate domain terms meets any one of the following rules:

a. the end of the word is a number word, a preposition word, a conjunctive word and a position word;

b. a non-noun;

c. containing delimiters or symbols;

filtering the candidate domain terms based on the grammatical filtering rules, and retaining when the candidate domain terms satisfy any one of the following rules:

d. noun + noun;

e. adjectives or nouns + nouns;

calculating a score for the filtered candidate domain terms:

wherein s represents a score, fwordFrequency, f, of the current candidate domain term(i)Is the product of the number of candidate domain terms with frequency i and the word frequency i, C1Is the total number of candidate domain terms with frequency 1, C is the total number of all candidate domain terms extracted, n is the maximum frequency, a is the hyperparameter;

the scores for the candidate domain terms are ranked from large to small.

5. The method of claim 1, wherein the calculating the TF-IDF value of the word in the input document comprises:

wherein, wordpTF-IDF value, c, representing the word pnIs the number of times a word p appears in the document, N is the total number of words in the document, m represents the number of sentences in the document, epRepresenting the number of sentences containing the word p.

6. The method of claim 5, wherein the calculating the similarity between sentences in the document based on the TF-IDF value comprises:

wherein, wijRepresenting a sentence SiAnd sentence SjSimilarity of (2), wordipThen represents the sentence SiTF-IDF value of the Chinese word p, wordjpRepresenting a sentence SjTF-IDF value of the Chinese word p.

7. The method of claim 6, wherein the calculating the importance of sentences based on similarity between sentences and ranking comprises:

sentences are expressed as nodes, and the two-way full connection between the sentences forms a graph;

the importance of the sentence is iteratively calculated using the T-TextRank algorithm as follows until convergence:

wherein WS (S)i) Representing a sentence SiD is the set damping coefficient, In (S)i) Indicating a pointing node SiAll nodes of, Out (S)j) Represents a node SjAll pointed nodes;

the convergence values of the sentence importance are sorted from large to small.

8. The method for generating a chinese question based on domain terms and key sentences according to claim 1, wherein the generating of multiple types of questions based on extracted domain terms and key sentences includes generating a chinese choice question stem, generating a chinese blank-filling question stem and generating a chinese answer question stem;

the generating of the Chinese choice question stem comprises the following steps:

obtaining an extracted key sentence list, matching in the key sentence list by using field terms in a field term library as key information, selecting sentences which contain terms and serve as subject or object components as question stems, and taking corresponding field term contents as correct options of the selected question;

generating a choice topic interference term based on at least one of the following strategies:

segmenting the domain terms, and selecting the domain terms with the same part of speech as the correct options as interference items;

selecting a domain term which is the same as the correct option affix as an interference item;

obtaining a word vector of the domain term through training a word2vec model, and selecting the domain term as an interference term based on cosine similarity of the domain term word vector;

selecting field terms with similar occurrence frequencies in the document from a field term library as interference items;

the generating of the Chinese gap filling question stem comprises the following steps:

matching in a key sentence library by using field terms in a field term library as key information, selecting sentences which contain terms and serve as subject or object components as question stems, and replacing corresponding field term contents with transverse lines to generate Chinese filling-in-the-blank question stem;

the generating of the Chinese question and answer stem comprises the following steps:

when the key sentence contains at least one of the following characteristic words: "means", "is a | class", "also known", "is defined as", "is short", "is also called", "is used", "is also called" and "is called", and also includes field terms, then the noun explanation question stem is generated;

when the key sentence contains at least one of the following causality associated words: "because", "so", and "so", the contents of the sentence indicating the cause are replaced with the question words, generating the fact question stem.

9. A Chinese question generation device based on domain terms and key sentences is characterized by comprising the following steps:

the extraction module is used for extracting the domain terms and the key sentences in the document based on the dependency syntax analysis;

and the number of the first and second groups,

the generating module is used for generating multi-type questions based on the extracted domain terms and the key sentences;

the extraction module comprises a first extraction module and a second extraction module;

the first extraction module is used for extracting the first extraction module,

establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to dependency syntax rules;

evaluating and ranking the generated candidate domain terms;

extracting a specified number of domain terms based on the sorting result;

the second extraction module is used for extracting the first extraction module,

calculating TF-IDF values of words in the input document;

calculating similarity between sentences in the document based on the TF-IDF value;

calculating the importance of the sentences based on the similarity between the sentences and sequencing;

and extracting a specified number of key sentences based on the importance sorting result of the sentences.

10. The apparatus of claim 9, wherein the first extraction module is specifically configured to,

filtering non-terms in the candidate domain terms based on the part-of-speech filtering rule, and deleting when the part of speech in the candidate domain terms meets any one of the following rules:

a. the end of the word is a number word, a preposition word, a conjunctive word and a position word;

b. a non-noun;

c. containing delimiters or symbols;

filtering the candidate domain terms based on the grammatical filtering rules, and retaining when the candidate domain terms satisfy any one of the following rules:

d. noun + noun;

e. adjectives or nouns + nouns;

calculating a score for the filtered candidate domain terms:

wherein s represents a score, fwordFrequency, f, of the current candidate domain term(i)Is the product of the number of candidate domain terms with frequency i and the word frequency i, C1Is the total number of candidate domain terms with frequency 1, C is the total number of all candidate domain terms extracted, n is the maximum frequency, a is the hyperparameter;

the scores for the candidate domain terms are ranked from large to small.

11. The apparatus of claim 9, wherein the second extraction module is specifically configured to,

the TF-IDF value of the word in the input document is calculated as follows:

wherein, wordpTF-IDF value, c, representing the word pnIs the number of times a word p appears in the document, N is the total number of words in the document, m represents the number of sentences in the document, epRepresenting the number of sentences containing the word p.

12. The apparatus of claim 11, wherein the second extraction module is specifically configured to,

sentences are expressed as nodes, and the two-way full connection between the sentences forms a graph;

the importance of the sentence is iteratively calculated using the T-TextRank algorithm as follows until convergence:

wherein WS (S)i) Representing a sentence SiD is the set damping coefficient, In (S)i) Indicating a pointing node SiAll nodes of, Out (S)j) Represents a node SjAll nodes pointed to, wijRepresenting a sentence SiAnd sentence SjSimilarity of (2), wjkRepresenting a sentence SjAnd sentence SkThe similarity of (2);

the convergence values of the sentence importance are sorted from large to small.

13. The device for generating a chinese question based on domain terms and key sentences according to claim 9, wherein the generating module comprises a first generating module, a second generating module and a third generating module;

the first generating module is used for generating,

obtaining an extracted key sentence list, matching in the key sentence list by using field terms in a field term library as key information, selecting sentences which contain terms and serve as subject or object components as question stems, and taking corresponding field term contents as correct options of the selected question;

generating a choice topic interference term based on at least one of the following strategies:

segmenting the domain terms, and selecting the domain terms with the same part of speech as the correct options as interference items;

selecting a domain term which is the same as the correct option affix as an interference item;

obtaining a word vector of the domain term through training a word2vec model, and selecting the domain term as an interference term based on cosine similarity of the domain term word vector;

selecting field terms with similar occurrence frequencies in the document from a field term library as interference items;

the second generating module is configured to generate a second set of parameters,

matching in a key sentence library by using field terms in a field term library as key information, selecting sentences which contain terms and serve as subject or object components as question stems, and replacing corresponding field term contents with transverse lines to generate Chinese filling-in-the-blank question stem;

the third generating module is used for generating,

when the key sentence contains at least one of the following characteristic words: "means", "is a | class", "also known", "is defined as", "is short", "is also called", "is used", "is also called" and "is called", and also includes field terms, then the noun explanation question stem is generated;

when the key sentence contains at least one of the following causality associated words: "because", "so", and "so", the contents of the sentence indicating the cause are replaced with the question words, generating the fact question stem.

Technical Field

The invention belongs to the technical field of information extraction, and particularly relates to a Chinese question automatic generation method based on domain terms and key sentences.

Background

In recent years, knowledge assessment or performance assessment is crucial to educational institutions and enterprises, and assessment in the form of question questionnaires is an effective evaluation strategy. However, the conventional manual-based problem generation requires a lot of manpower and time, and thus the research of automatic problem generation changes this situation. The problem automatic generation technology is to screen and extract important knowledge from the information in the document by using an information technology and automatically generate problems, thereby replacing the traditional mode of extracting manual writing problems from a test question library.

Existing solutions, such as automatic generation of chinese facts using grammar rule templates (Liu M, RUS V, Liu l. automatic knowledge generation [ J ]. IEEE Transactions on Learning Technologies,2016,10(2):1-1.) proposed by Liu et al in 2016 and methods for problem generation using adverb and preposition information (Khullar P, RACHNA K, HASE M, et al. automatic knowledge generation using relative terms and adverbs [ C ]// Proceedings of ACL 2018, Student Research works hop.2018: 153-.

However, the automatic question generation methods are all to ask questions by selecting nouns in sentences by using some question templates. Because Chinese does not have natural separators between words compared with English, documents in some specific fields often have word segmentation errors, thereby affecting the effect of problem generation. The traditional problem generation technology divides the vocabulary of the field of the human resource strategy, which not only causes the problem generation quality to be low, but also does not fully examine the field knowledge point. And problem generation for these domain knowledge is more valuable for employee assessment or student learning.

Disclosure of Invention

The invention aims to provide a Chinese question automatic generation method and device based on domain terms and key sentences, and the importance of the problem generation can be greatly improved through the extracted domain terms and key sentences.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the invention provides a Chinese problem generation method based on domain terms and key sentences, which comprises the following steps:

extracting domain terms and key sentences in the document based on dependency syntax analysis;

generating a plurality of types of questions based on the extracted domain terms and the key sentences;

wherein extracting domain terms in the document based on dependency parsing includes:

establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to dependency syntax rules;

evaluating and ranking the generated candidate domain terms;

extracting a specified number of domain terms based on the sorting result;

extracting key sentences in the document based on the dependency syntax analysis comprises the following steps:

calculating TF-IDF values of words in the input document;

calculating similarity between sentences in the document based on the TF-IDF value;

calculating the importance of the sentences based on the similarity between the sentences and sequencing;

and extracting a specified number of key sentences based on the importance sorting result of the sentences.

Further, the dependency syntax structure is built in any one of the following ways:

stanford-dependent parser, neural network-based dependent parser in the Hanlp toolkit, and post search dependent parser based on the ArcEager transfer system.

Further, the dependency syntax rule is:

(dep)?+(amod|nn)*+(nsubj|dobj);

wherein? Indicates none or more, indicates one or more, dep indicates dependency, amod indicates adjective modifiers, nn indicates noun compound modifiers, nsubj indicates nominal subjects, and dobj indicates direct objects.

Further, evaluating and ranking the generated candidate domain terms includes:

filtering non-terms in the candidate domain terms based on the part-of-speech filtering rule, and deleting when the part of speech in the candidate domain terms meets any one of the following rules:

a. the end of the word is a number word, a preposition word, a conjunctive word and a position word;

b. a non-noun;

c. containing delimiters or symbols;

filtering the candidate domain terms based on the grammatical filtering rules, and retaining when the candidate domain terms satisfy any one of the following rules:

d. noun + noun;

e. adjectives or nouns + nouns;

calculating a score for the filtered candidate domain terms:

wherein s represents a score, fwordFrequency, f, of the current candidate domain term(i)Is the product of the number of candidate domain terms with frequency i and the word frequency i, C1Is the total number of candidate domain terms with frequency 1, C is all of the decimatedA total number of candidate domain terms, n being a maximum frequency, a being a hyperparameter;

the scores for the candidate domain terms are ranked from large to small.

Further, the calculating the TF-IDF value of the word in the input document comprises:

wherein, wordpTF-IDF value, c, representing the word pnIs the number of times a word p appears in the document, N is the total number of words in the document, m represents the number of sentences in the document, epRepresenting the number of sentences containing the word p.

Further, the calculating the similarity between sentences in the document based on the TF-IDF value includes:

wherein, wijRepresenting a sentence SiAnd sentence SjSimilarity of (2), wordipThen represents the sentence SiTF-IDF value of the Chinese word p, wordjpRepresenting a sentence SjTF-IDF value of the Chinese word p.

Further, the calculating the importance of the sentences and the ordering based on the similarity between the sentences comprises:

sentences are expressed as nodes, and the two-way full connection between the sentences forms a graph;

the importance of the sentence is iteratively calculated using the T-TextRank algorithm as follows until convergence:

wherein WS (S)i) Representing a sentence SiD is the set damping coefficient, In (S)i) Indicating a pointing node SiAll nodes of, Out (S)j) Represents a node SjPointed toA node exists;

the convergence values of the sentence importance are sorted from large to small.

Further, the multi-type questions are generated based on the extracted domain terms and the key sentences, and the method comprises the steps of generating Chinese selection question questions, generating Chinese blank filling question questions and generating Chinese question and answer question questions;

the generating of the Chinese choice question stem comprises the following steps:

obtaining an extracted key sentence list, matching in the key sentence list by using field terms in a field term library as key information, selecting sentences which contain terms and serve as subject or object components as question stems, and taking corresponding field term contents as correct options of the selected question;

generating a choice topic interference term based on at least one of the following strategies:

segmenting the domain terms, and selecting the domain terms with the same part of speech as the correct options as interference items;

selecting a domain term which is the same as the correct option affix as an interference item;

obtaining a word vector of the domain term through training a word2vec model, and selecting the domain term as an interference term based on cosine similarity of the domain term word vector;

selecting field terms with similar occurrence frequencies in the document from a field term library as interference items;

the generating of the Chinese gap filling question stem comprises the following steps:

matching in a key sentence library by using field terms in a field term library as key information, selecting sentences which contain terms and serve as subject or object components as question stems, and replacing corresponding field term contents with transverse lines to generate Chinese filling-in-the-blank question stem;

the generating of the Chinese question and answer stem comprises the following steps:

when the key sentence contains at least one of the following characteristic words: "means", "is a | class", "also known", "is defined as", "is short", "is also called", "is used", "is also called" and "is called", and also includes field terms, then the noun explanation question stem is generated;

when the key sentence contains at least one of the following causality associated words: "because", "so", and "so", the contents of the sentence indicating the cause are replaced with the question words, generating the fact question stem.

The invention also provides a Chinese question generating device based on the domain terms and the key sentences, which comprises the following steps:

the extraction module is used for extracting the domain terms and the key sentences in the document based on the dependency syntax analysis;

and the number of the first and second groups,

the generating module is used for generating multi-type questions based on the extracted domain terms and the key sentences;

the extraction module comprises a first extraction module and a second extraction module;

the first extraction module is used for extracting the first extraction module,

establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to dependency syntax rules;

evaluating and ranking the generated candidate domain terms;

extracting a specified number of domain terms based on the sorting result;

the second extraction module is used for extracting the first extraction module,

calculating TF-IDF values of words in the input document;

calculating similarity between sentences in the document based on the TF-IDF value;

calculating the importance of the sentences based on the similarity between the sentences and sequencing;

and extracting a specified number of key sentences based on the importance sorting result of the sentences.

Further, the first extraction module is specifically configured to,

filtering non-terms in the candidate domain terms based on the part-of-speech filtering rule, and deleting when the part of speech in the candidate domain terms meets any one of the following rules:

a. the end of the word is a number word, a preposition word, a conjunctive word and a position word;

b. a non-noun;

c. containing delimiters or symbols;

filtering the candidate domain terms based on the grammatical filtering rules, and retaining when the candidate domain terms satisfy any one of the following rules:

d. noun + noun;

e. adjectives or nouns + nouns;

calculating a score for the filtered candidate domain terms:

wherein s represents a score, fwordFrequency, f, of the current candidate domain term(i)Is the product of the number of candidate domain terms with frequency i and the word frequency i, C1Is the total number of candidate domain terms with frequency 1, C is the total number of all candidate domain terms extracted, n is the maximum frequency, a is the hyperparameter;

the scores for the candidate domain terms are ranked from large to small.

Further, the second extraction module is specifically configured to,

the TF-IDF value of the word in the input document is calculated as follows:

wherein, wordpTF-IDF value, c, representing the word pnIs the number of times a word p appears in the document, N is the total number of words in the document, m represents the number of sentences in the document, epRepresenting the number of sentences containing the word p.

Further, the second extraction module is specifically configured to,

sentences are expressed as nodes, and the two-way full connection between the sentences forms a graph;

the importance of the sentence is iteratively calculated using the T-TextRank algorithm as follows until convergence:

wherein WS (S)i) Representing a sentence SiD is the set damping coefficient, In (S)i) Indicating a pointing node SiAll nodes of, Out (S)j) Represents a node SjAll nodes pointed to, wijRepresenting a sentence SiAnd sentence SjSimilarity of (2), wjkRepresenting a sentence SjAnd sentence SkThe similarity of (2);

further, the generating modules comprise a first generating module, a second generating module and a third generating module;

the first generating module is used for generating,

obtaining an extracted key sentence list, matching in the key sentence list by using field terms in a field term library as key information, selecting sentences which contain terms and serve as subject or object components as question stems, and taking corresponding field term contents as correct options of the selected question;

generating a choice topic interference term based on at least one of the following strategies:

segmenting the domain terms, and selecting the domain terms with the same part of speech as the correct options as interference items;

selecting a domain term which is the same as the correct option affix as an interference item;

obtaining a word vector of the domain term through training a word2vec model, and selecting the domain term as an interference term based on cosine similarity of the domain term word vector;

selecting field terms with similar occurrence frequencies in the document from a field term library as interference items;

the second generating module is configured to generate a second set of parameters,

matching in a key sentence library by using field terms in a field term library as key information, selecting sentences which contain terms and serve as subject or object components as question stems, and replacing corresponding field term contents with transverse lines to generate Chinese filling-in-the-blank question stem;

the third generating module is used for generating,

when the key sentence contains at least one of the following characteristic words: "means", "is a | class", "also known", "is defined as", "is short", "is also called", "is used", "is also called" and "is called", and also includes field terms, then the noun explanation question stem is generated;

when the key sentence contains at least one of the following causality associated words: "because", "so", and "so", the contents of the sentence indicating the cause are replaced with the question words, generating the fact question stem.

The invention achieves the following beneficial effects:

the method and the device extract the key sentences and the domain terms based on the dependency syntax information and realize automatic generation of multiple question types based on the domain terms. The core algorithm of the invention has good expandability and can be completely applied to the automatic generation of problems in specific fields; the field terms and key sentences extracted by the extraction method can greatly improve the importance of problem generation, and have wide application prospect.

Drawings

FIG. 1 is a flow chart of the method for automatically generating Chinese problems based on domain terms and key sentences according to the present invention;

FIG. 2 is a flow diagram illustrating the extraction of domain terms based on dependency parsing in one embodiment of the invention;

FIG. 3 is a flow diagram illustrating evaluation of candidate domain terms in one embodiment of the present invention;

FIG. 4 is a schematic flow chart of extracting key sentences based on the T-TextRank algorithm according to an embodiment of the present invention;

FIG. 5 is a flow diagram for automatically generating a multi-type Chinese problem based on extracted domain terms and key sentences in an embodiment of the present invention.

Detailed Description

The invention is further described below. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The invention provides a Chinese question generation method based on domain terms and key sentences, which is used for extracting the key sentences and the domain terms based on dependency syntax information and realizing automatic generation of multiple question types based on the domain terms.

Dependency Parsing (DP), one of the key technologies for natural language processing, mainly aims to reveal the syntactic structure of a sentence and determine the Dependency between words in the sentence by analyzing the Dependency between language unit components. The dependency Parser tool may be any dependency Parser capable of obtaining word-to-word dependencies, for example, Stanford Parser (Stanford Parser), neural network-based dependency Parser in the Hanlp toolkit, or post search dependency Parser based on the ArcEager transfer system.

The invention discloses a Chinese question generation method based on domain terms and key sentences, which comprises the following steps:

extracting domain terms and key sentences based on dependency syntax analysis;

and generating multi-type topics based on the extracted domain terms and the key sentences.

One embodiment of the present invention takes a neural network-based dependency parser in a Hanlp toolkit as an example to perform chinese problem generation based on domain terms and key sentences, and the specific implementation process is shown in fig. 1 and includes:

at step S1, domain terms are extracted based on the dependency parsing.

In particular, as shown in figure 2,

step S101, building a dependency syntax structure on an input sentence based on the neural network-based dependency syntax analyzer in the Hanlp toolkit.

Step S102, generating candidate domain terms according to the following dependency syntax rules, and Table 1 explains the dependency relationship.

(dep)?+(amod|nn)*+(nsubj|dobj)

Table 1 dependency interpretation

Dependency English acronym Interpretation of meanings
None or more than one
* One or more
dep Dependency relationships
amod Adjective modifier
nn Noun compound modifier
nsubj Noun subject
dobj Direct object

Step S103, evaluating and sequencing the candidate domain terms by the rules, the word frequency and the affixes.

In particular, as shown with reference to FIG. 3,

step S103-1, filtering non-terms in the candidate domain terms based on the part-of-speech filtering rules.

The word segmentation and the part-of-speech tagging of each word are obtained from the dependency syntactic structure, and candidate words that cannot be terms are filtered by checking the part-of-speech, for example, words containing human pronouns are deleted. The part-of-speech filtering rules for non-terms are shown in table 2, and are deleted when the part-of-speech of the candidate domain term satisfies any one of the rules.

TABLE 2 part-of-speech filtering rules for non-terms

Rule sequence number Rule description
1 The ending of the word is a number word, preposition word, conjunctive word, position word
2 Non-nouns
3 Comprising delimiters or symbols

Step S103-2, filtering the candidate domain terms based on the grammar filtering rules. The grammar filter rules are as shown in table 3, and are retained when the candidate domain term satisfies any one of the rules.

TABLE 3 grammar filter rules

Rule sequence number Rule description
1 Noun + noun
2 (adjective or noun) + noun

In step S103-3, the score of the candidate domain term is calculated by the multi-factor evaluator.

The calculation mode of the multi-factor evaluator is shown in formula (1), and the statistical word frequency and the prefix and suffix of the phrase are considered at the same time, so that the method is divided into three conditions: hot term prefix inclusive, non-term prefix inclusive, and neither inclusive.

In the formula (f)wordFrequency of the current word (candidate domain term), f(i)Is the product of the number of words with frequency i and the word frequency i, C1Is the total number of words with frequency 1, C is the total number of all candidate domain terms extracted, n is the maximum frequency, a is the hyperparameter (chosen as 2 by experiment).

And step S104, sorting the candidate domain terms from big to small according to the scores of the candidate domain terms, and extracting a specified number of domain terms.

And step S2, extracting key sentences based on the T-TextRank algorithm.

In particular, as shown with reference to FIG. 4,

step S201, the input is processed with document preprocessing, including sentence segmentation, word segmentation and stop word removal.

In step S202, TF-IDF (Term Frequency-Inverse Document Frequency) calculation of a word is performed, and a sentence is represented by a feature vector.

Assuming that the chinese document is D, containing m sentences, D may be expressed as D ═ S1,S2…SmAt the same time, each sentence canTo represent with feature word vectors:

Si={wordi1,wordi2…wordiN},

n represents the number of words in the whole document, wordinThen represents the sentence SiTF-IDF value of the Chinese word n.

The TF-IDF calculation is shown in formula (2), cnIs the number of times a word N appears in the document, N is the total number of words in the document, enIs the number of sentences containing the word n.

wordinThen represents the sentence SiTF-IDF value of the Chinese word n.

In step S203, the sentence similarity is calculated based on the cosine similarity, and the calculation is as shown in formula (3).

And step S204, sorting the sentence importance by using a T-TextRank algorithm, and calculating as shown in a formula (4). Each sentence is represented as a node and is represented by S, and the sentences in the document are fully connected in two directions to form a graph. The initial value of the weight WS of each node is 1/m, and the initial weight of the edge is sentence similarity wij. Where d is the damping coefficient, typically 0.85. In (S)i) Indicating a pointing node SiAll nodes of, Out (S)j) Represents a node SjAll nodes pointed to. Equation (4) can converge through several iterative computations.

And S205, sorting the key sentences in a specified number from large to small according to the convergence result WS of the T-TextRank algorithm.

It can be seen that the input in step S1 is in units of sentences and the input in step S2 is in units of text in the present invention.

In step S3, a multi-type chinese question is automatically generated based on the extracted domain terms and key sentences.

Specifically, as shown in fig. 5, three types of chinese questions are generated based on the extracted domain terms and key sentences.

S301, generating a chinese choice question based on the domain terms and the key sentences, specifically,

step S301-1, generating Chinese choice question stem. First, an extracted key sentence list is obtained, matching is performed in the key sentence list using a domain term in a domain term base as key information, a sentence containing a term and serving as a subject or an object component is selected as a stem, and a corresponding domain term portion is used for question making.

And S301-2, generating a choice question interference item by combining different language characteristics. The generation strategy is shown in table 4.

TABLE 4 choice question interference term Generation strategy

S302, generating Chinese gap filling question stems based on the domain terms and the key sentences.

Specifically, the domain terms in the domain term library are used as key information to be matched in the key sentence library, sentences which contain terms and serve as subject or object components are selected as question stems, and corresponding domain term parts are replaced by transverse lines.

S303, generating Chinese question and answer question stems based on the domain terms and the key sentences.

When the key sentence contains the following characteristic words, the specific words are: the terms "meaning", "is", "class", "also known", "defined", "abbreviated", "also known", "that is", "is used", "also known" and "called" are used to form the term "meaning" to explain the subject matter.

When the key sentence contains the cause-and-effect associated word, the following are specific: "because", "so", and "so", the part of the sentence that indicates the cause is replaced with the question word, generating the fact question stem.

Another embodiment of the present invention provides a chinese question generating apparatus based on domain terms and key sentences, including:

the extraction module is used for extracting the domain terms and the key sentences in the document based on the dependency syntax analysis;

and the number of the first and second groups,

the generating module is used for generating multi-type questions based on the extracted domain terms and the key sentences;

the extraction module comprises a first extraction module and a second extraction module;

specifically, the first extraction module is used for,

establishing a dependency syntax structure for sentences in an input document, and generating candidate domain terms according to dependency syntax rules;

evaluating and ranking the generated candidate domain terms;

extracting a specified number of domain terms based on the sorting result;

the second extraction module is used for extracting the first extraction module,

calculating TF-IDF values of words in the input document;

calculating similarity between sentences in the document based on the TF-IDF value;

calculating the importance of the sentences based on the similarity between the sentences and sequencing;

and extracting a specified number of key sentences based on the importance sorting result of the sentences.

In an embodiment of the present invention, the first extraction module is specifically configured to,

filtering non-terms in the candidate domain terms based on the part-of-speech filtering rule, and deleting when the part of speech in the candidate domain terms meets any one of the following rules:

a. the end of the word is a number word, a preposition word, a conjunctive word and a position word;

b. a non-noun;

c. containing delimiters or symbols;

filtering the candidate domain terms based on the grammatical filtering rules, and retaining when the candidate domain terms satisfy any one of the following rules:

d. noun + noun;

e. adjectives or nouns + nouns;

calculating a score for the filtered candidate domain terms:

wherein s represents a score, fwordFrequency, f, of the current candidate domain term(i)Is the product of the number of candidate domain terms with frequency i and the word frequency i, C1Is the total number of candidate domain terms with frequency 1, C is the total number of all candidate domain terms extracted, n is the maximum frequency, a is the hyperparameter;

the scores for the candidate domain terms are ranked from large to small.

In the embodiment of the present invention, the second extraction module is specifically configured to,

the TF-IDF value of the word in the input document is calculated as follows:

wherein, wordpTF-IDF value, c, representing the word pnIs the number of times a word p appears in the document, N is the total number of words in the document, m represents the number of sentences in the document, epRepresenting the number of sentences containing the word p.

In the embodiment of the present invention, the second extraction module is specifically configured to,

sentences are expressed as nodes, and the two-way full connection between the sentences forms a graph;

the importance of the sentence is iteratively calculated using the T-TextRank algorithm as follows until convergence:

wherein WS (S)i) Representing a sentence SiD is the set damping coefficient, In (S)i) Indicating a pointing node SiAll nodes of, Out (S)j) Represents a node SjAll nodes pointed to, wijRepresenting a sentence SiAnd sentence SjSimilarity of (2), wjkRepresenting a sentence SjAnd sentence SkThe similarity of (2);

in the embodiment of the invention, the generation module comprises a first generation module, a second generation module and a third generation module;

wherein the first generation module is used for generating,

obtaining an extracted key sentence list, matching in the key sentence list by using field terms in a field term library as key information, selecting sentences which contain terms and serve as subject or object components as question stems, and taking corresponding field term contents as correct options of the selected question;

generating a choice topic interference term based on at least one of the following strategies:

segmenting the domain terms, and selecting the domain terms with the same part of speech as the correct options as interference items;

selecting a domain term which is the same as the correct option affix as an interference item;

obtaining a word vector of the domain term through training a word2vec model, and selecting the domain term as an interference term based on cosine similarity of the domain term word vector;

selecting field terms with similar occurrence frequencies in the document from a field term library as interference items;

the second generating module is used for generating,

matching in a key sentence library by using field terms in a field term library as key information, selecting sentences which contain terms and serve as subject or object components as question stems, and replacing corresponding field term contents with transverse lines to generate Chinese filling-in-the-blank question stem;

the third generation module is used for generating,

when the key sentence contains at least one of the following characteristic words: "means", "is a | class", "also known", "is defined as", "is short", "is also called", "is used", "is also called" and "is called", and also includes field terms, then the noun explanation question stem is generated;

when the key sentence contains at least one of the following causality associated words: "because", "so", and "so", the contents of the sentence indicating the cause are replaced with the question words, generating the fact question stem.

It is to be noted that the apparatus embodiment corresponds to the method embodiment, and the implementation manners of the method embodiment are all applicable to the apparatus embodiment and can achieve the same or similar technical effects, so that the details are not described herein.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

20页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种副标题的生成方法、装置、电子设备和存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!