Topic list generation method and device based on part-of-speech structure and computer equipment

文档序号:1087400 发布日期:2020-10-20 浏览:8次 中文

阅读说明:本技术 基于词性结构的主题列表生成方法、装置和计算机设备 (Topic list generation method and device based on part-of-speech structure and computer equipment ) 是由 柳明辉 徐国强 于 2020-06-24 设计创作,主要内容包括:本申请涉及人工智能技术领域,揭示了一种基于词性结构的主题列表生成方法、装置、计算机设备和存储介质,所述方法包括:获取待分析语料;对所述待分析语料进行分词,从而得到单词序列;并将所述单词序列输入预设的词性结构分析模型,从而得到词性序列;获取多个指定词组;将多个指定词组输入预设的概率主题模型中,得到多个主题;生成频率矩阵;调用参数矩阵;计算出排序矩阵Y;对排序矩阵Y相同横行的元素进行加和,得到n个横行加和值;将n个主题根据n个横行加和值进行降序排列,从而得到主题列表,并输出所述主题列表。从而提高语料分析的趋势性、针对性和准确性。此外,本申请还涉及区块链技术,所述概率主题模型可存储于区块链中。(The application relates to the technical field of artificial intelligence, and discloses a topic list generation method and device based on a part of speech structure, computer equipment and a storage medium, wherein the method comprises the following steps: obtaining a corpus to be analyzed; performing word segmentation on the linguistic data to be analyzed to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence; acquiring a plurality of specified phrases; inputting a plurality of specified phrases into a preset probability theme model to obtain a plurality of themes; generating a frequency matrix; calling a parameter matrix; calculating a sorting matrix Y; summing elements of the same horizontal rows of the sorting matrix Y to obtain n horizontal row sum values; and arranging the n subjects in a descending order according to the n horizontal line sum values to obtain a subject list, and outputting the subject list. Thereby improving the trend, pertinence and accuracy of the corpus analysis. In addition, the present application relates to blockchain techniques, and the probabilistic topic model can be stored in a blockchain.)

1. A topic list generating method based on a part of speech structure is characterized by comprising the following steps:

obtaining a corpus to be analyzed, wherein the corpus to be analyzed comprises four parts of an abstract, a foresight, a text and a final sentence;

performing word segmentation on the linguistic data to be analyzed to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; the part-of-speech structure analysis model is based on a neural network and is formed by training by using preset training data, and the training data consists of texts with word-forming parts of speech labeled in advance;

extracting at least one appointed continuous part-of-speech structure from the part-of-speech sequence, and acquiring a plurality of appointed phrases corresponding to the appointed continuous part-of-speech structure in the word sequence; the continuous part-of-speech structure consists of word formation parts of speech corresponding to two continuous words;

inputting the specified phrases into a preset probability topic model so as to obtain a plurality of topics output by the probability topic model;

generating frequency matrixes of the phrases corresponding to the plurality of subjects appearing in the four parts of the abstract, the foreword, the text and the final sentence respectively

Figure FDA0002556092150000011

calling a preset parameter matrixWherein B11, B12, B13 and B14 are four parameters related to the first topic, and B11, B12, B13 and B14 correspond to abstract, foreword, text and stop, respectively, B11 is greater than B14, B14 is greater than B13, B13 is greater than B12; b21, B22, B23 and B24 are four parameters related to the second topic, and B21, B22, B23 and B24 correspond to abstract, foreword, text and final respectively, B21 is greater than B24, B24 is greater than B23, and B23 is greater than B22; bn1, Bn2, Bn3 and Bn4 are four parameters related to the nth subject, and Bn1, Bn2, Bn3 and Bn4 respectively correspond to an abstract, a preamble, a text and a conclusion, wherein Bn1 is larger than Bn4, Bn4 is larger than Bn3, and Bn3 is larger than Bn 2;

according to the formula:

the order matrix Y is calculated and,

wherein

Figure FDA0002556092150000022

summing elements of the same horizontal row of the sorting matrix Y to obtain n horizontal row sum values respectively corresponding to the n subjects;

and arranging the n themes in a descending order according to the n horizontal line sum values to obtain a theme list, and outputting the theme list.

2. The method for generating a topic list based on part-of-speech structure according to claim 1, wherein the corpus to be analyzed is segmented to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; the part-of-speech structure analysis model is formed by training through preset training data based on a neural network, and the training data comprises the following steps before the step of forming a text with word formation parts of speech labeled in advance:

acquiring pre-collected sample data, and dividing the sample data into training data and verification data according to a preset proportion, wherein the sample data comprises a training text and word formation part-of-speech labels corresponding to the training text;

inputting the training data into a preset neural network model for training so as to obtain an intermediate model;

verifying the intermediate model by using the verification data, and judging whether a verification result is passed;

and if the verification result is that the verification is passed, marking the intermediate model as a part-of-speech structure analysis model.

3. The method for generating a topic list based on a part-of-speech structure according to claim 1, wherein the step of inputting the specified phrases into a preset probabilistic topic model to obtain a plurality of topics output by the probabilistic topic model comprises:

calling a plurality of preset theme word sets, wherein each theme word set comprises a theme name and a plurality of special phrases corresponding to the theme name;

judging whether all the appointed phrases belong to the plurality of topic word sets;

and if all the specified phrases belong to the plurality of subject word sets, acquiring a plurality of subject names in the plurality of subject word sets to which all the specified phrases belong, and outputting the plurality of subject names.

4. The method for generating a topic list based on part-of-speech structure according to claim 3, wherein the step of determining whether all the specified phrases belong to the plurality of topic word sets comprises:

if all the appointed phrases do not uniformly belong to the plurality of subject word sets, dividing the appointed phrases into a first appointed phrase and a second appointed phrase, wherein the first appointed phrase belongs to the subject word sets, and the second appointed phrase does not belong to the subject word sets;

calculating a plurality of similarity values between the second type of designated phrase and a plurality of preset reference phrases according to a preset similarity calculation method;

judging whether the similarity values are all smaller than a preset similarity threshold value;

and if the similarity values are all smaller than a preset similarity threshold value, outputting the subject name corresponding to the first type of specified phrase.

5. The method for generating a topic list based on a part-of-speech structure according to claim 1, wherein the step of arranging the n topics in a descending order according to the n horizontal summation values to obtain a topic list and outputting the topic list comprises:

the method comprises the steps of calling preset first-level parameter values, preset second-level parameter values, preset n-1-level parameter values, wherein the values of the first-level parameter values, the preset second-level parameter values, the preset m-1-level parameter values are sequentially increased, and m is an integer larger than 1 and smaller than n;

dividing the n horizontal line addition values into m levels, wherein the numerical values of the horizontal line addition values at the first level are all smaller than the first level parameter values, the numerical values of the horizontal line addition values at the second level are all smaller than the second level parameter values, the numerical values of the horizontal line addition values at the m-1 level are all smaller than the m-1 level parameter values, and the numerical values of the horizontal line addition values at the m level are all larger than the m-1 level parameter values;

and arranging the n themes in a descending order according to the m layers to obtain a layered theme list, and outputting the layered theme list.

6. A topic list generation apparatus based on a part-of-speech structure, comprising:

the system comprises a to-be-analyzed corpus obtaining unit, a to-be-analyzed corpus obtaining unit and a to-be-analyzed corpus analyzing unit, wherein the to-be-analyzed corpus comprises four parts, namely an abstract, a foresight, a text and a final sentence;

a part-of-speech sequence obtaining unit, configured to perform word segmentation on the corpus to be analyzed, so as to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; the part-of-speech structure analysis model is based on a neural network and is formed by training by using preset training data, and the training data consists of texts with word-forming parts of speech labeled in advance;

a plurality of appointed phrase obtaining units, configured to extract at least one appointed continuous part-of-speech structure from the part-of-speech sequence, and obtain a plurality of appointed phrases corresponding to the appointed continuous part-of-speech structure in the word sequence; the continuous part-of-speech structure consists of word formation parts of speech corresponding to two continuous words;

the theme acquisition units are used for inputting the specified phrases into a preset probability theme model so as to obtain a plurality of themes output by the probability theme model;

a frequency matrix generating unit, configured to generate frequency matrices of the phrases corresponding to the multiple topics appearing in the four parts, i.e., the abstract, the antecedent, the text, and the finalWherein, A11, A12, A13 and A14 are the times of appearance of phrases corresponding to the first subject in abstract, foreword, text and final respectively; a21, A22, A23 and A24 are the times of appearance of phrases corresponding to the second subject in the abstract, the foreword, the text and the final sentence respectively; an1, An2, An3 and An4 are the times of appearance of phrases corresponding to the nth subject in the abstract, the antecedent, the text and the final respectively, and the total number of the phrases is n;

a parameter matrix calling unit for calling preset parameter matrix

Figure FDA0002556092150000042

a rank matrix calculation unit for, according to the formula:

Figure FDA0002556092150000051

a horizontal line addition value calculation unit, configured to add elements of horizontal lines that are the same as the sorting matrix Y, so as to obtain n horizontal line addition values respectively corresponding to the n subjects;

and the descending order arrangement unit is used for carrying out descending order arrangement on the n themes according to the n horizontal line sum values so as to obtain a theme list and outputting the theme list.

7. The apparatus of claim 6, wherein the apparatus comprises:

the system comprises a sample data dividing unit, a word formation part and a verification unit, wherein the sample data dividing unit is used for acquiring pre-collected sample data and dividing the sample data into training data and verification data according to a preset proportion, and the sample data comprises a training text and word formation part-of-speech labels corresponding to the training text;

the intermediate model obtaining unit is used for inputting the training data into a preset neural network model for training so as to obtain an intermediate model;

the intermediate model verification unit is used for verifying the intermediate model by using the verification data and judging whether the verification result is passed;

and the intermediate model marking unit is used for marking the intermediate model as a part-of-speech structure analysis model if the verification result is that the verification is passed.

8. The apparatus according to claim 6, wherein the plurality of topic acquisition units comprise:

the topic word set calling subunit is used for calling a plurality of preset topic word sets, and each topic word set comprises a topic name and a plurality of special phrases corresponding to the topic name;

a topic word set judging subunit, configured to judge whether all the specified phrases belong to the multiple topic word sets;

and the theme name output subunit is used for acquiring a plurality of theme names in a plurality of theme word sets to which all the specified phrases belong and outputting the plurality of theme names if all the specified phrases belong to the plurality of theme word sets.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 5 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.

Technical Field

The present application relates to the field of computers, and in particular, to a topic list generation method and apparatus based on a part-of-speech structure, a computer device, and a storage medium.

Background

The corpus analysis may employ a topic model for analysis. In the conventional scheme, the basic unit of the topic model is a word, that is, each word is used as the basis of analysis, so that a plurality of corresponding topics are obtained through analysis. However, this analysis has drawbacks: trend analysis is insufficient. In particular, it is difficult for a word to exhibit a trend, such as a corpus generally associated with a phase transition, in which the word occurs in a plurality of places: the phase change point can be collected during analysis, and then the corresponding theme is analyzed to be phase change. However, the theme of phase transformation is not enough to fully reflect the characteristics of the corpus, for example, if the corpus is an influence factor for analyzing the reduction of the phase transformation point, the more accurate theme should be the reduction of the phase transformation point. The trend analysis of the conventional scheme is insufficient. Moreover, the topic model of the conventional scheme has no limitation on where the words appear, that is, the contribution of the words at any position to the topic is the same, so that the pertinence and accuracy of the corpus analysis of the conventional scheme are insufficient. In conclusion, the trend, pertinence and accuracy of the corpus analysis of the conventional scheme need to be improved.

Disclosure of Invention

The main purpose of the present application is to provide a topic list generation method, apparatus, computer device and storage medium based on part-of-speech structure, which aim to improve the trend, pertinence and accuracy of corpus analysis.

In order to achieve the above object, the present application provides a topic list generation method based on a part-of-speech structure, including the following steps:

obtaining a corpus to be analyzed, wherein the corpus to be analyzed comprises four parts of an abstract, a foresight, a text and a final sentence;

performing word segmentation on the linguistic data to be analyzed to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; the part-of-speech structure analysis model is based on a neural network and is formed by training by using preset training data, and the training data consists of texts with word-forming parts of speech labeled in advance;

extracting at least one appointed continuous part-of-speech structure from the part-of-speech sequence, and acquiring a plurality of appointed phrases corresponding to the appointed continuous part-of-speech structure in the word sequence; the continuous part-of-speech structure consists of word formation parts of speech corresponding to two continuous words;

inputting the specified phrases into a preset probability topic model so as to obtain a plurality of topics output by the probability topic model;

generating frequency matrixes of the phrases corresponding to the plurality of subjects appearing in the four parts of the abstract, the foreword, the text and the final sentence respectivelyWherein, A11, A12, A13 and A14 are the times of appearance of phrases corresponding to the first subject in abstract, foreword, text and final respectively; a21, A22, A23 and A24 are the times of appearance of phrases corresponding to the second subject in the abstract, the foreword, the text and the final sentence respectively; an1, An2, An3 and An4 are the times of appearance of phrases corresponding to the nth subject in the abstract, the antecedent, the text and the final respectively, and the total number of the phrases is n;

calling a preset parameter matrixWherein B11, B12, B13 and B14 are four parameters related to the first topic, and B11, B12, B13 and B14 correspond to abstract, foreword, text and stop, respectively, B11 is greater than B14, B14 is greater than B13, B13 is greater than B12; b21, B22, B23 and B24 are four parameters related to the second topic, and B21B22, B23 and B24 respectively correspond to an abstract, a foresight, a text and a final, wherein B21 is larger than B24, B24 is larger than B23, and B23 is larger than B22; bn1, Bn2, Bn3 and Bn4 are four parameters related to the nth subject, and Bn1, Bn2, Bn3 and Bn4 respectively correspond to an abstract, a preamble, a text and a conclusion, wherein Bn1 is larger than Bn4, Bn4 is larger than Bn3, and Bn3 is larger than Bn 2;

according to the formula:

the order matrix Y is calculated and,

whereinRefers to the multiplication of elements at the same position in two matrices;

summing elements of the same horizontal row of the sorting matrix Y to obtain n horizontal row sum values respectively corresponding to the n subjects;

and arranging the n themes in a descending order according to the n horizontal line sum values to obtain a theme list, and outputting the theme list.

Further, performing word segmentation on the corpus to be analyzed to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; the part-of-speech structure analysis model is formed by training through preset training data based on a neural network, and the training data comprises the following steps before the step of forming a text with word formation parts of speech labeled in advance:

acquiring pre-collected sample data, and dividing the sample data into training data and verification data according to a preset proportion, wherein the sample data comprises a training text and word formation part-of-speech labels corresponding to the training text;

inputting the training data into a preset neural network model for training so as to obtain an intermediate model;

verifying the intermediate model by using the verification data, and judging whether a verification result is passed;

and if the verification result is that the verification is passed, marking the intermediate model as a part-of-speech structure analysis model.

Further, the step of inputting the plurality of specified phrases into a preset probabilistic topic model to obtain a plurality of topics output by the probabilistic topic model includes:

calling a plurality of preset theme word sets, wherein each theme word set comprises a theme name and a plurality of special phrases corresponding to the theme name;

judging whether all the appointed phrases belong to the plurality of topic word sets;

and if all the specified phrases belong to the plurality of subject word sets, acquiring a plurality of subject names in the plurality of subject word sets to which all the specified phrases belong, and outputting the plurality of subject names.

Further, after the step of determining whether all the specified phrases belong to the plurality of topic word sets, the method includes:

if all the appointed phrases do not uniformly belong to the plurality of subject word sets, dividing the appointed phrases into a first appointed phrase and a second appointed phrase, wherein the first appointed phrase belongs to the subject word sets, and the second appointed phrase does not belong to the subject word sets;

calculating a plurality of similarity values between the second type of designated phrase and a plurality of preset reference phrases according to a preset similarity calculation method;

judging whether the similarity values are all smaller than a preset similarity threshold value;

and if the similarity values are all smaller than a preset similarity threshold value, outputting the subject name corresponding to the first type of specified phrase.

Further, the step of sorting the n subjects in a descending order according to the n horizontal line addition values to obtain a subject list and outputting the subject list includes:

the method comprises the steps of calling preset first-level parameter values, preset second-level parameter values, preset n-1-level parameter values, wherein the values of the first-level parameter values, the preset second-level parameter values, the preset m-1-level parameter values are sequentially increased, and m is an integer larger than 1 and smaller than n;

dividing the n horizontal line addition values into m levels, wherein the numerical values of the horizontal line addition values at the first level are all smaller than the first level parameter values, the numerical values of the horizontal line addition values at the second level are all smaller than the second level parameter values, the numerical values of the horizontal line addition values at the m-1 level are all smaller than the m-1 level parameter values, and the numerical values of the horizontal line addition values at the m level are all larger than the m-1 level parameter values;

and arranging the n themes in a descending order according to the m layers to obtain a layered theme list, and outputting the layered theme list.

The application provides a topic list generating device based on part of speech structure, including:

the system comprises a to-be-analyzed corpus obtaining unit, a to-be-analyzed corpus obtaining unit and a to-be-analyzed corpus analyzing unit, wherein the to-be-analyzed corpus comprises four parts, namely an abstract, a foresight, a text and a final sentence;

a part-of-speech sequence obtaining unit, configured to perform word segmentation on the corpus to be analyzed, so as to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; the part-of-speech structure analysis model is based on a neural network and is formed by training by using preset training data, and the training data consists of texts with word-forming parts of speech labeled in advance;

a plurality of appointed phrase obtaining units, configured to extract at least one appointed continuous part-of-speech structure from the part-of-speech sequence, and obtain a plurality of appointed phrases corresponding to the appointed continuous part-of-speech structure in the word sequence; the continuous part-of-speech structure consists of word formation parts of speech corresponding to two continuous words;

the theme acquisition units are used for inputting the specified phrases into a preset probability theme model so as to obtain a plurality of themes output by the probability theme model;

a frequency matrix generating unit, configured to generate frequency matrices of the phrases corresponding to the multiple topics appearing in the four parts, i.e., the abstract, the antecedent, the text, and the finalWherein, A11, A12, A13 and A14 are the times of appearance of phrases corresponding to the first subject in abstract, foreword, text and final respectively; a21, A22, A23 and A24 are the times of appearance of phrases corresponding to the second subject in the abstract, the foreword, the text and the final sentence respectively; an1, An2, An3 and An4 are the times of appearance of phrases corresponding to the nth subject in the abstract, the antecedent, the text and the final respectively, and the total number of the phrases is n;

a parameter matrix calling unit for calling preset parameter matrix

Figure BDA0002556092160000052

Wherein B11, B12, B13 and B14 are four parameters related to the first topic, and B11, B12, B13 and B14 correspond to abstract, foreword, text and stop, respectively, B11 is greater than B14, B14 is greater than B13, B13 is greater than B12; b21, B22, B23 and B24 are four parameters related to the second topic, and B21, B22, B23 and B24 correspond to abstract, foreword, text and final respectively, B21 is greater than B24, B24 is greater than B23, and B23 is greater than B22; bn1, Bn2, Bn3 and Bn4 are four parameters related to the nth subject, and Bn1, Bn2, Bn3 and Bn4 respectively correspond to an abstract, a preamble, a text and a conclusion, wherein Bn1 is larger than Bn4, Bn4 is larger than Bn3, and Bn3 is larger than Bn 2;

a rank matrix calculation unit for, according to the formula:

calculating a sorting matrix Y, wherein

Figure BDA0002556092160000054

Refers to the multiplication of elements at the same position in two matrices;

a horizontal line addition value calculation unit, configured to add elements of horizontal lines that are the same as the sorting matrix Y, so as to obtain n horizontal line addition values respectively corresponding to the n subjects;

and the descending order arrangement unit is used for carrying out descending order arrangement on the n themes according to the n horizontal line sum values so as to obtain a theme list and outputting the theme list.

Further, the apparatus comprises:

the system comprises a sample data dividing unit, a word formation part and a verification unit, wherein the sample data dividing unit is used for acquiring pre-collected sample data and dividing the sample data into training data and verification data according to a preset proportion, and the sample data comprises a training text and word formation part-of-speech labels corresponding to the training text;

the intermediate model obtaining unit is used for inputting the training data into a preset neural network model for training so as to obtain an intermediate model;

the intermediate model verification unit is used for verifying the intermediate model by using the verification data and judging whether the verification result is passed;

and the intermediate model marking unit is used for marking the intermediate model as a part-of-speech structure analysis model if the verification result is that the verification is passed.

Further, the plurality of theme acquisition units include:

the topic word set calling subunit is used for calling a plurality of preset topic word sets, and each topic word set comprises a topic name and a plurality of special phrases corresponding to the topic name;

a topic word set judging subunit, configured to judge whether all the specified phrases belong to the multiple topic word sets;

and the theme name output subunit is used for acquiring a plurality of theme names in a plurality of theme word sets to which all the specified phrases belong and outputting the plurality of theme names if all the specified phrases belong to the plurality of theme word sets.

The present application provides a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the above methods when the processor executes the computer program.

The present application provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any of the above.

The subject list generating method and device based on the part-of-speech structure, the computer equipment and the storage medium obtain a corpus to be analyzed, wherein the corpus to be analyzed comprises four parts, namely an abstract, a foresight, a text and a final; performing word segmentation on the linguistic data to be analyzed to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; extracting at least one appointed continuous part-of-speech structure from the part-of-speech sequence, and acquiring a plurality of appointed phrases corresponding to the appointed continuous part-of-speech structure in the word sequence; inputting the specified phrases into a preset probability topic model so as to obtain a plurality of topics output by the probability topic model; generating frequency matrixes of phrases corresponding to the plurality of subjects appearing in the four parts of the abstract, the foreword, the text and the final sentence respectively; calling a preset parameter matrix; calculating a sorting matrix Y; summing elements of the same horizontal row of the sorting matrix Y to obtain n horizontal row sum values respectively corresponding to the n subjects; and arranging the n themes in a descending order according to the n horizontal line sum values to obtain a theme list, and outputting the theme list. Thereby improving the trend, pertinence and accuracy of the corpus analysis.

Drawings

Fig. 1 is a schematic flowchart of a topic list generation method based on a part-of-speech structure according to an embodiment of the present application;

fig. 2 is a schematic block diagram illustrating a structure of a topic list generation apparatus based on a part-of-speech structure according to an embodiment of the present application;

fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, an embodiment of the present application provides a topic list generation method based on a part-of-speech structure, including the following steps:

s1, obtaining a corpus to be analyzed, wherein the corpus to be analyzed comprises four parts, namely an abstract, a foresight, a text and a final sentence;

s2, performing word segmentation on the corpus to be analyzed to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; the part-of-speech structure analysis model is based on a neural network and is formed by training by using preset training data, and the training data consists of texts with word-forming parts of speech labeled in advance;

s3, extracting at least one appointed continuous part-of-speech structure from the part-of-speech sequence, and acquiring a plurality of appointed phrases corresponding to the appointed continuous part-of-speech structure in the word sequence; the continuous part-of-speech structure consists of word formation parts of speech corresponding to two continuous words;

s4, inputting the specified phrases into a preset probability topic model, thereby obtaining a plurality of topics output by the probability topic model;

s5, generating frequency matrixes of the phrases corresponding to the plurality of subjects appearing in the four parts of the abstract, the foreword, the text and the final sentence respectivelyWherein, A11, A12, A13 and A14 are the times of appearance of phrases corresponding to the first subject in abstract, foreword, text and final respectively; a21, A22, A23 and A24 are the second majorThe times that phrases corresponding to the question appear in the abstract, the antecedent, the text and the final respectively; an1, An2, An3 and An4 are the times of appearance of phrases corresponding to the nth subject in the abstract, the antecedent, the text and the final respectively, and the total number of the phrases is n;

s6, calling a preset parameter matrix

Figure BDA0002556092160000082

Wherein B11, B12, B13 and B14 are four parameters related to the first topic, and B11, B12, B13 and B14 correspond to abstract, foreword, text and stop, respectively, B11 is greater than B14, B14 is greater than B13, B13 is greater than B12; b21, B22, B23 and B24 are four parameters related to the second topic, and B21, B22, B23 and B24 correspond to abstract, foreword, text and final respectively, B21 is greater than B24, B24 is greater than B23, and B23 is greater than B22; bn1, Bn2, Bn3 and Bn4 are four parameters related to the nth subject, and Bn1, Bn2, Bn3 and Bn4 respectively correspond to an abstract, a preamble, a text and a conclusion, wherein Bn1 is larger than Bn4, Bn4 is larger than Bn3, and Bn3 is larger than Bn 2;

s7, according to the formula:

the order matrix Y is calculated and,

wherein

Figure BDA0002556092160000092

Refers to the multiplication of elements at the same position in two matrices;

s8, summing the elements of the same horizontal row of the sorting matrix Y, so as to obtain n horizontal row sum values respectively corresponding to the n subjects;

and S9, arranging the n themes in a descending order according to the n horizontal line sum values to obtain a theme list, and outputting the theme list.

The method and the device have the advantages that through special design, the trend, the pertinence and the accuracy of the corpus analysis are improved. The trend analysis is improved by using the design of the continuous part-of-speech structure, the sources of words contributing to the theme are distinguished by using the parameter matrix, and the pertinence and the accuracy are improved.

As described in step S1, the corpus to be analyzed is obtained, where the corpus to be analyzed includes four parts, i.e., an abstract, a foresight, a text, and a final utterance. The corpus to be analyzed may be any feasible corpus, such as professional literature. The application is particularly suitable for topic analysis of professional literature. The corpus to be analyzed comprises four parts of an abstract, a foresight, a text and a final, and the contribution of words in the four parts to a subject is different. In particular, the abstract is a simplification of the overall corpus in which the wording is most cautious, so that the words present therein contribute most to the topic. By analyzing in turn, it can be known that the words of the final part contribute second to the subject, the body part is again, and the introduction part is smallest. The nature is the basis for improving the pertinence and the accuracy in the subsequent subject analysis of the application.

As described in step S2, performing word segmentation on the corpus to be analyzed, so as to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; the part-of-speech structural analysis model is based on a neural network and is formed by training through preset training data, and the training data is composed of texts with word-forming parts of speech labeled in advance. The word segmentation can be performed in any feasible manner, for example, by using an open-source JIEBA word segmentation tool. The part-of-speech structure analysis model actually labels the part-of-speech of the word formation of the word sequence, thereby obtaining a part-of-speech sequence. The part-of-speech structure analysis model may adopt any feasible model, for example, an existing part-of-speech tagging model, which is not described herein again. Wherein the word formation part of speech includes: subjects, objects, predicates, determinants, subjects, and the like.

As described in step S3, at least one specified continuous part-of-speech structure is extracted from the part-of-speech sequence, and a plurality of specified phrases corresponding to the specified continuous part-of-speech structure in the word sequence are obtained; the continuous part-of-speech structure is composed of word formation parts of speech corresponding to two continuous words. Wherein the specified continuous part-of-speech structure includes, for example, a verb relationship, a preposition object, a centering relationship, a shape structure, etc., and the at least one specified continuous part-of-speech structure may be only one of the continuous part-of-speech structures, or may be a plurality of or even all of the continuous part-of-speech structures. In contrast, a general topic analysis is based on a single word as an analysis basis, whereas the present application is based on a plurality of specified phrases corresponding to the specified continuous part-of-speech structure, thereby making a trend analysis of the topic possible. For example, for general subject analysis, only the subject of phase change or phase change point can be obtained, while the subject of phase change point reduction can be obtained, so that trend analysis is realized.

As described in step S4, the specified phrases are input into a preset probabilistic topic model, so as to obtain multiple topics output by the probabilistic topic model. Wherein the probabilistic topic model may employ any feasible model, such as employing a latent dirichlet allocation model, hierarchical latent dirichlet allocation, and so on. And obtaining a plurality of themes output by the probability theme model by taking the plurality of specified phrases as a basis. It should be noted that the probability topic model adopted in the present application may adopt the number of times of occurrence of a phrase corresponding to the neglected topic, that is, the phrase only needs to occur once, and then the topic corresponding to the phrase is output.

As described in the above step S5, a frequency matrix is generated for the phrases corresponding to the plurality of subjects appearing in the four parts of the abstract, the preposition, the text and the final sentence respectivelyWherein, A11, A12, A13 and A14 are the times of appearance of phrases corresponding to the first subject in abstract, foreword, text and final respectively; a21, A22, A23 and A24 are the times of appearance of phrases corresponding to the second subject in the abstract, the foreword, the text and the final sentence respectively; an1, An2, An3 and An4 are the times of appearance of phrases corresponding to the nth subject in the abstract, the antecedent, the text and the final respectively, and the total number of the n subjects is n. Each element in the frequency matrix is paired with a topicThe occurrence times of the corresponding phrases are related. Since the frequency matrix has four columns, the positions where the phrases appear are distinguished, namely, the phrases appear in the abstract, the foreword, the text and the final, so that the contribution of the phrases at different positions can be distinguished.

As described in step S6, the preset parameter matrix is called

Figure BDA0002556092160000111

Wherein B11, B12, B13 and B14 are four parameters related to the first topic, and B11, B12, B13 and B14 correspond to abstract, foreword, text and stop, respectively, B11 is greater than B14, B14 is greater than B13, B13 is greater than B12; b21, B22, B23 and B24 are four parameters related to the second topic, and B21, B22, B23 and B24 correspond to abstract, foreword, text and final respectively, B21 is greater than B24, B24 is greater than B23, and B23 is greater than B22; bn1, Bn2, Bn3 and Bn4 are four parameters related to the nth topic, and Bn1, Bn2, Bn3 and Bn4 correspond to abstract, preamble, text and epilogue respectively, Bn1 is greater than Bn4, Bn4 is greater than Bn3, and Bn3 is greater than Bn 2. All elements of the parameter matrix can be obtained in any feasible manner, for example, by counting the corpus of the known subject collected in advance, performing phrase analysis to obtain the times of the phrases corresponding to the subject appearing in the abstract, the foresight, the text and the final part, and dividing the times of the corresponding parts appearing by the total times to obtain a value as the element numerical value in the corresponding position in the parameter matrix. The importance of the abstract, the fore-word, the text and the final language to the whole corpus is different, namely the importance of the abstract is greater than that of the final language, the final language is greater than the text, and the text is greater than the fore-word, so that Bn1 is set to be greater than Bn4, Bn4 is set to be greater than Bn3, and Bn3 is set to be greater than Bn 2.

As stated in step S7 above, according to the formula:

Figure BDA0002556092160000112

and calculating a sorting matrix Y. The resulting ordering matrix Y is also an n × 4 matrix, and the element value at each position in the ordering matrix Y is the product of the frequency matrix and the element at the corresponding position in the parameter matrix. Thereby to obtainThe ranking matrix Y reflects the contribution of the phrases to the topic.

As described in step S8, the elements in the same horizontal row of the sorting matrix Y are summed to obtain n horizontal row sum values corresponding to the n subjects, respectively. And each horizontal row sum value reflects the influence degree of the theme corresponding to the horizontal row on the corpus to be analyzed. That is, the larger the horizontal row sum value is, the higher the possibility that the corresponding topic is the main topic of the corpus to be analyzed is.

As described in step S9, the n subjects are sorted in descending order according to the n horizontal line sum values, so as to obtain a subject list, and the subject list is output. And obtaining a theme list after descending order arrangement, wherein the theme at the head is the theme which can most reflect the corpus to be analyzed, and the rest is done in sequence until the last theme. Therefore, the topic list reflects the contributions of different-provenance phrases to the topics, and the pertinence and the accuracy of topic analysis are improved.

Further, the present application also relates to blockchain techniques, and the probabilistic topic model can be stored in a blockchain. The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

In one embodiment, the parsing the corpus to be analyzed is performed to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; the part-of-speech structure analysis model is based on a neural network and is formed by training preset training data, and before the step S2, the training data is composed of texts with word-forming parts of speech labeled in advance, the method comprises the following steps:

s11, obtaining pre-collected sample data, and dividing the sample data into training data and verification data according to a preset proportion, wherein the sample data comprises a training text and word formation part-of-speech labels corresponding to the training text;

s12, inputting the training data into a preset neural network model for training, so as to obtain an intermediate model;

s13, verifying the intermediate model by using the verification data, and judging whether the verification result is that the verification is passed;

and S14, if the verification result is that the verification is passed, marking the intermediate model as a part-of-speech structure analysis model.

As described above, it is realized that the intermediate model is written as a part-of-speech structural analysis model. The preset ratio may be any feasible ratio, for example, 9: 1-99: 1, can be adjusted according to the actual number of sample numbers. The neural network may be any feasible model, such as a long-short term memory model, a deep convolution-generated antagonistic network model, and so on. According to the method, the sample data is divided into training data and verification data according to a preset proportion; and inputting the training data into a preset neural network model for training so as to obtain an intermediate model. The training mode can adopt any feasible mode, such as training by adopting a random gradient descent method. And verifying the intermediate model by using the verification data. And if the verification result is that the verification is passed, the trained intermediate model is qualified for the part-of-speech structure analysis task, and therefore the intermediate model is recorded as a part-of-speech structure analysis model. Further, the part-of-speech structural analysis model may be stored in a preset block chain. For example, the part-of-speech structural analysis model is sent to other auditing nodes in the blockchain through one blockchain node in the blockchain. And after the other auditing nodes pass the auditing, storing the part of speech structure analysis model into a public account book of the block chain in a new block form. The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

In one embodiment, the step S4 of inputting the specified phrases into a preset probabilistic topic model to obtain multiple topics output by the probabilistic topic model includes:

s401, calling a plurality of preset topic word sets, wherein each topic word set comprises a topic name and a plurality of special phrases corresponding to the topic name;

s402, judging whether all the specified phrases belong to the plurality of topic word sets;

and S403, if all the specified phrases belong to the plurality of subject word sets, acquiring a plurality of subject names in the plurality of subject word sets to which all the specified phrases belong, and outputting the plurality of subject names.

As described above, the plurality of specified phrases are input into the preset probabilistic topic model, so that a plurality of topics output by the probabilistic topic model are obtained. The different topic word sets comprise a plurality of corresponding special phrases, such as topics with phase change point reduction, including temperature reduction, crystallization point reduction and the like. In a corpus, a plurality of topics generally exist, so a plurality of topic word sets are preset in the application, and when a special phrase is determined to belong to a certain topic word set, the topic name corresponding to the topic word set is the topic related to the corpus. Therefore, whether all the specified phrases belong to the plurality of subject word sets is judged; and if all the specified phrases belong to the plurality of subject word sets, acquiring a plurality of subject names in the plurality of subject word sets to which all the specified phrases belong, and outputting the plurality of subject names. Therefore, a plurality of topic names are output preliminarily in a mode of neglecting the contribution of the word group, so that the processing efficiency is improved, the trend analysis is realized, and the accuracy and pertinence of the topic analysis can be further ensured due to the fact that the frequency matrix is used subsequently to realize the subdivision of the contribution of the word group.

In one embodiment, after the step S402 of determining whether all the specified phrases belong to the plurality of topic word sets, the method includes:

s4021, if all the specified phrases do not uniformly belong to the plurality of subject word sets, dividing the specified phrases into a first specified phrase and a second specified phrase, wherein the first specified phrase belongs to the subject word sets, and the second specified phrase does not belong to the subject word sets;

s4022, calculating a plurality of similarity values between the second type of designated phrase and a plurality of preset reference phrases according to a preset similarity calculation method;

s4023, judging whether the similarity values are all smaller than a preset similarity threshold value;

s4024, if the similarity values are all smaller than a preset similarity threshold, outputting the subject name corresponding to the first type of specified phrase.

As described above, if the similarity values are all smaller than the preset similarity threshold, the topic name corresponding to the first-class specified phrase is output. If all the appointed phrases do not belong to the plurality of topic word sets uniformly, the phrases which do not belong to the plurality of topic word sets do not contribute to the material under the common judgment condition. In practice, however, there are also situations where the theme word set is inaccurate in its planning, i.e., there are certain phrases that can actually contribute to the theme. Therefore, in order to further improve the accuracy of the topic analysis, according to a preset similarity calculation method, a plurality of similarity values between the second-class specified phrase and a plurality of preset reference phrases are calculated, and if the similarity values are all smaller than a preset similarity threshold value, it is indicated that the first-class specified phrase does not contribute, so that the topic name corresponding to the first-class specified phrase is output. Further, if the similarity value unevenness is smaller than a preset similarity threshold, acquiring a specified reference phrase corresponding to a second specified phrase not smaller than the preset similarity threshold, acquiring a subject name corresponding to the specified reference phrase, and outputting the subject name corresponding to the first specified phrase and the subject name corresponding to the specified reference phrase. The reference phrases are phrases extracted from different subject word sets, and the extraction mode may be any feasible mode, such as random extraction. The similarity calculation method may be any feasible method, for example, by querying a preset word vector library, mapping a plurality of reference phrases to a plurality of first vectors, mapping the second type of designated phrase to a second vector, and calculating the similarity between the first vector and the second vector by using a cosine similarity calculation method, thereby obtaining a plurality of similarity values between the second type of designated phrase and the preset plurality of reference phrases. Accordingly, the accuracy of topic analysis is improved.

In one embodiment, the step S9 of sorting the n subjects in descending order according to the n horizontal line summation values to obtain a subject list, and outputting the subject list includes:

s901, preset first-level parameter values, second-level parameter values and m-1-level parameter values are called, wherein the numerical values of the first-level parameter values, the second-level parameter values, the.

S902, dividing the n horizontal line addition values into m levels, wherein the numerical values of the horizontal line addition values at the first level are smaller than the first level parameter values, the numerical values of the horizontal line addition values at the second level are smaller than the second level parameter values, the numerical values of the horizontal line addition values at the m-1 level are smaller than the m-1 level parameter values, and the numerical values of the horizontal line addition values at the m level are larger than the m-1 level parameter values;

and S903, arranging the n topics in a descending order according to the m layers to obtain a layered topic list, and outputting the layered topic list.

As described above, the n subjects are sorted in descending order according to the n horizontal line addition values, so that a subject list is obtained, and the subject list is output. The traditional scheme outputs a plurality of themes, generally, the themes are sorted according to the contribution of the themes to the corpus, however, in some cases, the contribution difference between the themes is very small, and the themes can be regarded as equivalent contributions actually, namely, the theme of the corpus is a plurality of parallel themes, but the output theme list of the traditional scheme cannot be embodied. Therefore, preset first-level parameter values, second-level parameter values and m-1 level parameter values are called; dividing the n horizontal row summation values into m levels; and arranging the n themes in a descending order according to the m layers to obtain a layered theme list, and outputting the layered theme list, so that the layered output is realized, the fuzzification processing among the themes is realized, and the accurate analysis of the themes is facilitated.

The topic list generating method based on the part of speech structure obtains a corpus to be analyzed, wherein the corpus to be analyzed comprises four parts, namely an abstract, a foresight, a text and a final sentence; performing word segmentation on the linguistic data to be analyzed to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; extracting at least one appointed continuous part-of-speech structure from the part-of-speech sequence, and acquiring a plurality of appointed phrases corresponding to the appointed continuous part-of-speech structure in the word sequence; inputting the specified phrases into a preset probability topic model so as to obtain a plurality of topics output by the probability topic model; generating frequency matrixes of phrases corresponding to the plurality of subjects appearing in the four parts of the abstract, the foreword, the text and the final sentence respectively; calling a preset parameter matrix; calculating a sorting matrix Y; summing elements of the same horizontal row of the sorting matrix Y to obtain n horizontal row sum values respectively corresponding to the n subjects; and arranging the n themes in a descending order according to the n horizontal line sum values to obtain a theme list, and outputting the theme list. Thereby improving the trend, pertinence and accuracy of the corpus analysis.

Referring to fig. 2, an embodiment of the present application provides a topic list generation apparatus based on a part of speech structure, including:

the corpus analyzing device comprises a to-be-analyzed corpus acquiring unit 10, a to-be-analyzed corpus acquiring unit, a analyzing unit and a analyzing unit, wherein the to-be-analyzed corpus comprises four parts, namely an abstract, a foresight, a text and a final sentence;

a part-of-speech sequence obtaining unit 20, configured to perform word segmentation on the corpus to be analyzed, so as to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; the part-of-speech structure analysis model is based on a neural network and is formed by training by using preset training data, and the training data consists of texts with word-forming parts of speech labeled in advance;

a plurality of specified word group obtaining units 30, configured to extract at least one specified continuous part-of-speech structure from the part-of-speech sequence, and obtain a plurality of specified word groups corresponding to the specified continuous part-of-speech structure in the word sequence; the continuous part-of-speech structure consists of word formation parts of speech corresponding to two continuous words;

a plurality of topic acquisition units 40, configured to input the plurality of specified phrases into a preset probabilistic topic model, so as to obtain a plurality of topics output by the probabilistic topic model;

a frequency matrix generating unit 50, configured to generate frequency matrices of the phrases corresponding to the multiple topics appearing in the four parts of the abstract, the antecedent, the text, and the finalWherein, A11, A12, A13 and A14 are the times of appearance of phrases corresponding to the first subject in abstract, foreword, text and final respectively; a21, A22, A23 and A24 are the times of appearance of phrases corresponding to the second subject in the abstract, the foreword, the text and the final sentence respectively; an1, An2, An3 and An4 are the times of appearance of phrases corresponding to the nth subject in the abstract, the antecedent, the text and the final respectively, and the total number of the phrases is n;

a parameter matrix calling unit 60 for calling the preset parameter matrixWherein B11, B12, B13 and B14 are four parameters related to the first topic, and B11, B12, B13 and B14 correspond to abstract, foreword, text and stop, respectively, B11 is greater than B14, B14 is greater than B13, B13 is greater than B12; b21, B22, B23 and B24 are four parameters related to the second topic, and B21, B22, B23 and B24 correspond to abstract, foreword, text and final respectively, B21 is greater than B24, B24 is greater than B23, and B23 is greater than B22; bn1, Bn2, Bn3 and Bn4 are four parameters related to the nth subject, and Bn1, Bn2, Bn3 and Bn4 respectively correspond to an abstract, a preamble, a text and a conclusion, wherein Bn1 is larger than Bn4, Bn4 is larger than Bn3, and Bn3 is larger than Bn 2;

a rank matrix calculation unit 70 for calculating a rank matrix according to the formula:

Figure BDA0002556092160000172

the order matrix Y is calculated and,

whereinRefers to the multiplication of elements at the same position in two matrices;

a horizontal line addition value calculation unit 80, configured to add elements in the same horizontal line of the sorting matrix Y, so as to obtain n horizontal line addition values respectively corresponding to the n subjects;

a descending order arrangement unit 90, configured to perform descending order arrangement on the n topics according to the n horizontal line sum values, so as to obtain a topic list, and output the topic list.

The operations respectively executed by the units or the sub-units correspond to the steps of the topic list generation method based on the part-of-speech structure in the foregoing embodiment one by one, and are not described herein again.

In one embodiment, the apparatus comprises:

the system comprises a sample data dividing unit, a word formation part and a verification unit, wherein the sample data dividing unit is used for acquiring pre-collected sample data and dividing the sample data into training data and verification data according to a preset proportion, and the sample data comprises a training text and word formation part-of-speech labels corresponding to the training text;

the intermediate model obtaining unit is used for inputting the training data into a preset neural network model for training so as to obtain an intermediate model;

the intermediate model verification unit is used for verifying the intermediate model by using the verification data and judging whether the verification result is passed;

and the intermediate model marking unit is used for marking the intermediate model as a part-of-speech structure analysis model if the verification result is that the verification is passed.

The operations respectively executed by the units or the sub-units correspond to the steps of the topic list generation method based on the part-of-speech structure in the foregoing embodiment one by one, and are not described herein again.

In one embodiment, the plurality of topic acquisition units include:

the topic word set calling subunit is used for calling a plurality of preset topic word sets, and each topic word set comprises a topic name and a plurality of special phrases corresponding to the topic name;

a topic word set judging subunit, configured to judge whether all the specified phrases belong to the multiple topic word sets;

and the theme name output subunit is used for acquiring a plurality of theme names in a plurality of theme word sets to which all the specified phrases belong and outputting the plurality of theme names if all the specified phrases belong to the plurality of theme word sets.

The operations respectively executed by the units or the sub-units correspond to the steps of the topic list generation method based on the part-of-speech structure in the foregoing embodiment one by one, and are not described herein again.

In one embodiment, the apparatus comprises:

the specified phrase dividing unit is used for dividing the specified phrases into a first specified phrase and a second specified phrase if all the specified phrases do not uniformly belong to the plurality of subject word sets, wherein the first specified phrase belongs to the subject word sets, and the second specified phrase does not belong to the subject word sets;

the similarity calculation unit is used for calculating a plurality of similarity values between the second type of specified phrase and a plurality of preset reference phrases according to a preset similarity calculation method;

a similarity threshold judgment unit, configured to judge whether the similarity values are all smaller than a preset similarity threshold;

and the theme name output unit is used for outputting the theme name corresponding to the first type of specified phrase if the similarity values are all smaller than a preset similarity threshold.

The operations respectively executed by the units or the sub-units correspond to the steps of the topic list generation method based on the part-of-speech structure in the foregoing embodiment one by one, and are not described herein again.

In one embodiment, the descending order unit 90 includes:

the device comprises a hierarchical parameter value calling subunit, a hierarchical parameter value calling subunit and a hierarchical parameter value calling subunit, wherein the hierarchical parameter value calling subunit is used for calling preset first hierarchical parameter values, second hierarchical parameter values, and m-1 hierarchical parameter values, the numerical values of the first hierarchical parameter values, the second hierarchical parameter values, the m-1 hierarchical parameter values are sequentially increased, and m is an integer which is greater than 1 and less than n;

a horizontal line addition value dividing subunit, configured to divide the n horizontal line addition values into m levels, where the numerical values of the horizontal line addition values at the first level are all smaller than the first-level parameter value, the numerical values of the horizontal line addition values at the second level are all smaller than the second-level parameter value, a.

And the descending order arrangement subunit is used for carrying out descending order arrangement on the n themes according to the m layers so as to obtain a layered theme list and outputting the layered theme list.

The operations respectively executed by the units or the sub-units correspond to the steps of the topic list generation method based on the part-of-speech structure in the foregoing embodiment one by one, and are not described herein again.

The topic list generating device based on the part-of-speech structure acquires a corpus to be analyzed, wherein the corpus to be analyzed comprises four parts, namely an abstract, a foresight, a text and a final; performing word segmentation on the linguistic data to be analyzed to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; extracting at least one appointed continuous part-of-speech structure from the part-of-speech sequence, and acquiring a plurality of appointed phrases corresponding to the appointed continuous part-of-speech structure in the word sequence; inputting the specified phrases into a preset probability topic model so as to obtain a plurality of topics output by the probability topic model; generating frequency matrixes of phrases corresponding to the plurality of subjects appearing in the four parts of the abstract, the foreword, the text and the final sentence respectively; calling a preset parameter matrix; calculating a sorting matrix Y; summing elements of the same horizontal row of the sorting matrix Y to obtain n horizontal row sum values respectively corresponding to the n subjects; and arranging the n themes in a descending order according to the n horizontal line sum values to obtain a theme list, and outputting the theme list. Thereby improving the trend, pertinence and accuracy of the corpus analysis.

Referring to fig. 3, an embodiment of the present invention further provides a computer device, where the computer device may be a server, and an internal structure of the computer device may be as shown in the figure. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing data used by the topic list generation method based on the part of speech structure. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a topic list generation method based on a part-of-speech structure.

The processor executes the topic list generation method based on the part-of-speech structure, wherein the steps included in the method correspond to the steps of executing the topic list generation method based on the part-of-speech structure in the foregoing embodiment one to one, and are not described herein again.

It will be understood by those skilled in the art that the structures shown in the drawings are only block diagrams of some of the structures associated with the embodiments of the present application and do not constitute a limitation on the computer apparatus to which the embodiments of the present application may be applied.

The computer equipment acquires a corpus to be analyzed, wherein the corpus to be analyzed comprises four parts, namely an abstract, a foresight, a text and a final sentence; performing word segmentation on the linguistic data to be analyzed to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; extracting at least one appointed continuous part-of-speech structure from the part-of-speech sequence, and acquiring a plurality of appointed phrases corresponding to the appointed continuous part-of-speech structure in the word sequence; inputting the specified phrases into a preset probability topic model so as to obtain a plurality of topics output by the probability topic model; generating frequency matrixes of phrases corresponding to the plurality of subjects appearing in the four parts of the abstract, the foreword, the text and the final sentence respectively; calling a preset parameter matrix; calculating a sorting matrix Y; summing elements of the same horizontal row of the sorting matrix Y to obtain n horizontal row sum values respectively corresponding to the n subjects; and arranging the n themes in a descending order according to the n horizontal line sum values to obtain a theme list, and outputting the theme list. Thereby improving the trend, pertinence and accuracy of the corpus analysis.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored thereon, and when the computer program is executed by a processor, the method for generating a topic list based on a part-of-speech structure is implemented, where steps included in the method are respectively in one-to-one correspondence with steps of executing the method for generating a topic list based on a part-of-speech structure in the foregoing embodiment, and are not described herein again.

The computer-readable storage medium of the application acquires a corpus to be analyzed, wherein the corpus to be analyzed comprises four parts, namely an abstract, a foresight, a text and a final; performing word segmentation on the linguistic data to be analyzed to obtain a word sequence; inputting the word sequence into a preset part-of-speech structure analysis model so as to obtain a part-of-speech sequence output by the part-of-speech structure analysis model, wherein the part-of-speech sequence is labeled with word components corresponding to words; extracting at least one appointed continuous part-of-speech structure from the part-of-speech sequence, and acquiring a plurality of appointed phrases corresponding to the appointed continuous part-of-speech structure in the word sequence; inputting the specified phrases into a preset probability topic model so as to obtain a plurality of topics output by the probability topic model; generating frequency matrixes of phrases corresponding to the plurality of subjects appearing in the four parts of the abstract, the foreword, the text and the final sentence respectively; calling a preset parameter matrix; calculating a sorting matrix Y; summing elements of the same horizontal row of the sorting matrix Y to obtain n horizontal row sum values respectively corresponding to the n subjects; and arranging the n themes in a descending order according to the n horizontal line sum values to obtain a theme list, and outputting the theme list. Thereby improving the trend, pertinence and accuracy of the corpus analysis.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

23页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于NPL的药品名片自动提取方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!