Article subject extraction method and device based on artificial intelligence and computer-readable storage medium

文档序号:1544888 发布日期:2020-01-17 浏览:26次 中文

阅读说明:本技术 基于人工智能的文章主旨提取方法、装置及计算机可读存储介质 (Article subject extraction method and device based on artificial intelligence and computer-readable storage medium ) 是由 陈一峰 周骏红 汪伟 于 2019-09-02 设计创作,主要内容包括:本发明涉及一种人工智能技术,揭露了一种基于人工智能的文章主旨提取方法,包括:接收文本数据集,对所述文本数据集进行包括词语切分及合并操作得到单词文本集,将所述单词文本集进行编码操作后转为单词矩阵集,将所述单词矩阵集输入至词向量转化模型中训练得到单词向量集,将所述单词向量集进行降维操作后输入至卷积神经网络模型中训练,将所述用户输入的文本数据转为单词向量后输入至完成训练的所述卷积神经网络模型中得到文章主旨并输出。本发明还提出一种基于人工智能的文章主旨提取装置以及一种计算机可读存储介质。本发明可以实现精准高效的基于人工智能的文章主旨提取功能。(The invention relates to an artificial intelligence technology, and discloses an article subject extraction method based on artificial intelligence, which comprises the following steps: receiving a text data set, performing word segmentation and merging operations on the text data set to obtain a word text set, performing coding operation on the word text set to convert the word text set into a word matrix set, inputting the word matrix set into a word vector conversion model to train to obtain a word vector set, performing dimension reduction operation on the word vector set, inputting the word vector set into a convolutional neural network model to train, converting text data input by a user into word vectors, inputting the word vectors into the convolutional neural network model which completes the training to obtain article themes, and outputting the article themes. The invention also provides an article subject extraction device based on artificial intelligence and a computer readable storage medium. The invention can realize the accurate and efficient artificial intelligence-based article theme extraction function.)

1. An article theme extraction method based on artificial intelligence is characterized by comprising the following steps:

receiving a text data set, and performing word segmentation and merging operations on the text data set to obtain a word text set;

the word text set is converted into a word matrix set after being subjected to coding operation, and the word matrix set is input into a word vector conversion model to be trained to obtain a word vector set;

inputting the word vector set into a convolutional neural network model after the dimensionality reduction operation to obtain a training value, judging the size of the training value and a preset threshold value, continuing training the convolutional neural network model if the training value is larger than the preset threshold value, and finishing the training of the convolutional neural network model if the training value is smaller than the preset threshold value;

and receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article theme.

2. The artificial intelligence based article subject matter extraction method of claim 1, wherein the merging operation comprises:

traversing each text data in the text data set, and dividing the text data according to paragraphs to obtain a plurality of paragraphs;

presetting words with the occurrence frequency more than or equal to two times in the plurality of paragraphs as a hypothesis subject, and constructing a conditional probability model of each sentence in the plurality of paragraphs and the hypothesis subject;

and constructing a log-likelihood function, optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merging a plurality of sentences with the same subject into one sentence, and finishing the merging operation.

3. The method for extracting an article theme based on artificial intelligence as claimed in claim 2, wherein the conditional probability model is:

Figure FDA0002188686590000011

wherein, y1,…,yN,yiIs the hypothesis subject, N is the number of the hypothesis subject, D is the paragraph, j is the number of the paragraph, s is the sentence in the paragraph, P (y)iS) is the hypothesis subject yiProbability of being the subject of sentence s, s (i, y)i) The hypothetical subject representing the sentence i is yi

4. An artificial intelligence based article theme extraction method as claimed in any one of claims 1 to 3, wherein the encoding operation comprises:

numbering each word in the word text set by a number to obtain the maximum number;

creating a coding matrix with the same dimension as the maximum number, sequentially traversing sentences in the word text set, and mapping the sentences to the coding matrix;

and processing the coding matrix according to the number of each word in the word text set to obtain a word matrix set.

5. The artificial intelligence based article subject matter extraction method of claim 4, wherein the dimension reduction operation comprises:

calculating the covariance of each word vector in the word vector set;

and removing the word vectors with the absolute values larger than the preset covariance threshold value in the covariance to obtain a word vector set after dimension reduction.

6. An artificial intelligence based article subject matter extraction apparatus, the apparatus comprising a memory and a processor, the memory having stored thereon an artificial intelligence based article subject matter extraction program executable on the processor, the artificial intelligence based article subject matter extraction program when executed by the processor implementing the steps of:

receiving a text data set, and performing word segmentation and merging operations on the text data set to obtain a word text set;

the word text set is converted into a word matrix set after being subjected to coding operation, and the word matrix set is input into a word vector conversion model to be trained to obtain a word vector set;

inputting the word vector set into a convolutional neural network model after the dimensionality reduction operation to obtain a training value, judging the size of the training value and a preset threshold value, continuing training the convolutional neural network model if the training value is larger than the preset threshold value, and finishing the training of the convolutional neural network model if the training value is smaller than the preset threshold value;

and receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article theme.

7. The artificial intelligence based article subject matter extraction apparatus of claim 6, wherein the merging operation comprises:

traversing each text data in the text data set, and dividing the text data according to paragraphs to obtain a plurality of paragraphs;

presetting words with the occurrence frequency more than or equal to two times in the plurality of paragraphs as a hypothesis subject, and constructing a conditional probability model of each sentence in the plurality of paragraphs and the hypothesis subject;

and constructing a log-likelihood function, optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merging a plurality of sentences with the same subject into one sentence, and finishing the merging operation.

8. The artificial intelligence based article theme extraction device of claim 7, wherein the conditional probability model is:

Figure FDA0002188686590000031

wherein, y1,…,yN,yiFor the hypothesis subject, N is the hypothesis subjectD is the paragraph, j is the number of the paragraph, s is the sentence in the paragraph, P (y)iS) is the hypothesis subject yiProbability of being the subject of sentence s, s (i, y)i) The hypothetical subject representing the sentence i is yi

9. An artificial intelligence based article theme extraction apparatus as claimed in any one of claims 6 to 8, wherein the encoding operation comprises:

numbering each word in the word text set by a number to obtain the maximum number;

creating a coding matrix with the same dimension as the maximum number, sequentially traversing sentences in the word text set, and mapping the sentences to the coding matrix;

and processing the coding matrix according to the number of each word in the word text set to obtain a word matrix set.

10. A computer-readable storage medium having an artificial intelligence based article subject matter extraction program stored thereon, the artificial intelligence based article subject matter extraction program being executable by one or more processors to implement the steps of the artificial intelligence based article subject matter extraction method according to any one of claims 1 to 5.

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for extracting article themes based on artificial intelligence and a computer-readable storage medium.

Background

At present, the themes of most articles are analyzed by professional industry people, for example, the themes are manually read and researched by enterprise development reports, then the themes are summarized to lead high-level leaders to make decisions, academic reports are summarized by related people and then the themes are simplified for other people to learn, and the mode is time-consuming and labor-consuming. In addition, the subject extraction of the article is carried out based on the traditional naive Bayes algorithm, but the naive Bayes algorithm has large calculation resource and higher error rate of the extracted subject, so that the actual requirement cannot be met.

Disclosure of Invention

The invention provides an article subject extraction method and device based on artificial intelligence and a computer readable storage medium, and mainly aims to perform intelligent subject extraction according to articles input by a user.

In order to achieve the above object, the invention provides an article theme extraction method based on artificial intelligence, which comprises the following steps:

receiving a text data set, and performing word segmentation and merging operations on the text data set to obtain a word text set;

the word text set is converted into a word matrix set after being subjected to coding operation, and the word matrix set is input into a word vector conversion model to be trained to obtain a word vector set;

inputting the word vector set into a convolutional neural network model after the dimensionality reduction operation to obtain a training value, judging the size of the training value and a preset threshold value, continuing training the convolutional neural network model if the training value is larger than the preset threshold value, and finishing the training of the convolutional neural network model if the training value is smaller than the preset threshold value;

and receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article theme.

Optionally, the merging operation includes:

traversing each text data in the text data set, and dividing the text data according to paragraphs to obtain a plurality of paragraphs;

presetting words with the occurrence frequency more than or equal to two times in the plurality of paragraphs as a hypothesis subject, and constructing a conditional probability model of each sentence in the plurality of paragraphs and the hypothesis subject;

and constructing a log-likelihood function, optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merging a plurality of sentences with the same subject into one sentence, and finishing the merging operation.

Optionally, the conditional probability model is:

Figure BDA0002188686600000021

wherein, y1,…,yN,yiIs the hypothesis subject, N is the number of the hypothesis subject, D is the paragraph, j is the number of the paragraph, s is the sentence in the paragraph, P (y)iS) is the hypothesis subject yiProbability of being the subject of sentence s, s (i, y)i) The hypothetical subject representing the sentence i is yi

Optionally, the encoding operation comprises:

numbering each word in the word text set by a number to obtain the maximum number;

creating a coding matrix with the same dimension as the maximum number, sequentially traversing sentences in the word text set, and mapping the sentences to the coding matrix;

and processing the coding matrix according to the number of each word in the word text set to obtain a word matrix set.

Optionally, the dimension reduction operation comprises:

calculating the covariance of each word vector in the word vector set;

and removing the word vectors with the absolute values larger than the preset covariance threshold value in the covariance to obtain a word vector set after dimension reduction.

In order to achieve the above object, the present invention further provides an artificial intelligence-based article theme extraction device, including a memory and a processor, wherein the memory stores an artificial intelligence-based article theme extraction program executable on the processor, and the artificial intelligence-based article theme extraction program implements the following steps when executed by the processor:

receiving a text data set, and performing word segmentation and merging operations on the text data set to obtain a word text set;

the word text set is converted into a word matrix set after being subjected to coding operation, and the word matrix set is input into a word vector conversion model to be trained to obtain a word vector set;

inputting the word vector set into a convolutional neural network model after the dimensionality reduction operation to obtain a training value, judging the size of the training value and a preset threshold value, continuing training the convolutional neural network model if the training value is larger than the preset threshold value, and finishing the training of the convolutional neural network model if the training value is smaller than the preset threshold value;

and receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article theme.

Optionally, the merging operation includes:

traversing each text data in the text data set, and dividing the text data according to paragraphs to obtain a plurality of paragraphs;

presetting words with the occurrence frequency more than or equal to two times in the plurality of paragraphs as a hypothesis subject, and constructing a conditional probability model of each sentence in the plurality of paragraphs and the hypothesis subject;

and constructing a log-likelihood function, optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merging a plurality of sentences with the same subject into one sentence, and finishing the merging operation.

Optionally, the conditional probability model is:

Figure BDA0002188686600000031

wherein, y1,…,yN,yiIs the hypothesis subject, N is the number of the hypothesis subject, D is the paragraph, j is the number of the paragraph, s is the sentence in the paragraph, P (y)iS) is the hypothesis subject yiProbability of being the subject of sentence s, s (i, y)i) The hypothetical subject representing the sentence i is yi

Optionally, the encoding operation comprises:

numbering each word in the word text set by a number to obtain the maximum number;

creating a coding matrix with the same dimension as the maximum number, sequentially traversing sentences in the word text set, and mapping the sentences to the coding matrix;

and processing the coding matrix according to the number of each word in the word text set to obtain a word matrix set.

In addition, to achieve the above object, the present invention also provides a computer readable storage medium having an artificial intelligence based article subject matter extraction program stored thereon, the artificial intelligence based article subject matter extraction program being executable by one or more processors to implement the steps of the artificial intelligence based article subject matter extraction method as described above.

The method comprises the steps of firstly carrying out word segmentation and merging operation on a text data set to obtain a word text set, avoiding the influence of error words on the theme of the whole article, simultaneously carrying out coding operation and word vector conversion on the word text set to obtain a word vector set, and amplifying characteristic attributes while reducing word dimensionality through the coding operation and the word vector conversion. Therefore, the article subject extraction method and device based on artificial intelligence and the computer readable storage medium can realize accurate article subject output results.

Drawings

Fig. 1 is a schematic flowchart of an article subject extraction method based on artificial intelligence according to an embodiment of the present invention;

fig. 2 is a schematic internal structural diagram of an article theme extraction device based on artificial intelligence according to an embodiment of the present invention;

fig. 3 is a block diagram illustrating an article subject extraction program based on artificial intelligence in an article subject extraction device based on artificial intelligence according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides an article subject extraction method based on artificial intelligence. Referring to fig. 1, a flow chart of an article subject extraction method based on artificial intelligence according to an embodiment of the present invention is shown. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In this embodiment, the article subject extraction method based on artificial intelligence includes:

and S1, receiving a text data set, and performing word segmentation and merging operations on the text data set to obtain a word text set.

Preferably, the text data set includes multiple types of text, such as news, social, academic, government development planning, enterprise investment, and the like.

The cleaning is to remove the stop words, Arabic letters and other abnormal words in the text data set, and the text classification effect can be reduced because the abnormal words with no actual significance exist. The stop words are words which have no practical meaning and have no influence on text analysis, but have high occurrence frequency, such as commonly used pronouns, prepositions and the like. Specifically, the cleaning is to construct a special-shaped word table in advance, sequentially traverse the words in the text data set, and if the words are the same as those in the special-shaped word table, remove the words until the traversal is completed.

The word segmentation is to segment each word in the text data set to obtain a single word, and word segmentation is essential because there is no clear separation mark between words in the chinese representation. Preferably, the word segmentation of the present invention may be processed by using a final segmentation word library based on programming languages such as Python and JAVA, wherein the final segmentation word library is developed based on the characteristics of the part of speech of chinese, and is developed by converting the occurrence frequency of each word in the text data set into a frequency, searching a maximum probability path based on dynamic programming, and finding a maximum segmentation combination based on a word frequency. For example, the text data set has text segments as follows: when a person understands an exchange with a regime, they can ask themselves and the disc in reality because in their eyes, they do not really do anything with them until they have an equivalent exchange with a regime. After being processed by the ending part word library, the method is changed into the following steps: when a person understands an exchange with a regime, they can ask themselves and the disc in reality because in their eyes, they do not really do anything with them until they have an equivalent exchange with a regime. Wherein the blank part represents the processing result of the result word bank.

Further, since the subjects of the sentences may be the same, the merging is to merge the sentences having the same subject, so as to achieve the purpose of greatly reducing the words in the text data set. Preferably, said combining comprises: traversing each text in the text data set, dividing the text according to paragraphs to obtain a plurality of paragraphs, presetting words with the occurrence frequency more than or equal to two times in each paragraph as a hypothesis subject, constructing a conditional probability model of each sentence and the hypothesis subject in each paragraph, constructing a log-likelihood function, optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merging a plurality of sentences with the same subject into one sentence, and finishing the merging operation.

Specifically, the conditional probability model is:

Figure BDA0002188686600000051

wherein, y1,…,yN,yiIs the hypothesis subject, N is the number of the hypothesis subject, D is the paragraph, j is the number of the paragraph, such as D1Is the first paragraph of the text, s is a sentence within the paragraph, P (y)iS) is the hypothesis subject yiProbability of being the subject of sentence s, s (i, y)i) The hypothetical subject representing the sentence i is yi

Preferably, the log-likelihood function is:

Figure BDA0002188686600000061

wherein argmax is the hypothesis subject corresponding to the maximum partial derivative of all the hypothesis subjects for solving the conditional probability model.

And S2, converting the word text set into a word matrix set after the encoding operation is carried out on the word text set, and inputting the word matrix set into a word vector conversion model to train to obtain a word vector set.

Preferably, the encoding is in a one-hot encoding form, where the one-hot encoding is to number each word in the word text set to obtain a maximum number, then create an encoding matrix with the same dimension as the maximum number, sequentially traverse each sentence in the word text set, map each sentence to the encoding matrix, and perform one-hot encoding according to the mapping resultAnd completing coding operation on the number of each word in the word text set to obtain a word matrix set. If the word text set is: it is true that people can ask themselves and discs true when they understand and have a body exchange. After the text is numbered numerically: when in use1Human being2Understand that3And4body system5Of switching6At the time of flight7They are8Can be used for9Will be provided with10Is true11Oneself with12Out of the disc13That is to say that14Reality (reality)15And obtaining the maximum number of 15, and further creating a 15-dimensional coding matrix, and further, if the traversal sentence is: this is true, and the code is [0, 0, 0, 0, 0, 0, 0, 0, 1]。

Preferably, the word vector conversion model includes assuming a weight relationship between a word matrix in the word matrix set and a word vector in the word vector set, and calculating the weight based on the weight relationship to complete a conversion process from the word matrix set to the word vector set.

Specifically, the weight relationship is:

d={(t1,w1),(t2,w2),......,(ti,wi),......,(tn,wn)}

where d is the set of word matrices, t1、t2、......、tnFor the word matrix in the word matrix set, as described above [0, 0, 0, 0, 0, 0, 0, 0, 1]Etc. w1、w2、......、wnIs the weight of the corresponding word matrix.

Further, the weight calculation method comprises:

Figure BDA0002188686600000062

wherein f isiRepresenting the number of occurrences of a word matrix in the set of word matricesNumber, N is the total number of texts in the text data set, NjRepresenting the total number of words, N, in said text data setiRepresenting the number of occurrences of the word i in said text data set, FmThe weighting factor is generally less than 1.

S3, performing dimensionality reduction operation on the word vector set, inputting the word vector set into a convolutional neural network model for training to obtain a training value, judging the size of the training value and a preset threshold value, continuing training of the convolutional neural network model if the training value is larger than the preset threshold value, and finishing training of the convolutional neural network model if the training value is smaller than the preset threshold value.

Preferably, the dimension reduction operation includes calculating covariance of each word vector in the word vector set, and removing the word vector of which the absolute value is greater than a preset covariance threshold value in the covariance to obtain the word vector set after dimension reduction.

Further, the covariance is:

Figure BDA0002188686600000071

wherein x isi,xjRepresenting each word vector of said set of word vectors, n being the number of said set of word vectors, cov (x)i,xj) Representing a calculation xi,xjThe covariance between. If the calculated covariance cov (x)i,xj) If the average molecular weight is not 0, a positive correlation is represented by more than 0, and a negative correlation is represented by less than 0.

In a preferred embodiment of the present invention, the convolutional neural network model includes an input layer, a convolutional layer, a pooling layer, a fully-connected layer, and an output layer, the input layer receives the word vector set, and the convolutional layer, the pooling layer, and the fully-connected layer are trained in combination with an activation function to obtain a training value and output the training value through the output layer.

In a preferred embodiment of the present invention, the activation function may comprise a Softmax function, and the loss function is a least squares function. The Softmax function is:

Figure BDA0002188686600000072

wherein, OjRepresents the output value, I, of the jth neuron of the fully-connected layerjRepresenting an input value of a jth neuron of the output layer, t representing a total amount of neurons of the output layer, e being an infinite acyclic fraction;

the least squares method L(s) is:

wherein s is the training value, k is the number of the word vector set after dimension reduction, yiIs the set of word vectors, y'iAnd the predicted value of the convolutional neural network model is obtained.

And S4, receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting article themes.

If an article which is input by a user and used for describing ancient time character prisons is received, the article is output by the trained convolutional neural network model, and the theme of the article is as follows: the article describing ancient letter prisons discloses a building system for harsher violence against civilian ink guests, representing the profound sympathy of authors with knowledge officers and a strong angry against the crime.

The invention also provides an article theme extraction device based on artificial intelligence. Fig. 2 is a schematic diagram illustrating an internal structure of an article theme extraction device based on artificial intelligence according to an embodiment of the present invention.

In the present embodiment, the article theme extraction device 1 based on artificial intelligence may be a PC (personal computer), a terminal device such as a smart phone, a tablet computer, or a mobile computer, or may be a server. The article subject extraction device 1 based on artificial intelligence at least comprises a memory 11, a processor 12, a communication bus 13 and a network interface 14.

The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may be an internal storage unit of the artificial intelligence based article subject matter extracting apparatus 1 in some embodiments, for example, a hard disk of the artificial intelligence based article subject matter extracting apparatus 1. The memory 11 may also be an external storage device of the article theme extracting apparatus 1 based on artificial intelligence in other embodiments, such as a plug-in hard disk provided on the article theme extracting apparatus 1 based on artificial intelligence, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 11 may also include both an internal storage unit and an external storage device of the artificial intelligence based article subject matter extracting apparatus 1. The memory 11 can be used not only to store application software installed in the artificial intelligence-based article theme extraction device 1 and various types of data, such as the code of the artificial intelligence-based article theme extraction program 01, but also to temporarily store data that has been output or is to be output.

The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip in some embodiments, and is used for executing program codes stored in the memory 11 or Processing data, such as executing the artificial intelligence-based article theme extraction program 01.

The communication bus 13 is used to realize connection communication between these components.

The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), typically used to establish a communication link between the apparatus 1 and other electronic devices.

Optionally, the apparatus 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. Among them, the display may also be appropriately referred to as a display screen or a display unit for displaying information processed in the artificial intelligence based article theme extraction apparatus 1 and for displaying a visualized user interface.

While fig. 2 shows only the artificial intelligence based article subject matter extraction apparatus 1 having the components 11-14 and the artificial intelligence based article subject matter extraction program 01, those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the artificial intelligence based article subject matter extraction apparatus 1, and may include fewer or more components than shown, or combine certain components, or a different arrangement of components.

In the embodiment of the apparatus 1 shown in fig. 2, the memory 11 stores an artificial intelligence-based article theme extraction program 01; the processor 12 implements the following steps when executing the artificial intelligence based article theme extraction program 01 stored in the memory 11:

the method comprises the steps of firstly, receiving a text data set, and carrying out word segmentation and merging operations on the text data set to obtain a word text set.

Preferably, the text data set includes multiple types of text, such as news, social, academic, government development planning, enterprise investment, and the like.

The cleaning is to remove the stop words, Arabic letters and other abnormal words in the text data set, and the text classification effect can be reduced because the abnormal words with no actual significance exist. The stop words are words which have no practical meaning and have no influence on text analysis, but have high occurrence frequency, such as commonly used pronouns, prepositions and the like. Specifically, the cleaning is to construct a special-shaped word table in advance, sequentially traverse the words in the text data set, and if the words are the same as those in the special-shaped word table, remove the words until the traversal is completed.

The word segmentation is to segment each word in the text data set to obtain a single word, and word segmentation is essential because there is no clear separation mark between words in the chinese representation. Preferably, the word segmentation of the present invention may be processed by using a final segmentation word library based on programming languages such as Python and JAVA, wherein the final segmentation word library is developed based on the characteristics of the part of speech of chinese, and is developed by converting the occurrence frequency of each word in the text data set into a frequency, searching a maximum probability path based on dynamic programming, and finding a maximum segmentation combination based on a word frequency. For example, the text data set has text segments as follows: when a person understands an exchange with a regime, they can ask themselves and the disc in reality because in their eyes, they do not really do anything with them until they have an equivalent exchange with a regime. After being processed by the ending part word library, the method is changed into the following steps: when a person understands an exchange with a regime, they can ask themselves and the disc in reality because in their eyes, they do not really do anything with them until they have an equivalent exchange with a regime. Wherein the blank part represents the processing result of the result word bank.

Further, since the subjects of the sentences may be the same, the merging is to merge the sentences having the same subject, so as to achieve the purpose of greatly reducing the words in the text data set. Preferably, said combining comprises: traversing each text in the text data set, dividing the text according to paragraphs to obtain a plurality of paragraphs, presetting words with the occurrence frequency more than or equal to two times in each paragraph as a hypothesis subject, constructing a conditional probability model of each sentence and the hypothesis subject in each paragraph, constructing a log-likelihood function, optimizing the conditional probability model based on the log-likelihood function to obtain the subject of each sentence, merging a plurality of sentences with the same subject into one sentence, and finishing the merging operation.

Specifically, the conditional probability model is:

Figure BDA0002188686600000101

wherein, y1,…,yN,yiIs the hypothesis subject, N is the number of the hypothesis subjects, D is the hypothesis subjectParagraph, j is the number of the paragraph, e.g. D1Is the first paragraph of the text, s is a sentence within the paragraph, P (y)iS) is the hypothesis subject yiProbability of being the subject of sentence s, s (i, y)i) The hypothetical subject representing the sentence i is yi

Preferably, the log-likelihood function is:

wherein argmax is the hypothesis subject corresponding to the maximum partial derivative of all the hypothesis subjects for solving the conditional probability model.

And step two, converting the word text set into a word matrix set after the encoding operation is carried out on the word text set, and inputting the word matrix set into a word vector conversion model to train so as to obtain a word vector set.

Preferably, the encoding is in a one-hot encoding form, where the one-hot encoding is to perform number numbering on each word in the word text set to obtain a maximum number, then create an encoding matrix with the same dimension as the maximum number, sequentially traverse each sentence in the word text set, map each sentence to the encoding matrix, and complete an encoding operation according to the number of each word in the word text set to obtain a word matrix set. If the word text set is: it is true that people can ask themselves and discs true when they understand and have a body exchange. After the text is numbered numerically: when in use1Human being2Understand that3And4body system5Of switching6At the time of flight7They are8Can be used for9Will be provided with10Is true11Oneself with12Out of the disc13That is to say that14Reality (reality)15And obtaining the maximum number of 15, and further creating a 15-dimensional coding matrix, and further, if the traversal sentence is: this is true, and the code is [0, 0, 0, 0, 0, 0, 0, 0, 1]。

Preferably, the word vector conversion model includes assuming a weight relationship between a word matrix in the word matrix set and a word vector in the word vector set, and calculating the weight based on the weight relationship to complete a conversion process from the word matrix set to the word vector set.

Specifically, the weight relationship is:

d={(t1,w1),(t2,w2),......,(ti,wi),......,(tn,wn))

where d is the set of word matrices, t1、t2、......、tnFor the word matrix in the word matrix set, as described above [0, 0, 0, 0, 0, 0, 0, 0, 1]Etc. w1、w2、......、wnIs the weight of the corresponding word matrix.

Further, the weight calculation method comprises:

Figure BDA0002188686600000111

wherein f isiRepresenting the number of occurrences of a word matrix in said set of word matrices, N being the total number of texts in said text data set, NjRepresenting the total number of words, N, in said text data setiRepresenting the number of occurrences of the word i in said text data set, FmThe weighting factor is generally less than 1.

And step three, inputting the word vector set into a convolutional neural network model after the dimensionality reduction operation to obtain a training value, judging the size of the training value and a preset threshold value, continuing training the convolutional neural network model if the training value is larger than the preset threshold value, and finishing the training of the convolutional neural network model if the training value is smaller than the preset threshold value.

Preferably, the dimension reduction operation includes calculating covariance of each word vector in the word vector set, and removing the word vector of which the absolute value is greater than a preset covariance threshold value in the covariance to obtain the word vector set after dimension reduction.

Further, the covariance is:

Figure BDA0002188686600000112

wherein x isi,xjRepresenting each word vector of said set of word vectors, n being the number of said set of word vectors, cov (x)i,xj) Representing a calculation xi,xjThe covariance between. If the calculated covariance cov (x)i,xj) If the average molecular weight is not 0, a positive correlation is represented by more than 0, and a negative correlation is represented by less than 0.

In a preferred embodiment of the present invention, the convolutional neural network model includes an input layer, a convolutional layer, a pooling layer, a fully-connected layer, and an output layer, the input layer receives the word vector set, and the convolutional layer, the pooling layer, and the fully-connected layer are trained in combination with an activation function to obtain a training value and output the training value through the output layer.

In a preferred embodiment of the present invention, the activation function may comprise a Softmax function, and the loss function is a least squares function. The Softmax function is:

Figure BDA0002188686600000121

wherein, OjRepresents the output value, I, of the jth neuron of the fully-connected layerjRepresenting an input value of a jth neuron of the output layer, t representing a total amount of neurons of the output layer, e being an infinite acyclic fraction;

the least squares method L(s) is:

Figure BDA0002188686600000122

wherein s is the training value, k is the number of the word vector set after dimension reduction, yiIs the set of word vectors, y'iAnd the predicted value of the convolutional neural network model is obtained.

And step four, receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article subject.

If an article which is input by a user and used for describing ancient time character prisons is received, the article is output by the trained convolutional neural network model, and the theme of the article is as follows: the article describing ancient letter prisons discloses a building system for harsher violence against civilian ink guests, representing the profound sympathy of authors with knowledge officers and a strong angry against the crime.

Alternatively, in other embodiments, the article theme extraction program based on artificial intelligence can be further divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention.

For example, referring to fig. 3, a schematic diagram of program modules of an artificial intelligence-based article theme extraction program in an embodiment of the artificial intelligence-based article theme extraction apparatus according to the present invention is shown, in this embodiment, the artificial intelligence-based article theme extraction program may be divided into a data receiving module 10, a word vector solving module 20, a model training module 30, and an article theme output module 40, which exemplarily:

the data receiving module 10 is configured to: receiving a text data set, and carrying out word segmentation and merging operations on the text data set to obtain a word text set.

The word vector solving module 20 is configured to: and converting the word text set into a word matrix set after encoding operation, and inputting the word matrix set into a word vector conversion model to train so as to obtain a word vector set.

The model training module 30 is configured to: and after the dimension reduction operation is carried out on the word vector set, inputting the word vector set into a convolutional neural network model for training to obtain a training value, judging the size of the training value and a preset threshold value, if the training value is larger than the preset threshold value, continuing training of the convolutional neural network model, and if the training value is smaller than the preset threshold value, finishing training of the convolutional neural network model.

The article subject output module 40 is configured to: and receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article theme.

The functions or operation steps of the data receiving module 10, the word vector solving module 20, the model training module 30, the article substance output module 40 and other program modules when executed are substantially the same as those of the above embodiments, and are not repeated herein.

Furthermore, an embodiment of the present invention also provides a computer-readable storage medium, on which an artificial intelligence-based article subject matter extraction program is stored, where the artificial intelligence-based article subject matter extraction program is executable by one or more processors to implement the following operations:

receiving a text data set, and carrying out word segmentation and merging operations on the text data set to obtain a word text set.

And converting the word text set into a word matrix set after encoding operation, and inputting the word matrix set into a word vector conversion model to train so as to obtain a word vector set.

And after the dimension reduction operation is carried out on the word vector set, inputting the word vector set into a convolutional neural network model for training to obtain a training value, judging the size of the training value and a preset threshold value, if the training value is larger than the preset threshold value, continuing training of the convolutional neural network model, and if the training value is smaller than the preset threshold value, finishing training of the convolutional neural network model.

And receiving text data input by a user, converting the text data input by the user into word vectors, inputting the word vectors into the trained convolutional neural network model, and obtaining and outputting the article theme.

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. The term "comprising" is used to specify the presence of stated features, integers, steps, operations, elements, components, groups, integers, operations, elements, components, groups, elements, groups, integers, operations, elements.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种多源信息融合的生词库自动构建方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!