Topic analysis method and device and storage medium

文档序号：1414311 发布日期：2020-03-10 浏览：4次中文

阅读说明：本技术 一种话题分析方法、装置和存储介质 (Topic analysis method and device and storage medium ) 是由耿雪芹王晓斌焦梦姝黄三伟于 2020-01-20 设计创作，主要内容包括：本发明公开了一种话题分析方法,包括：获取待处理文本语料,并获取每一个待处理文本语料所对应的分词结果和对应的词性；获取过滤后文本语料；通过依存句法对过滤后的每一文本语料的分词结果和对应的词性进行分析,获得分词的语法成分和分词之间的依存关系以及获得每一个文本语料对应的依存对；根据组合句式结构和依存对,获得每一个文本语料对应的话题；获取相似话题,并根据相似话题的数量进行排序。本发明还同时公开了话题分析装置和存储介质,通过分词的基础上使用句法分析,来分析文本语句中的语法结构和分词结果之间的依存关系,再按照预设的多种汉语常见组合句式结构,提取出通顺、准确的话题,能够从海量文本中分析话题。(The invention discloses a topic analysis method, which comprises the following steps: acquiring text corpora to be processed, and acquiring word segmentation results and corresponding parts of speech corresponding to each text corpora to be processed; obtaining filtered text corpora; analyzing the word segmentation result and the corresponding part of speech of each filtered text corpus through dependency syntax to obtain the grammar components of the word segmentation and the dependency relationship among the words and obtain the dependency pair corresponding to each text corpus; obtaining topics corresponding to each text corpus according to the combined sentence pattern structure and the dependency pairs; and acquiring similar topics and sequencing according to the number of the similar topics. The invention also discloses a topic analysis device and a storage medium, which analyze the dependency relationship between the syntactic structure and the word segmentation result in the text sentence by using syntactic analysis on the basis of word segmentation, and extract smooth and accurate topics according to the preset structure of various common Chinese combined sentences, thereby being capable of analyzing topics from massive texts.)

1. A topic analysis method, characterized in that the method comprises:

acquiring text corpora to be processed, and acquiring word segmentation results and corresponding parts of speech corresponding to each text corpora to be processed;

filtering the text corpus to be processed according to the word segmentation result, and acquiring the filtered text corpus;

analyzing the word segmentation result and the corresponding part of speech of each filtered text corpus through dependency syntax to obtain the grammar components of the word segmentation and the dependency relationship among the words and obtain the dependency pair corresponding to each text corpus;

obtaining topics corresponding to each text corpus according to the combined sentence pattern structure and the dependency pairs;

acquiring similar topics, and sequencing according to the number of the similar topics;

the obtaining of similar topics and ranking according to the number of similar topics include:

calculating similarity values with the acquired other topics aiming at each topic;

carrying out similarity combination according to the similarity value;

merging according to document id distribution of topics, and merging the two topics into one topic if the same id number in the document id lists of the two topics exceeds a preset number;

and sequencing the combined topics, and selecting the topics with the target number according to the frequency and outputting the topics.

2. The topic analysis method of claim 1, wherein the step of obtaining the corpus of texts to be processed and obtaining the segmentation result corresponding to each corpus of texts to be processed comprises:

performing sentence division processing on the text corpus according to punctuation marks;

and performing word segmentation processing on each text corpus to be processed to obtain word segmentation results.

3. The topic analysis method according to claim 2, wherein the step of filtering the text corpus to be processed according to the word segmentation result and obtaining the filtered text corpus comprises:

calculating the document frequency of each word segmentation in the word segmentation result, and performing descending order arrangement;

acquiring the keywords of the topics arranged in the front;

acquiring a text corpus to be filtered, wherein the text corpus to be filtered is a text corpus which does not contain any participle in topic keywords;

and removing the text corpus to be filtered from the text corpus to be processed to obtain the filtered text corpus.

4. The topic analysis method according to claim 2 or 3, wherein the step of performing sentence division processing on the text corpus according to punctuation marks comprises:

randomly distributing numbers to all documents in the text corpus, deleting preset punctuations in all documents in the text corpus to obtain target text sentences, and marking the document numbers of the text sentences;

adopting punctuation marks to segment the target text sentences, counting the frequency of the segmented sentences, and marking the document number where the segmented sentences are located as the document id of the sentences, wherein the punctuation marks at least comprise: comma, semicolon, period, question mark, exclamation mark, ellipsis;

and taking the segmented text corpus marking the frequency number and the document id as a text corpus to be processed.

5. The topic analysis method according to claim 2 or 3, wherein the step of performing the segmentation processing on each text corpus to be processed to obtain a segmentation result comprises:

performing word segmentation processing on each text corpus to be processed;

removing stop words, special symbols, letters and emoticons in the word segmentation processing result;

and obtaining word segmentation results.

6. The topic analysis method according to claim 1, wherein the step of analyzing the segmentation result and the corresponding part of speech of each filtered text corpus by dependency syntax to obtain the grammatical component of the segmentation and the dependency relationship between the segmentation and obtain the dependency pair corresponding to each text corpus comprises:

analyzing the word segmentation result and the corresponding part of speech of each filtered text corpus through dependency syntax to obtain a dependency relationship between grammar components and the word segmentation, wherein the grammar components comprise a subject, a predicate, an object, a fixed language, a shape and a complement, and the dependency relationship is a phrase relationship;

determining the participles forming the dependency relationship as a dependency pair;

wherein the dependency syntax is: graph-based analysis methods, transfer-based analysis methods, or deep learning-based analysis methods.

7. The topic analysis method of claim 6, wherein the step of obtaining the topic corresponding to each text corpus according to the combined sentence structure and the dependency pair comprises:

obtaining a core word of each text statement according to dependency syntax analysis;

determining participles which are in a dominance relation and a move-guest relation with the core word dependency relation;

combining the determined participles according to a preset sequence to obtain a topic main stem, wherein the preset sequence is as follows: a combination of a master-predicate-guest sequence and other relationships, wherein the other relationships are: the combination of the main and auxiliary relation words, the core words and the moving and guest relation words;

and filling the determined topic main stems to obtain the topics.

8. The topic analysis method of claim 7, wherein the step of populating the determined topic stems to obtain topics comprises:

filling by finding words with a relationship in a fixed relationship or a state with subject, predicate and object dependence;

and keeping the filling result with the length within the preset length interval as a topic.

9. A topic analysis apparatus, characterized in that the apparatus comprises a processor, and a memory connected with the processor through a communication bus; wherein the content of the first and second substances,

the memory is used for storing a topic analysis program;

the processor for executing the topic analysis program to implement the topic analysis step of any one of claims 1 to 8.

10. A storage medium, in particular a computer readable storage medium, storing one or more programs, which are executable by one or more processors to cause the one or more processors to perform the topic analysis step of any one of claims 1 to 8.

Technical Field

The present invention relates to the field of topic analysis and processing, and in particular, to a topic analysis method, device and storage medium.

Background

With the rapid development of information technology, the internet has become a main channel for people to acquire and distribute information. Because of the large amount of network information, wide sources and high transmission speed, it becomes more and more troublesome for the general netizens how to quickly and accurately find the desired network information. Therefore, how to analyze and extract the hot topics concerned by netizens from massive network information quickly, accurately and comprehensively becomes a current very hot research direction.

At present, the network topics still take texts as main expression modes, and the technical means for finding topics from the texts at present is still limited to lexical levels, namely, information related to the topics is searched by means of identification of keywords, hot words, co-occurring words, sensitive words, emotional tendency words, entity words and the like, and on the other hand, most of the existing topic analysis algorithms are based on clustering algorithms and gather texts of the same topic into one class. However, only the analysis is performed at the word level, and the obtained information is often local, and complete semantic information cannot be obtained. In addition, sometimes an article has not only one topic but also related sub-topics. That is, topics and articles are not in one-to-one relationship, and the clustering algorithm considers that one text has only one topic, so that the core content of the whole text cannot be completely summarized.

Disclosure of Invention

In view of the above, the present invention is directed to a topic analysis method, device and storage medium, which are used for analyzing the dependency relationship between the syntactic structure and the word segmentation result in a text sentence by using syntactic analysis on the basis of word segmentation, and then extracting a smooth and accurate topic according to a preset structure of a plurality of common Chinese combination sentences, so as to analyze topics from a mass of texts.

In order to achieve the purpose, the technical scheme of the invention is realized as follows: the invention provides a topic analysis method, which comprises the following steps: