Method for carrying out approximate search and quickly extracting advertisement text theme based on word vector

文档序号:1556949 发布日期:2020-01-21 浏览:15次 中文

阅读说明:本技术 基于词向量进行近似搜索快速提取广告文本主题的方法 (Method for carrying out approximate search and quickly extracting advertisement text theme based on word vector ) 是由 李新 李征宇 邵品贤 吴小刚 于 2019-09-10 设计创作,主要内容包括:本发明公开了一种基于词向量进行近似搜索快速提取广告文本主题的方法,包括如下步骤:第一步,利用结巴分词工具,利用已有的停用词库,到广告标题中查找与停用词库相同的词将其去掉即去掉广告标题中的停用词,提取语料库中的中文词将其作为词典,利用词典,对广告文本主题进行分词;本发明操作方便,采用本发明可以将GPU-DMM生成模型中单个查询词的搜索复杂度从0(N)下降到0(log N),加速了整个广告文本主题提取过程,大大提升提取速度,整个流程可以在数小时内完成离线处理和无监督训练,能够应对互联网广告行业的大规模数据量与近实时性要求,可以做到按天更新或者按小时更新用户兴趣标签。(The invention discloses a method for carrying out approximate search and quickly extracting advertisement text topics based on word vectors, which comprises the following steps: the method comprises the steps that firstly, a word segmentation tool is utilized, an existing stop word bank is utilized, words which are the same as the stop word bank are searched in an advertisement title and removed, namely stop words in the advertisement title are removed, Chinese words in a corpus are extracted and used as dictionaries, and word segmentation is carried out on advertisement text topics by utilizing the dictionaries; the method is convenient to operate, the search complexity of a single query word in the GPU-DMM generation model can be reduced from 0 (N) to 0 (log N), the whole advertisement text theme extraction process is accelerated, the extraction speed is greatly improved, the whole process can complete off-line processing and unsupervised training within a few hours, the requirements of large-scale data volume and near real-time performance of the Internet advertisement industry can be met, and the user interest label can be updated on a daily basis or on an hourly basis.)

1. A method for carrying out approximate search and quickly extracting advertisement text topics based on word vectors is characterized in that: the method comprises the following steps: the method comprises the steps that firstly, a word segmentation tool is utilized, an existing stop word bank is utilized, words which are the same as the stop word bank are searched in an advertisement title and removed, namely stop words in the advertisement title are removed, Chinese words in a corpus are extracted and used as dictionaries, and word segmentation is carried out on advertisement text topics by utilizing the dictionaries;

secondly, establishing a word vector index by adopting a random projection algorithm according to the word vectors in the corpus;

thirdly, after the index is established, reading word segmentation results of the advertisement text theme, finding word vectors of each word after word segmentation, quickly searching nearest neighbor word vectors of query words from the index through an approximate nearest neighbor search (ANN) algorithm, calculating the similarity between the two word vectors through cosine similarity, and obtaining more than fifty similar words as basic data of an advertisement text theme model;

fourthly, generating a text theme and words under the text theme by combining the GPU-DMM model and the basic data of the advertisement text theme model acquired in the third step;

and fifthly, determining a user interest tag according to the advertisement text clicked by the user and the text theme generated in the fourth step, storing the user interest tag into a real-time Key-Value tag system such as Redis, and when the user accesses the website next time, if the tag meets the directional delivery requirements of one or more advertisers, considering that the user is a target user of the advertiser, and delivering the advertisement preset by the advertiser.

2. The method for fast extracting advertisement text subject based on word vector approximate search as claimed in claim 1, wherein: the method comprises the steps of establishing a word vector index by adopting a random projection algorithm, wherein the index step comprises an index step and a query step, the index step is to construct a series of binary trees, each binary tree node is split and split by using a random hyperplane, and whether a left sub-tree or a right sub-tree is searched based on the random hyperplane is judged during searching; in the query stage, the root node of each binary tree is inserted into a priority queue, then each binary tree is searched by using the priority queue until K candidate nodes are found, then repeated candidate nodes are removed, the distance from the candidate nodes to the query nodes is calculated, and the TopN nodes are sequenced.

3. The method for fast extracting advertisement text subject based on word vector approximate search as claimed in claim 2, wherein: the random hyperplane means that two points are randomly selected from nodes, then all the points of the hyperplane are divided by the hyperplane with the same distance between the two points, and if the two points are close enough, any two hyperplanes can not separate the two points.

Technical Field

The invention relates to a method for extracting advertisement text topics, in particular to a method for carrying out approximate search and quickly extracting advertisement text topics based on word vectors.

Background

In the internet advertisement recommendation service, firstly, an advertisement text theme is extracted according to an advertisement text clicked or browsed by a user so as to determine an interest tag of the user, if the interest tag of the user meets the interest oriented delivery requirement of an advertiser, the user is delivered with an advertisement, and currently, the advertisement text theme is provided with LDA and GPU-DMM.

LDA is a document theme generation model, comprising three layers of word, theme and document structure. By generative model, we mean that each word of an article is considered to be obtained through a process of "selecting a topic with a certain probability and selecting a word from the topic with a certain probability". The document-to-subject obeys polynomial distribution, the subject-to-word obeys polynomial distribution, and in the process of LDA estimating the document subject, the joint probability distribution can be calculated by the following formula:

Figure 785014DEST_PATH_IMAGE002

wherein:

Figure 921597DEST_PATH_IMAGE003

the number of times that the document d adopts the theme k is referred to, and the larger the number is, the more possible the document uses the theme k is;

Figure 590476DEST_PATH_IMAGE004

the Dirichlet super parameter refers to document-theme distribution and plays a smoothing role;

Figure 770790DEST_PATH_IMAGE005

the number of occurrences of word w in the document d in topic k;

Figure 462803DEST_PATH_IMAGE006

the Dirichlet super parameter refers to a topic-word item, and can play a smoothing role as well;

Figure 888230DEST_PATH_IMAGE007

what represents how likely document d is topic k;

representing the word probability distribution of the kth topic;

it can be seen from the above formula that the process of LDA extracting topics depends on information provided by the simultaneous occurrence of the same subject term, but the text in the advertisement is often a single sentence, and there is a very large sparsity on the co-occurrence frequency of the same subject term, whereas the traditional document topic generation model is difficult to generate document topic distribution with distinguishing capability, and the generated document topics are difficult to have semantic consistency, which becomes a bottleneck for accurately extracting advertisement titles.

The GPU-DMM is based on the assumption that each document is generated by a single theme, the method is more reasonable compared with LDA, the document theme and words under each theme are generated based on the GPU-DMM, in order to better utilize the information provided by the common occurrence of the same theme words, the GPU sampling process is usually combined on the basis of the DMM, for the words generated in each DMM process, the probability that the words and the words similar to the words in a large-scale corpus are selected is improved, the semantic association degree between the sampled themes and the similar words is enhanced, the probability that the words which are almost impossible to exist simultaneously but have similar semantics in the advertisement titles are appeared under the same theme is improved, and the final document theme distribution is more accurate. In the process of searching for similar words, word vectors in a corpus need to be searched violently to find the similar words, the existing large-scale open-source corpus generally comprises at least million-level words and corresponding word vectors, and under the background of Internet information flow advertisements, the number of advertisement texts clicked by users according to the service scale is generally in the million level, and the amount of the appeared vocabularies is in the hundred thousand level, so when a theme is extracted according to a GPU-DMM model, if the similar words of each word are searched violently in the corpus, the calculation complexity is in the billion level, and the requirement that the user interest needs to be calculated quickly and in real time in the Internet industry is difficult to meet.

Disclosure of Invention

The invention designs a method for carrying out approximate search and quickly extracting advertisement text themes based on word vectors aiming at the defects in the background technology, and aims to: the problem of slow speed of extracting the advertisement text theme in the prior art is solved.

The purpose of the invention is realized by the following ways:

a method for carrying out approximate search and quickly extracting advertisement text topics based on word vectors is characterized in that: the method comprises the following steps: the method comprises the steps that firstly, a word segmentation tool is utilized, an existing stop word bank is utilized, words which are the same as the stop word bank are searched in an advertisement title and removed, namely stop words in the advertisement title are removed, Chinese words in a corpus are extracted and used as dictionaries, and word segmentation is carried out on advertisement text topics by utilizing the dictionaries;

secondly, establishing a word vector index by adopting a random projection algorithm according to the word vectors in the corpus;

thirdly, after the index is established, reading word segmentation results of the advertisement text theme, finding word vectors of each word after word segmentation, quickly searching nearest neighbor word vectors of query words from the index through an approximate nearest neighbor search (ANN) algorithm, calculating the similarity between the two word vectors through cosine similarity, and obtaining more than fifty similar words as basic data of an advertisement text theme model;

fourthly, generating a text theme and words under the text theme by combining the GPU-DMM model and the basic data of the advertisement text theme model acquired in the third step;

and fifthly, determining a user interest tag according to the advertisement text clicked by the user and the text theme generated in the fourth step, storing the user interest tag into a real-time Key-Value tag system such as Redis, and when the user accesses the website next time, if the tag meets the directional delivery requirements of one or more advertisers, considering that the user is a target user of the advertiser, and delivering the advertisement preset by the advertiser.

The method comprises the steps of establishing a word vector index by adopting a random projection algorithm, wherein the index step comprises an index step and a query step, the index step is to construct a series of binary trees, each binary tree node is split and split by using a random hyperplane, and whether a left sub-tree or a right sub-tree is searched based on the random hyperplane is judged during searching; in the query stage, the root node of each binary tree is inserted into a priority queue, then each binary tree is searched by using the priority queue until K candidate nodes are found, then repeated candidate nodes are removed, the distance from the candidate nodes to the query nodes is calculated, and the TopN nodes are sequenced.

The random hyperplane means that two points are randomly selected from nodes, then all the points of the hyperplane are divided by the hyperplane with the same distance between the two points, and if the two points are close enough, any two hyperplanes can not separate the two points.

The invention has the beneficial effects that:

the method is convenient to operate, the search complexity of a single query word in the GPU-DMM generation model can be reduced from 0 (N) to 0 (log N), the whole advertisement text theme extraction process is accelerated, the extraction speed is greatly improved, the whole process can complete off-line processing and unsupervised training within a few hours, the requirements of large-scale data volume and near real-time performance of the Internet advertisement industry can be met, the user interest labels can be updated on a daily basis or on an hourly basis, in addition, the method maps the extracted information flow advertisement text themes clicked by the user to the user interest labels as the basis for accurately putting advertisements, and the extraction accuracy is high.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

As shown in FIG. 1, the invention discloses a method for performing approximate search based on word vectors to quickly extract advertisement text topics, which comprises the following steps: the method comprises the steps that firstly, a word segmentation tool is utilized, an existing deactivation word bank is utilized, words which are the same as the deactivation word bank are searched in an advertisement title and removed, namely, the deactivation words in the advertisement title are removed, Chinese words in a corpus are extracted and used as dictionaries, word graph scanning is realized by utilizing the dictionaries based on a prefix tree, Directed Acyclic Graphs (DAG) formed by all possible word forming conditions of Chinese characters in sentences are generated, a maximum probability path is searched through dynamic planning, a maximum segmentation combination based on word frequency is found out, and word segmentation is carried out on advertisement text topics;

secondly, establishing a word vector index by adopting a random projection algorithm according to the word vectors in the corpus;

thirdly, after the index is established, reading word segmentation results of the advertisement text theme, finding word vectors of each word after word segmentation, quickly searching nearest neighbor word vectors of query words from the index through an approximate nearest neighbor search (ANN) algorithm, calculating the similarity between the two word vectors through cosine similarity, and obtaining more than fifty similar words as basic data of an advertisement text theme model;

fourthly, generating a text theme and words under the text theme by combining the GPU-DMM model and the basic data of the advertisement text theme model acquired in the third step;

and fifthly, determining a user interest tag according to the advertisement text clicked by the user and the text theme generated in the fourth step, storing the user interest tag into a real-time Key-Value tag system such as Redis, and when the user accesses the website next time, if the tag meets the directional delivery requirements of one or more advertisers, considering that the user is a target user of the advertiser, and delivering the advertisement preset by the advertiser.

The method comprises the steps of establishing a word vector index by using a stochastic projection algorithm, wherein the index step comprises an index step and an inquiry step, the index step is to construct a series of binary trees, a stochastic hyperplane is used for splitting when each binary tree node is split, whether a left sub-tree or a right sub-tree is searched based on the hyperplane is judged during searching, actually, in the process of establishing the index, each intermediate node defines a stochastic hyperplane, and n data with similar distances are stored in leaf nodes; in the query stage, the root node of each binary tree is inserted into a priority queue, then each binary tree is searched by using the priority queue until K candidate nodes are found, then repeated candidate nodes are removed, the distance from the candidate nodes to the query nodes is calculated, and the TopN nodes are sequenced.

The random hyperplane means that two points are randomly selected from nodes, then all the points of the hyperplane are divided by the hyperplane with the same distance between the two points, and if the two points are close enough, any two hyperplanes can not separate the two points.

6页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于深度学习的词句级短文本分类方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!