Enterprise portrait method based on label layering and deepening modeling

文档序号:830111 发布日期:2021-03-30 浏览:6次 中文

阅读说明:本技术 一种基于标签分层延深建模的企业画像方法 (Enterprise portrait method based on label layering and deepening modeling ) 是由 李翔 丁行硕 王媛媛 朱全银 高尚兵 王留洋 马甲林 张柯文 成洁怡 于 2020-11-19 设计创作,主要内容包括:发明公开了一种基于标签分层延深建模的企业画像方法,首先对企业模糊标签进行统计和筛选,筛选出如批发业、零售业等不能完整概括企业特点的标签,使用Bert模型依据企业经营范围和企业标签对筛选出的标签进行分类延深;然后将企业名称、企业简介、经营范围信息整合,基于预先建立好的企业词库进行特征拓展,分别使用TextRank、TF-IDF、LDA主题模型从综合信息中抽取关键词,将处理后的关键词作为更深层的企业延深标签;最后,将本建模方法应用到企业画像系统中,优化标签精确概括能力。本发明普遍适用于标签延深建模和标签提取问题,充分考虑了标签延深的层次关系,可以有效的提高标签和企业画像系统的准确度。(The invention discloses an enterprise portrait method based on label layered deepening modeling, which comprises the steps of firstly counting and screening fuzzy labels of an enterprise, screening out labels which can not completely summarize enterprise characteristics such as wholesale industry, retail industry and the like, and classifying and deepening the screened labels by using a Bert model according to an enterprise operation range and enterprise labels; integrating the enterprise name, enterprise introduction and the management range information, expanding the characteristics based on a pre-established enterprise word bank, extracting key words from the comprehensive information by using TextRank, TF-IDF and LDA topic models respectively, and taking the processed key words as deeper enterprise extension labels; and finally, applying the modeling method to an enterprise portrait system to optimize the accurate summarizing capability of the label. The method is generally suitable for the problems of tag deepening modeling and tag extraction, fully considers the hierarchical relationship of tag deepening, and can effectively improve the accuracy of tags and an enterprise portrait system.)

1. An enterprise portrait method based on label layering and deepening modeling is characterized by comprising the following specific steps:

(1) removing the weight and the empty of the enterprise label data set D and the enterprise multi-source data set D1, and cleaning to obtain enterprise data sets D2 and D3;

(2) counting and screening a data set D2, screening a tag data set which cannot completely summarize the characteristics of the enterprise, defining the tag data set as D4, and counting all tag sets as a deepening basis;

(3) constructing a Bert model, taking a data set D4 as an input of the model, and after semantic learning, performing classification and deepening of a first layer of labels by using a softmax layer;

(4) integrating the enterprise name, enterprise introduction and operation range information in the D3 data set, respectively extracting keywords by using TextRank, TF-IDF and LDA topic models, then processing the extracted keywords, and taking the processed words as next-layer deep tags;

(5) based on the label deepening method, the method is applied to an enterprise portrait system, and the accuracy of the label and the enterprise portrait system is improved.

2. The method for enterprise representation based on tag hierarchy deepening modeling according to claim 1, wherein the specific method for obtaining the enterprise data sets D2 and D3 in the step (1) is as follows:

(1.1) defining Text as a single multi-source information set to be cleaned, defining id, content1, content2 and content3 as enterprise serial number, enterprise name, enterprise introduction and enterprise operation range respectively, and satisfying the relation

Text={id,content1,content2,content3};

(1.2) defining Text1 as an information set to be cleaned in a single enterprise operation range, defining id, content3 and label as an enterprise serial number, an enterprise operation range and an enterprise label respectively, and satisfying the relationship Text1 ═ id, content3 and label };

(1.3) define D as the first layer tag deepening to-be-cleaned data set, D1 as the next layer tag deepening to-be-cleaned data set, D ═ Text11,Text12,…,Text1a,…,Text1len(D)},Text1aFor the a-th enterprise tag data to be cleaned in D, D1 ═ Text1,Text2,…,Texta1,…,Textlen(D1)},Texta1For the a enterprise multi-source data to be cleaned in D1, wherein len (D) is the number of texts in D, and the variable a belongs to [1, len (D)]Len (D1) is the number of texts in D1, and the variable a1 e [1, len (D1)];

(1.4) after the text in the data set D is deduplicated and deduplicated, the cleaned first-layer enterprise data set D2 is obtained1,T2,…,Tb,…,Tlen(D2)},TbFor the b-th business label data to be processed in D2, where len (D2) is the number of texts in D2, and the variable b is the [1, len (D2)];

(1.5) after the text in the data set D1 is deduplicated and nulled, the next enterprise data set D3 ═ T is obtained1,T2,…,Tb1,…,Tlen(D3)},Tb1For the b-th enterprise multi-source data to be processed in D3, wherein len (D3) is the number of texts in D3, and variable b1 is E [1, len (D3)]。

3. The method for representing an enterprise image based on tag hierarchical deepening modeling according to claim 1, wherein the step (2) is to screen out a tag data set which cannot completely summarize characteristics of the enterprise, and define it as D4, and count all tag data sets as a deepening basis by the specific method of:

(2.1) screening the D2 data set to screen out a data set which can not completely summarize enterprise characteristics but can be deepened by other labels, such as wholesale industry, retail industry and the like, and defining D4 ═ { T ═ T1,T2,…,Tc,…,Tlen(D4)D5 ═ T1,T2,…,Td,…,Tlen(D5)The rest data sets are represented, the number of label categories of D4 is n, and list4 represents a label set of D4;

(2.2) counting the D5 data set, wherein all labels are counted to be used as a deepening basis, m is the number of categories of the D5 data set, and list5 is a label set of D5;

(2.3) using the label set of list5 as a label of which the label classification is deepened;

(2.4) using the first layer data set D5 as a training set, and carrying out classification and deepening on the D4 data set according to a list5 label set.

4. The enterprise sketch method based on label hierarchical deepening modeling according to claim 1, wherein the specific method for performing classification deepening of the first layer of labels in the step (3) by using a softmax layer is as follows:

(3.1) building a Bert model, and performing model training by using a D5 training set;

(3.2) processing the data set D4 to obtain text content T to be processedcFixed to a uniform length Lmax

(3.3) defining a cycle variable i, and assigning an initial value of i to be 1;

(3.4) jumping to step (3.5) if i is less than or equal to len (D4), otherwise, jumping to step (3.9);

(3.5) definition of len (T)i) Is the length of the ith text message in the text, if len (T)i)+2≤LmaxThen jump to next step after complementing 0, otherwise intercept text front LmaxSkipping to the next step for each unit;

(3.6)i=i+1;

(3.7) feeding each text into a Token entries layer in the BERT model, wherein the output result is represented as V1, and simultaneously extracting text information and Position information from the Segment entries layer and the Position entries layer, and the output result is represented as V2 and V3;

(3.8) adding the three different outputs V1, V2 and V3 to obtain a result denoted V, using vector V as input to the BERT model, and obtaining a word vector sequence s in the last layer of neuronsi={V(W1),V(W2),…,V(Wf),…,V(WLmax) }; wherein V (W)f) Is the f-th vector representation of the combined text information;

(3.9) end loop, output word vector sequence S ═ S1,s2,s3,…,sf,…,slen(D3)};

(3.10) carrying out document classification prediction on the vector sequence by using a softmax function to obtain a classification probability prediction vector P ═ { P }1,p2,…,pg,…,phIn which p isgRepresenting the probability of the g class of the text, and h is the total number of the classes;

and (3.11) searching the maximum value in the vector P, and outputting a result corresponding to the maximum value, namely a label classification deepening result y.

5. The enterprise portrait method based on tag hierarchical deepening modeling according to claim 1, wherein the specific method of using the processed words as the deepening tags of the next layer in the step (4) is as follows:

(4.1) post-wash dataset D3 ═ T in step (1.5)1,T2,…,Tb1,…,Tlen(D3)And T ═ id, content1, content2, content3, where id, content1, content2, content3 are enterprise serial number, enterprise name, enterprise profile, and enterprise business scope, respectively;

(4.2) define D6 as the dataset to be integrated, len (D6) as the number of texts to be integrated in D6, D6 ═ T1,T2,…,Ta,…,Tlen(D6)};

(4.3) integrating the enterprise name, the enterprise introduction and the management scope information, wherein the integrated enterprise text is content4, and satisfies the conditions that T1 is { id, content4}, and D7 is { T1 }1,T12,…,T1a,…,T1len(D7)T1 is a single integrated text, D7 is an integrated enterprise dataset;

(4.4) carrying out statistics on words influencing the extraction result, and establishing a stop word dictionary;

(4.5) establishing an enterprise dictionary by collecting professional vocabularies of the enterprise field;

(4.6) performing keyword extraction on all nouns in the D7 enterprise integrated data set by using TextRank to obtain an extraction result K1 set;

(4.7) performing keyword extraction on all nouns in the D7 enterprise integrated data set by using TF-IDF to obtain an extraction result K2 set;

(4.8) finally, performing keyword extraction on all nouns in the D7 enterprise integration data set by using an LDA topic model to obtain an extraction result K3 set;

(4.9) sorting and merging the extracted K1, K2 and K3 keyword sets to obtain a keyword set K, wherein K is { W ═ W1,W2,…,Wi,…,Wlen(D7)},WiFor a single enterprise keyword set, i<len(D7);

(4.10) extracting the keywords WiAs a further extended depth label;

and (4.11) counting the obtained labels, and marking all the labels for the enterprises according to the hierarchical relationship.

6. The method for enterprise representation based on tag layering deepening modeling according to claim 1, wherein the method applied to the enterprise representation system in step (5) based on the tag deepening method is specifically configured to improve the accuracy of the tag and the enterprise representation system:

(5.1) the enterprise portrait system comprises a preprocessing module, a tag classification deepening module, a keyword extraction deepening module, a tag integration module and a portrait display module;

(5.2) inputting a text of the enterprise to be deepened, and preprocessing the text by a preprocessing module to remove noise;

(5.3) transmitting the preprocessed enterprise text into a label classification deepening module to perform classification deepening of labels;

(5.4) integrating the enterprise name, the enterprise introduction and the operation range information, and further enriching the label content in the keyword extraction and depth extension module;

(5.5) integrating all the extended labels in a label integration module, and marking all the labels for the enterprise;

and (5.6) generating enterprise image information, and displaying the label information through the image display module.

Technical Field

The invention belongs to the technical field of enterprise portrait and natural language processing, and particularly relates to an enterprise portrait method based on label layered deepening modeling.

Background

The layered extension of the label in the invention has important function and significance for the image technology. In the face of the portrait label problem, researchers usually select classification matching, but the model has obvious defects, neglects the hierarchical relation of labels from shallow to deep, the labels cannot accurately summarize the characteristics of enterprises, and further deepening modeling cannot be performed on the labels. Therefore, the problem of tag deepening modeling can be well solved by combining the neural network and natural language processing, and therefore the accuracy of the tag and the portrait system is improved.

The existing research bases of plum blossom, cinnabar and the like comprise: li, Z.Wang, S.Gao, R.Hu, Q.Zhu and L.Wang, "An Intelligent content-Aware Management Framework for Cold Chain Logistics Distribution," in IEEE Transactions on Intelligent transfer systems. doi: 10.1109/TITS.2018.2889069; li, Z.Wang, L.Wang, R.Hu and Q.Zhu, "A Multi-Dimensional Context-Aware communication application Based on Improved Random Forest project Algorithm," in IEEE Access, vol.6, pp.45071-45085,2018, doi: 10.1109/ACCESS.2018.2865436; li, X., Wang, Z., Hu, R.et al.Recommendation algorithm based on improved calibration and transfer learning. Pattern antenna application 22, 633-647 (2019); lixiang, Zhu-Quanyin, collaborative clustering and scoring matrix shared collaborative filtering recommendations [ J ] computer science and exploration 2014,8(6): 751-; the application, the disclosure and the related granted patents of Li Xiang and Zhu quan Yin, etc.: the patent numbers ZL 2017100546758.1, 2020.02.07 are patent numbers ZL 2017100546758.1, 2020.02.07; an incremental learning multi-level and two-classification method for scientific and technological news, namely Zhuquanhyin, Lixiang, Hurongling and the like, the invention has the patent numbers of ZL 201510642902, X and 2018.08.10; zhuquanhyin, Shaowujie, Lixiang and the like, a multilayer and multi-classification method for scientific news titles, invention patent numbers ZL 201610114278.0 and 2019.04.19; the patent numbers ZL201210325368.6, 2016.06.08 are invention patent numbers ZL201210325368.6, 2016.06.08; the patent number ZL 201610565749, X, 2019.06.11 is invented.

Enterprise portrait:

the enterprise portrait is a product in the big data era and is generated based on user portrait, a tagged enterprise model is extracted through the basic information of an enterprise, and the enterprise information is displayed in an all-round mode in a chart mode. The establishment of enterprise portrait tags is that the enterprise portrait tags are established through the most basic statistical class tags and the rule class tags generated by enterprise user behaviors, and finally, data mining is used for conducting prediction judgment on certain attributes of an enterprise, potential value information is mined, and the tags form an enterprise portrait tag system. The enterprise portrait can vividly show the comprehensive strength of an enterprise, and portrait information can be used as an important basis when the enterprise performs project cooperation. Meanwhile, the competition among enterprises can be reduced, and the enterprises are attracted to benefit and avoid harm. For governments, knowing the enterprise information is beneficial to the enterprise supervision.

Yang Ling Yun, Yang Wen Feng, a method and system for providing enterprise portrait: CN111666377A,2020.06.03, the invention provides a method and a system for enterprise portrait, which analyzes and processes label data to establish enterprise portrait by collecting identification information of enterprises, although the invention provides a construction method of enterprise portrait, no deeper research is carried out on labels; the patent refers to the field of 'pictorial communication,'. CN108572967A, 2018.09.25, the invention provides a method and a system for creating an enterprise portrait, which classify by acquiring enterprise portrait data, and then match the classified data with enterprise information, although the invention can divide enterprise labels, the generalized ability of the classified labels is limited, and the characteristics of enterprises cannot be accurately described; the method for establishing the enterprise portrait based on the regression model comprises the following steps: CN105512245A, 2016.4.20, the invention establishes an enterprise portrait based on a regression model, and the method makes full use of the potential semantic information of the text to make up for the deficiency of the traditional enterprise portrait, but does not consider the progressive relation of the labels from shallow to deep, and only expands and extracts the feature words.

The above inventions have remarkable effects in processing related fields, but the traditional enterprise images have the following problems: 1. the traditional label definition of the enterprise portrait is fuzzy, and the characteristics of the enterprise cannot be fully described, so that the label accuracy is reduced; 2. traditional enterprise images do not carry out deepening modeling from shallow to deep on the labels, and key words more suitable for enterprise characteristics cannot be extracted. Aiming at the problems, the invention provides an enterprise portrait method and an enterprise portrait system based on label layered deepening modeling. The method comprises the steps of firstly, counting and screening fuzzy labels of an enterprise, screening out labels which cannot completely summarize characteristics of the enterprise, and classifying and deepening the screened labels by using a Bert model according to the operation range of the enterprise and the statistical labels; and then integrating the information, expanding the characteristics based on a pre-established enterprise library, and extracting keywords by using various algorithms to serve as deeper enterprise extension labels. The method is generally suitable for the problems of tag deepening modeling and tag extraction, fully considers the hierarchical relationship of tag deepening, and can effectively improve the accuracy of tags and an enterprise portrait system.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides an enterprise sketch method based on label layered deepening modeling, which can accurately depict the characteristics of an enterprise, make up the defects of the traditional enterprise sketch and improve the actual application efficiency.

The technical scheme is as follows: in order to solve the technical problem, the invention provides an enterprise portrait method based on label layered deepening modeling, which comprises the following specific steps:

(1) removing the weight and the empty of the enterprise label data set D and the enterprise multi-source data set D1, and cleaning to obtain enterprise data sets D2 and D3;

(2) counting and screening a data set D2, screening a tag data set which cannot completely summarize the characteristics of the enterprise, defining the tag data set as D4, and counting all tag sets as a deepening basis;

(3) constructing a Bert model, taking a data set D4 as an input of the model, and after semantic learning, performing classification and deepening of a first layer of labels by using a softmax layer;

(4) integrating the enterprise name, enterprise introduction and operation range information in the D3 data set, respectively extracting keywords by using TextRank, TF-IDF and LDA topic models, then processing the extracted keywords, and taking the processed words as next-layer deep tags;

(5) based on the label deepening method, the method is applied to an enterprise portrait system, and the accuracy of the label and the enterprise portrait system is improved.

Further, the specific method for obtaining the enterprise data sets D2 and D3 in the step (1) is as follows:

(1.1) defining Text as a single multi-source information set to be cleaned, defining id, content1, content2 and content3 as enterprise serial number, enterprise name, enterprise introduction and enterprise operation range respectively, and satisfying the relation

Text={id,content1,content2,content3};

(1.2) defining Text1 as an information set to be cleaned in a single enterprise operation range, defining id, content3 and label as an enterprise serial number, an enterprise operation range and an enterprise label respectively, and satisfying the relationship Text1 ═ id, content3 and label };

(1.3) define D as the first layer tag deepening to-be-cleaned data set, D1 as the next layer tag deepening to-be-cleaned data set, D ═ Text11,Text12,…,Text1a,…,Text1len(D)},Text1aFor the a-th enterprise tag data to be cleaned in D, D1 ═ Text1,Text2,…,Texta1,…,Textlen(D1)},Texta1For the a enterprise multi-source data to be cleaned in D1, wherein len (D) is the number of texts in D, and the variable a belongs to [1, len (D)]Len (D1) is the number of texts in D1, and the variable a1 e [1, len (D1)];

(1.4) after the text in the data set D is deduplicated and deduplicated, the cleaned first-layer enterprise data set D2 is obtained1,T2,…,Tb,…,Tlen(D2)},TbFor the b-th business label data to be processed in D2, where len (D2) is the number of texts in D2, and the variable b is the [1, len (D2)];

(1.5) after the text in the data set D1 is deduplicated and nulled, the next enterprise data set D3 ═ T is obtained1,T2,…,Tb1,…,Tlen(D3)},Tb1For the b-th enterprise multi-source data to be processed in D3, wherein len (D3) is the number of texts in D3, and variable b1 is E [1, len (D3)]。

Further, the specific method of screening out the label data set which cannot completely summarize the characteristics of the enterprise in the step (2) and defining the label data set as D4 and counting out all label sets as the basis for the deepening is as follows:

(2.1) screening the D2 data set to screen out a data set which can not completely summarize enterprise characteristics but can be deepened by other labels, such as wholesale industry, retail industry and the like, and defining D4 ═ { T ═ T1,T2,…,Tc,…,Tlen(D4)D5 ═ T1,T2,…,Td,…,Tlen(D5)The rest data sets are represented, the number of label categories of D4 is n, and list4 represents a label set of D4;

(2.2) counting the D5 data set, wherein all labels are counted to be used as a deepening basis, m is the number of categories of the D5 data set, and list5 is a label set of D5;

(2.3) using the label set of list5 as a label of which the label classification is deepened;

(2.4) using the first layer data set D5 as a training set, and carrying out classification and deepening on the D4 data set according to a list5 label set.

Further, the specific method for performing classification deepening on the first-layer labels by using the softmax layer in the step (3) is as follows:

(3.1) building a Bert model, and performing model training by using a D5 training set;

(3.2) processing the data set D4 to obtain text content T to be processedcFixed to a uniform length Lmax

(3.3) defining a cycle variable i, and assigning an initial value of i to be 1;

(3.4) jumping to step (3.5) if i is less than or equal to len (D4), otherwise, jumping to step (3.9);

(3.5) definition of len (T)i) Is the length of the ith text message in the text, if len (T)i)+2≤LmaxThen jump to next step after complementing 0, otherwise intercept text front LmaxSkipping to the next step for each unit;

(3.6)i=i+1;

(3.7) feeding each text into a Token entries layer in the BERT model, wherein the output result is represented as V1, and simultaneously extracting text information and Position information from the Segment entries layer and the Position entries layer, and the output result is represented as V2 and V3;

(3.8) adding the three different outputs V1, V2 and V3 to obtain a result denoted V, using vector V as input to the BERT model, and obtaining a word vector sequence s in the last layer of neuronsi={V(W1),V(W2),…,V(Wf),…,V(WLmax) }; wherein V (W)f) Is the f-th vector representation of the combined text information;

(3.9) end loop, output word vector sequence S ═ S1,s2,s3,…,sf,…,slen(D3)};

(3.10) Carrying out document classification prediction on the vector sequence by using a softmax function to obtain a classification probability prediction vector P ═ { P }1,p2,…,pg,…,phIn which p isgRepresenting the probability of the g class of the text, and h is the total number of the classes;

and (3.11) searching the maximum value in the vector P, and outputting a result corresponding to the maximum value, namely a label classification deepening result y.

Further, a specific method for using the processed word as a next-layer deep label in the step (4) is as follows:

(4.1) post-wash dataset D3 ═ T in step (1.5)1,T2,…,Tb1,…,Tlen(D3)And T ═ id, content1, content2, content3, where id, content1, content2, content3 are enterprise serial number, enterprise name, enterprise profile, and enterprise business scope, respectively;

(4.2) define D6 as the dataset to be integrated, len (D6) as the number of texts to be integrated in D6, D6 ═ T1,T2,…,Ta,…,Tlen(D6)};

(4.3) integrating the enterprise name, the enterprise introduction and the management scope information, wherein the integrated enterprise text is content4, and satisfies the conditions that T1 is { id, content4}, and D7 is { T1 }1,T12,…,T1a,…,T1len(D7)T1 is a single integrated text, D7 is an integrated enterprise dataset;

(4.4) carrying out statistics on words influencing the extraction result, and establishing a stop word dictionary;

(4.5) establishing an enterprise dictionary by collecting professional vocabularies of the enterprise field;

(4.6) performing keyword extraction on all nouns in the D7 enterprise integrated data set by using TextRank to obtain an extraction result K1 set;

(4.7) performing keyword extraction on all nouns in the D7 enterprise integrated data set by using TF-IDF to obtain an extraction result K2 set;

(4.8) finally, performing keyword extraction on all nouns in the D7 enterprise integration data set by using an LDA topic model to obtain an extraction result K3 set;

(4.9) pairsAnd sorting and combining the extracted K1, K2 and K3 keyword sets to obtain a keyword set K, wherein K is { W ═1,W2,…,Wi,…,Wlen(D7)},WiFor a single enterprise keyword set, i<len(D7);

(4.10) extracting the keywords WiAs a further extended depth label;

and (4.11) counting the obtained labels, and marking all the labels for the enterprises according to the hierarchical relationship.

Further, based on the tag deepening method in the step (5), the method is applied to an enterprise representation system, and the specific method for improving the accuracy of the tag and the enterprise representation system comprises the following steps:

(5.1) the enterprise portrait system comprises a preprocessing module, a tag classification deepening module, a keyword extraction deepening module, a tag integration module and a portrait display module;

(5.2) inputting a text of the enterprise to be deepened, and preprocessing the text by a preprocessing module to remove noise;

(5.3) transmitting the preprocessed enterprise text into a label classification deepening module to perform classification deepening of labels;

(5.4) integrating the enterprise name, the enterprise introduction and the operation range information, and further enriching the label content in the keyword extraction and depth extension module;

(5.5) integrating all the extended labels in a label integration module, and marking all the labels for the enterprise;

and (5.6) generating enterprise image information, and displaying the label information through the image display module.

By adopting the technical scheme, the invention has the following beneficial effects:

based on the existing enterprise text label data set, the invention utilizes Bert and keyword extraction to carry out label layering and deepening modeling, and the specific description is as follows: according to the method, a Bert model is utilized to carry out first-layer classification deepening on the data set in the enterprise operation range, then the integrated data set is further extracted and deepened by combining various extraction algorithms, and finally, through label integration, labels can accurately depict enterprise characteristics, meanwhile, the label modeling speed is optimized, the working time of practitioners is shortened, and the operation efficiency of an enterprise portrait system is improved.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a flow diagram of data cleansing in an exemplary embodiment;

FIG. 3 is a flow chart of statistical screening data in an embodiment;

FIG. 4 is a flowchart illustrating classification and depth enhancement of a Bert model in an exemplary embodiment;

FIG. 5 is a flowchart illustrating keyword extraction deepening in an exemplary embodiment;

FIG. 6 is a flow diagram illustrating an exemplary implementation of an enterprise representation system.

Detailed Description

The present invention is further illustrated by the following specific examples in conjunction with the national standards of engineering, it being understood that these examples are intended only to illustrate the invention and not to limit the scope of the invention, which is defined in the claims appended hereto, as modifications of various equivalent forms by those skilled in the art upon reading the present invention.

As shown in fig. 1-6, the enterprise sketch method based on label hierarchical deepening modeling according to the present invention includes the following steps:

step 1: the method comprises the following steps of carrying out duplicate removal and null removal on an enterprise tag data set D and enterprise multi-source data D1, and cleaning to obtain enterprise data sets D2 and D3, wherein the specific method comprises the following steps:

step 1.1: defining Text as a single multi-source information set to be cleaned, defining id, content1, content2 and content3 as an enterprise serial number, an enterprise name, an enterprise introduction and an enterprise operation range respectively, and satisfying the relationship of { id, content1, content2 and content3 };

step 1.2: defining Text1 as an information set to be cleaned in a single enterprise operation range, defining id, content3 and label as an enterprise serial number, an enterprise operation range and an enterprise label respectively, and satisfying the relationship Text1 as { id, content3 and label };

step 1.3: defining D as the first layer of tag extension to be cleaned data set, D1 as the next layer of tag extensionDeep-to-be-cleaned dataset, D ═ Text11,Text12,…,Text1a,…,Text1len(D)},Text1aFor the a-th enterprise tag data to be cleaned in D, D1 ═ Text1,Text2,…,Texta1,…,Textlen(D1)},Texta1For the a enterprise multi-source data to be cleaned in D1, wherein len (D) is the number of texts in D, and the variable a belongs to [1, len (D)]Len (D1) is the number of texts in D1, and the variable a1 e [1, len (D1)];

Step 1.4: and after the duplication and null removing operation is carried out on the text in the data set D, obtaining a cleaned first-layer enterprise data set D2 ═ T1,T2,…,Tb,…,Tlen(D2)},TbFor the b-th business label data to be processed in D2, where len (D2) is the number of texts in D2, and the variable b is the [1, len (D2)];

Step 1.5: and after the text in the data set D1 is subjected to duplicate removal and null removal, the next-layer enterprise data set D3 is obtained1,T2,…,Tb1,…,Tlen(D3)},Tb1For the b-th enterprise multi-source data to be processed in D3, wherein len (D3) is the number of texts in D3, and variable b1 is E [1, len (D3)]。

Step 2: counting and screening a data set D2, screening a tag data set which can not completely summarize the characteristics of the enterprise, defining the tag data set as D4, and counting all tag sets as a deepening basis, wherein the specific method comprises the following steps:

step 2.1: screening the D2 data set to screen out the data set which can not completely summarize the characteristics of enterprises such as wholesale industry, retail industry and the like but can be deepened by other labels, and defining D4 ═ T1,T2,…,Tc,…,Tlen(D4)D5 ═ T1,T2,…,Td,…,Tlen(D5)The rest data sets are represented, the number of label categories of D4 is n, and list4 represents a label set of D4;

step 2.2: counting the D5 data set, and counting all tags as a deepening basis, wherein m is the category number of the D5 data set, and list5 is the tag set of D5;

step 2.3: taking the label set of list5 as label of label classification deepening;

step 2.4: the first layer data set D5 is used as a training set, and the D4 data set is classified and deepened according to a list5 label set.

And step 3: constructing a Bert model, taking a data set D4 as an input of the model, and after semantic learning, performing classification and deepening of a first layer of labels by using a softmax layer, wherein the concrete method comprises the following steps:

step 3.1: establishing a Bert model, and performing model training by using a D5 training set;

step 3.2: processing the data set D4 to obtain text content T to be processedcFixed to a uniform length Lmax

Step 3.3: defining a cycle variable i, and assigning an initial value of i as 1;

step 3.4: if i is less than or equal to len (D4), skipping to step 3.5, otherwise skipping to step 3.9;

step 3.5: definition len (T)i) Is the length of the ith text message in the text, if len (T)i)+2≤LmaxThen jump to next step after complementing 0, otherwise intercept text front LmaxSkipping to the next step for each unit;

step 3.6: i is i + 1;

step 3.7: sending each text into a Token columns layer in a BERT model, wherein the output result is represented as V1, extracting text information and Position information from a Segment columns layer and a Position columns layer, and the output result is represented as V2 and V3;

step 3.8: adding three different outputs V1, V2 and V3 to obtain a result which is expressed as V, taking the vector V as the input of the BERT model, and obtaining a word vector sequence s in the neuron of the last layeri={V(W1),V(W2),…,V(Wf),…,V(WLmax) }; wherein V (W)f) Is the f-th vector representation of the combined text information;

step 3.9: ending the loop, and outputting the word vector sequence S ═ S1,s2,s3,…,sf,…,slen(D3)};

Step 3.10: performing vector sequence by using softmax functionClassifying and predicting the documents to obtain a classification probability prediction vector P ═ { P ═ P1,p2,…,pg,…,phIn which p isgRepresenting the probability of the g class of the text, and h is the total number of the classes;

step 3.11: and searching the maximum value in the vector P, and outputting a result corresponding to the maximum value, namely a label classification deepening result y.

And 4, step 4: integrating the enterprise name, enterprise introduction and operation range information in a D3 data set, respectively extracting keywords by using TextRank, TF-IDF and LDA topic models, processing the extracted keywords, and taking the processed words as the next layer of deep tags, wherein the specific method comprises the following steps:

step 4.1: the post-wash dataset D3 ═ T in step 1.51,T2,…,Tb1,…,Tlen(D3)And T ═ id, content1, content2, content3, where id, content1, content2, content3 are enterprise serial number, enterprise name, enterprise profile, and enterprise business scope, respectively;

step 4.2: define D6 as the data set to be integrated, len (D6) as the number of texts to be integrated in D6, D6 ═ T1,T2,…,Ta,…,Tlen(D6)};

Step 4.3: integrating the enterprise name, enterprise introduction and management range information, wherein the integrated enterprise text is content4, and satisfies T1 ═ id, content4, and D7 ═ T11,T12,…,T1a,…,T1len(D7)T1 is a single integrated text, D7 is an integrated enterprise dataset;

step 4.4: counting words influencing the extraction result, and establishing a stop word dictionary;

step 4.5: establishing an enterprise dictionary by collecting professional vocabularies of the enterprise field;

step 4.6: performing keyword extraction on all nouns in the D7 enterprise integration data set by using TextRank to obtain an extraction result K1 set;

step 4.7: then, performing keyword extraction on all nouns in the D7 enterprise integration data set by using TF-IDF to obtain an extraction result K2 set;

step 4.8: finally, performing keyword extraction on all nouns in the D7 enterprise integration data set by using an LDA topic model to obtain an extraction result K3 set;

step 4.9: sorting and merging the extracted K1, K2 and K3 keyword sets to obtain a keyword set K, wherein K is { W ═1,W2,…,Wi,…,Wlen(D7)},WiFor a single enterprise keyword set, i<len(D7);

Step 4.10: the extracted key words WiAs a further extended depth label;

step 4.11: and counting the obtained tags, and marking all tags for the enterprise according to the hierarchical relationship.

And 5: based on the label deepening method, the method is applied to an enterprise portrait system, the accuracy of the label and the enterprise portrait system is improved, and the specific method comprises the following steps:

step 5.1: the enterprise portrait system comprises a preprocessing module, a label classification and deepening module, a keyword extraction and deepening module, a label integration module and a portrait display module;

step 5.2: inputting a text of an enterprise to be deepened, and preprocessing the text by a preprocessing module to remove noise;

step 5.3: transmitting the preprocessed enterprise text into a tag classification deepening module to perform tag classification deepening;

step 5.4: integrating the enterprise name, enterprise introduction and operation range information, and further enriching the label content in the keyword extraction and deepening module;

step 5.5: integrating all the extended labels in a label integration module, and marking all the labels for enterprises;

step 5.6: and generating enterprise portrait information, and displaying the label information through the portrait display module.

Table 1 description of variables

The above description is only an example of the present invention and is not intended to limit the present invention. All equivalents which come within the spirit of the invention are therefore intended to be embraced therein. Details not described herein are well within the skill of those in the art.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种针对图像识别的英语作文评分方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!