Evaluation expert recommendation method based on pictograph-semantic dual-feature space mapping

文档序号：169364 发布日期：2021-10-29 浏览：30次中文

阅读说明：本技术 一种基于象形-语义双特征空间映射的评审专家推荐方法 (Evaluation expert recommendation method based on pictograph-semantic dual-feature space mapping ) 是由杨政尹春林朱华苏蒙潘侃于 2021-08-10 设计创作，主要内容包括：本申请涉及专家推荐技术领域,提供一种基于象形-语义双特征空间映射的评审专家推荐方法,首先利用RoBerta模型对文本进行层次化表示,进而使用Bi-LSTM+CRF模型对项目文本和专家文本进行命名实体识别,然后将命名实体通过象形-语义双特征空间映射为特征向量,并对特征向量进行欧氏距离和余弦相似度计算,获得匹配得分,再对匹配得分进行加权求和,获得综合匹配得分,最后将综合匹配得分最高的专家作为该项目文本的评审专家。本申请提出基于语义-象形双特征空间映射的实体匹配策略,智能化实现项目与专家的有效精准匹配,进而降低了评审工作人力成本、增强了评审结果可靠性以及提高了评审整体效率,是一种准确高效的方法。(The application relates to the technical field of expert recommendation, and provides an expert review recommendation method based on pictographic-semantic dual-feature space mapping. The entity matching strategy based on semantic-pictographic dual-feature space mapping is provided, effective and accurate matching of projects and experts is achieved intelligently, accordingly, the labor cost of review work is reduced, the reliability of review results is enhanced, the overall review efficiency is improved, and the method is accurate and efficient.)

1. A review expert recommendation method based on pictograph-semantic dual-feature space mapping is characterized by comprising the following steps:

acquiring abstract information of an electric power science and technology project application form;

carrying out named entity identification on the abstract information of the electric power project application form to obtain an electric power project entity, wherein the electric power project entity comprises a use method entity and a related field entity;

crawling the personal homepage information of the power expert and the abstract information of published papers;

carrying out named entity recognition on the personal homepage information of the power expert and the abstract information of the published paper to obtain a power expert entity, wherein the power expert entity comprises an adept technical entity and a research direction entity;

pictographic mapping is carried out on the use method entity to obtain a pictographic use method entity, and pictographic mapping is carried out on the related field entity to obtain a pictographic related field entity;

pictographic mapping is carried out on the skilled technical entity to obtain a pictographic skilled technical entity, and pictographic mapping is carried out on the research direction entity to obtain a pictographic research direction entity;

semantic mapping is carried out on the pictograph use method entity to obtain a use method feature vector, and semantic mapping is carried out on the pictograph related field entity to obtain a related field feature vector;

semantic mapping is carried out on the pictographic excellence technical entity to obtain an excellence technical feature vector, and semantic mapping is carried out on the pictographic research direction entity to obtain a research direction feature vector;

calculating to obtain a comprehensive matching score according to the using method feature vector, the related field feature vector, the excellence technology feature vector and the research direction feature vector;

and determining the review experts according to the level of all the comprehensive matching scores.

2. The expert review recommendation method based on pictographic-semantic bi-feature space mapping of claim 1, wherein named entity recognition is performed using RoBERTa pre-training model and BiLSTM + CRF model.

3. The expert review recommendation method based on pictograph-semantic dual feature space mapping as claimed in claim 2, wherein the specific method for named entity recognition using RoBERTa pre-training model and BiLSTM + CRF model is:

acquiring text information;

segmenting words of the text information to obtain a word set;

vector mapping is carried out on the word set by using a RoBerta pre-training model to obtain a word vector set;

and training the word vector set by using a BilSTM + CRF model to obtain the named entity of the text information.

4. The expert review recommendation method based on pictograph-semantic dual feature space mapping as claimed in claim 3, wherein the specific method for obtaining the research direction entity and the skilled technical entity is:

acquiring keywords of evaluation expert information;

crawling expert personal homepage information and abstract information of published papers according to the keywords;

carrying out named entity recognition on the expert personal homepage information according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the research direction entity;

and carrying out named entity recognition on the abstract information of the published paper according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the skilled technical entity.

5. The expert review recommendation method based on the pictograph-semantic dual feature space mapping as claimed in claim 1, wherein the specific method for obtaining the comprehensive matching score by calculation is as follows:

performing Euclidean distance calculation on the related field characteristic vector and the research direction characteristic vector in a pictographic characteristic space to obtain a first pictographic matching score;

cosine similarity calculation is carried out on the related field characteristic vector and the research direction characteristic vector in a semantic characteristic space, and a first semantic matching score is obtained;

performing Euclidean distance calculation on the using method characteristic vector and the researching method characteristic vector in a pictographic characteristic space to obtain a second pictographic matching score;

cosine similarity calculation is carried out on the using method characteristic vector and the researching method characteristic vector in a semantic characteristic space, and a second semantic matching score is obtained;

summing the first pictographic matching score and the first semantic matching score to obtain a research direction matching score;

summing the second pictographic matching score and the second semantic matching score to obtain a research method matching score;

and carrying out weighted summation on the research direction matching score and the research method matching score to obtain a comprehensive matching score.

6. The expert review recommendation method based on the pictograph-semantic dual feature space mapping according to claim 5, characterized in that the Euclidean distance calculation is performed by adopting the following method:

wherein D is the entity similarity score of the domain (direction) level, F is the corresponding set of the related domain entities, and R is the related research directionA set of correspondences between the entities,embedding the pictographic space corresponding to the entity in the F set,embedding the pictographic space corresponding to the entity in the R set,embedding semantic space corresponding to the entities in the F set,and embedding the semantic space corresponding to the entities in the R set.

7. The expert review recommendation method based on pictograph-semantic dual feature space mapping according to claim 6, characterized in that the cosine similarity calculation is performed by adopting the following method:

wherein T is the entity similarity score of the method (technology) level, O is the set corresponding to the entity using the method, L is the set corresponding to the entity skilled in the technology,embedding the pictographic space corresponding to the entity in the O set,embedding the pictographic space corresponding to the entity in the L set,embedding semantic space corresponding to the entities in the O set,and embedding the semantic space corresponding to the entity in the L set.

8. The expert review recommendation method based on pictograph-semantic dual feature space mapping according to claim 7, characterized in that the following method is adopted to perform the comprehensive matching score calculation:

score＝k×D+(1-k)×T

where score is the composite match score and k is the weight.

9. The expert review recommendation method based on pictograph-semantic dual feature space mapping as claimed in claim 8 wherein a greedy algorithm is used to calculate the k value.

10. The expert review recommendation method based on pictographic-semantic bi-feature space mapping as claimed in claim 9 wherein the k value is set to 0.3.

Technical Field

The application relates to the technical field of expert recommendation, in particular to a review expert recommendation method based on pictograph-semantic dual-feature space mapping.

Background

With the fact that theoretical innovation of national power grids is greatly promoted in the aspects of extra-high voltage alternating current and direct current power grids, smart power grids, third industrial revolution and the like, the application amount of various innovative electric power science and technology projects is greatly increased, and further the application number of the electric power science and technology projects is continuously increased.

In this case, the current review task of the power science and technology project application is difficult and burdensome, and the form review to the content quality review needs to be completed with high quality and high efficiency. The most important link in the auditing process is that the expert audits the content quality of the application, so that the accurate evaluation result can be obtained only by matching the technology and the field mastered by the auditing expert with the content of the application, and the reliability of the evaluation result is directly hooked with the matching degree.

However, most of the matching work of the review experts and the project application is manually and randomly issued at present or is specially recommended by talents with profound expertise. Due to the subjective initiative inevitably existing in the manual operation and the matching mode of the current evaluation experts and the project application form, the labor cost of evaluation work is too high, the reliability of evaluation results is poor, and the overall evaluation efficiency is low.

Disclosure of Invention

In order to overcome the defects of the prior art, the application aims to provide a recommendation method of review experts based on pictograph-semantic dual-feature space mapping, so as to solve at least one technical problem of overhigh labor cost of review work, weaker reliability of review results and lower overall review efficiency.

In order to achieve the above object, the present application provides a review expert recommendation method based on pictograph-semantic dual feature space mapping, which specifically includes:

and acquiring abstract information of the electric power science and technology project application.

And carrying out named entity identification on the abstract information of the electric power project application form to obtain an electric power project entity, wherein the electric power project entity comprises a use method entity and a related field entity.

Crawling the personal homepage information of the power expert and the abstract information of published papers.

And carrying out named entity identification on the personal homepage information of the power expert and the abstract information of the published paper to obtain a power expert entity, wherein the power expert entity comprises an adept technical entity and a research direction entity.

And carrying out pictographic mapping on the use method entity to obtain a pictographic use method entity, and carrying out pictographic mapping on the related field entity to obtain a pictographic related field entity.

And carrying out pictographic mapping on the entity with the strong skill to obtain a pictographic entity with the strong skill, and carrying out pictographic mapping on the entity with the research direction to obtain a pictographic entity with the research direction.

And carrying out semantic mapping on the pictograph use method entity to obtain a use method feature vector, and carrying out semantic mapping on the pictograph related field entity to obtain a related field feature vector.

And carrying out semantic mapping on the pictographic excellence technical entity to obtain an excellence technical feature vector, and carrying out semantic mapping on the pictographic research direction entity to obtain a research direction feature vector.

And calculating to obtain a comprehensive matching score according to the using method feature vector, the related field feature vector, the excellence technology feature vector and the research direction feature vector.

And determining the review experts according to the level of all the comprehensive matching scores.

Further, named entity recognition is carried out by utilizing a RoBERTA pre-training model and a BilSTM + CRF model.

Further, the specific method for carrying out named entity recognition by utilizing the RoBERTA pre-training model and the BilSTM + CRF model comprises the following steps:

and acquiring text information.

And segmenting the text information to obtain a word set.

And performing vector mapping on the word set by using a RoBerta pre-training model to obtain a word vector set.

And training the word vector set by using a BilSTM + CRF model to obtain the named entity of the text information.

Further, the specific method for obtaining the research direction entity and the adept technical entity comprises the following steps:

and acquiring keywords of the evaluation expert information.

And crawling the personal homepage information of the experts and the abstract information of the published papers according to the keywords.

And carrying out named entity identification on the expert personal homepage information according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the research direction entity.

And carrying out named entity recognition on the abstract information of the published paper according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the skilled technical entity.

Further, the specific method for obtaining the comprehensive matching score by calculation is as follows:

and performing Euclidean distance calculation on the related field characteristic vector and the research direction characteristic vector in a pictographic characteristic space to obtain a first pictographic matching score.

And performing cosine similarity calculation on the related field characteristic vector and the research direction characteristic vector in a semantic characteristic space to obtain a first semantic matching score.

And performing Euclidean distance calculation on the using method characteristic vector and the researching method characteristic vector in a pictographic characteristic space to obtain a second pictographic matching score.

And performing cosine similarity calculation on the using method characteristic vector and the researching method characteristic vector in a semantic characteristic space to obtain a second semantic matching score.

And summing the first pictographic matching score and the first semantic matching score to obtain a research direction matching score.

And summing the second pictographic matching score and the second semantic matching score to obtain a research method matching score.

And carrying out weighted summation on the research direction matching score and the research method matching score to obtain a comprehensive matching score.

Further, the calculation of the euclidean distance is performed by the following method:

wherein D is the entity similarity score of the domain (direction) level, F is the set corresponding to the related domain entities, R is the set corresponding to the research direction entities,embedding the pictographic space corresponding to the entity in the F set,embedding the pictographic space corresponding to the entity in the R set,embedding semantic space corresponding to the entities in the F set,and embedding the semantic space corresponding to the entities in the R set.

Further, the cosine similarity calculation is performed by adopting the following method:

Further, the following method is adopted for calculating the comprehensive matching score:

score＝k×D+(1-k)×T

where score is the composite match score and k is the weight.

Further, a greedy algorithm is adopted to calculate the k value.

Further, the k value is set to 0.3.

The method comprises the steps of firstly, utilizing a RoBerta pre-training model to hierarchically express a text, then using a Bi-LSTM + CRF model to identify a named entity of an electric power project text and an electric power expert text, then mapping the named entity into a feature vector through the Roberta-semantic dual feature space, carrying out Euclidean distance and cosine similarity calculation on the obtained feature vector to obtain a related matching score, then carrying out weighted summation on the related matching score to obtain a comprehensive matching score, and finally taking an expert with the highest comprehensive matching score as an expert for reviewing the electric power project text. The project text and domain expert entity matching strategy based on semantic-pictographic double-feature space mapping is provided, effective and accurate matching of the project and the domain expert is achieved intelligently, accordingly, the labor cost of review work is reduced, the reliability of review results is enhanced, the overall review efficiency is improved, and the method is accurate and efficient.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flowchart of a recommendation method for review experts based on pictograph-semantic dual feature space mapping according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a method for calculating a composite matching score according to an embodiment of the present disclosure;

fig. 3(a) is a schematic diagram of a result of identifying an entity in a field related to an application provided in an embodiment of the present application;

fig. 3(b) is a schematic diagram of an entity identification result of an application usage method provided in the embodiment of the present application;

fig. 4 is a schematic diagram of a heterogeneous matching process based on dual feature space mapping according to an embodiment of the present application;

fig. 5 is a schematic diagram illustrating comparison between matching effects of an electric power project entity and an electric power expert entity according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be fully and clearly described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to facilitate understanding of the technical solutions of the present application, some concepts related to the present application are first described below.

The RoBERTA model is an improved Chinese pre-training model of BERT, and compared with the traditional BERT, the RoBERTA model increases the Batch size, introduces a dynamic Masking mechanism, expands a training sample, and removes the constraint of an NSP (next presence prediction) item in a loss function. Specifically, the Batch size of the model was increased from 256 to 8000, and 10 different Masking methods were used, so that the samples in different epochs were not masked by the fixed Masking, and the training data was changed from 13G to 160G.

Specifically, the RoBerta model input is composed of a word vector, a sentence vector and a position quantity. The word vector comprises a coding vector of the category symbol and a coding vector of the separator; the sentence vector is a coding vector used for distinguishing different sentences; the position quantity is a coding vector of corresponding positions of different words in the sentence. The model output is a word embedding matrix after all words of the sentence are coded by the self-attention coder.

The Recurrent Neural Network (RNN) is the most widely applied neural network in the task of sequence relation learning, and the bidirectional long-and-short time memory network (Bi-LSTM) is a variant of the RNN, has bidirectional time sequence characteristics and a special gate control structure, and can effectively solve the problems of gradient disappearance and explosion.

Conditional Random Fields (CRF) are a commonly used sequence labeling algorithm.

The matching work of the experts and the project application can be regarded as heterogeneous data matching, and the heterogeneous data matching is that different sources of data are optimized in a preprocessing mode and then matched with each other, and finally reasonable output is obtained. The preprocessing of the project application data relates to a technology of named entity identification, namely, the proprietary names such as the names of people, places and organizations in the text are identified and classified, and the preprocessing of the expert data can be extracted from a webpage and then is finished through a crawler technology. After the data processed by the two are obtained, a more accurate and efficient result than manual recommendation can be obtained by means of heterogeneous matching

Specifically, in identifying the named entities of the power project application, the following concepts are first defined:

(1) the use method entity comprises the following steps: the methods used in the application, such as: zero sequence harmonic component principle, thermal step current method, electromagnetic coupling principle.

(2) Relating to a field entity: the application relates to the fields such as: reactive compensation, power transformation engineering and economic power transmission.

Referring to fig. 1, a schematic flow chart of a review expert recommendation method based on pictograph-semantic dual feature space mapping provided in an embodiment of the present application is shown. The embodiment of the application provides a review expert recommendation method based on pictograph-semantic dual-feature space mapping, which specifically comprises the following steps:

step S1: and acquiring abstract information of the electric power science and technology project application.

Step S2: and carrying out named entity identification on the abstract information of the electric power project application form to obtain an electric power project entity, wherein the electric power project entity comprises a use method entity and a related field entity.

Step S3: crawling the personal homepage information of the power expert and the abstract information of published papers.

Step S4: and carrying out named entity identification on the personal homepage information of the power expert and the abstract information of the published paper to obtain a power expert entity, wherein the power expert entity comprises an adept technical entity and a research direction entity.

Further, named entity recognition is carried out by utilizing a RoBERTA pre-training model and a BilSTM + CRF model. Specifically, the feature extractor of the RoBerta model is a bidirectional Transformer, and each unit of the Transformer is composed of a self-Attention layer (self-Attention), a Feed-Forward neural Network (Feed Forward Network) and a Normalization layer (Add & Normalization), and the structure can make full use of context information to capture the dependency relationship of longer distance.

In the embodiment of the present application, the BiLSTM model takes the score matrix of each word and label as output, which is called "emission matrix" a, and specifically includes: and (3) taking the mapped value of the word hidden layer vector through a linear layer (namely, using BilSTM as the last step of classification, and mapping the hidden state into a score) as a score matrix of the label corresponding to the word.

Meanwhile, the embodiment of the application selects a linear CRF model to learn the internal relation among the labels in the sequence, namely predicting the label corresponding to the input sequence.

Furthermore, the specific method for carrying out named entity recognition by utilizing the RoBERTA pre-training model and the BilSTM + CRF model comprises the following steps:

step S411: and acquiring text information.

Step S412: and segmenting the text information to obtain a word set.

Step S413: and performing vector mapping on the word set by using a RoBerta pre-training model to obtain a word vector set.

Step S414: and training the word vector set by using a Bi-LSTM + CRF model to obtain a named entity of the text information.

Specifically, in the embodiment of the application, the original input is initialized through a RoBerta model, a word vector is output, the word vector is used as the input of a BiLSTM + CRF model, and then the named entity is obtained through the operation of the BiLSTM + CRF model.

Furthermore, the concrete method for obtaining the research direction entity and the skilled technical entity is as follows:

step S421: and acquiring keywords of the evaluation expert information.

Step S422: and crawling the personal homepage information of the experts and the abstract information of the published papers according to the keywords.

Step S423: and carrying out named entity identification on the expert personal homepage information according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the research direction entity.

Step S424: and carrying out named entity recognition on the abstract information of the published paper according to a RoBERTA pre-training model and a BilSTM + CRF model to obtain the skilled technical entity.

Specifically, if the number of the electric power expert texts is small, the research direction entities of the crawled electric power expert texts can be manually screened; more specifically, since the entity of the research method of the power expert text and the entity of the use method of the power project have comparability, the embodiment of the present application adopts the same model as the entity of the use method of the text for identifying the power project to the entity of the research method of the power expert text.

Step S5: and carrying out pictographic mapping on the use method entity to obtain a pictographic use method entity, and carrying out pictographic mapping on the related field entity to obtain a pictographic related field entity.

Step S6: and carrying out pictographic mapping on the entity with the strong skill to obtain a pictographic entity with the strong skill, and carrying out pictographic mapping on the entity with the research direction to obtain a pictographic entity with the research direction.

Step S7: and carrying out semantic mapping on the pictograph use method entity to obtain a use method feature vector, and carrying out semantic mapping on the pictograph related field entity to obtain a related field feature vector.

Step S8: and carrying out semantic mapping on the pictographic excellence technical entity to obtain an excellence technical feature vector, and carrying out semantic mapping on the pictographic research direction entity to obtain a research direction feature vector.

Step S9: and calculating to obtain a comprehensive matching score according to the using method feature vector, the related field feature vector, the excellence technology feature vector and the research direction feature vector.

Further, referring to fig. 2, a flow chart of a method for calculating a composite matching score according to the embodiment of the present application is schematically shown. In the embodiment of the present application, a specific method for obtaining the comprehensive matching score is as follows:

step S91: and performing Euclidean distance calculation on the related field characteristic vector and the research direction characteristic vector in a pictographic characteristic space to obtain a first pictographic matching score.

Further, the calculation of the euclidean distance is performed by the following method:

Step S92: and performing cosine similarity calculation on the related field characteristic vector and the research direction characteristic vector in a semantic characteristic space to obtain a first semantic matching score.

Further, the cosine similarity calculation is performed by adopting the following method:

Step S93: and performing Euclidean distance calculation on the using method characteristic vector and the researching method characteristic vector in a pictographic characteristic space to obtain a second pictographic matching score.

Step S94: and performing cosine similarity calculation on the using method characteristic vector and the researching method characteristic vector in a semantic characteristic space to obtain a second semantic matching score.

Step S95: and summing the first pictographic matching score and the first semantic matching score to obtain a research direction matching score.

Step S96: and summing the second pictographic matching score and the second semantic matching score to obtain a research method matching score.

Step S97: and carrying out weighted summation on the research direction matching score and the research method matching score to obtain a comprehensive matching score.

Further, the following method is adopted for calculating the comprehensive matching score:

score＝k×D+(1-k)×T

in the formula, score is a composite matching score, and k is a hyper-parameter, i.e., a weight, which represents the matching importance at the domain (direction) level.

Further, in the embodiment of the present application, a greedy algorithm is used to calculate the k value, and after repeated verification, the k value in the embodiment of the present application is set to 0.3, which is most suitable.

Step S10: and determining the review experts according to the level of all the comprehensive matching scores. Specifically, the final calculated and output comprehensive matching scores are arranged in a descending order, and the expert with the highest comprehensive matching score is selected as the evaluation expert of the electric power science and technology project application.

The expert review recommendation method based on the pictograph-semantic dual feature space mapping provided by the embodiment of the present application will be explained in detail through specific embodiments.

In an embodiment of the present application, for electric power project text data, 2000 documents are selected from an electric power science and technology project declaration database as a corpus, and a research topic mainly includes: high voltage and insulation technology, motors and electrical and power systems and automation, etc. The specific embodiment of the application carries out word segmentation and stop word removal operations on the abstract of the project application and carries out labeling on the named entities. Because the method provided by the specific embodiment of the application is insensitive to long sequences, the specific embodiment of the application breaks the abstract of the project application according to the period number, and simultaneously ensures that the ratio of the number of sentences containing the required named entity labels to the number of sentences not containing the required named entity labels in the preprocessed data set is 8:1, and the total number of sentences is about 10000.

In the aspect of data set division, in the specific embodiment of the present application, 10000 electric power item texts are divided according to 8: 1: the scale of 1 is divided into a training set, a validation set and a test set.

In the aspect of data labeling, the embodiment of the present application adopts a classic BIO three-segment labeling method, that is, for each entity, the first word is labeled as "B-entity name", the subsequent word is "I-entity name", and the entity not required in this document is labeled as O.

In the word embedding module based on hierarchical representation, in the specific embodiment of the application, a pre-trained RoBerta model maps words into 1024-dimensional vectors and introduces the training of a named entity recognition model BiLSTM + CRF.

Specifically, referring to fig. 3(a) and fig. 3(b), schematic diagrams of entity recognition results of application related to domain entities and using methods provided for the embodiments of the present application are shown. As can be seen from fig. 3(a), reasonable application method entities such as a pulse current method, an equivalent circuit mathematical model, ensemble learning, and the like are identified; as can be seen from fig. 3(b), related domain entities such as transformer overhaul, power transformation project, clean power sharing, etc. are identified. In summary, after the RoBERTa pre-training model is added to the BilSTM + CRF model, the embodiment of the application can effectively extract the relevant entities of the electric power project text.

For the text data of the electric power experts, in the search of the relevant entities of the experts, 8 laboratories under 3 large laboratories of the electrical academy of colleges and universities are selected in the specific embodiment of the application, and each of professor (Bo director), subsidiary professor (Bo director) and subsidiary professor (Master director) is selected from each laboratory, and 24 experts are used for information extraction. The whole process is divided into the technical entity crawling and the research direction entity crawling, and the method comprises the following steps:

(1) searching the expert name and the keywords of the school in the known network, crawling the abstract of the published article, identifying the named entity, and extracting the method used in the published article as the expert skilled technical entity.

(2) Crawling the research direction of the main page of the expert by using a crawler technology, searching the research direction entity after word segmentation operation (the part has small workload and adopts a manual screening mode), and taking the search result as the research direction entity of the expert. The screening results (in part) are shown in table 1.

Table 1 entity screening results (parts) of expert data

As can be seen from table 1, the entity of the research direction of the expert homepage is comparable to the entity of the application method in the published papers and the two entities of the application of the power science and technology project to a certain extent, which provides a basis for the subsequent entity matching process.

Through the processing, the results of preprocessing two heterogeneous data, namely an electric power science and technology project application form and an expert, are obtained, namely four types of entities with a certain relation. Then, the embodiment of the application adopts pictograph-semantic dual-feature space matching to match the four types of entities, and the specific matching process is shown in fig. 4.

As can be seen from FIG. 4, for the matching of these four types of entities, the embodiment of the present application adopts stroke-based pictographic space mapping and sequence information-based semantic space mapping to map the entities into feature vectors. In the matching process, the research direction entities of the field-related entities and the experts of the electric power science and technology project application can perform similarity comparison on the same layer, and the use method entities of the electric power science and technology project application and the expert skilled technical entities can perform similarity comparison on the same layer. The specific process is as follows:

(1) and mapping the four entities into 512-dimensional feature vectors respectively through a cw2vec model at the pictographic level and a RoBERTA model at the semantic level.

(2) And performing full-array Euclidean distance and cosine similarity calculation on the related field entity, the research direction entity, the using method entity and the adequacy technical entity in the pictographic feature space and the semantic feature space respectively, and taking the sum of the highest values of the two as an entity matching score.

(3) And carrying out weighted synthesis on the matching scores of the electric power science and technology project application form and the expert at the field (direction) level and the method (technology) level to finally obtain the comprehensive matching score. Wherein, the matching score weight of the domain (direction) level is set to 0.3, and the matching score weight of the method (technology) level is set to 0.7.

(4) And taking the expert with the highest comprehensive matching score as a review expert of the electric power science and technology project application.

In order to verify the effectiveness of the image-semantic dual-feature space mapping matching algorithm, three groups of comparison experiments are performed in the specific embodiment of the application, namely semantic space mapping + cosine similarity matching, image space mapping + cosine similarity matching and image-semantic dual-feature space mapping + cosine similarity matching. The effect of matching the power item entity and the power expert entity is shown in fig. 5.

As can be seen from fig. 5, the embodiment of the present application implements heterogeneous data preprocessing for performing multi-scale representation learning on an electric power technology project application form-related electric power experts, and the accuracy of matching 2000 electric power project documents with 24 electric power expert texts reaches the highest 0.85. The result shows that the pictographic space and the semantic space can capture the information of the word semantics and the pictographic layer, the two characteristic spaces have stronger complementarity, and the entity matching is more sufficient than the entity information mapped by using a single characteristic space.

In summary, compared with the prior art, the embodiment of the present application has the following features:

(1) the idea of named entity identification and entity matching of heterogeneous data is used, an end-to-end matching method is achieved, and the whole process does not need manual participation.

(2) The pre-trained RoBerta model is introduced into the training of the named entity recognition model BilSTM + CRF, so that the training efficiency and accuracy are greatly improved.

(3) When the entities are matched, the thought of pictographic-semantic dual-feature space matching is introduced, and a more accurate matching effect is achieved.

(4) The method has excellent generalization, and can be used for expert recommendation of application books of other industries as long as corresponding documents are provided.

The application provides a review expert recommendation method based on pictograph-semantic dual-feature space mapping, which specifically comprises the following steps:

and acquiring abstract information of the electric power science and technology project application.

Crawling the personal homepage information of the power expert and the abstract information of published papers.

And determining the review experts according to the level of all the comprehensive matching scores.

According to the technical scheme, the method for recommending the review experts based on the pictographic-semantic dual-feature space mapping comprises the steps of firstly utilizing a RoBerta pre-training model to hierarchically express texts, then utilizing a Bi-LSTM + CRF model to identify named entities of the electric power project texts and the electric power expert texts, then mapping the named entities into feature vectors through the pictographic-semantic dual-feature space, carrying out Euclidean distance and cosine similarity calculation on the obtained feature vectors to obtain related matching scores, carrying out weighted summation on the related matching scores to obtain comprehensive matching scores, and finally taking the expert with the highest comprehensive matching score as the review expert of the electric power project texts. The project text and domain expert entity matching strategy based on semantic-pictographic double-feature space mapping is provided, effective and accurate matching of the project and the domain expert is achieved intelligently, accordingly, the labor cost of review work is reduced, the reliability of review results is enhanced, the overall review efficiency is improved, and the method is accurate and efficient.

The present application has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to limit the application. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the presently disclosed embodiments and implementations thereof without departing from the spirit and scope of the present disclosure, and these fall within the scope of the present disclosure. The protection scope of this application is subject to the appended claims.

16页详细技术资料下载

Evaluation expert recommendation method based on pictograph-semantic dual-feature space mapping

相关技术

网友询问留言