Text representation vector generation method and device, storage medium and electronic equipment

文档序号:272980 发布日期:2021-11-19 浏览:2次 中文

阅读说明:本技术 一种文本表示向量生成方法、装置、存储介质及电子设备 (Text representation vector generation method and device, storage medium and electronic equipment ) 是由 李如寐 严渊蒙 王思睿 张富峥 武威 于 2021-07-15 设计创作,主要内容包括:本申请提供了一种文本表示向量生成方法、装置、存储介质及电子设备,涉及信息处理技术领域,旨在高效、优质地生成文本表示向量。所述方法包括:获取待处理文本;将所述待处理文本转换为对应的编码向量;将所述编码向量输入文本表示向量生成模型,得到所述待处理文本对应的表示向量;其中,所述文本表示向量生成模型是对预设模型进行训练得到的,所述文本表示向量生成模型的训练数据至少包括:样本文本对应的编码向量及其增强后的编码向量,其中,基于同一样本文本的两个编码向量互为正样本,基于不同样本文本的两个编码向量互为负样本。(The application provides a text expression vector generation method, a text expression vector generation device, a storage medium and electronic equipment, relates to the technical field of information processing, and aims to generate a text expression vector efficiently and excellently. The method comprises the following steps: acquiring a text to be processed; converting the text to be processed into a corresponding encoding vector; inputting the coding vector into a text representation vector generation model to obtain a representation vector corresponding to the text to be processed; the text expression vector generation model is obtained by training a preset model, and the training data of the text expression vector generation model at least comprises: and the coding vectors corresponding to the sample texts and the enhanced coding vectors thereof, wherein two coding vectors based on the same sample text are positive samples, and two coding vectors based on different sample texts are negative samples.)

1. A method for generating text representation vectors, the method comprising:

acquiring a text to be processed;

converting the text to be processed into a corresponding encoding vector;

inputting the coding vector into a text representation vector generation model to obtain a representation vector corresponding to the text to be processed;

the text expression vector generation model is obtained by training a preset model, and the training data of the text expression vector generation model at least comprises: and the coding vectors corresponding to the sample texts and the enhanced coding vectors thereof, wherein two coding vectors based on the same sample text are positive samples, and two coding vectors based on different sample texts are negative samples.

2. The method of claim 1, wherein the training data of the text-representation vector model comprises at least one piece of training data obtained by:

obtaining a sample text, and converting the sample text into a corresponding coding vector;

enhancing the coding vector corresponding to the sample text to obtain at least one enhanced coding vector;

and forming a coding vector pair by the coding vector corresponding to the sample text and the enhanced coding vector thereof to obtain a piece of training data.

3. The method of claim 2, wherein for each sample text, the enhancement processing is performed on the coding vector corresponding to the sample text, and the enhancement processing includes at least one of:

for each sample text, disordering the sequence numbers which represent the positions of the words in the sample text in the coding vectors corresponding to the sample text;

for each sample text, randomly deleting the whole row elements or the whole column elements in the coding vector corresponding to the sample text;

and for each sample text, randomly deleting at least one element in the coding vector corresponding to the sample text.

4. A method according to any one of claims 1 to 3, wherein the training method for the text-representation vector generative model comprises at least the following steps:

respectively inputting the coding vector corresponding to the sample text and the enhanced coding vector thereof into the preset model to obtain two prediction expression vectors output by the preset model;

determining a contrast learning loss function value according to the similarity between two prediction expression vectors based on the same sample text and the difference between the two prediction expression vectors based on different sample texts;

and training a preset model based on the contrast learning loss function value by taking the maximization of the similarity between two prediction expression vectors based on the same sample text and the maximization of the difference between the two prediction expression vectors based on different sample texts as targets.

5. The method of claim 1, wherein the training data for the text-representation vector model further comprises a piece of training data obtained by:

acquiring a sample text and an associated text thereof, wherein the associated text carries a label marked in advance to represent whether the meaning of the associated text is the same as that of the sample text;

respectively converting the sample text and the associated text into corresponding coding vectors;

forming a coding vector pair by the sample text and the coding vectors corresponding to the associated texts with the same meanings as the sample text to obtain a positive sample training data;

and forming a coding vector pair by the sample text and the coding vectors corresponding to the associated texts with different meanings to obtain a piece of negative sample training data.

6. The method of claim 5, wherein the training process of the text-representation vector generation model further comprises the steps of:

respectively inputting the coding vectors corresponding to the sample text and the associated text into the preset model to obtain two prediction expression vectors output by the preset model;

determining a supervised loss function value according to the similarity between two prediction expression vectors output by the preset model and the label carried by the associated text;

training a preset model based on the supervised loss function value.

7. The method of claim 6, after determining the supervised loss function values, the method further comprising:

generating counterdisturbance according to the supervised loss function value;

and adding the counterdisturbance to the coding vector corresponding to the sample text to obtain an enhanced coding vector.

8. The method according to claim 1, wherein after obtaining the representation vector corresponding to the text to be processed, the method further comprises:

determining semantic similarity of two texts to be processed according to the similarity between the expression vectors corresponding to the two texts to be processed respectively; or

And comparing the representation vector corresponding to the text to be processed with the representation vector corresponding to each text in the text library to output the retrieval result corresponding to the text to be processed.

9. An apparatus for generating text representation vectors, the apparatus comprising:

the acquisition module is used for acquiring a text to be processed;

the conversion module is used for converting the text to be processed into a corresponding coding vector;

the generating module is used for inputting the coding vector into a text representation vector generating model to obtain a representation vector corresponding to the text to be processed;

the text expression vector generation model is obtained by training a preset model, and the training data of the text expression vector generation model at least comprises: and the coding vectors corresponding to the sample texts and the enhanced coding vectors thereof, wherein two coding vectors based on the same sample text are positive samples, and two coding vectors based on different sample texts are negative samples.

10. A computer-readable storage medium, having stored thereon a computer program which, when executed, implements a text representation vector generation method as claimed in any one of claims 1-8.

11. An electronic device, comprising:

a processor, a memory, and a computer program stored on the memory and executable on the processor, wherein the processor implements the text representation vector generation method of any one of claims 1-8 when executing the program.

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a method and an apparatus for generating text expression vectors, a storage medium, and an electronic device.

Background

Natural Language Processing (NLP) is an important direction in the field of information processing technology and computer science, and it is studied how to make a computer understand the meaning of a natural language text, the former is called natural language understanding, and the latter is called natural language generation, and express a given intention, thought, and the like through the natural language text.

Text representation vector learning plays an important role in the NLP field, and the success of many NLP tasks does not allow for the training of good-quality sentence representation vectors. The existing text expression vector generation method needs a large number of training samples, and complex calculation is needed after a model to generate the text expression vector, so that the consumed computing resources are large, and the speed is low. Therefore, a high-quality text expression vector generation method is urgently needed.

Disclosure of Invention

In view of the foregoing, embodiments of the present invention provide a text expression vector generation method, apparatus, storage medium, and electronic device, so as to overcome the foregoing problems or at least partially solve the foregoing problems.

In a first aspect of the embodiments of the present invention, a method for generating a text expression vector is provided, where the method includes:

acquiring a text to be processed;

converting the text to be processed into a corresponding encoding vector;

inputting the coding vector into a text representation vector generation model to obtain a representation vector corresponding to the text to be processed;

the text expression vector generation model is obtained by training a preset model, and the training data of the text expression vector generation model at least comprises: and the coding vectors corresponding to the sample texts and the enhanced coding vectors thereof, wherein two coding vectors based on the same sample text are positive samples, and two coding vectors based on different sample texts are negative samples.

Optionally, the training data of the text-representation vector model at least includes a piece of training data obtained by the following steps:

obtaining a sample text, and converting the sample text into a corresponding coding vector;

enhancing the coding vector corresponding to the sample text to obtain at least one enhanced coding vector;

and forming a coding vector pair by the coding vector corresponding to the sample text and the enhanced coding vector thereof to obtain a piece of training data.

Optionally, for each sample text, performing enhancement processing on the coding vector corresponding to the sample text, including at least one of:

for each sample text, disordering the sequence numbers which represent the positions of the words in the sample text in the coding vectors corresponding to the sample text;

for each sample text, randomly deleting the whole row elements or the whole column elements in the coding vector corresponding to the sample text;

and for each sample text, randomly deleting at least one element in the coding vector corresponding to the sample text.

Optionally, the training method for the text-representation vector generation model at least includes the following steps:

respectively inputting the coding vector corresponding to the sample text and the enhanced coding vector thereof into the preset model to obtain two prediction expression vectors output by the preset model;

determining a contrast learning loss function value according to the similarity between two prediction expression vectors based on the same sample text and the difference between the two prediction expression vectors based on different sample texts;

and training a preset model based on the contrast learning loss function value by taking the maximization of the similarity between two prediction expression vectors based on the same sample text and the maximization of the difference between the two prediction expression vectors based on different sample texts as targets.

Optionally, the training data of the text-representation vector model further includes a piece of training data obtained by the following steps:

acquiring a sample text and an associated text thereof, wherein the associated text carries a label marked in advance to represent whether the meaning of the associated text is the same as that of the sample text;

respectively converting the sample text and the associated text into corresponding coding vectors;

forming a coding vector pair by the sample text and the coding vectors corresponding to the associated texts with the same meanings as the sample text to obtain a positive sample training data;

and forming a coding vector pair by the sample text and the coding vectors corresponding to the associated texts with different meanings to obtain a piece of negative sample training data.

Optionally, the training process of the text-representation vector generation model further includes the following steps:

respectively inputting the coding vectors corresponding to the sample text and the associated text into the preset model to obtain two prediction expression vectors output by the preset model;

determining a supervised loss function value according to the similarity between two prediction expression vectors output by the preset model and the label carried by the associated text;

training a preset model based on the supervised loss function value.

Optionally, after determining the supervised loss function values, the method further comprises:

generating counterdisturbance according to the supervised loss function value;

and adding the counterdisturbance to the coding vector corresponding to the sample text to obtain an enhanced coding vector.

Optionally, after obtaining the representation vector corresponding to the text to be processed, the method further includes:

determining semantic similarity of two texts to be processed according to the similarity between the expression vectors corresponding to the two texts to be processed respectively; or

And comparing the representation vector corresponding to the text to be processed with the representation vector corresponding to each text in the text library to output the retrieval result corresponding to the text to be processed.

In a second aspect of the embodiments of the present invention, there is provided a text expression vector generating apparatus, including:

the acquisition module is used for acquiring a text to be processed;

the conversion module is used for converting the text to be processed into a corresponding coding vector;

the generating module is used for inputting the coding vector into a text representation vector generating model to obtain a representation vector corresponding to the text to be processed;

the text expression vector generation model is obtained by training a preset model, and the training data of the text expression vector generation model at least comprises: and the coding vectors corresponding to the sample texts and the enhanced coding vectors thereof, wherein two coding vectors based on the same sample text are positive samples, and two coding vectors based on different sample texts are negative samples.

In a third aspect of the embodiments of the present invention, a computer-readable storage medium is provided, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed, the computer program implements the steps of the text expression vector generation method disclosed in the embodiments of the present application.

In a fourth aspect of the embodiments of the present invention, there is provided an electronic device, including: the text representation vector generation method comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the text representation vector generation method disclosed by the embodiment of the application.

The embodiment of the invention has the following advantages:

in this embodiment, a text to be processed may be obtained; converting the text to be processed into a corresponding encoding vector; inputting the coding vector into a text representation vector generation model to obtain a representation vector corresponding to the text to be processed; the text expression vector generation model is obtained by training a preset model, and the training data of the text expression vector generation model at least comprises: and the coding vectors corresponding to the sample texts and the enhanced coding vectors thereof, wherein two coding vectors based on the same sample text are positive samples, and two coding vectors based on different sample texts are negative samples. Therefore, in the training of the text expression vector generation model, for each coding vector, all the other coding vectors are negative samples except for the coding vector which is a positive sample, so that the model can be trained only by a small amount of training data, and the computing resources can be effectively saved. The representation vector of the text to be processed can be directly generated through the trained text representation vector generation model, other complex calculation is not needed after the text representation vector generation model, and the efficiency of generating the text representation vector can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flowchart illustrating the steps of a method for training a text-representation vector generation model according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a default model according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating steps of a method for generating text-based presentation vectors according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a text-representation-vector generating apparatus in an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

In order to solve the problems of more computing resources, low speed and the like consumed by a text expression vector generation method in the related art, the applicant proposes that: and training a text expression vector generation model by taking the coding vector corresponding to the sample text and the enhanced coding vector thereof as training data, wherein two coding vectors based on the same sample text are mutually positive samples, two coding vectors based on different sample texts are mutually negative samples, and the trained text expression vector generation model is used for directly generating the expression vector of the text.

In order to train the text expression vector generation model, training data needs to be acquired first, and the training data of the text expression vector model at least comprises one piece of training data obtained according to the following steps:

step S110: a sample text is obtained and converted into a corresponding code vector.

In order to train a general text representation vector generation model, texts in a general text database can be directly adopted as sample texts; in order to train a text expression vector generation model of a specific field, a text in a label-free text corpus in the data distribution of the specific field can be used as a sample text; the sample text is converted into a corresponding encoding vector through an embedding operation.

It can be understood that, after a general text expression vector generation model is trained, if a specific field text expression vector generation model is to be obtained, the general text expression vector generation model is retrained only by using the specific field sample text to fine tune parameters in the general text expression vector generation model so as to convert the parameters into the specific field text expression vector generation model.

Step S120: and performing enhancement processing on the coding vector corresponding to the sample text to obtain at least one enhanced coding vector.

And implicitly generating an enhanced coding vector of the coding vector corresponding to the sample text at an embedding layer to obtain at least one enhanced coding vector corresponding to the sample text.

Step S130: and forming a coding vector pair by the coding vector corresponding to the sample text and the enhanced coding vector thereof to obtain a piece of training data.

And forming a coding vector pair by using the coding vector corresponding to the sample text and the enhanced coding vector as training data. The coded vector obtained based on the same sample text and the enhanced coded vector are positive samples, and the coded vectors obtained based on different sample texts are negative samples.

Optionally, as an embodiment, for each sample text, performing enhancement processing on the coding vector corresponding to the sample text, where the enhancement processing includes at least one of:

1. for each sample text, disordering the sequence numbers which represent the positions of the words in the sample text in the coding vectors corresponding to the sample text;

2. for each sample text, randomly deleting the whole row elements or the whole column elements in the coding vector corresponding to the sample text;

3. and for each sample text, randomly deleting at least one element in the coding vector corresponding to the sample text.

For each sample text, at least one of the three methods can be adopted for processing.

The method comprises the following steps: disordering the word order in the sample text, specifically: the position ids (position identifiers) of each word in the embedding layer is sensed, and then the position ids of each word is subjected to shuffle operation.

The method 2 comprises the following steps: randomly selecting a token (expression) of the sample text, and setting the imbedding whole line of the corresponding token to be zero. Or randomly selecting embedding features, and setting the selected feature dimensions to be zero in an alignment manner.

The method 3 comprises the following steps: and randomly setting the elements in the coding vector corresponding to the sample text to zero.

By adopting the technical scheme of the embodiment of the application, the enhanced coding vector of the coding vector corresponding to the sample text can be implicitly generated at the embedding layer, so that the semantic consistency of the coding vector corresponding to the sample text and the enhanced coding vector can be ensured.

And after the training data are obtained, training the preset model by using the training data. Optionally, referring to fig. 1, a flowchart illustrating steps of a training method for a text-representation vector generation model in an embodiment of the present invention is shown, and as shown in fig. 1, the training method for a text-representation vector generation model may specifically include the following steps:

step S210: and respectively inputting the coding vector corresponding to the sample text and the enhanced coding vector thereof into the preset model to obtain two prediction expression vectors output by the preset model.

The preset model is a BERT-like language model, and as shown in fig. 2, a schematic structure diagram of the preset model is shown. The pre-set model includes a data enhancement module, a BERT (a language representation model) coding layer, an average pooling layer, and a loss function module. The data enhancement module can be used for enhancing the coding vector corresponding to the sample text; the BERT coding layer is used for carrying out interactive calculation on coding vectors, the average pooling layer is used for taking the average value in the last two layers or the preset multiple layers as the prediction expression vector of each text, and the loss function module is used for establishing a comparison learning loss function according to the prediction expression vector of each text.

The data input into the preset model are the coding vectors corresponding to the sample text and the enhanced coding vectors thereof, so that the data enhancement module of the preset model directly performs interactive calculation through a BERT coding layer and averaging through an averaging pooling layer without enhancing the coding vectors corresponding to the sample text.

Step S220: determining a contrast learning loss function value according to the similarity between two prediction expression vectors based on the same sample text and the difference between the two prediction expression vectors based on different sample texts;

in order to optimize the parameters of the preset model, the comparative learning loss function value needs to be calculated. Specifically, the comparative learning loss function can be established by the following formula:

wherein L isi,jRepresenting a comparative learning loss function value; the sim () function is a cosine similarity function; n represents the quantity of training data input into the preset model at the same time; tau represents the difficulty of calculation and is generally 0.1; 1,2, … …, N; k is 1,2, … …, N; j ═ 1,2, … …, N; r isi、rj、rkRespectively representing the ith, j and k text representation vectors.

Step S230: and training a preset model based on the contrast learning loss function value by taking the maximization of the similarity between two prediction expression vectors based on the same sample text and the maximization of the difference between the two prediction expression vectors based on different sample texts as targets.

The purpose of establishing the comparative learning loss function is to enable the coding vector input into the preset model to find the coding vector which is a positive sample with the coding vector. In the training stage, through establishing a comparison learning loss function, the text expression vectors output by the preset model and corresponding to positive samples can be guided to have the similarity as much as possible in the expression space, and the text expression vectors corresponding to negative samples have the difference as much as possible.

And when the training result is converged or the preset training times are reached, finishing the training to obtain a trained preset model, wherein the trained preset model is the text expression vector generation model.

By adopting the technical scheme of the embodiment of the invention, the preset model can be trained directly through the training data without marking the training data, so that the cost for marking the data can be reduced; for each code vector, except for one code vector which is a positive sample, all the other code vectors are negative samples, so that parameters of a preset model can be effectively optimized without a large amount of training data, and the practical application proves that a better training effect can be obtained under the condition that only 1000 pieces of training data are available; other complex calculations are not required after BERT coding, so that the calculation resources can be effectively saved and the efficiency can be improved no matter in the training process or when the trained text is used for representing the vector; the training data is composed of the coding vector corresponding to the sample text and the enhanced coding vector thereof, so that the result output by the trained text representation vector generation model has the advantage of accurate expressed semantics.

After the trained text representation vector generation model is obtained, the text representation vector generation model can be used for generating the representation vector of the text. Referring to fig. 3, a flowchart illustrating steps of a text expression vector generation method according to an embodiment of the present invention is shown, and as shown in fig. 3, the text expression vector generation method may specifically include the following steps:

step S310: acquiring a text to be processed;

step S320: converting the text to be processed into a corresponding encoding vector;

step S330: inputting the coding vector into a text representation vector generation model to obtain a representation vector corresponding to the text to be processed; the text expression vector generation model is obtained by training a preset model, and the training data of the text expression vector generation model at least comprises: and the coding vectors corresponding to the sample texts and the enhanced coding vectors thereof, wherein two coding vectors based on the same sample text are positive samples, and two coding vectors based on different sample texts are negative samples.

It can be understood that, when the trained text representation vector generation model is used to generate the representation vector of the text to be processed, the enhanced coding vector of the text to be processed is not required to be obtained, and the loss function is not required to be calculated, but the coding vector of the text to be processed is directly and interactively calculated through the BERT coding layer, and then the average pooling layer takes the average value in the last two layers or the preset multiple layers as the text representation vector of the text to be processed, specifically: after the text to be processed is obtained, the text to be processed is converted into a corresponding coding vector through an embedding operation, after the coding vector is input into a text representation vector model, a BERT coding layer of the text representation vector model carries out interactive calculation on the coding vector, and then an average pooling layer of the text representation vector model takes the average value in the last two layers or preset multiple layers of the BERT coding layer as the text representation vector of the text to be processed.

Accordingly, the data enhancement module and the loss function module can be omitted from the structure of the trained text representation vector model relative to the structure of the preset model.

By adopting the technical scheme of the embodiment of the application, the expression vector of the text to be processed is obtained through the text expression vector generation model, and complex data processing is not required after a BERT coding layer in the text expression vector model, so that the method has the advantages of high calculation speed and calculation resource saving; the text expression vector generation model training is obtained by training the coding vector corresponding to the sample text and the enhanced coding vector thereof, and has the advantage of accurate semantic expression of the expression vector of the obtained text to be processed.

The above embodiments disclose a method for training an unsupervised text-expression-vector generation model, and may also adopt a method for training a supervised text-expression-vector generation model. Optionally, as an embodiment, the training data of the text-representation vector model further includes a piece of training data obtained according to the following steps:

step S410: acquiring a sample text and an associated text thereof, wherein the associated text carries a label marked in advance to represent whether the meaning of the associated text is the same as that of the sample text;

step S420: respectively converting the sample text and the associated text into corresponding coding vectors;

step S430: forming a coding vector pair by the sample text and the coding vectors corresponding to the associated texts with the same meanings as the sample text to obtain a positive sample training data;

step S440: and forming a coding vector pair by the sample text and the coding vectors corresponding to the associated texts with different meanings to obtain a piece of negative sample training data.

In order to train the text expression vector generation model in a supervised manner, the associated text of the sample text needs to be labeled first to determine whether the meaning of the associated text is the same as that of the sample text. The sample text and the associated text are respectively converted into corresponding coding vectors through an embedding operation, a coding vector pair formed by the sample text and the coding vectors corresponding to the associated text with the same meaning is used as positive sample training data, and a coding vector pair formed by the sample text and the coding vectors corresponding to the associated text with different meanings is used as negative sample training data.

After the training data is obtained, the text expression vector generation model can be trained by using the training data. Optionally, as an embodiment, the training process of the text-representation vector generation model further includes the following steps:

step S510: respectively inputting the coding vectors corresponding to the sample text and the associated text into the preset model to obtain two prediction expression vectors output by the preset model;

step S520: determining a supervised loss function value according to the similarity between two prediction expression vectors output by the preset model and the label carried by the associated text;

step S530: training a preset model based on the supervised loss function value.

The preset model in the embodiment of the present invention has the same structure as the preset model described above, and outputs the prediction expression vector of the text after the coding vector of the text is input. And determining the value of the supervised loss function according to the similarity between the two prediction expression vectors and the label carried by the associated text. And guiding the parameters of the preset model to be optimized through the supervised loss function value so that the output text expression vector can accurately express the meaning of the text.

Optionally, as an embodiment, after determining the supervised loss function values, the method further comprises:

step S610: generating counterdisturbance according to the supervised loss function value;

step S620: and adding the counterdisturbance to the coding vector corresponding to the sample text to obtain an enhanced coding vector.

And generating counterdisturbance through gradient inversion according to the supervised loss function value, and adjusting the price of the counterdisturbance to the coding vector corresponding to the sample text to obtain the enhanced coding vector. Wherein the generation of the countering perturbation by gradient inversion means: loss function values are transmitted in a reverse mode layer by layer through reverse propagation, each layer of network multiplies a negative number according to transmitted errors, training targets of front and back networks are opposite, and the effect of countermeasure is achieved.

By adopting the technical scheme of the embodiment of the invention, the countermeasure disturbance is generated through gradient inversion, so that the parameters of the preset model are optimized through countermeasure training, the obtained preset model is more stable, and the output result is more accurate. Therefore, the text expression vector generation model trained by the method can generate more accurate text expression vectors.

Optionally, the method for training the preset model with the supervision and the method for training the preset model without the supervision provided in the embodiment of the present invention may be used in combination, the comparison learning loss function and the supervised loss function are simultaneously established, and parameters in the preset model are jointly optimized by combining the comparison learning loss function and the supervised loss function, so as to obtain a text representation vector model with higher quality.

Optionally, as an embodiment, after obtaining the representation vector corresponding to the text to be processed, the method further includes: determining semantic similarity of two texts to be processed according to the similarity between the expression vectors corresponding to the two texts to be processed respectively; or comparing the representation vector corresponding to the text to be processed with the representation vector corresponding to each text in the text library to output the retrieval result corresponding to the text to be processed.

After the expression vectors of the texts to be processed are obtained, determining the semantic similarity of the two texts to be processed according to the similarity of the expression vectors of the two texts to be processed; therefore, the method plays an important role in text duplicate checking.

Or comparing the representation vector corresponding to the text to be processed with the representation vector corresponding to each text in the text library, thereby recalling the retrieval result corresponding to the text to be processed.

It is understood that there are other application scenarios after the representation vector of the text is obtained, and the present invention is not listed here.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Fig. 4 is a schematic structural diagram of a text expression vector generation apparatus according to an embodiment of the present invention, and as shown in fig. 4, the text expression vector generation apparatus includes an obtaining module, a converting module, and a generating module, where:

the acquisition module is used for acquiring a text to be processed;

the conversion module is used for converting the text to be processed into a corresponding coding vector;

the generating module is used for inputting the coding vector into a text representation vector generating model to obtain a representation vector corresponding to the text to be processed; the text expression vector generation model is obtained by training a preset model, and the training data of the text expression vector generation model at least comprises: and the coding vectors corresponding to the sample texts and the enhanced coding vectors thereof, wherein two coding vectors based on the same sample text are positive samples, and two coding vectors based on different sample texts are negative samples.

Optionally, as an embodiment, the training data of the text-representation vector model at least includes a piece of training data obtained according to the following steps:

obtaining a sample text, and converting the sample text into a corresponding coding vector;

enhancing the coding vector corresponding to the sample text to obtain at least one enhanced coding vector;

and forming a coding vector pair by the coding vector corresponding to the sample text and the enhanced coding vector thereof to obtain a piece of training data.

Optionally, as an embodiment, for each sample text, performing enhancement processing on the coding vector corresponding to the sample text, where the enhancement processing includes at least one of:

for each sample text, disordering the sequence numbers which represent the positions of the words in the sample text in the coding vectors corresponding to the sample text;

for each sample text, randomly deleting the whole row elements or the whole column elements in the coding vector corresponding to the sample text;

and for each sample text, randomly deleting at least one element in the coding vector corresponding to the sample text.

Optionally, as an embodiment, the training method for generating a model by using a text-representation vector at least includes the following steps:

respectively inputting the coding vector corresponding to the sample text and the enhanced coding vector thereof into the preset model to obtain two prediction expression vectors output by the preset model;

determining a contrast learning loss function value according to the similarity between two prediction expression vectors based on the same sample text and the difference between the two prediction expression vectors based on different sample texts;

and training a preset model based on the contrast learning loss function value by taking the maximization of the similarity between two prediction expression vectors based on the same sample text and the maximization of the difference between the two prediction expression vectors based on different sample texts as targets.

Optionally, as an embodiment, the training data of the text-representation vector model further includes a piece of training data obtained according to the following steps:

acquiring a sample text and an associated text thereof, wherein the associated text carries a label marked in advance to represent whether the meaning of the associated text is the same as that of the sample text;

respectively converting the sample text and the associated text into corresponding coding vectors;

forming a coding vector pair by the sample text and the coding vectors corresponding to the associated texts with the same meanings as the sample text to obtain a positive sample training data;

and forming a coding vector pair by the sample text and the coding vectors corresponding to the associated texts with different meanings to obtain a piece of negative sample training data.

Optionally, as an embodiment, the training process of the text-representation vector generation model further includes the following steps:

respectively inputting the coding vectors corresponding to the sample text and the associated text into the preset model to obtain two prediction expression vectors output by the preset model;

determining a supervised loss function value according to the similarity between two prediction expression vectors output by the preset model and the label carried by the associated text;

training a preset model based on the supervised loss function value.

Optionally, as an embodiment, after determining the supervised loss function values, the method further comprises:

generating counterdisturbance according to the supervised loss function value;

and adding the counterdisturbance to the coding vector corresponding to the sample text to obtain an enhanced coding vector.

Optionally, as an embodiment, after obtaining the representation vector corresponding to the text to be processed, the method further includes:

determining semantic similarity of two texts to be processed according to the similarity between the expression vectors corresponding to the two texts to be processed respectively; or

And comparing the representation vector corresponding to the text to be processed with the representation vector corresponding to each text in the text library to output the retrieval result corresponding to the text to be processed.

By adopting the technical scheme of the embodiment of the application, the preset model can be trained to obtain the trained text expression vector generation model, wherein the training data is composed of the coding vector corresponding to the sample text and the enhanced coding vector thereof, so that the output result of the trained text expression vector generation model has the advantage of accurate expressed semantics; for each code vector, except for one code vector which is a positive sample, all the other code vectors are negative samples of the code vector, so that a large amount of training data is not needed in the training process; the text expression vector has the same structure as the preset model, and complex calculation is not needed after interactive calculation is carried out on a BERT coding layer, so that the calculation resource is saved, and the speed is effectively increased; the representation vector of the text to be processed can be generated through the text representation vector model, so that the representation vector of the text to be processed can be utilized for various applications, such as duplicate checking, similar text recalling and the like.

It should be noted that the device embodiments are similar to the method embodiments, so that the description is simple, and reference may be made to the method embodiments for relevant points.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed, the method for generating a text expression vector according to any of the above embodiments is implemented.

An embodiment of the present invention further provides an electronic device, including: the text representation vector generation method comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the text representation vector generation method disclosed by any one of the embodiments.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, electronic devices and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The text expression vector generation method, the text expression vector generation device, the storage medium and the electronic device provided by the application are introduced in detail, a specific example is applied in the text to explain the principle and the implementation of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

15页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于混合监督双层匹配编码映射推荐方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!