Training sample enhancement method, system, device and storage medium

文档序号:907639 发布日期:2021-02-26 浏览:2次 中文

阅读说明:本技术 训练样本增强方法、系统、设备及存储介质 (Training sample enhancement method, system, device and storage medium ) 是由 杨森 罗超 胡泓 李巍 邹宇 于 2020-12-08 设计创作,主要内容包括:本发明提供了一种训练样本增强方法、系统、设备及存储介质,该方法包括:采集语料数据,从所述语料数据中提取备选样本;获取所述备选样本的句向量;分别计算种子样本的句向量与各个所述备选样本的句向量的相似度;根据计算得到的相似度从所述备选样本中选择增强样本,加入训练集中。本发明从原始语料中直接挖掘与种子样本相似的样本数据,基于相似度计算自动扩充训练样本集,减少了人工标注样本的需求,整个过程均为自动实现的,无需人工干预,节省了人力,并且采用增强后的训练样本集训练得到的模型具有更好的鲁棒性,受噪声影响小,提高了模型应用的准确率。(The invention provides a training sample enhancement method, a system, equipment and a storage medium, wherein the method comprises the following steps: collecting corpus data, and extracting alternative samples from the corpus data; obtaining a sentence vector of the alternative sample; respectively calculating the similarity between the sentence vector of the seed sample and the sentence vector of each alternative sample; and selecting an enhanced sample from the alternative samples according to the calculated similarity, and adding the enhanced sample into a training set. According to the method, the sample data similar to the seed sample is directly mined from the original corpus, the training sample set is automatically expanded based on similarity calculation, the requirement for manually marking the sample is reduced, the whole process is automatically realized, manual intervention is not needed, manpower is saved, the model obtained by training the enhanced training sample set has better robustness, is less influenced by noise, and the accuracy of model application is improved.)

1. A training sample enhancement method is characterized by comprising the following steps:

collecting corpus data, and extracting alternative samples from the corpus data;

obtaining a sentence vector of the alternative sample;

respectively calculating the similarity between the sentence vector of the seed sample and the sentence vector of each alternative sample;

and selecting an enhanced sample from the alternative samples according to the calculated similarity, and adding the enhanced sample into a training set.

2. The method for enhancing training samples according to claim 1, further comprising the following steps after extracting the candidate samples from the corpus data:

and carrying out normalization processing on the alternative sample according to a preset normalization processing rule.

3. The method of claim 1, wherein the obtaining the sentence vector of the candidate sample comprises obtaining the sentence vector of the candidate sample by at least one of:

obtaining a first sentence vector of the alternative sample based on doc2 vec;

obtaining a second sentence vector of the alternative sample according to the word vector of each word in the alternative sample;

and obtaining a third sentence vector of the alternative sample according to the word vector of each word in the alternative sample.

4. The method of claim 1, wherein the obtaining the sentence vector of the candidate samples comprises:

obtaining a first sentence vector of the alternative sample based on doc2vec, obtaining a second sentence vector of the alternative sample according to a word vector of each word in the alternative sample, and obtaining a third sentence vector of the alternative sample according to a word vector of each word in the alternative sample;

and splicing the first sentence vector, the second sentence vector and the third sentence vector to obtain the sentence vector of the alternative sample.

5. The method according to claim 3 or 4, wherein the obtaining of the second sentence vector of the candidate sample according to the word vector of each word in the candidate sample comprises the following steps:

and weighting the word vector of each word obtained by the word2vec based on a TF-IDF algorithm to obtain a second sentence vector of the alternative sample.

6. The method for enhancing training samples according to claim 1, further comprising the following steps after obtaining the sentence vector of the candidate sample:

and normalizing the sentence vectors of the alternative samples to enable the sentence vectors of the alternative samples to be consistent in length.

7. The method for enhancing training samples according to claim 1, wherein the step of separately calculating the similarity between the sentence vectors of the seed samples and the sentence vectors of the candidate samples comprises the steps of:

and respectively calculating the cosine similarity between the sentence vector of the seed sample and the sentence vector of each alternative sample to serve as the similarity of the two corresponding sentence vectors.

8. The method according to claim 1, wherein the selecting the enhanced samples from the candidate samples according to the calculated similarity comprises selecting the candidate samples with the similarity greater than a preset similarity threshold with the seed sample as the enhanced samples.

9. The method for enhancing training samples according to claim 1, further comprising the following steps after the training samples are added into the training set:

judging whether the number of samples in the training set is greater than or equal to a preset sample number threshold value or not;

if not, taking the samples in the training set as new seed samples, respectively calculating the similarity between the sentence vectors of the new seed samples and the sentence vectors of each alternative sample, selecting new enhanced samples from the alternative samples according to the calculated similarity, and adding the new enhanced samples into the training set.

10. A training sample enhancement system for implementing the training sample enhancement method of any one of claims 1 to 9, the system comprising:

the corpus collection module is used for collecting corpus data and extracting alternative samples from the corpus data;

a vector obtaining module, configured to obtain a sentence vector of the candidate sample;

the similarity calculation module is used for calculating the similarity between the sentence vector of the seed sample and the sentence vector of each alternative sample;

and the sample enhancement module is used for selecting an enhanced sample from the alternative samples according to the calculated similarity and adding the enhanced sample into a training set.

11. The training sample enhancement system of claim 10 wherein the vector acquisition module acquires the sentence vector of the candidate sample by:

the vector obtaining module obtains a first sentence vector of the alternative sample based on doc2vec, obtains a second sentence vector of the alternative sample according to a word vector of each word in the alternative sample, and obtains a third sentence vector of the alternative sample according to a word vector of each word in the alternative sample;

and the vector acquisition module splices the first sentence vector, the second sentence vector and the third sentence vector to obtain the sentence vector of the alternative sample.

12. A training sample enhancement apparatus, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the training sample enhancement method of any of claims 1 to 9 via execution of the executable instructions.

13. A computer-readable storage medium storing a program, which when executed by a processor implements the steps of the training sample enhancement method of any one of claims 1 to 9.

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a training sample enhancement method, system, device, and storage medium.

Background

The current similarity algorithm is mainly divided into supervised and unsupervised. The supervised method is to judge the text similarity or calculate the similarity by using a supervised model such as a naive Bayes classifier and the like. The method requires a certain amount of labeled corpora, and the construction cost is high; because the training corpus can not be made very large usually, the generalization of the model is not enough, and the practical use is troublesome; the complexity of the distance calculation link is high. The unsupervised method is to directly calculate the distance or similarity between texts by methods such as Euclidean distance. The method is characterized in that: the linguistic data do not need to be labeled, and very large data can be used for characteristic engineering or parameter estimation; many methods have small dependence on languages, can deal with multi-language mixed scenes, and have low complexity of a distance calculation link.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a training sample enhancement method, a training sample enhancement system, training sample enhancement equipment and a storage medium, wherein a training sample set is automatically expanded based on similarity, so that the manual labeling cost is reduced, and the labor is saved.

The embodiment of the invention provides a training sample enhancement method, which comprises the following steps:

collecting corpus data, and extracting alternative samples from the corpus data;

obtaining a sentence vector of the alternative sample;

respectively calculating the similarity between the sentence vector of the seed sample and the sentence vector of each alternative sample;

and selecting an enhanced sample from the alternative samples according to the calculated similarity, and adding the enhanced sample into a training set.

In some embodiments, after extracting the candidate sample from the corpus data, the method further includes the following steps:

and carrying out normalization processing on the alternative sample according to a preset normalization processing rule.

In some embodiments, the obtaining the sentence vector of the candidate sample includes obtaining the sentence vector of the candidate sample by at least one of the following methods:

obtaining a first sentence vector of the alternative sample based on doc2 vec;

obtaining a second sentence vector of the alternative sample according to the word vector of each word in the alternative sample;

and obtaining a third sentence vector of the alternative sample according to the word vector of each word in the alternative sample.

In some embodiments, the obtaining the sentence vector of the candidate sample includes the following steps:

obtaining a first sentence vector of the alternative sample based on doc2vec, obtaining a second sentence vector of the alternative sample according to a word vector of each word in the alternative sample, and obtaining a third sentence vector of the alternative sample according to a word vector of each word in the alternative sample;

and splicing the first sentence vector, the second sentence vector and the third sentence vector to obtain the sentence vector of the alternative sample.

In some embodiments, the obtaining a second sentence vector of the candidate sample according to the word vector of each word in the candidate sample includes:

and weighting the word vector of each word obtained by the word2vec based on a TF-IDF algorithm to obtain a second sentence vector of the alternative sample.

In some embodiments, after obtaining the sentence vector of the candidate sample, the method further includes the following steps:

and normalizing the sentence vectors of the alternative samples to enable the sentence vectors of the alternative samples to be consistent in length.

In some embodiments, the separately calculating the similarity between the sentence vector of the seed sample and the sentence vector of each of the candidate samples includes the following steps:

and respectively calculating the cosine similarity between the sentence vector of the seed sample and the sentence vector of each alternative sample to serve as the similarity of the two corresponding sentence vectors.

In some embodiments, the selecting an enhanced sample from the candidate samples according to the calculated similarity includes selecting a candidate sample having a similarity greater than a preset similarity threshold with the seed sample as the enhanced sample.

In some embodiments, after the adding into the training set, the method further includes the following steps:

judging whether the number of samples in the training set is greater than or equal to a preset sample number threshold value or not;

if not, taking the samples in the training set as new seed samples, respectively calculating the similarity between the sentence vectors of the new seed samples and the sentence vectors of each alternative sample, selecting new enhanced samples from the alternative samples according to the calculated similarity, and adding the new enhanced samples into the training set.

The embodiment of the present invention further provides a training sample enhancement system, configured to implement the training sample enhancement method, where the system includes:

the corpus collection module is used for collecting corpus data and extracting alternative samples from the corpus data;

a vector obtaining module, configured to obtain a sentence vector of the candidate sample;

the similarity calculation module is used for calculating the similarity between the sentence vector of the seed sample and the sentence vector of each alternative sample;

and the sample enhancement module is used for selecting an enhanced sample from the alternative samples according to the calculated similarity and adding the enhanced sample into a training set.

In some embodiments, the vector obtaining module obtains the sentence vector of the candidate sample by:

the vector obtaining module obtains a first sentence vector of the alternative sample based on doc2vec, obtains a second sentence vector of the alternative sample according to a word vector of each word in the alternative sample, and obtains a third sentence vector of the alternative sample according to a word vector of each word in the alternative sample;

and the vector acquisition module splices the first sentence vector, the second sentence vector and the third sentence vector to obtain the sentence vector of the alternative sample.

An embodiment of the present invention further provides a training sample enhancement apparatus, including:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the training sample enhancement method via execution of the executable instructions.

An embodiment of the present invention further provides a computer-readable storage medium, which is used for storing a program, and when the program is executed by a processor, the method for enhancing a training sample is implemented.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

The training sample enhancement method, the training sample enhancement system, the training sample enhancement equipment and the training sample enhancement storage medium have the following beneficial effects:

according to the method, the sample data similar to the seed sample is directly mined from the original corpus, the training sample set is automatically expanded based on similarity calculation, the requirement for manually marking the sample is reduced, the whole process is automatically realized, manual intervention is not needed, manpower is saved, the model obtained by training the enhanced training sample set has better robustness, is less influenced by noise, and the accuracy of model application is improved.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.

FIG. 1 is a flow chart of a training sample enhancement method according to an embodiment of the invention;

FIG. 2 is a flow chart of obtaining a sentence vector of alternative samples according to an embodiment of the present invention;

FIG. 3 is a flowchart of determining whether the sample size meets the requirements according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a training sample enhancement system according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a training sample enhancing apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the steps. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

As shown in fig. 1, an embodiment of the present invention provides a training sample enhancement method, including the following steps:

s100: collecting corpus data, and extracting alternative samples from the corpus data;

in order to train data expansion of a general scene, data of each scene in the field is collected in the embodiment, and open-source high-quality data, such as data sources of wikipedia, news and the like, are added;

s200: obtaining a sentence vector of the alternative sample;

s300: respectively calculating the similarity between the sentence vector of the seed sample and the sentence vector of each alternative sample;

s400: and selecting an enhanced sample from the alternative samples according to the calculated similarity, and adding the enhanced sample into a training set.

The invention aims to directly excavate sample data similar to a seed sample from an original corpus, and then add the excavated sample data into a training sample set to realize the enhancement of the existing training sample. Firstly, corpus data is collected as a candidate sample through the step S100, then a sentence vector of the candidate sample is obtained through the step S200, the similarity between the sentence vectors of the candidate sample and a seed sample is calculated through the step S300, the training sample set is automatically expanded through the step S400 based on the similarity calculation, the requirement of manual labeling of the sample is reduced, the whole process is automatically realized, manual intervention is not needed, manpower is saved, a model obtained through training of the enhanced training sample set has better robustness, is less influenced by noise, and the accuracy of model application is improved.

The training sample set obtained by the method for enhancing the training sample can be used for training machine learning models, such as models of Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), transformers, and the like. The training sample set after the expansion and the enhancement has richer training samples, the similarity between the enhancement sample screened by the similarity and the seed sample is high, the enhancement sample can be labeled based on the label of the seed sample similar to the enhancement sample, the labeling is more accurate, the machine learning model training is more accurate, and a model with better effect is obtained.

In this embodiment, the step S100: after extracting the candidate sample from the corpus data, the method further comprises a step of performing data preprocessing on the candidate sample, and specifically, the data preprocessing comprises the following steps:

and carrying out normalization processing on the alternative sample according to a preset normalization processing rule.

The normalized processing rules may include a unified rule for case, a unified rule for punctuation, a unified rule for traditional to simplified, a rule for removing stop words, a rule for removing low frequency words, and the like, so that the processed alternative samples have a unified format.

In this embodiment, the step S200: obtaining the sentence vector of the alternative sample, including obtaining the sentence vector of the alternative sample by one or more of the following methods:

obtaining a first sentence vector of the alternative sample based on doc2 vec; doc2vec is an unsupervised algorithm for converting text into vectors, which is proposed based on the word2vec algorithm, and has some advantages, such as that sentences with different lengths are received as training samples without fixing the lengths of sentences, and the like, and Doc2vec is different from word2vec in that a paragraph vector is added in an input layer, and the paragraph vector can predict the next word according to a context sample given from a paragraph;

obtaining a second sentence vector of the alternative sample according to the word vector of each word in the alternative sample, specifically, the word vector obtaining mode of each word may adopt a word2vec algorithm, and the word2vec algorithm is an algorithm for representing words as vectors;

and acquiring a third sentence vector of the candidate sample according to the word vector of each word in the candidate sample, specifically, the word vector of each word may also be acquired by adopting a word2vec algorithm, or may also be acquired based on a preset mapping relationship between the word and the word vector.

The three ways are all selectable methods for obtaining the sentence vector of the alternative sample. Further, the sentence vector of the corresponding seed sample may also be obtained by one or more of the above manners. In an embodiment, in order to obtain a rich sentence vector, the three ways may be merged, that is, the first sentence vector, the second sentence vector and the third sentence vector obtained in the three ways are spliced, and the spliced sentence vector is used as the sentence vector of the candidate sample.

Specifically, as shown in fig. 2, the step S200: obtaining a sentence vector of the alternative sample, comprising the following steps:

s210: obtaining a first sentence vector of the alternative sample based on doc2 vec;

s220: obtaining a second sentence vector of the alternative sample according to the word vector of each word in the alternative sample, specifically, the word vector obtaining mode of each word can adopt a word2vec algorithm;

s230: obtaining a third sentence vector of the alternative sample according to a word vector of each word in the alternative sample, specifically, the word vector of each word may also be obtained by adopting a word2vec algorithm, or may also be obtained based on a preset mapping relationship between the word and the word vector;

s240: and splicing the first sentence vector, the second sentence vector and the third sentence vector to obtain the sentence vector of the alternative sample.

In the step S240, when the three sentence vectors are spliced, the order of the three sentence vectors is not limited, and for example, the three sentence vectors may be spliced in the order of the first sentence vector, the second sentence vector, and the third sentence vector, or spliced in the order of the first sentence vector, the third sentence vector, and the second sentence vector, or spliced in the manner of the second sentence vector, the third sentence vector, and the first sentence vector.

Further, when the sentence vectors of the candidate samples are obtained in steps S210 to S240, the sentence vectors of the seed samples are also obtained in a similar manner to steps S210 to S240, and the stitching order of the first sentence vector, the second sentence vector, and the third sentence vector in the sentence vectors of the seed samples is identical to the stitching order of the first sentence vector, the second sentence vector, and the third sentence vector in the sentence vectors of the candidate samples.

Further, considering that different words have different degrees of importance in the text, if the second sentence vector is uniformly constructed with consistent weights, the semantic expression of the text may not be accurate enough. Based on this, the step S220: obtaining a second sentence vector of the alternative sample according to the word vector of each word in the alternative sample, comprising the following steps:

and weighting the word vector of each word obtained by the word2vec based on a TF-IDF algorithm to obtain a second sentence vector of the alternative sample. TF-IDF (term frequency-inverse document frequency) is a weighting technique used for information retrieval and data mining. TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency). The TF-IDF is used here to evaluate how important a word is for a candidate sample. The importance of a word increases in direct proportion to the number of times it appears in the alternative sample, but at the same time decreases in inverse proportion to the frequency with which it appears in the entire corpus.

In this embodiment, in consideration of the influence of sentences of different lengths on the final result, the step S200: after the sentence vector of the alternative sample is obtained, the method further comprises the following steps:

and normalizing the sentence vectors of the alternative samples to make the lengths of the sentence vectors of the alternative samples consistent, thereby reducing the negative influence caused by different sentence lengths.

In the present invention, the seed sample refers to the initial sample in the training set, i.e. the sample that needs to be enhanced by data expansion. Considering that when calculating the similarity between the seed sample and all the candidate samples, the time complexity of the direct calculation is O (n × m), n is the number of seed samples, and m is the number of candidate samples (i.e. the amount of collected data, which is generally more than 100 w). If the amount of the seed sample data is too large, the time consumption is very serious. Therefore, in order to reduce the amount of work in the training sample enhancement process, collected alternative sample data can be firstly vectorized in advance and then stored, so that the subsequent direct use is facilitated. And the operation among the vectors is converted into matrix operation, so that the calculation time is greatly saved.

Further, the step S300: respectively calculating the similarity between the sentence vector of the seed sample and the sentence vector of each alternative sample, comprising the following steps:

and respectively calculating the cosine similarity between the sentence vector of the seed sample and the sentence vector of each alternative sample to serve as the similarity of the two corresponding sentence vectors. Cosine similarity refers to measuring the similarity between two vectors by measuring the cosine value of the included angle of the two vectors. Compared with methods for calculating similarity based on Euclidean distance, such as k-means and dbscan, the cosine similarity is more suitable for distance calculation in high dimension.

The cosine similarity is selected to calculate the similarity of the sentence vectors of the seed sample and the candidate sample, which is only a preferred embodiment, and is not a limitation to the scope of the present invention. In other alternative embodiments, the step S300: similarity between the sentence vector of the seed sample and the sentence vector of each alternative sample is calculated respectively, and methods for calculating similarity based on Euclidean distances, such as k-means and dbscan, can also be adopted, and the methods belong to the protection scope of the invention.

In this embodiment, in order to ensure the accuracy of the augmented training sample and ensure that the similarity between the augmented sample and the seed sample is sufficiently high, the step S400: and selecting an enhanced sample from the alternative samples according to the calculated similarity, wherein the selection of the alternative sample with the similarity greater than a preset similarity threshold value with the seed sample is used as the enhanced sample. The numerical value of the preset similarity threshold value can be selected and adjusted according to needs, and when the prediction similarity threshold value is set to be higher, the similarity requirement of the enhanced sample and the seed sample can be improved, but the sample expansion speed is possibly slower. When the prediction similarity threshold is set to be low, the sample expansion speed is high, and a large number of enhanced samples can be obtained quickly, but the similarity between part of the enhanced samples and the seed samples may be low.

In the embodiment, in order to ensure the accuracy of the extended and enhanced samples and avoid influencing the training effect of the subsequent model due to inaccurate sample labeling, only the enhanced samples of the high threshold part are reserved. If sufficient training samples may not be obtained by only a single round of similarity calculation, the similarity calculation may be performed again based on the similar samples obtained from the previous round until sufficient samples are collected.

Specifically, as shown in fig. 3, in this embodiment, the step S400: selecting an enhanced sample from the alternative samples according to the calculated similarity, and adding the enhanced sample into a training set, wherein the method further comprises the following steps:

s510: judging whether the number of samples in the training set is greater than or equal to a preset sample number threshold value or not;

if not, S520: taking the samples in the training set as new seed samples, S530: respectively calculating the similarity between the sentence vector of the new seed sample and the sentence vector of each candidate sample, and then continuing to step S540: selecting a new enhanced sample from the alternative samples according to the calculated similarity, adding the new enhanced sample into a training set, and after the new enhanced sample is added into the training set, continuing to perform judgment in step S510 to determine whether the number of samples in the current training set is greater than or equal to a preset sample number threshold value until the number of samples in the current training set meets the requirement;

if so, S550: it is shown that a sufficient number of training samples have been collected, and the current training sample enhancement procedure is ended.

As shown in fig. 4, an embodiment of the present invention further provides a training sample enhancement system, which is used to implement the training sample enhancement method, and the system includes:

the corpus collection module M100 is used for collecting corpus data and extracting alternative samples from the corpus data; in order to train data expansion of a general scene, the corpus collection module M100 collects data of each scene in the field, and adds open-source high-quality data, such as data sources of wikipedia, news, and the like;

a vector obtaining module M200, configured to obtain a sentence vector of the candidate sample, where specifically, the vector obtaining module M200 may also be configured to obtain a sentence vector of a seed sample, and the sentence vector of the seed sample and the sentence vector of the candidate sample have the same format and may be obtained in the same manner;

a similarity calculation module M300, configured to calculate similarities between the sentence vectors of the seed samples and the sentence vectors of the candidate samples, respectively;

and the sample enhancement module M400 is used for selecting an enhanced sample from the alternative samples according to the calculated similarity and adding the enhanced sample into a training set.

The invention aims to directly excavate sample data similar to a seed sample from an original corpus, and then add the excavated sample data into a training sample set to realize the enhancement of the existing training sample. Firstly, the corpus data is collected as a candidate sample through the corpus collection module M100, then the sentence vector of the candidate sample is obtained through the vector obtaining module M200, the similarity between the sentence vectors of the candidate sample and the seed sample is calculated through the similarity calculation module M300, the training sample set is automatically expanded through the sample enhancement module M400 based on the similarity calculation, the requirement of manually marking the sample is reduced, the whole process is automatically realized, manual intervention is not needed, manpower is saved, a model obtained by training through the enhanced training sample set has better robustness, the influence of noise is small, and the accuracy of model application is improved.

The training sample set obtained by the training sample enhancement system can be used for training machine learning models, such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), transformers and other models. The training sample set after the expansion and the enhancement has richer training samples, the similarity between the enhancement sample screened by the similarity and the seed sample is high, the enhancement sample can be labeled based on the label of the seed sample similar to the enhancement sample, the labeling is more accurate, the machine learning model training is more accurate, and a model with better effect is obtained.

In this embodiment, the training sample enhancement system may further include a data preprocessing module, configured to perform normalization processing on the candidate sample according to a preset normalization processing rule after the corpus data is acquired by the corpus acquisition module M100 and the candidate sample is extracted from the corpus data. The normalized processing rules may include a unified rule for case, a unified rule for punctuation, a unified rule for traditional to simplified, a rule for removing stop words, a rule for removing low frequency words, and the like, so that the processed alternative samples have a unified format.

In this embodiment, the vector obtaining module M200 may obtain the sentence vector of the candidate sample by using any one or more of the following manners:

obtaining a first sentence vector of the alternative sample based on doc2 vec;

obtaining a second sentence vector of the alternative sample according to the word vector of each word in the alternative sample, specifically, the word vector obtaining mode of each word can adopt a word2vec algorithm;

and acquiring a third sentence vector of the candidate sample according to the word vector of each word in the candidate sample, specifically, the word vector of each word may also be acquired by adopting a word2vec algorithm, or may also be acquired based on a preset mapping relationship between the word and the word vector.

In this embodiment, in order to obtain rich sentence vectors, the three manners may be fused, that is, the first sentence vector, the second sentence vector, and the third sentence vector obtained in the three manners are spliced, that is, the vector obtaining module M200 preferably obtains the sentence vector of the candidate sample by using the following steps:

the vector obtaining module M200 obtains a first sentence vector of the candidate sample based on doc2vec, obtains a second sentence vector of the candidate sample according to the word vector of each word in the candidate sample, and obtains a third sentence vector of the candidate sample according to the word vector of each word in the candidate sample;

the vector obtaining module M200 splices the first sentence vector, the second sentence vector, and the third sentence vector to obtain the sentence vector of the candidate sample.

When the vector obtaining module M200 splices the three sentence vectors, the order of the three sentence vectors is not limited, for example, the three sentence vectors may be spliced according to the order of the first sentence vector, the second sentence vector and the third sentence vector, or spliced according to the order of the first sentence vector, the third sentence vector and the second sentence vector, or spliced according to the way of the second sentence vector, the third sentence vector and the first sentence vector, and so on.

Further, considering that different words have different degrees of importance in the text, if the second sentence vector is uniformly constructed with consistent weights, the semantic expression of the text may not be accurate enough. Based on this, the vector obtaining module M200 obtains the second sentence vector by: and weighting the word vector of each word obtained by the word2vec based on a TF-IDF algorithm to obtain a second sentence vector of the alternative sample.

In this embodiment, the similarity calculation module M300 calculates the similarity between the sentence vector of the seed sample and the sentence vector of each candidate sample, respectively, including calculating the cosine similarity between the sentence vector of the seed sample and the sentence vector of each candidate sample, respectively. In other alternative embodiments, other algorithms may also be used for calculating the similarity between the sentence vector of the seed sample and the sentence vector of the candidate sample, and all of them are within the scope of the present invention.

In this embodiment, the selecting, by the sample enhancement module M400, an enhanced sample from the candidate samples according to the calculated similarity includes: and selecting the alternative sample with the similarity greater than a preset similarity threshold value with the seed sample as an enhanced sample. The numerical value of the preset similarity threshold value can be selected and adjusted according to needs, and when the prediction similarity threshold value is set to be higher, the similarity requirement of the enhanced sample and the seed sample can be improved, but the sample expansion speed is possibly slower. When the prediction similarity threshold is set to be low, the sample expansion speed is high, and a large number of enhanced samples can be obtained quickly, but the similarity between part of the enhanced samples and the seed samples may be low.

The embodiment of the invention also provides training sample enhancement equipment, which comprises a processor; a memory having stored therein executable instructions of the processor; wherein the processor is configured to perform the steps of the training sample enhancement method via execution of the executable instructions.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" platform.

An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 5. The electronic device 600 shown in fig. 5 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 5, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one storage unit 620, a bus 630 that connects the various system components (including the storage unit 620 and the processing unit 610), a display unit 640, and the like.

Wherein the storage unit stores program code executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present invention described in the training sample enhancement method section above in this specification. For example, the processing unit 610 may perform the steps as shown in fig. 1.

The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.

The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

In the training sample enhancement apparatus, the program in the memory is executed by the processor to implement the steps of the training sample enhancement method, and therefore, the computer storage medium can also obtain the technical effects of the training sample enhancement method.

An embodiment of the present invention further provides a computer-readable storage medium, which is used for storing a program, and when the program is executed by a processor, the method for enhancing a training sample is implemented. In some possible embodiments, aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present invention described in the training sample enhancement method section above of this specification, when the program product is executed on the terminal device.

Referring to fig. 6, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be executed on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The program in the computer storage medium implements the steps of the training sample enhancement method when executed by the processor, and therefore, the computer storage medium can also achieve the technical effects of the training sample enhancement method.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

16页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:车辆数据清洗方法、装置及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!