Method and device for establishing term recognition model and method and device for recognizing terms

文档序号:1831803 发布日期:2021-11-12 浏览:8次 中文

阅读说明:本技术 术语识别模型的建立方法及装置、术语识别方法及装置 (Method and device for establishing term recognition model and method and device for recognizing terms ) 是由 顾淑琴 张昱琪 施杨斌 陆军 于 2020-04-27 设计创作,主要内容包括:本发明公开一种术语识别模型的建立方法及装置、术语识别方法及装置,涉及人工智能技术领域,能够解决现有识别术语准确性较低的问题。本发明的方法主要包括:获取第一训练集;删除第一训练集中一定比例的术语标注,得到第二训练集;基于第一训练集进行模型训练得到第一术语识别模型,以及基于第二训练集进行模型训练得到第二术语识别模型;根据利用第一术语识别模型和第二术语识别模型分别对特定数据集进行术语识别得到的损失差异,从特定数据集中筛选出满足术语标注质量要求的标注数据作为第三训练集;利用第三训练集进行模型训练得到最终所需的术语识别模型。本发明主要适用于基于神经网络建立术语识别模型的场景中。(The invention discloses a method and a device for establishing a term recognition model and a term recognition method and a term recognition device, relates to the technical field of artificial intelligence, and can solve the problem of low accuracy of the existing term recognition. The method mainly comprises the following steps: acquiring a first training set; deleting a certain proportion of term labels in the first training set to obtain a second training set; performing model training based on the first training set to obtain a first term recognition model, and performing model training based on the second training set to obtain a second term recognition model; according to loss difference obtained by respectively carrying out term recognition on a specific data set by utilizing a first term recognition model and a second term recognition model, screening out labeled data meeting the term labeling quality requirement from the specific data set as a third training set; and carrying out model training by using the third training set to obtain a final required term recognition model. The method is mainly suitable for the scene of establishing the term recognition model based on the neural network.)

1. A method for establishing a term recognition model is characterized by comprising the following steps:

acquiring a first training set, wherein the first training set comprises labeling data obtained after term labeling is carried out on a first corpus based on a preset automatic labeling method;

deleting a certain proportion of term labels in the first training set to obtain a second training set;

performing model training based on the first training set to obtain a first term recognition model, and performing model training based on the second training set to obtain a second term recognition model;

according to the loss difference obtained by respectively carrying out term recognition on a specific data set by utilizing the first term recognition model and the second term recognition model, screening out labeling data meeting the term labeling quality requirement from the specific data set as a third training set; the specific data set is marked data obtained after term marking is carried out on a second corpus by utilizing the preset automatic marking method; the first corpus and the second corpus belong to the same corpus;

and carrying out model training by using the third training set to obtain a final required term recognition model.

2. The method according to claim 1, wherein the selecting labeled data satisfying term labeling quality requirements from a specific data set as a third training set according to loss difference obtained by term recognition on the specific data set by using the first term recognition model and the second term recognition model respectively comprises:

respectively using the first term recognition model and the second term recognition model to perform term recognition on the original sentence corresponding to the labeling data in the specific data set, and labeling the recognized terms; the original sentence is a sentence before term labeling is carried out based on the preset automatic labeling method;

calculating the labeling loss of the first term recognition model and the labeling loss of the second term recognition model respectively aiming at the same original sentence, and calculating the difference between the labeling loss of the first term recognition model and the labeling loss of the second term recognition model to obtain the loss difference;

and screening out the marked data with the loss difference larger than a preset threshold value from the specific data set as the third training set.

3. The method of claim 2, wherein calculating the annotation loss of the first term recognition model and the annotation loss of the second term recognition model for the same original sentence respectively comprises:

calculating term labeling results of the first term recognition model and term labeling results in the specific data set by using a preset loss function aiming at the same original sentence to obtain a labeling loss of the first term recognition model;

and calculating the term labeling result of the second term recognition model and the term labeling result in the specific data set by using the preset loss function aiming at the same original sentence to obtain the labeling loss of the second term recognition model.

4. The method according to claim 1, wherein if the term dictionary used by the preset automatic labeling method is updated, the method further comprises:

based on the updated term dictionary, carrying out term annotation on the corpus again by utilizing the preset automatic annotation method;

screening out the labeling data meeting the term labeling quality requirement from the labeling data subjected to the term labeling again;

and updating the finally required term recognition model based on the screened annotation data.

5. The method of claim 1, wherein deleting a percentage of term labels in the first training set to obtain a second training set comprises:

deleting the term labels in the first training set according to a certain proportion at random to obtain a second training set;

or, determining a domain to which each term in the first training set belongs, and randomly deleting the term labels in a certain proportion for each domain respectively to obtain the second training set.

6. The method according to any one of claims 1-5, wherein the preset automatic labeling method is a remote supervision method.

7. A method for term recognition, the method comprising:

acquiring user data information;

identifying the commodity name in the user data information based on a term identification model; the term recognition model is obtained by adopting the establishment method of the term recognition model in any one of claims 1 to 6;

marking out a commodity name in the user data information;

and determining user preference by analyzing the user data information marked with the commodity name, and recommending the commodity to the user according to the user preference.

8. A method for term recognition, the method comprising:

acquiring data information including a person name generated in a preset platform;

identifying the name of the person in the data information based on a term identification model; the term recognition model is obtained by adopting the establishment method of the term recognition model in any one of claims 1 to 6;

marking out a person name in the data information;

and analyzing the data information of the marked name according to a preset name analysis rule to obtain statistical information aiming at the name.

9. A method for term recognition, the method comprising:

acquiring medical data information;

identifying a medical name in the medical data information based on a term identification model; the term recognition model is obtained by adopting the establishment method of the term recognition model in any one of claims 1 to 6;

highlighting the medical name in the medical data information.

10. An apparatus for building a term recognition model, the apparatus comprising:

the system comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a first training set, and the first training set comprises labeling data obtained after term labeling is carried out on a first corpus based on a preset automatic labeling method;

a deleting unit, configured to delete a certain proportion of term labels in the first training set to obtain a second training set;

the first training unit is used for carrying out model training on the basis of the first training set to obtain a first term recognition model and carrying out model training on the basis of the second training set to obtain a second term recognition model;

a screening unit, configured to screen, according to loss differences obtained by respectively performing term recognition on a specific data set by using the first term recognition model and the second term recognition model, labeling data meeting term labeling quality requirements from the specific data set as a third training set; the specific data set is marked data obtained after term marking is carried out on a second corpus by utilizing the preset automatic marking method; the first corpus and the second corpus belong to the same corpus;

and the second training unit is used for carrying out model training by utilizing the third training set to obtain a final required term recognition model.

11. A term recognition apparatus, the apparatus comprising:

an acquisition unit configured to acquire user data information;

an identifying unit configured to identify a commodity name in the user data information based on a term identification model; the term recognition model is obtained by adopting the establishment method of the term recognition model in any one of claims 1 to 6;

the marking unit is used for marking the commodity name in the user data information;

the determining unit is used for determining the user preference by analyzing the user data information marked with the commodity name;

and the recommending unit is used for recommending commodities to the user according to the user preference.

12. A term recognition apparatus, the apparatus comprising:

the system comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring data information including a name generated in a preset platform;

the identification unit is used for identifying the name of the person in the data information based on a term identification model; the term recognition model is obtained by adopting the establishment method of the term recognition model in any one of claims 1 to 6;

the marking unit is used for marking the name of the person in the data information;

and the analysis unit is used for analyzing the data information of the marked name according to a preset name analysis rule to obtain statistical information aiming at the name.

13. A term recognition apparatus, the apparatus comprising:

an acquisition unit for acquiring medical data information;

the identification unit is used for identifying the medical name in the medical data information based on a term identification model; the term recognition model is obtained by adopting the establishment method of the term recognition model in any one of claims 1 to 6;

and the output unit is used for highlighting the medical name in the medical data information.

14. A storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method according to any one of claims 1 to 9.

15. An electronic device, comprising a storage medium and a processor;

the processor is suitable for realizing instructions;

the storage medium adapted to store a plurality of instructions;

the instructions are adapted to be loaded by the processor and to perform the method of any of claims 1 to 9.

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for establishing a term recognition model and a method and a device for recognizing terms.

Background

A term is a collection of terms used in a particular subject area to represent concepts. For example, in the field of wear there are terms of dress, high-heeled shoes, hats, etc. Term recognition has a research meaning in the field of natural language processing, and has a wide application prospect particularly in machine translation and cross-language information retrieval.

The current term recognition methods mainly include manual recognition methods and automatic recognition methods. In order to realize the automatic recognition of terms, the linguistic data is subjected to term labeling according to a term dictionary to obtain labeled data, then part of labeled data is randomly selected to serve as a training sample to perform model training to obtain a term recognition model, and finally the term recognition model is used for performing term recognition on the sentence to be recognized. However, since the term dictionary is a term set accumulated according to experience, all terms in the corpus cannot be labeled based on the term dictionary, and thus a training sample obtained by randomly selecting part of labeled data has a certain error, which results in low accuracy of term recognition of the trained term recognition model.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for establishing a term recognition model, and a method and an apparatus for recognizing terms, which aim to solve the problem of low accuracy of the existing recognized terms.

In a first aspect, the present invention provides a method for building a term recognition model, where the method includes:

acquiring a first training set, wherein the first training set comprises labeling data obtained after term labeling is carried out on a first corpus based on a preset automatic labeling method;

deleting a certain proportion of term labels in the first training set to obtain a second training set;

performing model training based on the first training set to obtain a first term recognition model, and performing model training based on the second training set to obtain a second term recognition model;

according to the loss difference obtained by respectively carrying out term recognition on a specific data set by utilizing the first term recognition model and the second term recognition model, screening out labeling data meeting the term labeling quality requirement from the specific data set as a third training set; the specific data set is marked data obtained after term marking is carried out on a second corpus by utilizing the preset automatic marking method; the first corpus and the second corpus belong to the same corpus;

and carrying out model training by using the third training set to obtain a final required term recognition model.

Optionally, the screening, according to the loss difference obtained by respectively performing term recognition on a specific data set by using the first term recognition model and the second term recognition model, the labeling data meeting the term labeling quality requirement from the specific data set as a third training set includes:

respectively using the first term recognition model and the second term recognition model to perform term recognition on the original sentence corresponding to the labeling data in the specific data set, and labeling the recognized terms; the original sentence is a sentence before term labeling is carried out based on the preset automatic labeling method;

calculating the labeling loss of the first term recognition model and the labeling loss of the second term recognition model respectively aiming at the same original sentence, and calculating the difference between the labeling loss of the first term recognition model and the labeling loss of the second term recognition model to obtain the loss difference;

and screening out the marked data with the loss difference larger than a preset threshold value from the specific data set as the third training set.

Optionally, for the same original sentence, respectively calculating the annotation loss of the first term recognition model and the annotation loss of the second term recognition model includes:

calculating term labeling results of the first term recognition model and term labeling results in the specific data set by using a preset loss function aiming at the same original sentence to obtain a labeling loss of the first term recognition model;

and calculating the term labeling result of the second term recognition model and the term labeling result in the specific data set by using the preset loss function aiming at the same original sentence to obtain the labeling loss of the second term recognition model.

Optionally, if the term dictionary used by the preset automatic labeling method is updated, the method further includes:

based on the updated term dictionary, carrying out term annotation on the corpus again by utilizing the preset automatic annotation method;

screening out the labeling data meeting the term labeling quality requirement from the labeling data subjected to the term labeling again;

and updating the finally required term recognition model based on the screened annotation data.

Optionally, the deleting a certain proportion of term labels in the first training set to obtain a second training set includes:

deleting the term labels in the first training set according to a certain proportion at random to obtain a second training set;

or, determining a domain to which each term in the first training set belongs, and randomly deleting the term labels in a certain proportion for each domain respectively to obtain the second training set.

Optionally, the preset automatic labeling method is a remote monitoring method.

In a second aspect, the present invention provides a method for recognizing a term, the method comprising:

acquiring user data information;

identifying the commodity name in the user data information based on a term identification model; the term recognition model is obtained by adopting the establishment method of the term recognition model in the first aspect;

marking out a commodity name in the user data information;

and determining user preference by analyzing the user data information marked with the commodity name, and recommending the commodity to the user according to the user preference.

In a third aspect, the present invention provides a method for identifying a term, including:

acquiring data information including a person name generated in a preset platform;

identifying the name of the person in the data information based on a term identification model; the term recognition model is obtained by adopting the establishment method of the term recognition model in the first aspect;

marking out a person name in the data information;

and analyzing the data information of the marked name according to a preset name analysis rule to obtain statistical information aiming at the name.

In a fourth aspect, the present invention provides a method for recognizing a term, including:

acquiring medical data information;

identifying a medical name in the medical data information based on a term identification model; the term recognition model is obtained by adopting the establishment method of the term recognition model in the first aspect;

highlighting the medical name in the medical data information.

In a fifth aspect, the present invention provides an apparatus for building a term recognition model, the apparatus comprising:

the system comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a first training set, and the first training set comprises labeling data obtained after term labeling is carried out on a first corpus based on a preset automatic labeling method;

a deleting unit, configured to delete a certain proportion of term labels in the first training set to obtain a second training set;

the first training unit is used for carrying out model training on the basis of the first training set to obtain a first term recognition model and carrying out model training on the basis of the second training set to obtain a second term recognition model;

a screening unit, configured to screen, according to loss differences obtained by respectively performing term recognition on a specific data set by using the first term recognition model and the second term recognition model, labeling data meeting term labeling quality requirements from the specific data set as a third training set; the specific data set is marked data obtained after term marking is carried out on a second corpus by utilizing the preset automatic marking method; the first corpus and the second corpus belong to the same corpus;

and the second training unit is used for carrying out model training by utilizing the third training set to obtain a final required term recognition model.

Optionally, the screening unit includes:

the labeling module is used for performing term recognition on the original sentences corresponding to the labeling data in the specific data set by using the first term recognition model and the second term recognition model respectively and labeling the recognized terms; the original sentence is a sentence before term labeling is carried out based on the preset automatic labeling method;

the calculation module is used for calculating the labeling loss of the first term recognition model and the labeling loss of the second term recognition model respectively aiming at the same original sentence, and calculating the difference between the labeling loss of the first term recognition model and the labeling loss of the second term recognition model to obtain the loss difference;

and the screening module is used for screening out the marked data with the loss difference larger than a preset threshold value from the specific data set as the third training set.

Optionally, the calculation module is configured to calculate, by using a preset loss function, a term tagging result of the first term identification model and a term tagging result in the specific data set, so as to obtain a tagging loss of the first term identification model; and calculating the term labeling result of the second term recognition model and the term labeling result in the specific data set by using the preset loss function aiming at the same original sentence to obtain the labeling loss of the second term recognition model.

Optionally, the apparatus further comprises:

an updating unit, configured to perform term annotation again on the corpus by using the preset automatic annotation method based on the updated term dictionary if the term dictionary used by the preset automatic annotation method is updated; screening out the labeling data meeting the term labeling quality requirement from the labeling data subjected to the term labeling again; and updating the finally required term recognition model based on the screened annotation data.

Optionally, the deleting unit is configured to randomly delete the term labels in the first training set according to the certain proportion to obtain the second training set;

or, determining a domain to which each term in the first training set belongs, and randomly deleting the term labels in a certain proportion for each domain respectively to obtain the second training set.

Optionally, the preset automatic labeling method is a remote monitoring method.

In a sixth aspect, the present invention provides a term recognition apparatus, including:

an acquisition unit configured to acquire user data information;

an identifying unit configured to identify a commodity name in the user data information based on a term identification model; the term recognition model is obtained by adopting the establishment method of the term recognition model in the first aspect;

the marking unit is used for marking the commodity name in the user data information;

the determining unit is used for determining the user preference by analyzing the user data information marked with the commodity name;

and the recommending unit is used for recommending commodities to the user according to the user preference.

In a seventh aspect, the present invention provides a term recognition apparatus, including:

the system comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring data information including a name generated in a preset platform;

the identification unit is used for identifying the name of the person in the data information based on a term identification model; the term recognition model is obtained by adopting the establishment method of the term recognition model in the first aspect;

the marking unit is used for marking the name of the person in the data information;

and the analysis unit is used for analyzing the data information of the marked name according to a preset name analysis rule to obtain statistical information aiming at the name.

In an eighth aspect, the present invention provides a term recognition apparatus, including:

an acquisition unit for acquiring medical data information;

the identification unit is used for identifying the medical name in the medical data information based on a term identification model; the term recognition model is obtained by adopting the establishment method of the term recognition model in the first aspect;

and the output unit is used for highlighting the medical name in the medical data information.

In a ninth aspect, the present invention provides a storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method according to any one of the first to fourth aspects.

In a tenth aspect, the present invention provides an electronic device comprising a storage medium and a processor;

the processor is suitable for realizing instructions;

the storage medium adapted to store a plurality of instructions;

the instructions are adapted to be loaded by the processor and to perform the method of any of the first to fourth aspects.

By means of the technical scheme, the term recognition model establishing method and device, and the term recognition method and device provided by the invention can obtain the term recognition model with higher term recognition accuracy by screening out relatively comprehensive labeled data (namely high-quality labeled data) from the labeled data instead of directly and randomly selecting a part of labeled data to perform model training after performing term labeling on the material library based on a preset automatic labeling method to obtain the labeled data, and then performing model training by using the screened high-quality labeled data. Specifically, after a part of labeled data is selected, the selected labeled data is used as a first training set, and labeled data with a certain proportion of terms in the first training set deleted is used as a second training set; secondly, training based on the first training set to obtain a first term recognition model with relatively high quality, and training based on the second training set to obtain a second term recognition model with relatively low quality; and then, judging whether the quality of the term labeling originally performed on the specific data set meets the requirement or not by using the difference between the loss of the term recognition performed on the specific data set (namely, data selected from other labeling data except the first training set in the labeled corpus) by using the first term recognition model with relatively high quality and the loss of the term recognition performed on the specific data set by using the second term recognition model with relatively low quality, and screening out labeling data meeting the quality requirement as a final training set required by model training for model training, thereby greatly improving the recognition accuracy of the finally trained term recognition model.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart illustrating a method for building a term recognition model according to an embodiment of the present invention;

fig. 2 and fig. 3 are schematic diagrams illustrating a method for establishing a term recognition model according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a method for screening annotation data according to an embodiment of the present invention;

FIG. 5 is a block diagram illustrating an apparatus for building a term recognition model according to an embodiment of the present invention;

FIG. 6 is a block diagram showing another term recognition model building apparatus according to an embodiment of the present invention;

fig. 7 is a block diagram illustrating components of a term recognition apparatus according to an embodiment of the present invention;

fig. 8 is a block diagram showing another term recognition apparatus provided in the embodiment of the present invention;

fig. 9 is a block diagram illustrating a further term recognition apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In order to improve the term recognition accuracy of the term recognition model, the term recognition model can obtain the term recognition model with higher term recognition accuracy by screening out relatively comprehensive labeled data (namely high-quality labeled data) from the labeled data and then performing model training by using the screened high-quality labeled data instead of directly and randomly selecting a part of labeled data to perform model training after term labeling is performed on the corpus based on a preset automatic labeling method to obtain the labeled data. As shown in fig. 1 to 3, the method for establishing a term recognition model provided in the embodiment of the present invention mainly includes:

101. a first training set is obtained.

The first training set comprises labeled data obtained after term labeling is carried out on the first corpus based on a preset automatic labeling method. The preset automatic labeling method comprises but is not limited to a remote supervision method; the first corpus includes a preset number of original sentences (i.e., unlabeled sentences). The remote supervision is usually applied to a relation extraction task and is used for automatically constructing training annotation data, so that the manual annotation cost is saved. In the term mining task, assuming that there is a term dictionary and unlabeled corpus, if a character string of the corpus is contained in the term dictionary, the character string is labeled as a term.

For example, for a sentence to be marked "buy a long-sleeve dress and a pair of black high-heeled shoes. "if the term dictionary includes" high-heeled shoes ", then" high-heeled shoes "may be labeled as a term, and if the term dictionary includes" high-heeled shoes "and" dress ", then" high-heeled shoes "and" dress "may both be labeled as terms.

When marking specifically, each character in the sentence may be assigned a symbol, and the symbol corresponding to the term is different from the symbol corresponding to the non-term. For example, if only "high heeled shoes" are labeled as terms in the above sentence, the labeled character string corresponding to the sentence is "oooooooooooooooooo oo iio", where O denotes a non-term character, B denotes a beginning boundary of a term, and I denotes a middle or ending boundary of a term. If the embodiment of the present invention adopts the labeling method to label the first corpus, the labeled data in the first training set includes the original sentence and the corresponding labeled character string.

When the fonts in the first corpus are all the same, the term marking function can be realized by modifying the font of a certain character string, and other marking modes can also be adopted.

In addition, when the term labeling is performed, the term dictionary used may include only one domain term dictionary or may include multiple domain term dictionaries. And when the term dictionary of multiple fields (such as e-commerce field, medical field, computer field, etc.), the term dictionaries of different fields and the sentences to be labeled need to be distributed relatively uniformly so as to ensure that the difference of the term recognition accuracy of the finally trained term recognition model to different fields is relatively small.

102. And deleting a certain proportion of term labels in the first training set to obtain a second training set.

When the first training set comprises terms of a domain, the term labels in the first training set in a certain proportion can be deleted randomly to obtain a second training set; when the first training set includes terms of multiple fields, the term labels in the first training set in a certain proportion may be randomly deleted to obtain the second training set, or the field to which each term in the first training set belongs may be determined first, and then the term labels in a certain proportion are randomly deleted for each field to obtain the second training set. The certain proportion may be determined empirically, for example, it may be determined through experiments that the determined labeling quality of the third training set is the highest when the first proportion is 20% to 30%.

When the term is marked in the "OBI" manner mentioned in step 101, the specific implementation manner of deleting term marking may be: the symbols of 'B' and 'I' corresponding to the terms are modified into 'O' symbols. When the terms are marked in a manner that modifies the font, the font can be changed back to the original font.

103. And carrying out model training based on the first training set to obtain a first term recognition model, and carrying out model training based on the second training set to obtain a second term recognition model.

The neural network structure used for training the first term recognition model and the second term recognition model includes, but is not limited to, the following: Bi-LSTM (Bi-directional Long Short-Term Memory Network), RNN (Recurrent Neural Network), CNN (Convolutional Neural Network), and transducer.

104. And screening out labeling data meeting the term labeling quality requirement from the specific data set as a third training set according to the loss difference obtained by respectively carrying out term recognition on the specific data set by utilizing the first term recognition model and the second term recognition model.

The specific data set is marked data obtained after term marking is carried out on a second corpus by using the preset automatic marking method; the first corpus and the second corpus belong to the same corpus. In specific implementation, a first preset number of sentences may be taken out from the corpus as first corpora for term labeling to obtain a first training set, a second preset number of sentences may be taken out from the corpus as second corpora for term labeling to obtain a specific data set, and there is no duplicate sentence between the first corpora and the second corpora. Or, directly labeling terms of the corpus, and then taking one part from the labeled data as a first training set and taking the other part as a specific data set. In addition, in order to adjust parameters in the term recognition model and determine the recognition effect of the term recognition model, a third corpus and a fourth corpus which are not repeated with the first corpus and the second corpus may be taken out from the corpus to perform term labeling to obtain a verification set and a test set, or two parts which are not repeated with the first training set and a specific data set are taken out from all labeling data corresponding to the corpus to be used as the verification set and the test set respectively. Wherein the number of the specific data sets is much larger than a first training set, the first training set: and (4) verification set: the test set may be N: 1: 1, and N is greater than 1.

Because the second training set is subjected to partial term labeling deletion on the basis of the first training set, the labeled terms in the second training set are relatively fewer, so that the recognition accuracy of the second term recognition model trained on the basis of the second training set is lower than that of the first term recognition model trained on the basis of the first training set. The term labeling of the original sentences in the specific data set by the first term identification model produces a labeling loss compared with the original term labeling of the specific data set, and similarly, the term labeling of the original sentences in the specific data set by the second term identification model also produces a labeling loss compared with the original term labeling of the specific data set, and the larger the difference between the two labeling losses is, the higher the quality of the terms originally labeled in the specific data set is. Therefore, the quality of each labeled data in the specific data set can be judged according to the loss difference, and labeled data with better quality is selected from the labeled data as a training set required by the final term recognition model training.

105. And carrying out model training by using the third training set to obtain a final required term recognition model.

The network structure used for training the final required term recognition model is the same as the neural network structure used for the first term recognition model and the second term recognition model.

The method for establishing the term recognition model provided by the embodiment of the invention can obtain the term recognition model with higher term recognition accuracy by firstly screening out relatively comprehensive labeled data (namely high-quality labeled data) from the labeled data and then carrying out model training by utilizing the screened high-quality labeled data instead of directly and randomly selecting a part of labeled data to carry out model training after carrying out term labeling on the material library based on a preset automatic labeling method to obtain the labeled data. Specifically, after a part of labeled data is selected, the selected labeled data is used as a first training set, and labeled data with a certain proportion of terms in the first training set deleted is used as a second training set; secondly, training based on the first training set to obtain a first term recognition model with relatively high quality, and training based on the second training set to obtain a second term recognition model with relatively low quality; and then, judging whether the quality of the term labeling originally performed on the specific data set meets the requirement or not by using the difference between the loss of the term recognition performed on the specific data set (namely, data selected from other labeling data except the first training set in the labeled corpus) by using the first term recognition model with relatively high quality and the loss of the term recognition performed on the specific data set by using the second term recognition model with relatively low quality, and screening out labeling data meeting the quality requirement as a final training set required by model training for model training, thereby greatly improving the recognition accuracy of the finally trained term recognition model.

Optionally, a specific implementation manner of the step 104 may be as shown in fig. 4, and specifically includes:

1041. and respectively using the first term recognition model and the second term recognition model to perform term recognition on the original sentence corresponding to the labeling data in the specific data set, and labeling the recognized terms.

The original sentence is a sentence before term labeling is performed based on the preset automatic labeling method. For example, the specific data set includes the pair "buy a long sleeve dress and a pair of black high-heeled shoes. The 'middle' dress and the 'high-heeled shoes' are labeled with data after term labeling, and then the corresponding original sentence is 'buying a long-sleeve dress and a pair of black high-heeled shoes'. When the term recognition is performed on the original sentence by using the first term recognition model, the term labeling is performed on both the one-piece dress and the high-heeled shoe, and when the term recognition is performed on the original sentence by using the second term recognition model, the term labeling may be performed only on the one-piece dress.

1042. And aiming at the same original sentence, respectively calculating the labeling loss of the first term recognition model and the labeling loss of the second term recognition model, and calculating the difference between the labeling loss of the first term recognition model and the labeling loss of the second term recognition model to obtain the loss difference.

Calculating term labeling results of the first term recognition model and term labeling results in the specific data set by using a preset loss function aiming at the same original sentence to obtain a labeling loss of the first term recognition model; and calculating the term labeling result of the second term recognition model and the term labeling result in the specific data set by using the preset loss function aiming at the same original sentence to obtain the labeling loss of the second term recognition model. Wherein the pre-set loss function includes, but is not limited to, a cross-entropy loss function.

The loss function is used for measuring the inconsistency degree of the predicted value and the real value of the model, is a non-negative real value function, and the smaller the loss function is, the better the robustness of the model is. The term labeling results of the first term identification model and the second term identification model are predicted values, and the term labeling results in the specific data set are real values, so that the labeling loss can be calculated based on the predicted values and the real values.

1043. And screening out the marked data with the loss difference larger than a preset threshold value from the specific data set as the third training set.

The greater the difference between the labeling loss of the first term recognition model and the labeling loss of the second term recognition model, the higher the quality of the labeled data in the specific data set, that is, the more comprehensive the labeled terms, so a difference threshold value can be set empirically, the labeled data greater than the difference threshold value is determined as labeled data with relatively high quality, and the labeled data is selected as a third training set for final term recognition model training.

Further, as languages are diversified, new terms appear in each field, so that the term dictionary needs to be updated continuously, and the term recognition model also needs to be updated continuously to ensure the accuracy of term recognition. Therefore, if the term dictionary used by the preset automatic labeling method is updated, the corpus may be re-labeled with terms by using the preset automatic labeling method based on the updated term dictionary, then labeling data meeting the term labeling quality requirement is screened from the labeling data after the term labeling is performed again, and finally the finally required term recognition model is updated based on the screened labeling data. The specific implementation manner of screening out the labeling data meeting the term labeling quality requirement from the labeling data after the term labeling is repeated is the same as that in the step 101-104, and is not described herein again.

The term recognition model provided by the embodiment of the invention can be applied to various term recognition scenes, for example, the term recognition model can be used for carrying out commodity name recognition on user data (such as commodity comment information, communication information with merchants, communication information with e-commerce customer service and the like) in the e-commerce field, the term recognition model can be used for carrying out name recognition on data (such as lyrics, user comments, movie profiles and the like) in the fields of music, movie ticketing and the like, and the term recognition model can be used for carrying out medicine name or disease name recognition on data (such as diagnosis books, laboratory test reports and the like) in the medical field.

Specific implementation manners for implementing term identification in the above three fields are illustrated below:

(one) E-commerce field:

and A1, acquiring user data information.

The user data information comprises commodity comment information, communication information with a merchant, communication information with e-commerce customer service and the like.

And A2, identifying the commodity name in the user data information based on the term identification model.

The term recognition model is established by adopting the establishing method of the term recognition model, and terms related when the term recognition model is established by adopting the method comprise commodity names.

A3, marking commodity name in the user data information.

A4, analyzing the user data information marked with the commodity name, determining the user preference, and recommending the commodity to the user according to the user preference.

(II) music, film ticket selling and other fields

And B1, acquiring data information including the name of the person generated in the preset platform.

The preset platform includes but is not limited to a music platform, a movie ticket selling platform and the like. The data information is data information including a name of a person, and includes, for example, lyrics, a music profile, user comments, a movie profile, and the like.

And B2, identifying the names of the persons in the data information based on the term identification model.

The term recognition model is established by adopting the establishing method of the term recognition model, and terms related when the term recognition model is established by adopting the method comprise names of people.

And B3, marking the name of the person in the data information.

And B4, analyzing the data information of the marked names according to a preset name analysis rule to obtain statistical information aiming at the names.

The names of the tagged persons include, but are not limited to, names of persons in lyrics or videos, and names of actors (including singers and video actors). Specifically, the user comments with the names of the people marked can be analyzed to obtain the preference of the user to the people, and other movies and other songs played by the actor are recommended to the user, and the attention of the user to the names of the people can be counted, and the names of the people are ranked.

(III) medical field

And C1, acquiring medical data information.

The medical data information includes a diagnostic book, a laboratory sheet, and the like.

And C2, identifying the medical name in the medical data information based on the term identification model.

The term recognition model is established by adopting the establishing method of the term recognition model, and the terms involved in establishing the term recognition model by adopting the method comprise medical names. The medical names include drug names and disease names.

And C3, highlighting the medical name in the medical data information so that the medical staff can read effective information from the medical data information quickly.

Further, according to the above embodiment of the training method of the dialogue model, another embodiment of the present invention further provides an apparatus for establishing a term recognition model, as shown in fig. 5, the apparatus includes:

an obtaining unit 21, configured to obtain a first training set, where the first training set includes tagged data obtained by performing term tagging on a first corpus based on a preset automatic tagging method;

a deleting unit 22, configured to delete a certain proportion of term labels in the first training set to obtain a second training set;

a first training unit 23, configured to perform model training based on the first training set to obtain a first term recognition model, and perform model training based on the second training set to obtain a second term recognition model;

a screening unit 24, configured to screen, according to loss differences obtained by respectively performing term recognition on a specific data set by using the first term recognition model and the second term recognition model, labeling data meeting the term labeling quality requirement from the specific data set as a third training set; the specific data set is marked data obtained after term marking is carried out on a second corpus by utilizing the preset automatic marking method; the first corpus and the second corpus belong to the same corpus;

and the second training unit 25 is configured to perform model training by using the third training set to obtain a final required term recognition model.

Optionally, as shown in fig. 6, the screening unit 24 includes:

a labeling module 241, configured to perform term recognition on an original sentence corresponding to labeled data in the specific data set by using the first term recognition model and the second term recognition model, respectively, and label the recognized terms; the original sentence is a sentence before term labeling is carried out based on the preset automatic labeling method;

a calculating module 242, configured to calculate, for the same original sentence, a labeling loss of the first term recognition model and a labeling loss of the second term recognition model, and calculate a difference between the labeling loss of the first term recognition model and the labeling loss of the second term recognition model, so as to obtain the loss difference;

a screening module 243, configured to screen out, from the specific data set, the labeled data with the loss difference larger than a preset threshold as the third training set.

Optionally, the calculating module 242 is configured to calculate, by using a preset loss function, a term tagging result of the first term identification model and a term tagging result in the specific data set, so as to obtain a tagging loss of the first term identification model; and calculating the term labeling result of the second term recognition model and the term labeling result in the specific data set by using the preset loss function aiming at the same original sentence to obtain the labeling loss of the second term recognition model.

Optionally, as shown in fig. 6, the apparatus further includes:

an updating unit 26, configured to perform term annotation again on the corpus by using the preset automatic annotation method based on the updated term dictionary if the term dictionary used by the preset automatic annotation method is updated; screening out the labeling data meeting the term labeling quality requirement from the labeling data subjected to the term labeling again; and updating the finally required term recognition model based on the screened annotation data.

Optionally, the deleting unit 22 is configured to randomly delete the term labels in the first training set according to the certain proportion to obtain the second training set;

or, determining a domain to which each term in the first training set belongs, and randomly deleting the term labels in a certain proportion for each domain respectively to obtain the second training set.

Optionally, the preset automatic labeling method is a remote monitoring method.

The device for establishing the term recognition model provided by the embodiment of the invention can obtain the term recognition model with higher term recognition accuracy by firstly screening out relatively comprehensive labeled data (namely high-quality labeled data) from the labeled data and then carrying out model training by utilizing the screened high-quality labeled data instead of directly and randomly selecting a part of labeled data to carry out model training after carrying out term labeling on the material library based on a preset automatic labeling method to obtain the labeled data. Specifically, after a part of labeled data is selected, the selected labeled data is used as a first training set, and labeled data with a certain proportion of terms in the first training set deleted is used as a second training set; secondly, training based on the first training set to obtain a first term recognition model with relatively high quality, and training based on the second training set to obtain a second term recognition model with relatively low quality; and then, judging whether the quality of the term labeling originally performed on the specific data set meets the requirement or not by using the difference between the loss of the term recognition performed on the specific data set (namely, data selected from other labeling data except the first training set in the labeled corpus) by using the first term recognition model with relatively high quality and the loss of the term recognition performed on the specific data set by using the second term recognition model with relatively low quality, and screening out labeling data meeting the quality requirement as a final training set required by model training for model training, thereby greatly improving the recognition accuracy of the finally trained term recognition model.

Further, according to the above embodiment of the term recognition method, another embodiment of the present invention further provides a term recognition apparatus, as shown in fig. 7, the apparatus includes:

an acquisition unit 31 for acquiring user data information;

an identifying unit 32 for identifying the commodity name in the user data information based on a term identification model; the term recognition model is obtained by adopting the establishment method of the term recognition model;

a labeling unit 33, configured to label a commodity name in the user data information;

a determination unit 34 configured to determine a user preference by analyzing the user data information labeled with the commodity name;

and the recommending unit 35 is configured to recommend the commodity to the user according to the user preference.

Further, according to the above embodiment of the term recognition method, another embodiment of the present invention further provides a term recognition apparatus, as shown in fig. 8, the apparatus includes:

an obtaining unit 41, configured to obtain data information including a name generated in a preset platform;

a recognition unit 42 for recognizing the names of persons in the data information based on a term recognition model; the term recognition model is obtained by adopting the establishment method of the term recognition model;

a labeling unit 43, configured to label a person name in the data information;

and the analysis unit 44 is configured to analyze the data information with the name of the tagged person according to a preset name analysis rule, so as to obtain statistical information for the name of the person.

Further, according to the above embodiment of the term recognition method, another embodiment of the present invention further provides a term recognition apparatus, as shown in fig. 9, the apparatus includes:

an acquisition unit 51 for acquiring medical data information;

an identifying unit 52 for identifying the medical name in the medical data information based on a term identification model; the term recognition model is obtained by adopting the establishment method of the term recognition model;

and an output unit 53, configured to highlight the medical name in the medical data information.

Further, another embodiment of the present invention also provides a storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method as described above.

When the program stored in the storage medium provided by the embodiment of the invention is executed, after term labeling is performed on the corpus based on a preset automatic labeling method to obtain labeled data, rather than directly and randomly selecting a part of labeled data to perform model training, labeled data (namely high-quality labeled data) which are relatively comprehensively labeled are screened from the labeled data, and then the screened high-quality labeled data are used for performing model training, so that a term recognition model with higher term recognition accuracy can be obtained.

Further, another embodiment of the present invention also provides an electronic device including a storage medium and a processor;

the processor is suitable for realizing instructions;

the storage medium adapted to store a plurality of instructions;

the instructions are adapted to be loaded by the processor and to perform the method as described above.

The electronic equipment provided by the embodiment of the invention can obtain the labeled data by carrying out term labeling on the material library based on the preset automatic labeling method, and then can obtain the term recognition model with higher term recognition accuracy by firstly screening the labeled data (namely, high-quality labeled data) which is relatively comprehensively labeled from the labeled data instead of directly and randomly selecting a part of labeled data to carry out model training and then carrying out model training by using the screened high-quality labeled data.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any references above to specific languages are provided for disclosure of enablement and practice of the present invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be understood by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of the term recognition model building method and apparatus, some or all of the components of the term recognition method and apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

22页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于目录主题分类的轨道交通规范实体识别方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!