Entity revision method, entity revision device, computer equipment and readable storage medium

文档序号：1127741 发布日期：2020-10-02 浏览：8次中文

阅读说明：本技术 一种实体修订方法、装置、计算机设备和可读存储介质 (Entity revision method, entity revision device, computer equipment and readable storage medium ) 是由张乐情李燕婷李果夫李贤杰刘剑于 2020-06-24 设计创作，主要内容包括：本发明公开了一种实体修订方法,包括：获取待修订文本；将待修订文本输入意图识别模型,以使意图识别模型识别出待修订文本的意图并在待修订文本中标注识别出的意图所对应的文本范围；在识别出的意图不唯一时,按照标注的文本范围将待修订文本拆分为多个待修订子文本,其中,每个待修订子文本唯一对应一个意图；将待修订子文本输入实体提取模型,以使实体提取模型提取出待修订子文本中的实体；从后台数据库获取与待修订子文本的意图关联的标准实体,并利用标准实体对提取的待修订子文本中的实体进行修订。本发明还公开了一种实体修订装置和一种计算机可读存储介质。另外,本发明还涉及人工智能中的模型训练及区块链技术。(The invention discloses an entity revision method, which comprises the following steps: acquiring a text to be revised; inputting the text to be revised into an intention identification model so that the intention identification model identifies the intention of the text to be revised and marks a text range corresponding to the identified intention in the text to be revised; when the identified intention is not unique, splitting the text to be revised into a plurality of sub-texts to be revised according to the marked text range, wherein each sub-text to be revised uniquely corresponds to one intention; inputting the sub-text to be revised into an entity extraction model so that the entity extraction model extracts the entity in the sub-text to be revised; and acquiring a standard entity associated with the intention of the sub-text to be revised from the background database, and revising the entity in the extracted sub-text to be revised by using the standard entity. The invention also discloses an entity revision device and a computer-readable storage medium. In addition, the invention also relates to a model training and block chain technology in artificial intelligence.)

1. An entity revision method, comprising:

acquiring a text to be revised;

inputting the text to be revised into an intention recognition model, so that the intention recognition model recognizes the intention of the text to be revised and marks a text range corresponding to the recognized intention in the text to be revised;

when the identified intentions are not unique, splitting the text to be revised into a plurality of sub-texts to be revised according to the marked text range, wherein each sub-text to be revised uniquely corresponds to one intention;

inputting the sub-text to be revised into an entity extraction model so that the entity extraction model extracts the entity in the sub-text to be revised;

and acquiring a standard entity associated with the intention of the sub-text to be revised from a background database, and revising the entity in the extracted sub-text to be revised by using the standard entity.

2. The method of claim 1, further comprising:

acquiring a plurality of intention recognition training samples, wherein each intention recognition training sample comprises a history text to be revised, intentions of the history text to be revised and a text range corresponding to each intention of the history text to be revised;

when a feature word meeting a first preset rule exists in the historical text to be revised, converting each character in the feature word meeting the first preset rule into an M-dimensional vector, wherein elements in the M-dimensional vector represent that a first preset type of feature word matched with the first preset rule exists in the historical text to be revised, and M is an integer greater than or equal to 1;

and training a machine learning algorithm according to the converted M-dimensional vector to obtain the intention recognition model.

3. The method of claim 2, wherein the training of the machine learning algorithm from the transformed M-dimensional vector to obtain the intent recognition model comprises:

when the feature words meeting the first preset rule exist in the historical text to be revised, converting each character in the feature words meeting the first preset rule into an N-dimensional vector by using a first preset algorithm, wherein N is an integer greater than or equal to 1;

splicing the N-dimensional vector and the M-dimensional vector of each character in the feature words according with the first preset rule into an L-dimensional vector, wherein L is N + M;

and training the machine learning algorithm according to the L-dimensional vector obtained by splicing to obtain the intention recognition model.

4. The method of claim 1, further comprising:

acquiring a plurality of entity extraction training samples, wherein each entity extraction training sample comprises a historical sub-text to be revised and an entity in the historical sub-text to be revised;

when a feature word meeting a second preset rule exists in the historical sub-text to be revised, converting each word in the feature word meeting the second preset rule into an M ' dimensional vector, wherein elements in the M ' dimensional vector represent that a second preset type of feature word matched with the second preset rule exists in the historical sub-text to be revised, and M ' is an integer greater than or equal to 1;

and training a machine learning algorithm according to the converted M' dimensional vector to obtain the entity extraction model.

5. The method of claim 4, wherein the training of the machine learning algorithm according to the transformed M' dimensional vector to obtain the entity extraction model comprises:

when the feature words meeting the second preset rule exist in the historical text to be revised, converting each character in the feature words meeting the second preset rule into an N '-dimensional vector by using a second preset algorithm, wherein N' is an integer greater than or equal to 1;

splicing the N '-dimensional vector and the M' -dimensional vector of each character in the feature words according with the second preset rule into an L '-dimensional vector, wherein L' ═ N '+ M';

and training the machine learning algorithm according to the L' dimension vector obtained by splicing to obtain the entity extraction model.

6. The method according to claim 4, wherein the obtaining a standard entity associated with the intention of the sub-text to be revised from a background database, and revising the entity in the extracted sub-text to be revised by using the standard entity comprises:

dividing entities belonging to the same category in the sub-text to be revised into a group;

filling each group of divided entities into each row of an entity revision table for entity revision;

acquiring the standard entities from the background database, and judging whether each row of entities in the filled entity revision table is consistent with the standard entities of the corresponding category;

and if not, revising each row of entities in the entity revision table after filling by using the standard entities of the corresponding category.

7. The method of claim 6, wherein the populating each group of divided entities into each row of an entity revision table for entity revision comprises:

and filling each group of divided entities, the intention of the sub-text to be revised of each group of entities and the user information corresponding to the sub-text to be revised of each group of entities into each row of the entity revision table.

8. The method of claim 7, further comprising:

acquiring a target text input by a target user;

inputting the target text into the intention recognition model and the entity extraction model, so that the intention recognition model recognizes the intention of the target text, and the entity extraction model extracts the entity in the target text;

and screening out user information matched with the intention and the entity of the target text from the revised entity revision table, and recommending the screened user information to the target user.

9. An entity revision apparatus, comprising:

the first acquisition module is used for acquiring a text to be revised;

the first input module is used for inputting the text to be revised into an intention recognition model so that the intention recognition model recognizes the intention of the text to be revised and marks a text range corresponding to the recognized intention in the text to be revised;

the splitting module is used for splitting the text to be revised into a plurality of sub-texts to be revised according to the marked text range when the identified intention is not unique, wherein each sub-text to be revised uniquely corresponds to one intention;

the second input module is used for inputting the sub-text to be revised into an entity extraction model so as to enable the entity extraction model to extract the entity in the sub-text to be revised;

and the revision module is used for acquiring a standard entity associated with the intention of the sub-text to be revised from a background database and revising the entity in the sub-text to be revised by using the standard entity.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 8.

Technical Field

The invention relates to the technical field of computers, in particular to an entity revision method, an entity revision device, computer equipment and a computer-readable storage medium.

Background

The existing stock market generally lacks a centralized bidding platform, so the discovery of trading opponents and the publishing of trading intentions are often realized by instant chat tools, such as message transmission in the same chat room by using natural language texts.

However, the inventor researches and discovers that due to informality of conversation occasions, the natural language texts usually have unstructured characteristics, and even have error information or missing information, namely, the natural language texts usually have error entities or missing entities.

Aiming at the technical problem that natural text languages in the existing trading market have wrong entities or missing entities, an effective solution is not provided at present.

Disclosure of Invention

The invention aims to provide an entity revision method, an entity revision device, computer equipment and a computer-readable storage medium, which can solve the technical problem that natural text languages in the existing trading market have wrong entities or missing entities.

One aspect of the present invention provides an entity revision method, including: acquiring a text to be revised; inputting the text to be revised into an intention recognition model, so that the intention recognition model recognizes the intention of the text to be revised and marks a text range corresponding to the recognized intention in the text to be revised; when the identified intentions are not unique, splitting the text to be revised into a plurality of sub-texts to be revised according to the marked text range, wherein each sub-text to be revised uniquely corresponds to one intention; inputting the sub-text to be revised into an entity extraction model so that the entity extraction model extracts the entity in the sub-text to be revised; and acquiring a standard entity associated with the intention of the sub-text to be revised from a background database, and revising the entity in the extracted sub-text to be revised by using the standard entity.

Optionally, the method further comprises: acquiring a plurality of intention recognition training samples, wherein each intention recognition training sample comprises a history text to be revised, intentions of the history text to be revised and a text range corresponding to each intention of the history text to be revised; when a feature word meeting a first preset rule exists in the historical text to be revised, converting each character in the feature word meeting the first preset rule into an M-dimensional vector, wherein elements in the M-dimensional vector represent that a first preset type of feature word matched with the first preset rule exists in the historical text to be revised, and M is an integer greater than or equal to 1; and training a machine learning algorithm according to the converted M-dimensional vector to obtain the intention recognition model.

Optionally, the training a machine learning algorithm according to the converted M-dimensional vector to obtain the intention recognition model includes: when the feature words meeting the first preset rule exist in the historical text to be revised, converting each character in the feature words meeting the first preset rule into an N-dimensional vector by using a first preset algorithm, wherein N is an integer greater than or equal to 1; splicing the N-dimensional vector and the M-dimensional vector of each character in the feature words according with the first preset rule into an L-dimensional vector, wherein L is N + M; and training the machine learning algorithm according to the L-dimensional vector obtained by splicing to obtain the intention recognition model.

Optionally, the method further comprises: acquiring a plurality of entity extraction training samples, wherein each entity extraction training sample comprises a historical sub-text to be revised and an entity in the historical sub-text to be revised; when a feature word meeting a second preset rule exists in the historical sub-text to be revised, converting each word in the feature word meeting the second preset rule into an M ' dimensional vector, wherein elements in the M ' dimensional vector represent that a second preset type of feature word matched with the second preset rule exists in the historical sub-text to be revised, and M ' is an integer greater than or equal to 1; and training a machine learning algorithm according to the converted M' dimensional vector to obtain the entity extraction model.

Optionally, the training of the machine learning algorithm according to the converted M' dimensional vector to obtain the entity extraction model includes: when the feature words meeting the second preset rule exist in the historical text to be revised, converting each character in the feature words meeting the second preset rule into an N '-dimensional vector by using a second preset algorithm, wherein N' is an integer greater than or equal to 1; splicing the N '-dimensional vector and the M' -dimensional vector of each character in the feature words according with the second preset rule into an L '-dimensional vector, wherein L' ═ N '+ M'; and training the machine learning algorithm according to the L' dimension vector obtained by splicing to obtain the entity extraction model.

Optionally, the obtaining a standard entity associated with the intention of the sub-text to be revised from a background database, and revising the extracted entity in the sub-text to be revised by using the standard entity includes: dividing entities belonging to the same category in the sub-text to be revised into a group; filling each group of divided entities into each row of an entity revision table for entity revision; acquiring the standard entities from the background database, and judging whether each row of entities in the filled entity revision table is consistent with the standard entities of the corresponding category; and if not, revising each row of entities in the entity revision table after filling by using the standard entities of the corresponding category.

Optionally, the populating each group of divided entities into each row of an entity revision table for entity revision includes: and filling each group of divided entities, the intention of the sub-text to be revised of each group of entities and the user information corresponding to the sub-text to be revised of each group of entities into each row of the entity revision table.

Optionally, the method further comprises: acquiring a target text input by a target user; inputting the target text into the intention recognition model and the entity extraction model, so that the intention recognition model recognizes the intention of the target text, and the entity extraction model extracts the entity in the target text; and screening out user information matched with the intention and the entity of the target text from the revised entity revision table, and recommending the screened user information to the target user.

Still another aspect of the present invention provides an entity revision apparatus, including: the first acquisition module is used for acquiring a text to be revised; the first input module is used for inputting the text to be revised into an intention recognition model so that the intention recognition model recognizes the intention of the text to be revised and marks a text range corresponding to the recognized intention in the text to be revised; the splitting module is used for splitting the text to be revised into a plurality of sub-texts to be revised according to the marked text range when the identified intention is not unique, wherein each sub-text to be revised uniquely corresponds to one intention; the second input module is used for inputting the sub-text to be revised into an entity extraction model so as to enable the entity extraction model to extract the entity in the sub-text to be revised; and the revision module is used for acquiring a standard entity associated with the intention of the sub-text to be revised from a background database and revising the entity in the sub-text to be revised by using the standard entity.

Yet another aspect of the present invention provides a computer apparatus, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of utilizing entity revision of any of the above embodiments when executing the computer program.

Yet another aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the entity revision method described in any of the above embodiments. Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The entity revision method provided by the invention is characterized in that for a text to be revised, the intention of the text to be revised is recognized through an intention recognition model and a text range corresponding to the intention is marked in the text to be revised, because the entities corresponding to different intentions have differences, and a standard entity used in entity revision also has certain differences, when the text to be revised has a plurality of intentions, the text to be revised is divided into a plurality of sub-texts to be revised according to the text range corresponding to each intention marked in the text to be revised by the intention recognition model, each sub-text to be revised only has one intention, furthermore, for any sub-text to be revised, the entity contained in the sub-text to be revised is extracted through an entity extraction model, and then the extracted entity is revised by using the standard entity related to the intention of the sub-text to be revised, the method and the device achieve the purpose of revising the entities in the text and solve the technical problem that the natural text language in the existing trading market has wrong entities or missing entities.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 schematically illustrates a flow diagram of an entity revision method according to an embodiment of the present invention;

FIG. 2 schematically illustrates a schematic diagram of an entity revision scheme according to an embodiment of the present invention;

FIG. 3 schematically shows a schematic diagram of a model training process according to an embodiment of the invention;

FIG. 4 schematically shows a block diagram of an entity revision apparatus according to an embodiment of the present invention;

FIG. 5 schematically illustrates a block diagram of a computer device suitable for implementing an entity revision method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Fig. 1 schematically shows a flowchart of an entity revision method according to an embodiment of the present invention, which may include steps S1 through S5, as shown in fig. 1, wherein:

and step S1, acquiring the text to be revised.

Taking a security trading scenario as an example, a plurality of users issue security information through a network platform to form a dialog text, where the dialog text includes a plurality of subfiles divided by the users and time, and each subfile is called a text to be revised. In this embodiment, for each user, a dialog issued by the user at one time is used as a text to be revised, that is, all texts from a dialog start position corresponding to the user until a dialog start of a next user are used as a text to be revised by the user.

The method comprises the following steps of acquiring a text to be revised from a dialog text, specifically: identifying all starting position identifications and ending position identifications corresponding to each starting position identification in the dialog text; and taking the text positioned between each starting position mark and the ending position mark corresponding to the starting position mark in the dialog text as the text to be revised. Wherein, the dialogue text can include macro economic data.

For example, dialog text includes:

user A2019/3/2/11: 14:15

High performance price ratio non-public sale

03889Y125610.5H bead administration 01AA +/AA +6000W assessment 7.345

0.9342+2Y145678.5H Xinyang 02 AA/03000W valuation 5.5645

User B2019/3/2/11: 16:51

Interested in the second comparison

Namely 0.9342+2Y145678.5H Xingyang 02 AA/03000W valuation 5.5645

Pre-sale

1.1479+2Y 031772042.1817 south charging airport PPN001 AA/05000W valuation 6.6045

Then, for the user A, corresponding to a text to be revised, namely [ cost performance non-public sale 03889Y125610.5H bead-casting 01AA +/AA +6000W valuation 7.3450.9342+2Y145678.5H Xingyang 02 AA/03000W valuation 5.5645 ]; for user B, correspond to a text to be revised, i.e., [ 0.9342+2Y145678.5H Xingyang 02 AA/03000W estimate 5.5645 pre-sell 1.1479+2Y 031772042.1817 south charging airport PPN001 AA/05000W estimate 6.6045 interested in the second comparison ]. Taking a text to be revised corresponding to the user a as an example, a starting position identifier exists at a position of 'cost performance non-public sale', an ending position identifier exists at a position of '0.9342 +2Y145678.5H Xinyang 02 AA/03000W evaluation 5.5645', a text between the starting position identifier and the ending position identifier is the text to be revised, and user information of the user a can be interpreted in 'user a 2019/3/2/11:14: 15' associated with the text to be revised.

Step S2, inputting the text to be revised into an intention recognition model, so that the intention recognition model recognizes the intention of the text to be revised and marks the text range corresponding to the recognized intention in the text to be revised.

In this embodiment, the intention recognition model not only has a function of recognizing the intention, but also can mark a text range corresponding to the intention, such as bold marking, changing font marking, or adding underline marking.

For example, with reference to the above example, for the text to be revised corresponding to the user a, the intention is to sell, and the text range is the whole text to be revised; for the text to be revised corresponding to the user B, the intention is purchase and sale, the text range corresponding to the purchase intention is [ 0.9342+2Y145678.5H Xinyang 02 AA/03000W valuation ] which is interested in the second comparison ], and the text range corresponding to the sale intention is [ pre-sale 1.1479+2Y 031772042.1817 south charge harbor PPN001 AA/05000W valuation 6.6045 ].

Alternatively, before step S2 is executed, the machine learning algorithm needs to be trained by using a training set to train the intention recognition model. In the prior art, in the process of training any model, a machine automatically learns the rule of the model based on sample data, however, the inventor researches and discovers that a large amount of sample data is needed to be used if the machine automatically learns the rule of the model based on the sample data only when the model is trained, and each sample data needs to be labeled manually, so that the time and the labor are consumed; and if a small amount of sample data is directly used for model training, an overfitting phenomenon is easy to occur, so that the generalization performance of the model is very poor. Based on this, the inventor considers that the rules of the sample data can be marked for the machine in advance by using expert rules, so that the time for the machine to learn the rules by self is saved, the purpose that the machine can complete self-learning by using the sample data with smaller data volume is achieved, and the data volume of the sample data is reduced. Specifically, the method may further include a step a1 to a step A3, wherein:

step A1, obtaining a plurality of intention recognition training samples, wherein each intention recognition training sample comprises a history text to be revised, intentions of the history text to be revised and a text range corresponding to each intention of the history text to be revised;

step A2, when a feature word meeting a first preset rule exists in the historical text to be revised, converting each word in the feature word meeting the first preset rule into an M-dimensional vector, wherein elements in the M-dimensional vector represent that a first preset type of feature word matched with the first preset rule exists in the historical text to be revised, and M is an integer greater than or equal to 1;

and A3, training a machine learning algorithm according to the converted M-dimensional vector to obtain the intention recognition model.

In this embodiment, a history text to be revised is used as an input parameter, an intention of the history text to be revised and a text range corresponding to each intention of the history text to be revised are used as output parameters, and a machine learning algorithm is trained. In the training process, when the feature words meeting the first preset rule exist in the historical text to be revised, each character in the feature words meeting the first preset rule is converted into an M-dimensional vector, and then the M-dimensional vector obtained after conversion is used for training a machine learning algorithm. It should be noted that, when the machine learns the rules from these intention recognition training samples, the vector tells the machine in advance which words are combined together to form what kind of the feature words, which is equivalent to telling the rules that the feature words in the intention recognition training set of the machine conform to in advance, so that the self-learning time of the machine is greatly reduced, the learning speed is accelerated, and the order of magnitude of the samples is reduced. The machine learning algorithm may include a Support Vector Machine (SVM), a Convolutional Neural Network (CNN), a Long Short-Term Memory Network (LSTM), and the like.

For example, a certain historical text to be revised is "0.1726 +2Y 145066.SH 16 quanfeng 01AA/AA8000w valuation 5.4697.5", wherein 0.1726+2Y is expiration time and 2.1726 years later, 145066.SH is bond number, 16 quanfeng 01 is bond abbreviation, AA/AA are respectively corresponding internal and external ratings, 8000w indicates an amount of transaction of eighty million, 5.469 indicates a price of medium bond valuation, 7.5 indicates a price quote of 7.5, and the prices are negotiable. The predetermined types of signatures are 0.1726+2Y, 145066.SH, 16 quanfeng 01, AA/AA, 8000w, 5.469 and 7.5, respectively. Taking the first preset type as "transaction amount" as an example, the feature word of "transaction amount" is "8000 w", the corresponding first preset rule may be "mantissa is w and the number range is greater than 100", and "1" may be used to indicate that there is a feature word of this type of transaction amount in the history sub-text, then the M-dimensional vector corresponding to "8" in "8000 w" may be [1], the M-dimensional vector corresponding to the first "0" may be [1], the M-dimensional vector corresponding to the second "0" may be [1], the M-dimensional vector corresponding to the third "0" may be [1], and the M-dimensional vector corresponding to "w" may be [1 ]. When the machine learns the part, since the last bit in the vector corresponding to the 8 is identified to be 1, the machine knows that the characteristic word of the transaction amount may exist in the history subfile, the machine continuously identifies all vectors with the last bit being 1, and the machine can know the characteristic word of the transaction amount 8000w by splicing the vectors in sequence.

Optionally, the step A3 may further include a step a31 to a step a33, where:

step A31, when the feature words meeting the first preset rule exist in the historical text to be revised, converting each character in the feature words meeting the first preset rule into an N-dimensional vector by using a first preset algorithm, wherein N is an integer greater than or equal to 1;

step a32, splicing an N-dimensional vector and an M-dimensional vector of each word in the feature words meeting the first preset rule into an L-dimensional vector, wherein L is N + M;

and A33, training the machine learning algorithm according to the L-dimensional vectors obtained by splicing to obtain the intention recognition model.

Wherein the first preset algorithm may be a bert algorithm.

For example, converting "8" in "8000 w" into an N-dimensional vector using bert may be [0101110], the first "0" may correspond to [0011000], the second "0" may correspond to [0011000], the third "0" may correspond to [0011000], and the "w" may correspond to [1100010 ]. Splicing an N-dimensional vector and an M-dimensional vector of 8 in 8000w into an L-dimensional vector [01011101], splicing an N-dimensional vector and an M-dimensional vector of a first 0 into an L-dimensional vector [00110001], splicing an N-dimensional vector and an M-dimensional vector of a second 0 into an L-dimensional vector [00110001], splicing an N-dimensional vector and an M-dimensional vector of a third 0 into an L-dimensional vector [00110001], and splicing an N-dimensional bit vector and an M-dimensional vector of w into an L-dimensional vector [1100011 ]. When the machine learns the part, because the last bit in the L-dimensional vector corresponding to the '8' is recognized to be 1, the machine knows that the characteristic word of the 'transaction amount' may exist in the history text to be revised, the machine continuously recognizes all the L-dimensional vectors of which the last bit is 1, and the machine can know the characteristic word of the 'transaction amount' 8000w by splicing the L-dimensional vectors in sequence.

Preferably, to further ensure the privacy and security of the intention recognition model, the intention recognition model may also be stored in a node of a block chain.

Step S3, when the identified intention is not unique, splitting the text to be revised into a plurality of sub-texts to be revised according to the labeled text range, wherein each sub-text to be revised uniquely corresponds to an intention.

In this embodiment, when each to-be-revised text represents more than one intention, the to-be-revised text needs to be split according to the intention, so as to ensure that each split to-be-revised sub-text represents only one intention. Specifically, since the text range corresponding to each intention has been labeled in step S2, the text to be revised may be split according to the text range, where the number of sub-texts to be revised obtained by splitting the text to be revised is consistent with the number of intentions represented by the text to be revised.

Step S4, inputting the sub-text to be revised into an entity extraction model, so that the entity extraction model extracts the entity in the sub-text to be revised.

In this embodiment, the entity extraction model has a function of extracting an entity. For example, the entities are: security name, security number, and security valuation, among others.

Optionally, the entity extraction model may also convert the extracted entities into entities in a canonical format. For example, the entity extraction model includes two modules, the first module is used to extract the entity in the sub-text to be revised, and the second module is used to convert the format of the extracted entity into a canonical format, for example, the entity is: security name, security number, and security valuation, etc., the canonical format of the security valuation being in units of w, then for a security valuation of 1.43 million the entity converted to the canonical format is: the securities valuation 143 w. It should be noted that the training set corresponding to the first module includes a plurality of entity extraction training samples, and each entity extraction training sample includes an entity of a historical sub-text to be revised as an input parameter and an entity of a historical sub-text to be revised as an output parameter; the training set corresponding to the second module comprises a plurality of standard entity training samples, and each standard entity training sample comprises an entity of a historical sub-text to be revised output by the first module as an input parameter and an entity of a standard format as an output parameter. It should be understood that some of the entities input by the second module may be in a canonical format and some of the entities may be in a non-canonical format, while all of the entities output in a canonical format are in a canonical format. For example, the input entities are securities valuations of 1.43 million and securities quantities of 100 million, and the output entities in canonical format are securities valuations of 143w and securities quantities of 100 w.

Optionally, before step S4 is executed, the machine learning algorithm needs to be trained by using a training set to train the entity extraction model. In the prior art, in the process of training any model, a machine automatically learns the rule of the model based on sample data, however, the inventor researches and discovers that a large amount of sample data is needed to be used if the machine automatically learns the rule of the model based on the sample data only when the model is trained, and each sample data needs to be labeled manually, so that the time and the labor are consumed; and if a small amount of sample data is directly used for model training, an overfitting phenomenon is easy to occur, so that the generalization performance of the model is very poor. Based on this, the inventor considers that the rules of the sample data can be marked for the machine in advance by using expert rules, so that the time for the machine to learn the rules by self is saved, the purpose that the machine can complete self-learning by using the sample data with smaller data volume is achieved, and the data volume of the sample data is reduced. Specifically, the method further comprises a step B1-a step B3, wherein:

step B1, obtaining a plurality of entity extraction training samples, wherein each entity extraction training sample comprises a historical sub-text to be revised and an entity in the historical sub-text to be revised;

step B2, when a feature word meeting a second preset rule exists in the historical sub-text to be revised, converting each word in the feature word meeting the second preset rule into an M ' dimensional vector, wherein elements in the M ' dimensional vector represent that a second preset type of feature word matched with the second preset rule exists in the historical sub-text to be revised, and M ' is an integer greater than or equal to 1;

and step B3, training a machine learning algorithm according to the converted M' dimensional vector to obtain the entity extraction model.

In this embodiment, the machine learning algorithm is trained by using the historical sub-text to be revised as an input parameter and using the entity in the historical sub-text to be revised as an output parameter. In the training process, when the characteristic words meeting the second preset rule exist in the historical sub-text to be revised, each character in the characteristic words meeting the second preset rule is converted into an M 'dimension vector, and then the M' dimension vector obtained after conversion is used for training the machine learning algorithm. It should be noted that, when the machine learns the rules from the training samples extracted from these entities, the vector tells the machine in advance which words are combined together to form what kind of the feature words, which is equivalent to telling the rules that the machine entities extract the feature words in the training set in advance, so that the self-learning time of the machine is greatly reduced, the learning speed is accelerated, and the order of magnitude of the samples is reduced.

For example, a certain historical pending revision sublibrary text is "0.1726 +2Y 145066.SH 16 quanfeng 01AA/AA8000w valuation 5.4697.5", wherein 0.1726+2Y is expiration time and 2.1726 years later, 145066.SH is bond number, 16 quanfeng 01 is bond abbreviation, AA/AA are respectively corresponding internal and external ratings, 8000w indicates an amount of transaction is eighty million, 5.469 is medium bond valuation price, 7.5 indicates a quote of 7.5, and the prices are negotiable. The second preset type of signature is 0.1726+2Y, 145066.SH, 16 quanfeng 01, AA/AA, 8000w, 5.469 and 7.5, respectively. Taking the second preset type as "transaction amount" as an example, the feature word of "transaction amount" is "8000 w", the corresponding second preset rule may be "mantissa is w and the number range is greater than 100", and "1" may be used to indicate that there is a feature word of this type of transaction amount in the history sub-text, then the M-dimensional vector corresponding to "8" in "8000 w" may be [1], the M-dimensional vector corresponding to the first "0" may be [1], the M-dimensional vector corresponding to the second "0" may be [1], the M-dimensional vector corresponding to the third "0" may be [1], and the M-dimensional vector corresponding to "w" may be [1 ]. When the machine learns the part, since the last bit in the vector corresponding to the 8 is identified to be 1, the machine knows that the characteristic word of the transaction amount may exist in the history subfile, the machine continuously identifies all vectors with the last bit being 1, and the machine can know the characteristic word of the transaction amount 8000w by splicing the vectors in sequence.

Optionally, the step B3 may further include a step B31 to a step B33, where:

step B31, when the feature words meeting the second preset rule exist in the historical text to be revised, converting each character in the feature words meeting the second preset rule into an N '-dimensional vector by using a second preset algorithm, wherein N' is an integer greater than or equal to 1;

step B32, splicing the N '-dimensional vector and the M' -dimensional vector of each word in the feature words that meet the second preset rule into an L '-dimensional vector, wherein L' ═ N '+ M';

and B33, training the machine learning algorithm according to the L' dimensional vector obtained by splicing to obtain the entity extraction model.

Wherein the second predetermined algorithm may be a bert algorithm. The terms "first" and "second" in this embodiment are used only for distinguishing terms, and represent the order of steps, for example, a first preset algorithm and a second preset algorithm may both refer to a certain algorithm.

For example, converting "8" in "8000 w" into an N' dimensional vector using bert may be [0101110], the first "0" may correspond to [0011000], the second "0" may correspond to [0011000], the third "0" may correspond to [0011000], and the "w" may correspond to [1100010 ]. Splicing an N ' dimension vector and an M ' dimension vector of 8 in 8000w into an L ' dimension vector [01011101], splicing an N ' dimension vector and an M ' dimension vector of a first 0 into an L ' dimension vector [00110001], splicing an N ' dimension vector and an M ' dimension vector of a second 0 into an L ' dimension vector [00110001], splicing an N ' dimension vector and an M ' dimension vector of a third 0 into an L ' dimension vector [00110001], and splicing an N ' dimension bit vector and an M ' dimension vector of w into an L ' dimension vector [1100011 ]. When the machine learns the part, because the last bit in the L ' dimensional vector corresponding to the ' 8 ' is recognized to be 1, the machine knows that the child text to be revised in the history possibly has the characteristic word of the ' transaction amount ', the machine continuously recognizes all the L ' dimensional vectors of which the last bit is 1, and the machine can know the characteristic word of the ' transaction amount ' 8000w by splicing the L ' dimensional vectors in sequence.

Preferably, to further ensure the privacy and security of the entity extraction model, the entity extraction model may also be stored in a node of a block chain.

Step S5, obtaining a standard entity associated with the intention of the sub-text to be revised from a background database, and revising the entity in the extracted sub-text to be revised by using the standard entity.

In this embodiment, the revision includes error correction and completion, such as correcting errors of the entity, and supplementing necessary information not mentioned. In this embodiment, a standard entity may be obtained according to the intention, and an entity in the child text to be revised is revised based on the obtained standard entity, where the standard entity is a standard and correct entity.

Optionally, for convenience of management, the extracted entities may be further populated into a data table, and step S5 may further include step S51 to step S54, where:

step S51, dividing the entities belonging to the same category in the sub-text to be revised into a group;

step S52, filling each group of divided entities into each row of an entity revision table for entity revision;

step S53, obtaining the standard entity from the background database, and judging whether each row of entities in the filled entity revision table is consistent with the standard entity of the corresponding category;

in step S54, if not, each row of entities in the populated entity revision table is revised by using the standard entities of the corresponding category.

In this embodiment, entities of the same category belong to the same information chain. For example, for the sub-text to be revised: [ cost performance non-public sale 03889Y125610.5H bead 01AA +/AA +6000W valuation 7.3450.9342+2Y145678.5H Xinyang 02 AA/03000W valuation 5.5645 ], one chain of information is [ 03889Y125610.5H bead 01AA +/AA +6000W valuation 7.345 ], and the other chain of information is [ 0.9342+2Y145678.5H Xinyang 02 AA/03000W valuation 5.5645 ]. Entities of multiple classes are included in the same type of entity, such as entities of type 8 in the same type entities "03889Y, 125610.5H, bead throw, 01, AA +/AA +, 6000W, valuation, and 7.345".

The entity revision table is a data table and comprises a plurality of rows and a plurality of columns, and each column corresponds to one entity. The fixed entity name is filled in the first row of the entity revision table, and each of the rest rows is filled in the corresponding group of extracted entity values through the matching of the entity names. Further, the standard entities of the same category are matched with each row of entities in the entity revision table after filling, and once unmatched entities exist, the corresponding standard entities are utilized for revision.

For example, for a class a security, the security code is fixed, and for a fixed correct security code, it can be called a standard entity, and by comparing the standard security code with the security code in the entity revision table, if the standard security code is inconsistent or the security code in the entity revision table is missing, the standard security code is replaced or supplemented to the corresponding position.

Optionally, step S52 may further include: and filling each group of divided entities, the intention of the sub-text to be revised of each group of entities and the user information corresponding to the sub-text to be revised of each group of entities into each row of the entity revision table. The method may further include a step C1 to a step C3, wherein: step C1, acquiring a target text input by a target user; step C2, inputting the target text into the intention recognition model and the entity extraction model, so that the intention recognition model recognizes the intention of the target text, and the entity extraction model extracts the entity in the target text; and step C3, screening out the user information matched with the intention and the entity of the target text from the revised entity revision table, and recommending the screened user information to the target user.

In this embodiment, each row of the entity revision table may be filled with the intention of the sub-text to be revised to which each group of entities belongs and the user information corresponding to the sub-text to be revised to which each group of entities belongs. The specific implementation methods of step C1 and step C2 are similar to step S1, step S2, and step S4, and are not described herein again. For step C3, by matching the intention and entity of the target text with the intention and entity item in the revised entity revision table, one or more lines of data that most match can be screened out from the revised entity information table, and further, the user information recorded in each screened line of data can be extracted and recommended to the target user.

For example, if the identified intention is sale, one or more lines of data which are intended to be purchase and the entity of which matches the entity of the target text can be screened out from the revised entity revision table, and then the user information of each screened line of data is extracted and recommended to the target user.

For another example, if the identified intention is selling, one or more lines of data which are also selling and the entity is matched with the entity of the target text can be screened out from the revised entity revision table, and then the user information of each screened line of data is extracted and recommended to the target user.

FIG. 2 schematically shows a schematic diagram of an entity revision scheme according to an embodiment of the present invention.

As shown in fig. 2, taking a security trading scenario as an example, the left text is a dialog text, and information issued by any user in one time is called a text to be revised. Corresponding to any text to be revised, the domain (namely the text subject classification) to which the text belongs, such as the security domain, can be determined through the classification model. Then, the text to be revised is input into an intention identification model (namely, transaction intention analysis) corresponding to the field, when a plurality of intentions exist, the text to be revised is further split into a plurality of sub-texts to be revised (namely, information text segments), then entities are extracted through an entity extraction model (namely, meta information extraction), each extracted group of entities is filled into each line of an entity revision table, and each line comprises a plurality of information slots, such as an information slot 1, an information slot 2, an information slot … and an information slot k in the figure. And then acquiring a corresponding standard entity from the background database, revising the entity (namely, matching the security information and correcting and completing the information), and storing a revised entity revision table into the transaction intention database. Further, after the trader issues the pre-trading information, matched opponent information can be screened from the revised entity revision table and recommended to the trader.

Wherein, the training process of the intention recognition model and the entity recognition model is shown in fig. 3, and fig. 3 schematically shows a schematic diagram of the model training process according to the embodiment of the invention. For the input data in fig. 3, in the process of training the intent recognition model, each word of the feature words meeting the first preset rule may be converted into an N-dimensional vector by bert, each word of the feature words meeting the first preset rule may be converted into an M-dimensional vector by using expert rules, feature fusion may be further performed to form an L-dimensional vector, and then a machine learning algorithm such as a neural network intent classifier may be trained to obtain the intent recognition model, so that the intent recognition function may be performed. In the process of training the Entity extraction model, each word of the feature words conforming to the second preset rule can be converted into an N ' dimensional vector through bert, each word of the feature words conforming to the second preset rule is converted into an M ' dimensional vector by using expert rules, feature fusion is further performed to form an L ' dimensional vector, and then a machine learning algorithm such as a ner (named Entity recognition) Entity extraction model is trained to obtain the Entity extraction model, so that an Entity extraction function (namely keyword extraction) can be executed.

The embodiment of the present invention provides an entity revision device, which corresponds to the entity revision method described in the above embodiment, and corresponding technical features and technical effects are not detailed in this embodiment, and reference may be made to the above embodiment for relevant points. Specifically, fig. 4 schematically shows a block diagram of an entity revision apparatus according to an embodiment of the present invention, and as shown in fig. 4, the entity revision apparatus 400 may include a first acquisition module 401, a first input module 402, a splitting module 403, a second input module 404, and a revision module 405, wherein:

a first obtaining module 401, configured to obtain a text to be revised;

a first input module 402, configured to input the text to be revised into an intention recognition model, so that the intention recognition model recognizes an intention of the text to be revised and marks a text range corresponding to the recognized intention in the text to be revised;

a splitting module 403, configured to split the text to be revised into multiple sub-texts to be revised according to a labeled text range when the identified intention is not unique, where each sub-text to be revised uniquely corresponds to one intention;

a second input module 404, configured to input the sub-text to be revised into an entity extraction model, so that the entity extraction model extracts an entity in the sub-text to be revised;

a revising module 405, configured to obtain a standard entity associated with the intention of the sub-text to be revised from a background database, and revise the entity in the extracted sub-text to be revised by using the standard entity.

Optionally, the apparatus may further include: the second acquisition module is used for acquiring a plurality of intention recognition training samples, wherein each intention recognition training sample comprises a history text to be revised, intentions of the history text to be revised and a text range corresponding to each intention of the history text to be revised; the first conversion module is used for converting each character in the feature words meeting a first preset rule into an M-dimensional vector when the feature words meeting the first preset rule exist in the historical text to be revised, wherein elements in the M-dimensional vector represent that a first preset type of feature words matched with the first preset rule exists in the historical text to be revised, and M is an integer greater than or equal to 1; and the first training module is used for training a machine learning algorithm according to the converted M-dimensional vector to obtain the intention recognition model.

Optionally, the first training module may be further configured to: when the feature words meeting the first preset rule exist in the historical text to be revised, converting each character in the feature words meeting the first preset rule into an N-dimensional vector by using a first preset algorithm, wherein N is an integer greater than or equal to 1; splicing the N-dimensional vector and the M-dimensional vector of each character in the feature words according with the first preset rule into an L-dimensional vector, wherein L is N + M; and training the machine learning algorithm according to the L-dimensional vector obtained by splicing to obtain the intention recognition model.

Optionally, the apparatus may further include: the third acquisition module is used for acquiring a plurality of entity extraction training samples, wherein each entity extraction training sample comprises a historical sub-text to be revised and an entity in the historical sub-text to be revised; the second conversion module is used for converting each character in the feature words meeting a second preset rule into an M ' -dimensional vector when the feature words meeting the second preset rule exist in the historical to-be-revised sub-text, wherein elements in the M ' -dimensional vector represent that a second preset type of feature words matched with the second preset rule exists in the historical to-be-revised sub-text, and M ' is an integer greater than or equal to 1; and the second training module is used for training a machine learning algorithm according to the converted M' dimensional vector to obtain the entity extraction model.

Optionally, the second training module may be further configured to: when the feature words meeting the second preset rule exist in the historical text to be revised, converting each character in the feature words meeting the second preset rule into an N '-dimensional vector by using a second preset algorithm, wherein N' is an integer greater than or equal to 1; splicing the N '-dimensional vector and the M' -dimensional vector of each character in the feature words according with the second preset rule into an L '-dimensional vector, wherein L' ═ N '+ M'; and training the machine learning algorithm according to the L' dimension vector obtained by splicing to obtain the entity extraction model.

Optionally, the revision module may be further operable to: dividing entities belonging to the same category in the sub-text to be revised into a group; filling each group of divided entities into each row of an entity revision table for entity revision; acquiring the standard entities from the background database, and judging whether each row of entities in the filled entity revision table is consistent with the standard entities of the corresponding category; and if not, revising each row of entities in the entity revision table after filling by using the standard entities of the corresponding category.

Optionally, when the step of filling each group of divided entities into each row of the entity revision table for entity revision is executed, the revision module may further be configured to: and filling each group of divided entities, the intention of the sub-text to be revised of each group of entities and the user information corresponding to the sub-text to be revised of each group of entities into each row of the entity revision table.

Optionally, the apparatus may further include: the fourth acquisition module is used for acquiring a target text input by a target user; a third input module, configured to input the target text into the intention recognition model and the entity extraction model, so that the intention recognition model recognizes an intention of the target text, and the entity extraction model extracts an entity in the target text; and the screening module is used for screening out the user information matched with the intention and the entity of the target text from the revised entity revision table and recommending the screened user information to the target user.

FIG. 5 schematically illustrates a block diagram of a computer device suitable for implementing an entity revision method according to an embodiment of the present invention. In this embodiment, the computer device 500 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack-mounted server, a blade server, a tower server, or a rack-mounted server (including an independent server or a server cluster composed of a plurality of servers) for executing programs, and the like. As shown in fig. 5, the computer device 500 of the present embodiment includes at least but is not limited to: a memory 501, a processor 502, and a network interface 503 communicatively coupled to each other via a system bus. It is noted that FIG. 5 only illustrates the computer device 500 having components 501 and 503, but it is to be understood that not all illustrated components are required to be implemented, and that more or fewer components can alternatively be implemented.

In this embodiment, the memory 503 includes at least one type of computer-readable storage medium, and the readable storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 501 may be an internal storage unit of the computer device 500, such as a hard disk or a memory of the computer device 500. In other embodiments, the memory 501 may also be an external storage device of the computer device 500, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 500. Of course, the memory 501 may also include both internal and external memory units of the computer device 500. In the present embodiment, the memory 501 is generally used for storing an operating system and various kinds of application software installed in the computer apparatus 500, such as program codes of the entity revision method, and the like. Further, the memory 501 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 502 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 502 generally operates to control the overall operation of the computer device 500. Such as program code for executing entity revision methods related to data interaction or communication-related control and processing with computer device 500.

In this embodiment, the entity revision method stored in the memory 501 may be further divided into one or more program modules and executed by one or more processors (in this embodiment, the processor 502) to complete the present invention.

The network interface 503 may include a wireless network interface or a wired network interface, and the network interface 503 is typically used to establish communication links between the computer device 500 and other computer devices. For example, the network interface 503 is used to connect the computer device 500 to an external terminal via a network, establish a data transmission channel and a communication link between the computer device 500 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G network, Bluetooth (Bluetooth), Wi-Fi, etc.

The present embodiment also provides a computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which implements the entity revision method when executed by a processor. Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

It should be noted that the blockchain in the present invention is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It will be apparent to those skilled in the art that the modules or steps of the embodiments of the invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

21页详细技术资料下载

Entity revision method, entity revision device, computer equipment and readable storage medium

相关技术

网友询问留言