Matching method and device based on text processing, computer equipment and storage medium

文档序号：1938127 发布日期：2021-12-07 浏览：23次中文

阅读说明：本技术 基于文本处理的匹配方法、装置、计算机设备和存储介质 (Matching method and device based on text processing, computer equipment and storage medium ) 是由杨韬于 2021-04-25 设计创作，主要内容包括：本申请实施例公开了一种基于文本处理的匹配方法、装置、计算机设备和存储介质；本申请实施例可以获取待处理文本,所述待处理文本包括待匹配的目标分词、以及与所述目标分词在语义层面上具有关联关系的关联分词；确定所述目标分词的候选匹配内容集,所述候选匹配内容集包括所述目标分词的至少一个候选匹配内容,每个所述候选匹配内容具有对应的内容描述信息；基于所述目标分词与所述关联分词之间的关联关系、以及所述候选匹配内容的所述内容描述信息,计算所述目标分词与所述候选匹配内容的语义匹配度；基于所述语义匹配度,从所述候选匹配内容集中确定并输出所述目标分词的目标匹配内容。该方案可以提高针对文本中分词的内容匹配效率。(The embodiment of the application discloses a matching method and device based on text processing, computer equipment and a storage medium; the method comprises the steps that a text to be processed can be obtained, wherein the text to be processed comprises target participles to be matched and associated participles which are in an association relation with the target participles on a semantic level; determining a candidate matching content set of the target participle, wherein the candidate matching content set comprises at least one candidate matching content of the target participle, and each candidate matching content has corresponding content description information; calculating the semantic matching degree of the target participle and the candidate matching content based on the incidence relation between the target participle and the incidence participle and the content description information of the candidate matching content; and determining and outputting the target matching content of the target participle from the candidate matching content set based on the semantic matching degree. The scheme can improve the content matching efficiency aiming at the word segmentation in the text.)

1. A matching method based on text processing is characterized by comprising the following steps:

determining a candidate matching content set of the target participle, wherein the candidate matching content set comprises at least one candidate matching content of the target participle, and each candidate matching content has corresponding content description information;

calculating the semantic matching degree of the target participle and the candidate matching content based on the incidence relation between the target participle and the incidence participle and the content description information of the candidate matching content;

and determining and outputting the target matching content of the target participle from the candidate matching content set based on the semantic matching degree.

2. The matching method based on text processing according to claim 1, wherein calculating the semantic matching degree of the target participle and the candidate matching content based on the association relationship between the target participle and the associated participle and the content description information of the candidate matching content comprises:

calculating the semantic association degree of the target participle and the candidate matching content based on the association relation between the target participle and the associated participle, wherein the semantic association degree represents the association degree of the target participle and the candidate matching content on a semantic level;

calculating semantic similarity of the target participle and the candidate matching content based on the content description information of the candidate matching content, wherein the semantic similarity represents the similarity level of the target participle and the candidate matching content on a semantic level;

and calculating the semantic matching degree of the target participle and the candidate matching content based on the semantic association degree and the semantic similarity.

3. The matching method based on text processing according to claim 2, wherein calculating the semantic association degree between the target participle and the candidate matching content based on the association relationship between the target participle and the associated participle comprises:

calculating semantic relevancy between candidate matching content of the target participle and candidate matching content of the associated participle based on the association relation between the target participle and the associated participle, wherein the semantic relevancy represents the correlation degree between the candidate matching content of the target participle and the candidate matching content of the associated participle on a semantic level;

and determining the semantic relevance between the target participle and the candidate matching content of the target participle based on the semantic relevance.

4. The matching method based on text processing according to claim 3, wherein calculating the semantic correlation between the candidate matching content of the target participle and the candidate matching content of the associated participle based on the association relationship between the target participle and the associated participle comprises:

and calculating the semantic correlation between the candidate matching content of the target participle and the candidate matching content of the associated participle based on the content reference set corresponding to the candidate matching content of the target participle and the content reference set corresponding to the candidate matching content of the associated participle.

5. The matching method based on text processing according to claim 4, wherein calculating semantic relevance between the candidate matching content of the target participle and the candidate matching content of the associated participle based on the content reference set corresponding to the candidate matching content of the target participle and the content reference set corresponding to the candidate matching content of the associated participle comprises:

performing set operation on a content reference set corresponding to the candidate matching content of the target participle and a content reference set corresponding to the candidate matching content of the associated participle to obtain an operated target reference set, wherein the target reference set comprises at least one target reference content, the target reference content and the candidate matching content of the target participle have a content reference relationship, and the target reference content and the candidate matching content of the associated participle have a content reference relationship;

and calculating the semantic correlation between the candidate matching content of the target participle and the candidate matching content of the associated participle according to the target reference set.

6. The matching method based on text processing according to claim 2, wherein the content description information includes content profile information and content attribute information of the candidate matching content;

calculating semantic similarity of the target participle and the candidate matching content based on the content description information of the candidate matching content, including:

acquiring context text information of the target word in the text to be processed;

combining the content introduction information and the content attribute information to obtain combined content description information;

and calculating the semantic similarity between the target participle and the candidate matching content based on the context text information and the combined content description information.

7. The matching method based on text processing according to claim 6, wherein the content description information includes at least one item of content attribute information of the candidate matching content;

combining the content introduction information and the content attribute information to obtain combined content description information, including:

selecting target content attribute information from the at least one item of content attribute information based on the calculation result;

and combining the content introduction information and the target content attribute information to obtain combined content description information.

8. The matching method based on text processing according to claim 6, wherein calculating semantic similarity between the target participle and the candidate matching content based on the context text information and the combined content description information comprises:

acquiring a trained semantic feature extraction model;

respectively extracting the features of the context text information and the combined content description information through the semantic feature extraction model to obtain the context semantic features corresponding to the context text information and the content semantic features corresponding to the combined content description information;

and calculating the semantic similarity between the target participle and the candidate matching content based on the context semantic features and the content semantic features.

9. The matching method based on text processing according to claim 8, wherein the extracting the features of the context text information through the semantic feature extraction model to obtain the context semantic features corresponding to the context text information comprises:

performing information division on the context text information to obtain divided context text information;

performing feature conversion on the divided context text information to obtain context text features corresponding to the divided context text information;

and performing feature extraction on the context text features based on an attention mechanism through the semantic feature extraction model to obtain context semantic features corresponding to the context text features.

10. The matching method based on text processing according to claim 8, wherein obtaining the trained semantic feature extraction model comprises:

determining a candidate matching content set of the sample participle, wherein the candidate matching content set of the sample participle comprises at least one sample candidate matching content of the sample participle;

calculating the semantic matching degree of the sample participle and the sample candidate matching content;

and performing model training on the semantic feature extraction model to be trained based on the semantic matching degree to obtain a trained semantic feature extraction model.

11. The matching method based on text processing according to claim 2, wherein calculating the semantic association degree between the target participle and the candidate matching content based on the association relationship between the target participle and the associated participle comprises:

determining the prior importance of the candidate matching contents based on the content reference relation among the candidate matching contents;

performing fusion processing on the semantic association degree, the semantic similarity degree and the prior importance degree to obtain a fusion result;

and determining the semantic matching degree of the target participle and the candidate matching content based on the fusion result.

12. The matching method based on text processing according to claim 1, wherein determining and outputting the target matching content of the target participle from the candidate matching content set based on the semantic matching degree comprises:

based on the semantic matching degree, sorting the candidate matching contents in the candidate matching content set;

and determining and outputting the target matching content of the target participle from the candidate matching content set based on the sorting result.

13. A matching apparatus based on text processing, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a text to be processed, and the text to be processed comprises target participles to be matched and associated participles which have an association relation with the target participles on a semantic level;

a determining unit, configured to determine a candidate matching content set of the target word, where the candidate matching content set includes at least one candidate matching content of the target word, and each candidate matching content has corresponding content description information;

the calculating unit is used for calculating the semantic matching degree of the target participle and the candidate matching content based on the incidence relation between the target participle and the incidence participle and the content description information of the candidate matching content;

and the output unit is used for determining and outputting the target matching content of the target participle from the candidate matching content set based on the semantic matching degree.

14. An electronic device comprising a memory and a processor; the memory stores an application program, and the processor is configured to execute the application program in the memory to perform the operations of the matching method based on text processing according to any one of claims 1 to 12.

15. A storage medium storing instructions adapted to be loaded by a processor to perform the steps of the text-based processing matching of any of claims 1 to 12.

Technical Field

The present application relates to the field of computer technologies, and in particular, to a matching method and apparatus based on text processing, a computer device, and a storage medium.

Background

In the process of processing the text, the content matching is performed on the segmentation words in the text, and the purpose of determining the meaning represented by the segmentation words in the text or the content referred by the segmentation words is to be determined, which has a very wide application in many products of natural language processing.

In the research and practice process of the related art, the inventors of the present application found that, when content matching is performed on a target word segmentation in a to-be-processed text, content matching is performed by focusing on the target word segmentation in the to-be-processed text, which may cause a content matching method for word segmentation to be improved, for example, both matching efficiency and accuracy to be improved.

Disclosure of Invention

The embodiment of the application provides a matching method and device based on text processing, computer equipment and a storage medium, and the content matching efficiency aiming at word segmentation in a text can be improved.

The embodiment of the application provides a matching method based on text processing, which comprises the following steps:

and determining and outputting the target matching content of the target participle from the candidate matching content set based on the semantic matching degree.

Correspondingly, an embodiment of the present application further provides a matching device based on text processing, including:

and the output unit is used for determining and outputting the target matching content of the target participle from the candidate matching content set based on the semantic matching degree.

In one embodiment, the computing unit includes:

a first calculating subunit, configured to calculate a semantic association degree between the target participle and the candidate matching content based on an association relationship between the target participle and the associated participle, where the semantic association degree represents a degree of association between the target participle and the candidate matching content on a semantic level;

a second calculating subunit, configured to calculate, based on the content description information of the candidate matching content, semantic similarity between the target participle and the candidate matching content, where the semantic similarity represents a similarity level of the target participle and the candidate matching content on a semantic level;

and the third calculating subunit is used for calculating the semantic matching degree of the target participle and the candidate matching content based on the semantic association degree and the semantic similarity.

In an embodiment, the first computing subunit is configured to:

determining a candidate matching content set of the associated participles, wherein the candidate matching content set of the associated participles comprises candidate matching content of at least one associated participle; calculating semantic relevancy between candidate matching content of the target participle and candidate matching content of the associated participle based on the association relation between the target participle and the associated participle, wherein the semantic relevancy represents the correlation degree between the candidate matching content of the target participle and the candidate matching content of the associated participle on a semantic level; and determining the semantic relevance between the target participle and the candidate matching content of the target participle based on the semantic relevance.

In an embodiment, the first calculating subunit is specifically configured to:

determining a content reference set of each candidate matching content, wherein the content reference set comprises at least one reference content of the candidate matching content, and the reference content has a content reference relation with the candidate matching content; and calculating the semantic correlation between the candidate matching content of the target participle and the candidate matching content of the associated participle based on the content reference set corresponding to the candidate matching content of the target participle and the content reference set corresponding to the candidate matching content of the associated participle.

In an embodiment, the first calculating subunit is specifically configured to:

In one embodiment, the content description information includes content profile information and content attribute information of the candidate matching content; the second calculating subunit is configured to:

acquiring context text information of the target word in the text to be processed; combining the content introduction information and the content attribute information to obtain combined content description information; and calculating the semantic similarity between the target participle and the candidate matching content based on the context text information and the combined content description information.

In an embodiment, the second calculating subunit is specifically configured to:

calculating semantic relevance between the content attribute information and the context text information, wherein the semantic relevance represents the relevance of the content attribute information and the context text information on a semantic level; selecting target content attribute information from the at least one item of content attribute information based on the calculation result; and combining the content introduction information and the target content attribute information to obtain combined content description information.

In an embodiment, the second calculating subunit is specifically configured to:

acquiring a trained semantic feature extraction model; respectively extracting the features of the context text information and the combined content description information through the semantic feature extraction model to obtain the context semantic features corresponding to the context text information and the content semantic features corresponding to the combined content description information; and calculating the semantic similarity between the target participle and the candidate matching content based on the context semantic features and the content semantic features.

In an embodiment, the second calculating subunit is specifically configured to:

performing information division on the context text information to obtain divided context text information; performing feature conversion on the divided context text information to obtain context text features corresponding to the divided context text information; and performing feature extraction on the context text features based on an attention mechanism through the semantic feature extraction model to obtain context semantic features corresponding to the context text features.

In an embodiment, the second calculating subunit is specifically configured to:

determining a semantic feature extraction model to be trained and a sample data set required by model training, wherein the sample data set comprises a sample text, and the sample text comprises sample participles to be matched and sample associated participles which have an association relation with the sample participles on a semantic level; determining a candidate matching content set of the sample participle, wherein the candidate matching content set of the sample participle comprises at least one sample candidate matching content of the sample participle; calculating the semantic matching degree of the sample participle and the sample candidate matching content; and performing model training on the semantic feature extraction model to be trained based on the semantic matching degree to obtain a trained semantic feature extraction model.

In an embodiment, the third computing subunit is configured to:

determining the prior importance of the candidate matching contents based on the content reference relation among the candidate matching contents; performing fusion processing on the semantic association degree, the semantic similarity degree and the prior importance degree to obtain a fusion result; and determining the semantic matching degree of the target participle and the candidate matching content based on the fusion result.

In one embodiment, the output unit includes:

the sorting subunit is used for sorting the candidate matching contents in the candidate matching content set based on the semantic matching degree;

and the output subunit is used for determining and outputting the target matching content of the target word segmentation from the candidate matching content set based on the sequencing result.

Accordingly, the present application also provides a storage medium having a computer program stored thereon, where the computer program is executed by a processor to implement the steps of the matching method based on text processing as shown in the present application.

Accordingly, embodiments of the present application further provide a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the matching method based on text processing according to the embodiments of the present application.

The method comprises the steps of obtaining a text to be processed, wherein the text to be processed comprises target participles to be matched and associated participles which are in an association relation with the target participles on a semantic level; determining a candidate matching content set of the target participle, wherein the candidate matching content set comprises at least one candidate matching content of the target participle, and each candidate matching content has corresponding content description information; calculating the semantic matching degree of the target participle and the candidate matching content based on the incidence relation between the target participle and the incidence participle and the content description information of the candidate matching content; and determining and outputting the target matching content of the target participle from the candidate matching content set based on the semantic matching degree.

The scheme can calculate the semantic matching degree of the target participle and the candidate matching content based on the incidence relation between the target participle and the associated participle thereof, so that when the scheme is used for performing content matching on the target participle, not only the focus is on the target participle, but also the strong semantic correlation degree between the target participle and the associated participle thereof in a text to be processed is considered, and the scheme is used for performing content matching on the target participle based on the semantic correlation degree, so that the matching efficiency and the matching accuracy are improved. In addition, when the associated participle of the target participle is also the participle to be matched in the text to be processed, that is, when the text to be processed has a plurality of participles to be matched, the content matching is performed by combining the semantic relevance between the plurality of participles to be matched, so that compared with the case of independently and sequentially performing content matching on each participle to be matched in the text to be processed, the semantic relevance between the plurality of participles to be matched in the text to be processed can be simultaneously calculated by the scheme, and the content matching efficiency for the plurality of participles to be matched in the text to be processed is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a scene schematic diagram of a matching method based on text processing according to an embodiment of the present application;

FIG. 2 is a flowchart of a matching method based on text processing according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a candidate matching content set of a matching method based on text processing according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating semantic relevance of a matching method based on text processing according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating content reference of a matching method based on text processing according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a relevancy network of a matching method based on text processing according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of semantic feature extraction of a matching method based on text processing according to an embodiment of the present application;

FIG. 8 is another schematic flow chart diagram of a matching method based on text processing according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a matching apparatus based on text processing according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of another matching apparatus based on text processing according to an embodiment of the present application;

fig. 11 is another schematic structural diagram of a matching apparatus based on text processing according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a matching method and device based on text processing, computer equipment and a storage medium. Specifically, the embodiment of the application provides a matching device based on text processing and suitable for computer equipment. The computer device may be a terminal or a server, and the terminal may be a mobile phone, a tablet computer, a notebook computer, and the like. The server may be a single server or a server cluster composed of a plurality of servers.

In the embodiment of the present application, a computer device is taken as an example of a server to introduce a matching method based on text processing.

Referring to fig. 1, a search client may be run on the terminal 20, and the terminal 20 may obtain a text to be searched sent by a user through the search client, and use the text to be searched as a text to be processed, where the text to be processed includes a target participle to be matched and an associated participle having an association relationship with the target participle on a semantic level.

The terminal 20 may transmit the pending text to the server 10 so that the server 10 may acquire the pending text accordingly. Further, the server 10 may determine a candidate matching content set of the target segmented word, wherein the candidate matching content set includes at least one candidate matching content of the target segmented word, and each candidate matching content has corresponding content description information. The server 10 may calculate the semantic matching degree between the target segmented word and the candidate matching content based on the association relationship between the target segmented word and the associated segmented word and the content description information of the candidate matching content. Further, the server 10 may determine and output target matching content of the target segmented word from the candidate matching content set based on the calculated semantic matching degree.

The terminal 20 may correspondingly obtain the target matching content, and after generating a search result page based on the target matching content, present the search result page to the user.

Similarly, a question-answering client, a recommendation client and the like based on artificial intelligence can also run on the terminal 20, the terminal 20 can acquire a text to be processed through the client and send the text to the server 10, and the server 10 can determine and output target matching content through the matching method based on text processing, so that the terminal 20 can further interact with a user or perform data processing after acquiring the target matching content.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The matching method based on text Processing provided by the embodiment of the application relates to technologies such as artificial intelligence Natural Language Processing (NLP), and the like, and can be executed by a terminal or a server, or can be executed by both the terminal and the server; in the embodiment of the present application, the matching method based on text processing is performed by a server as an example, specifically, the matching method based on text processing is performed by a matching device based on text processing integrated in the server, as shown in fig. 2, a specific flow of the matching method based on text processing may be as follows:

101. and acquiring a text to be processed, wherein the text to be processed comprises target participles to be matched and associated participles which have an association relation with the target participles on a semantic level.

Text, as used herein, refers to the presentation of a written language, usually a sentence or a combination of sentences having a complete, systematic meaning. The sentences are basic units of language operation, and the sentences can be composed of words and phrases.

The language type of the text to be processed can be various, for example, the text to be processed can be a Chinese text, an English text, a French text, and the like; as another example, the text to be processed may be text written by a programming language; and so on.

For example, the text to be processed may be a chinese text, for example, the text to be processed may be "li X is influenced by father since then, badminton practice is started in 1988, and tennis practice is started in summer X phase, a famous tennis trainer in 1989.

As another example, the text to be processed may be English text, for example, the text to be processed may be "underflued by his family great position child hood, Zhang San began to practical badminton In 1988.In 1989, he wa selected by Li Si, a facial tissues coach, to practical tissues.

The participles in the text to be processed are words or phrases in the text to be processed, and the target participles to be matched refer to the participles to be subjected to content matching. Specifically, content matching is performed on the segmentation for matching the segmentation to the corresponding content.

Wherein, the content is a carrier of information, and the content can be composed of a plurality of items of information. For example, the content may include text information, picture information, audio information, video information, and the like, for example, the content may be recorded in knowledge in a knowledge base, the representation of the knowledge recorded in the database may be related data records in the database, and the representation of the knowledge recorded in the application may be web page content and the like.

For example, referring to fig. 3, each meaning item corresponds to an entity named "li X" in a knowledge base, wherein each entity in the knowledge base has related content information, so it should be noted that, in the present application, a process of matching the content of a target participle to be matched in a text to be processed may also be referred to as a process of matching a named entity corresponding to the target participle in the text to be processed to match the named entity to a correct entity, for example, a process of matching the named entity to a correct entity in the knowledge base.

In practical applications, the user may jump to the content presentation page of the corresponding entity by clicking on the semantic item in fig. 3. Wherein, a multi-meaning word is composed of a term name and a plurality of meaning items. The description contents of each different concept meaning thing under the same entry name are called meaning item. The meaning item description is a clear description of the meaning item, and is the content which can represent the attribute and the characteristic of the meaning item thing most. For example, the term "apple" has a plurality of meaning items, including fruit trees, fruits, companies, movies, and the like.

In practical applications, the content matching for the target participle may have a plurality of different application scenarios, for example, may be applied in search, and as an example, for the text to be processed of "X deluxe X lane", if not only "X deluxe" is identified as a movie star, and "X lane" is a movie, but also the two entities are linked to the entity corresponding to the knowledge base, then more detailed information of the two entities, such as age, constellation, and representative work of "X deluxe", and director, actor, date of showing "X lane" and the like, may be obtained through the knowledge base, and these information may not only provide the user with rich presentation information, but also help deep understanding of the text to be processed, and at the same time, may better retrieve more relevant content information.

For another example, in a question-answering system, a user question needs to be accurately analyzed, which is also an application scenario of content matching. For example, the pending text may be the user question "relationship of money XX and the university of qinghua", and "money XX" and "university of qinghua" may be matched against knowledge records in the knowledge base to learn the correct answer to the question through the knowledge base.

As another example, content matching may also be applied in a recommendation system. For example, if the user clicks on the article "X ma son announces to establish a collaboration with X news video", it may be determined that the user may be interested in the company entity "X ma son" instead of the company entity "X ma son river", and then the user may be recommended the relevant information of the company entity "X ma son".

As an example, content matching may be applied in entity disambiguation of NLP, in particular, entity disambiguation is a fundamental technique in NLP, aiming to link named entities identified in text to the correct entities in the knowledge base, sometimes also called entity links. For example, if "lie X" is recognized as a person name in the text to be processed, in one embodiment, referring to fig. 3, there are many persons named "lie X" in the knowledge base. It is necessary to link "li X" in the text to be processed to the correct entity, i.e. "chinese female tennis celebrity" this entity. The process is a process of performing content matching on the target participle "lie X", that is, matching the target participle to the content corresponding to the target participle, or may be called a process of matching the target participle to the correct content corresponding to the target participle.

And the associated participles of the target participles are participles which have an association relation with the target participles in the text to be processed on a semantic level. The association relationship at the semantic level characterizes how the participles are associated to form the meaning of the sentence.

As an example, if the text to be processed is "li X is influenced by father from a small time, badminton practice is started in 1988, tennis practice is started in summer X phase, which is a well-known tennis trainer in 1989", and the target participle is "li X", the associated participle of the target participle may be "summer X", because it is advantageous to understand the correct meaning of the text to be processed after associating "li X" with "summer X".

It is noted that the number of associated participles may be multiple, for example, in another example, if the text to be processed is "X blue at the time of reporter interview, is asked and X m s regresses X-s decision, he says that i think that X m s is considered very detailed and is processed very beautiful'", and the target participle is "X blue", the associated participle of the target participle may include "X m s" and "X s", because associating "X blue", "X m s" and "X s" is helpful to understand the correct meaning of the text to be processed.

It should be noted that the associated participles may also be participles to be matched in the text to be processed, that is, the text to be processed may include at least one target participle to be matched, and the target participles have an association relationship on a semantic level. For example, if the text to be processed is "X orchid is asked and X mus regresses X s when being interviewed by a reporter to decide that i say that i think that X mus considers very carefully and processes very beautiful'", the target participles may include "X orchid", "X mus" and "X mus", and have an association relationship on a semantic level, that is, it is beneficial to understand the correct meaning of the text to be processed after associating the target participles.

In the present application, there may be multiple ways of obtaining the text to be processed, for example, the terminal may send the text to be processed to the server, so that the server obtains the text to be processed. Specifically, a client may be run on the terminal, and the terminal may obtain the text to be processed through the client, for example, a search client, a question and answer system client, a recommendation system client, and the like may be run on the terminal.

102. And determining a candidate matching content set of the target word, wherein the candidate matching content set comprises at least one candidate matching content of the target word, and each candidate matching content has corresponding content description information.

The candidate matching content of the target participle refers to matching content that can be matched with the target participle, for example, taking the text to be processed as "lie X is influenced by father from childhood, badminton practice is started in 1988, and tennis practice is started in summer X phase by a famous tennis trainer" in 1989 as an example, the target participle can be "lie X", and the candidate matching content of the target participle can be 115 semantic items of "lie X" shown in fig. 3, wherein each semantic item has corresponding content information in the knowledge base.

The candidate matching content set of the target participle is a set formed by candidate matching content of the target participle, wherein at least one candidate matching content of the target participle may be included, for example, the candidate matching content set of the target participle "li X" may be a set formed by 115 semantic items shown in fig. 3.

The content description information is information describing candidate matching content, for example, the content description information may include content profile information and content attribute information of the candidate matching content, where the content profile information is related information that briefly introduces an entity corresponding to the candidate matching content, and the content attribute information is related information that describes an attribute of the entity corresponding to the candidate matching content.

As an example, referring to fig. 3, for the candidate matching content corresponding to "chinese female tennis celebrity", the content description information may include content profile information and content attribute information, for example, the content profile information may be "li X is chinese female tennis celebrity, which has been participating in international contests many times and has obtained many great honors … …"; the content attribute information may include related information of an attribute of an entity corresponding to "chinese daughter tennis celebrity", for example, the content attribute information may include related information of the following attributes: chinese name, foreign name, alias, nationality, famous family, place of birth, date of birth, height, weight, university, etc.

The candidate matching content set of the target word may be determined in various ways, for example, the server may obtain the candidate matching content set of the target word by requesting the terminal or other servers for the relevant data.

As an example, a candidate matching content set of the target segmentation may be stored in the database, for example, the knowledge base may include the candidate matching content set of the target segmentation, and then a data query request for requesting to obtain the candidate matching content of the target segmentation may be generated based on the target segmentation. Further, response data in response to the data query request may be obtained, and the response data may include at least one candidate matching content of the target participle. Thus, the candidate matching content set of the target participle can be determined based on the obtained candidate matching content.

103. And calculating the semantic matching degree of the target participle and the candidate matching content based on the incidence relation between the target participle and the incidence participle and the content description information of the candidate matching content.

The semantic matching degree describes the matching degree of the target participle and the candidate matching content on the semantic level. The expression form of the semantic matching degree can be various, for example, a semantic matching score of the target participle and the candidate matching content can be calculated, and the semantic matching degree can be represented by the semantic matching score.

The semantic matching degree of the target participle and the candidate matching content may be calculated in various ways based on the association relationship between the target participle and the associated participle and the content description information of the candidate matching content, for example, the semantic matching degree of the target participle and the candidate matching content may be calculated based on the association relationship between the target participle and the associated participle and the content description information of the candidate matching content, respectively, and further determined based on the calculation result. Specifically, the step "calculating the semantic matching degree between the target participle and the candidate matching content based on the association relationship between the target participle and the associated participle and the content description information of the candidate matching content" may include:

calculating semantic similarity of the target participle and the candidate matching content based on the content description information of the candidate matching content, wherein the semantic similarity represents the similarity of the target participle and the candidate matching content on a semantic level;

and calculating the semantic matching degree of the target participle and the candidate matching content based on the semantic association degree and the semantic similarity.

The following may describe the step of "calculating the semantic association degree between the target participle and the candidate matching content based on the association relationship between the target participle and the associated participle".

The semantic association degree represents the association degree of the target participle and the candidate matching content on the semantic level. Specifically, in the text to be processed, the target participle and the associated participle thereof have an association relationship on a semantic level, and the association relationship describes how to understand the correct meaning of the text to be processed by associating the target participle with the associated participle thereof, so that it can be known that the associated participle is helpful for determining the target matching content of the target participle from the candidate matching content set of the target participle.

For example, for the pending text "2004, li X chosen to reiterate under the encouragement and support of husband ginger X. "the target participle may be" lie X "and the associated participle of the target participle may be" ginger X ". The "li X" and "ginger X" may correspond to entities in a plurality of knowledge bases, respectively. However, the semantic association degree of the pair of entities, namely "plum X (Chinese lady tennis celebrity)" and "ginger X (famous tennis trainer)" is higher than that of other entities. Therefore, the semantic association degree between the target participle and the associated participle is calculated, and the target matching content of the target participle is determined from the candidate matching content set of the target participle.

It should be noted that after the semantic association degree between the target participle and the associated participle is calculated, the semantic association degree is not only helpful for performing content matching on the target participle, but also is similarly helpful for performing content matching on the associated participle.

For example, for the pending text "2004, li X chosen to reiterate under the encouragement and support of husband ginger X. The target participle "lie X" has an associated participle "ginger X", and since the semantic association degree between "lie X" and "ginger X" is calculated in the process of content matching of "lie X", the semantic association degree not only facilitates content matching of "lie X", but also similarly facilitates content matching of "ginger X".

The manner of calculating the semantic association degree between the target participle and the candidate matching content of the target participle based on the association relationship between the target participle and the associated participle may be various, for example, the semantic association degree between the target participle and the candidate matching content of the target participle may be determined by calculating the degree of correlation between the candidate matching content of the target participle and the candidate matching content of the associated participle on a semantic level, specifically, the step "calculating the semantic association degree between the target participle and the candidate matching content based on the association relationship between the target participle and the associated participle" may include:

calculating semantic correlation between the candidate matching content of the target participle and the candidate matching content of the associated participle based on the correlation between the target participle and the associated participle, wherein the semantic correlation characterizes the correlation degree between the candidate matching content of the target participle and the candidate matching content of the associated participle on a semantic level;

and determining the semantic relevance between the target participle and the candidate matching content of the target participle based on the semantic relevance.

For example, taking the text to be processed as "li X is influenced by father from the beginning, badminton practice is started in 1988, and tennis practice is started in summer X phase by a famous tennis trainer" in 1989, the target participle may be "li X", the associated participle may be "summer X", and then the candidate matching content of the associated participle may be a meaning item of "summer X" in the knowledge base.

The candidate matching content set of the associated participle is a set formed by candidate matching content of the associated participle, wherein at least one candidate matching content of the associated participle may be included, for example, the candidate matching content set of the associated participle "summer X" may be a set formed by a meaning item of "summer X" in the knowledge base.

Similarly, the manner of determining the candidate matching content set of the associated segmented word may refer to the manner of determining the candidate matching content set of the target segmented word, which is not described herein again.

Further, semantic relatedness between the candidate matching content of the target participle (hereinafter referred to as first candidate content) and the candidate matching content of the associated participle (hereinafter referred to as second candidate content) can be calculated.

As an example, referring to fig. 4, the text to be processed may be "X nat is asked and X mins regresses X away when receiving the reporter interview, he says that" i think that X mins are considered a lot of detail and are well-behaved' ", where the target participle is" X nat "and the associated participle may be" X mins "and" X away ". The description may be given by taking the example of calculating the semantic correlation between the candidate matching content of the target participle "X blue" and the candidate matching content of the associated participle "X mins".

Referring to fig. 4, the semantic relatedness between candidate matching contents can be represented in a connected line manner. Specifically, in this example, the candidate matching content of the target participle "X lantet" is the following two: "a 1 · X blue (historian)" and "a 2 · X blue (basketball player)". The candidate matching contents of the associated participle "X mus" are the following two: "B1 · X mus" and "B2 · X mus" (actor, singer), therefore, referring to fig. 4, the semantic relatedness between the candidate matching content of the target participle "X lante" and the candidate matching content of the associated participle "X mus" can be illustrated in a continuous manner.

Similarly, the semantic correlation between the candidate matching content of "X blue" and the candidate matching content of "X shi" can also be illustrated as shown in fig. 4.

The manner of calculating the semantic relevance between the first candidate content and the second candidate content may be various, for example, considering that in practical application, content references exist between the contents, for example, if the contents exist in the form of pages, the content references may be links between the pages; for another example, if the content exists in the form of articles, the content reference may be a reference between the articles; for another example, if the content exists in the form of a program file, the content reference may be a reference between programs; and so on. Since the more the intersection of the content references between two participles, the higher the semantic relevance of the two participles, for example, the more common pages linked to two entities, the higher the semantic relevance of the two entities, when calculating the semantic relevance between the first candidate content and the second candidate content, the semantic relevance may be calculated based on the content reference condition of each candidate content, specifically, the step "calculating the semantic relevance between the candidate matching content of the target participle and the candidate matching content of the associated participle based on the association between the target participle and the associated participle" may include:

The content reference relationship refers to a relationship of mutual reference or one-way reference between contents. The reference situation may be various, for example, if the content exists in the form of a page, the content reference may be a link between pages; for another example, if the content exists in the form of articles, the content reference may be a reference between the articles; for another example, if the content exists in the form of a program file, the content reference may be a reference between programs; and so on.

The reference content of the candidate matching content refers to content having a content reference relationship with the candidate matching content. For example, if the content exists in the form of a page, the reference content of the page a may include an in-link page and an out-link page of the page a.

Wherein, the content reference set of the candidate matching content is a set formed by the reference content of the candidate matching content. For example, the content reference set of the first candidate content is a set of reference contents of the first candidate content, and the content reference set of the second candidate content is a set of reference contents of the second candidate content.

By way of example, the content may exist in the form of pages, and the content reference may be a link between pages. Referring to fig. 5, taking the candidate matching content "a 2 · X blue (basketball player)" of the target participle "X blue" and the candidate matching content "B1 · X blue (basketball player)" of the associated participle "X blue" as an example, the content reference set of the candidate matching content "a 2 · X blue (basketball player)" includes 5 reference contents, i.e. 5 in-chain pages: "basketball team D", "XX university", "XX alliance most valued player", "XX sports brand", and "XX general final most valued player". Similarly, a content reference set of candidate matching content "B1. X Mms" may be determined.

Referring to fig. 4, the target participle "X blue" and the associated participle "X mu si" both have a plurality of candidate matching contents, but there is a strong semantic correlation between their corresponding target matching contents, i.e., "a 2 · X blue (basketball player)" and "B1 · X mu si (basketball player)". And, as can be seen from fig. 5, the more the two candidate matching contents are commonly referred to, the higher the semantic correlation between the two candidate matching contents, specifically, the more the common pages linked to the two candidate matching contents, the higher the semantic correlation between the two candidate matching contents. Thus, semantic relatedness between candidate matching content may be calculated based on the content reference set of candidate matching content. Specifically, the step of calculating the semantic relatedness between the candidate matching content of the target participle and the candidate matching content of the associated participle based on the content reference set corresponding to the candidate matching content of the target participle and the content reference set corresponding to the candidate matching content of the associated participle may include:

performing set operation on a content reference set corresponding to candidate matching content of the target participle and a content reference set corresponding to candidate matching content of the associated participle to obtain an operated target reference set, wherein the target reference set comprises at least one target reference content, the target reference content has a content reference relation with the candidate matching content of the target participle, and the target reference content has a content reference relation with the candidate matching content of the associated participle;

Set operations are operations performed on sets, for example, set operations may include basic operations of sets, such as intersections, unions, relative complements, absolute complements, subsets, and the like.

As an example, referring to fig. 5, it can be known that the target cited content having a content reference relationship with the candidate matching content "a 2 · X blue (basketball player)" of the target participle "X blue" and having a content reference relationship with the candidate matching content "B1 · X mus (basketball player)" of the associated participle "X mus" is the following three contents: "XX alliance most valuable players", "XX sports brand", and "XX general winning year most valuable players". After the target reference content is determined, a target reference set consisting of the target reference content can be determined.

Further, semantic relatedness between candidate matching contents can be calculated according to the target reference set. In one embodiment, the content may exist in the form of pages, and the content reference may be a link between the pages, and a semantic relevance WLM (e1, e2) between the candidate matching content e1 and the candidate matching content e2 may be calculated using a Wikipedia link-based Measure (WLM) with reference to the following formula:

wherein S is_e1An in-link set representing page e1, i.e., a set of pages linked to candidate matching content e 1; s_e2An in-link set representing page e2, i.e., a set of pages linked to candidate matching content e 2; d represents all documents, for example, D may represent all documents of wikipedia.

Similarly, for the candidate matching contents of each word to be matched in the text to be processed, the semantic relevance between any pair of candidate matching contents can be calculated, and a relevance network is constructed based on the calculation result. For example, as an illustration, the semantic relevance between each candidate matching content may be represented by a connecting line, and the thickness of the connecting line may be correspondingly adjusted based on the calculation result of the semantic relevance, for example, the connecting line may be thicker when the semantic relevance is larger, so that the semantic relevance between the candidate matching contents may be better shown through the relevance network. As an example, the relevance network shown in fig. 6 may be constructed, wherein the semantic relevance between "a 2 · X blue (basketball players)", "C2 · X players", and "B1 · X mus" may be determined to be the greatest by computing the semantic relevance between the candidate entities pairwise.

After the semantic relevance between the candidate matching content of the target participle and the candidate matching content of the associated participle is obtained through calculation, the semantic relevance between the target participle and the candidate matching content of the target participle can be determined based on the semantic relevance.

For example, in an embodiment, for a plurality of candidate matching contents of the target participle, a semantic association degree between the target participle and each candidate matching content of the target participle may be calculated based on the constructed relevance network, for example, the semantic association degree may be characterized by a semantic association degree score, specifically, the semantic association degree score between the target participle and each candidate matching content of the target participle may be calculated by a pagerank algorithm to determine the semantic association degree between the target participle and each candidate matching content of the target participle.

As another example, in an embodiment, a semantic relevance between the target participle and each candidate matching content of the target participle may be calculated based on the generated relevance network. By way of example, referring to fig. 6, the target participle "X lante" has two candidate matches: "a 1 · X lan (historian)" and "a 2 · X lan (basketball player)", and the semantic relevance between each candidate matching content of the target participle and each candidate matching content of the associated participle is shown in fig. 6, in practical application, the semantic relevance may be embodied in numbers, for example, the semantic relevance score may be used as a weight corresponding to each connecting line in fig. 6 to label each connecting line.

Further, for each candidate matching content of the target participle, a score of each candidate matching content of the target participle may be determined through the weight values, and the score is determined as a semantic association degree score between the candidate matching content and the target participle. For example, the sum of the weights of each candidate matching content of the target participle in the relevancy network may be used as the semantic relevancy score of the candidate matching content and the target participle.

The step of calculating semantic similarity between the target segmented word and the candidate matching content based on the content description information of the candidate matching content may be described as follows.

It should be noted that, in the present application, the execution sequence between the step "calculating the semantic association degree between the target participle and the candidate matching content based on the association relationship between the target participle and the associated participle" and the step "calculating the semantic similarity between the target participle and the candidate matching content based on the content description information of the candidate matching content" is not limited, for example, the execution sequence may be executed simultaneously or may not be executed simultaneously; and the sequence when the execution is not simultaneous is not limited.

The semantic similarity between the target participle and the candidate matching content describes the similarity between the target participle and the candidate matching content on a semantic level. Specifically, in the text to be processed, the target participle may be replaced with the candidate matching content, and if the candidate matching content has a higher semantic similarity with the target participle, the replaced text to be processed and the original text to be processed also have a higher similarity in a semantic level.

The manner of calculating the semantic similarity between the target segment and the candidate matching content may be various based on the content description information of the candidate matching content, for example, considering that the content description information of the candidate matching content may include content profile information and content attribute information of the candidate matching content, and the content profile information and the content attribute information are information describing the candidate matching content from different granularities, and therefore, the content profile information and the content attribute information may be combined and the semantic similarity between the target segment and the candidate matching content may be calculated by combining the results, specifically, the step "calculating the semantic similarity between the target segment and the candidate matching content based on the content description information of the candidate matching content" may include:

acquiring context text information of a target word in a text to be processed;

combining the content introduction information and the content attribute information to obtain combined content description information;

and calculating the semantic similarity between the target participle and the candidate matching content based on the context text information and the combined content description information.

The context of the target participle in the text to be processed is the context of the target participle in the text to be processed. For example, the text in the text to be processed, except for the target participle, may be used as the context text of the target participle in the text to be processed. Correspondingly, the context text information is the relevant information of the context text.

As an example, the text to be processed may be "lie X practise tennis beginning in 1989", wherein the target participle may be "lie X", and the contextual textual information of "lie X" may be "practise tennis beginning in 1989".

The method for obtaining the context text information of the target word segmentation in the text to be processed may be various, for example, the text to be processed may be masked, specifically, the target word segmentation in the text to be processed may be masked, and in this way, the context text information of the target word segmentation may be obtained. For example, for the text to be processed, "lie X practises tennis beginning in 1989," the [ MASK ] string may be used to MASK the target participle, "lie X," resulting in masked contextual text information: "[ MASK ] began practicing tennis in 1989".

Further, there may be various ways to combine the content profile information and the content attribute information of the candidate matching content, for example, considering that the candidate matching content may include multiple content attributes, and correspondingly, the content description information may include multiple content attribute information, but not every item of content attribute information is helpful or relevant to calculating the semantic similarity between the target participle and the candidate matching content, so the content attribute information of the candidate matching content may be filtered, and after obtaining the filtered content attribute information, the information combination may be further performed. Specifically, the step of "combining the content profile information and the content attribute information to obtain the combined content description information" may include:

selecting target content attribute information from the at least one item of content attribute information based on the calculation result;

and combining the content introduction information and the target content attribute information to obtain combined content description information.

The semantic relevance between the content attribute information and the context text information represents the relevance between the content attribute information and the context text information on a semantic level.

The semantic relevance may be calculated in various ways, for example, the semantic relevance of the content attribute information and the context text information may be determined by calculating the distance between the two. Specifically, word segmentation processing may be performed on the context text information and the content attribute information, respectively, to obtain context text information after word segmentation and content attribute information after word segmentation, where the context text information after word segmentation may include at least one context word segmentation; the segmented content attribute information may include at least one content attribute segment.

Moreover, the pre-trained word vectors can be used for respectively carrying out vectorization processing on the context participles and the content attribute participles to obtain context participle vectors corresponding to the context participles and content attribute participle vectors corresponding to the content attribute participles, further, each context participle vector can be subjected to addition and average processing, and the processed vectors are used as context text vectors corresponding to the context text information; and adding and averaging the word segmentation vectors of the content attributes, and taking the processed vectors as content attribute vectors corresponding to the content attribute information.

In this way, the semantic relevance between the content attribute information and the context text information can be calculated by calculating the vector similarity between the context text vector and the content attribute vector. The vector similarity may be calculated in various ways, for example, the vector similarity may be calculated by calculating a cosine similarity, a euclidean distance, a manhattan distance, a pearson correlation coefficient, and the like, and further, the semantic correlation between the content attribute information and the context text information may be determined based on a calculation result of the vector similarity, for example, the calculation result of the vector similarity may be used as the semantic correlation; for another example, after data processing is performed on the calculation result of the vector similarity based on the service requirement, the processing result is used as the semantic relevance; and so on.

After determining the semantic relatedness between the content attribute information and the context text information, the target content attribute information may be selected from the at least one item of content attribute information based on the calculation result. For example, the content attribute information may be sorted based on the calculation result, and the target content attribute information may be selected from the plurality of items of content attribute information based on the sorting result. For example, content attribute information with semantic relevance meeting a preset threshold condition may be selected as target content attribute information; for another example, content attribute information of a preset proportion may be selected as the target content attribute information based on the sorting result; and so on.

And screening the content attribute information of the candidate matching content to obtain target content attribute information, and then combining the content introduction information and the target content attribute information to obtain combined content description information. There are various ways of combining information, for example, combining information by connecting content profile information and target content attribute information to obtain combined content description information.

Optionally, in order to better characterize the target content attribute information of the candidate matching content, an attribute descriptor may be set for the target content attribute information. Specifically, each piece of target content attribute information may be assigned an identification number (ID), and an attribute ID sequence may be added to the text segment from which the current text segment is derived. For example, referring to fig. 7, the text to be processed may be "lie X practice tennis beginning in 1989", wherein the candidate matching content of the target participle "lie X" has content profile information "lie X profile …", and target content attribute information "2 nd 1982 including birthday attribute information", 1 may be set as the attribute ID of the content profile information, and 3 is set as the attribute ID of the birthday attribute information, and then corresponding attribute descriptors may be added to the content profile information and the birthday attribute information of the candidate matching content, respectively, as described in fig. 7. It is noted that each attribute ID corresponds to a vector that can be learned along with the model.

After the context text information and the combined content description information are obtained, semantic similarity between the target participle and the candidate matching content can be calculated based on the context text information and the combined content description information. The calculation may be performed in various ways, for example, the calculation may be performed by a trained model, and specifically, the step "calculating semantic similarity between the target segmented word and the candidate matching content based on the context text information and the combined content description information" may include:

acquiring a trained semantic feature extraction model;

respectively extracting the features of the context text information and the combined content description information through a semantic feature extraction model to obtain context semantic features corresponding to the context text information and content semantic features corresponding to the combined content description information;

and calculating the semantic similarity between the target participle and the candidate matching content based on the context semantic features and the content semantic features.

The semantic feature extraction model is a model for extracting semantic features from text information, and the type of the semantic feature extraction model may be various, for example, a Deep Bidirectional Pre-training converter (BERT) model for semantic Understanding may be used as the semantic feature extraction model; as another example, a Long Short-Term Memory artificial neural network (LSTM); for another example, a Gated Round Unit (GRU) may be used as a semantic feature extraction model; and so on.

In an embodiment, referring to fig. 7, a BERT model may be used as the semantic feature extraction model, and a two-tower model architecture may be employed as shown in fig. 7. Specifically, in the left part of the model, feature extraction can be performed on the context text information through the BERT model to obtain context semantic features corresponding to the context text features. Similarly, referring to fig. 7, in the right part of the model, feature extraction may be performed on the combined content description information through the BERT model to obtain a content semantic feature corresponding to the combined content description information.

In this application, the step of respectively performing feature extraction on the context text information and the combined content description information through a semantic feature extraction model to obtain a context semantic feature corresponding to the context text information and a content semantic feature corresponding to the combined content description information may include:

extracting the features of the context text information through a semantic feature extraction model to obtain context semantic features corresponding to the context text information;

and performing feature extraction on the combined content description information through a semantic feature extraction model to obtain content semantic features corresponding to the combined content description information.

As an example, the execution mode of the step "performing feature extraction on the context text information through the semantic feature extraction model to obtain the context semantic features corresponding to the context text information" may be referred to correspondingly, and this is not repeated in this application.

In an embodiment, the semantic feature extraction model may be a BERT model, and specifically, the step "extracting features of the context text information through the semantic feature extraction model to obtain context semantic features corresponding to the context text information" may include:

performing information division on the context text information to obtain the divided context text information;

performing feature conversion on the divided context text information to obtain context text features corresponding to the divided context text information;

and performing feature extraction on the context text features based on an attention mechanism through a semantic feature extraction model to obtain context semantic features corresponding to the context text features.

The information division is performed in a process of dividing the information into sub-information with smaller granularity, and the information division can be implemented in various ways, for example, the information division can be implemented through word segmentation; as another example, information partitioning may be achieved through word segmentation processing; and so on.

As an example, in the present embodiment, the information division of the context text information may be realized by performing word segmentation processing on the context text information. Specifically, referring to fig. 7, the text to be processed is "lie X practice beginning in 1989" as an example, wherein the target participle may be "lie X", and the target participle may be masked by [ MASK ] to obtain masked context text information. Further, word segmentation processing can be performed on the masked context text information, so that the processed context text information can be split according to word granularity, and the split context text information is obtained: "practice tennis was started in 1989" to obtain the divided contextual textual information.

In this case, the feature conversion may be a process of converting the context text information into a corresponding vector.

In an embodiment, a plurality of words in the context text obtained after the word segmentation processing can be respectively converted into corresponding word vectors through pre-trained word vectors, so as to implement feature conversion on the segmented context text information, and obtain context text features corresponding to the segmented context text information. Further, feature extraction can be performed on the context text features through the BERT model based on an attention mechanism, so that context semantic features corresponding to the context text features output by the BERT model are obtained.

Alternatively, before inputting the context text information into the BERT model, in order to conform to the specification of the BERT model, special characters "[ CLS ]" and "[ SEP ]" may be added in front of and behind the context text information, respectively, so that the context text information can be input into the BERT model and feature-extracted by the BERT model.

After the context semantic features and the content semantic features are obtained, the semantic similarity between the target participle and the candidate matching content can be calculated based on the context semantic features and the content semantic features.

In an embodiment, the semantic feature extraction model may be a BERT model, and accordingly, the obtained context semantic features and the content semantic features are vectors, and thus, the semantic similarity may be calculated by calculating the vector similarity. The vector similarity may be calculated in various ways, for example, by calculating cosine similarity, euclidean distance, manhattan distance, pearson correlation coefficient, and the like.

As an example, the semantic similarity S may be calculated with reference to the following formula: s ═ cos (V)₁,V₂) Wherein V is₁Representing contextual semantic features, V₂Representing content semantic features.

It should be noted that, in the application, how to perform feature extraction by using the trained semantic feature extraction model is described, in the application, the trained semantic feature extraction model may also be obtained by model training, specifically, the step "obtaining the trained semantic feature extraction model" may include:

calculating the semantic matching degree of the sample participles and the sample candidate matching content;

and performing model training on the semantic feature extraction model to be trained based on the semantic matching degree to obtain the trained semantic feature extraction model.

The sample data set is a set of sample data, and in the application, the sample data is a sample text, wherein the sample text may include sample participles to be matched and sample associated participles having an association relation with the sample participles on a semantic level.

Similarly, the way of determining the candidate matching content set of the sample participle may refer to the steps of "determining the candidate matching content set of the target participle"; if the semantic matching degree of the sample participle and the sample candidate matching content is calculated, the step of calculating the semantic matching degree of the target participle and the candidate matching content may be referred to, and details are not repeated herein.

By way of example, Loss in the model training process may be calculated with reference to the following equation: loss ═ max (0, M-S)_-+S₊) Where M is a hyperparameter, S_-Denotes the negative example, S₊A positive example is shown.

The foregoing steps of "calculating the semantic association degree of the target participle and the candidate matching content based on the association relationship between the target participle and the associated participle" and the steps of "calculating the semantic similarity of the target participle and the candidate matching content based on the content description information of the candidate matching content" are explained, and the following steps of "calculating the semantic matching degree of the target participle and the candidate matching content based on the semantic association degree and the semantic similarity" may be further explained.

Specifically, the step of calculating the semantic matching degree between the target participle and the candidate matching content based on the semantic association degree and the semantic similarity may include:

determining the prior importance of the candidate matching contents based on the content reference relation among the candidate matching contents;

performing fusion processing on the semantic association degree, the semantic similarity and the prior importance degree to obtain a fusion result;

and determining the semantic matching degree of the target participle and the candidate matching content based on the fusion result.

The prior importance is an importance obtained from past experience and analysis.

In an embodiment, the content may exist in the form of pages, and the content reference may be a link between the pages, and the prior importance of the candidate matching content may be calculated based on a link relationship of the candidate matching content in the knowledge base, for example, based on the link relationship, the importance of the candidate matching content may be calculated by a pagerank algorithm, so as to obtain a first prior importance.

For another example, the link probability of the candidate matching content may be calculated based on the link relationship of the candidate matching content in the encyclopedia, and the link probability is used as the importance of the candidate matching content to obtain the second prior importance. For example, the candidate matching content may be in the form of an anchor text in encyclopedia, and for the anchor text "X deluxe", if it appears 100 times in encyclopedia and 90 times all link to "a deluxe (movie star)", the link probability of "a deluxe (movie star)" is 90/100 ═ 0.9.

The fusion processing mode can be various, for example, semantic association, semantic similarity and prior importance can be fused through a linear model; as another example, semantic relevance, semantic similarity, and a priori importance may be fused via a non-linear model.

In an embodiment, a linear model may be selected as a model required for fusion, and the linear model may be trained, and a trained model is obtained, so that the semantic association degree, the semantic similarity degree, and the prior importance degree may be fused by the trained model, and a fusion result is obtained.

As an example, the linear model S may be trained using the semantic relevance, the semantic similarity, the first a priori importance, and the second a priori importance with reference to the following equation: a is₁×x+a₂×y+a₃×z+a₄X t, wherein x, y, z and t are independent variables corresponding to the semantic association degree, the semantic similarity degree, the first prior importance degree and the second prior importance degree respectively, a₁、a₂、a₃、a₄Then it is waited forAnd training parameters. Also, Loss in the model training process can be calculated with reference to the following equation: loss ═ max (0, M-S)_-+S₊) Where M is a hyperparameter, S_-Denotes the negative example, S₊A positive example is shown.

And performing fusion processing on the semantic association degree, the semantic similarity and the prior importance degree through the trained linear model to obtain a fusion result, and determining the semantic matching degree of the target participle and the candidate matching content based on the fusion result. For example, the calculated fusion result S may be directly used as the semantic matching degree of the target participle and the candidate matching content; for another example, data processing may be performed on the fusion result S based on the service requirement to obtain processed S ', and the processed S' is used as the semantic matching degree between the target word segmentation and the candidate matching content.

104. And determining and outputting the target matching content of the target participle from the candidate matching content set based on the semantic matching degree.

The semantic matching degree obtained through calculation can represent the matching degree of the target participle and the candidate matching content on the semantic level, so that the candidate matching content can be ranked based on the semantic matching degree, and the candidate matching content meeting the requirement of the matching degree of the target participle on the semantic level can be selected as the target matching content based on the ranking result. Specifically, the step "determining and outputting the target matching content of the target participle from the candidate matching content set based on the semantic matching degree" may include:

based on the semantic matching degree, sequencing the candidate matching contents in the candidate matching content set;

and determining and outputting the target matching content of the target participle from the candidate matching content set based on the sorting result.

The sorting mode may be various, for example, the candidate matching contents in the candidate matching content set may be sorted by directly using the semantic matching degree as a sorting index; for another example, the semantic matching degree may be further processed based on the service requirement, and the processing result is used as a ranking index to rank the candidate matching content in the candidate matching content set; and so on.

The method for determining the target matching content based on the sorting result may be various, for example, the candidate matching content of which the sorting result meets the preset threshold condition may be selected as the target matching content; for another example, the candidate matching content with the preset proportion may be selected as the target matching content based on the sorting result; and so on.

In an embodiment, the matching method based on text processing described in the present application may be applied to a search, and then target matching content may be output based on the sorting result, so that the search client may present the search result to the user based on the target matching content.

In another embodiment, the matching method based on text processing described in the present application may be applied to a question-answering system, and then target matching content may be output based on the sorting result, so that a question-answering system client may generate a dialog based on the target matching content to perform question-answering interaction with a user.

In another embodiment, the matching method based on text processing described in the present application may be applied to a recommendation system, and then target matching content may be output based on the sorting result, so that the sorting system may generate recommendation information based on the target matching content and recommend the recommendation information to a user.

As can be seen from the above, the embodiment may obtain a to-be-processed text, where the to-be-processed text includes a target participle to be matched and an associated participle having an association relationship with the target participle on a semantic level; determining a candidate matching content set of the target participle, wherein the candidate matching content set comprises at least one candidate matching content of the target participle, and each candidate matching content has corresponding content description information; calculating the semantic matching degree of the target participle and the candidate matching content based on the incidence relation between the target participle and the incidence participle and the content description information of the candidate matching content; and determining and outputting the target matching content of the target participle from the candidate matching content set based on the semantic matching degree.

The method described in the above examples is further described in detail below by way of example.

In this embodiment, a description will be given by taking an example in which a matching device based on text processing is integrated in a server and a terminal, where the server may be a single server or a server cluster composed of a plurality of servers; the terminal can be a mobile phone, a tablet computer, a notebook computer and other equipment.

As shown in fig. 8, a matching method based on text processing includes the following specific processes:

201. and the terminal sends a text to be processed to the server, wherein the text to be processed comprises target participles to be matched and associated participles which have an association relation with the target participles on a semantic level.

In an embodiment, the matching method based on text processing described in the present application may be applied to search, a terminal may operate a search client, and the terminal may obtain a text to be processed input by a user through the search client.

As an example, the to-be-processed text input by the user may be "what protocols are common in computers", where the to-be-processed text includes a target participle "protocol" to be matched and an associated participle "computer" having an association relation with the target participle on a semantic level.

After the terminal obtains the text to be processed input by the user through the search client, the text to be processed can be sent to the server.

202. The server acquires a text to be processed sent by the terminal.

203. The server determines a candidate matching content set of the target participle, wherein the candidate matching content set comprises at least one candidate matching content of the target participle, and each candidate matching content has corresponding content description information.

The server may determine a set of candidate matches for the target segmented "protocol," where the set of candidate matches may include at least one candidate match for the target segmented "protocol," e.g., the candidate matches may include "network protocol," "user protocol," "chinese vocabulary 'protocol'", and so on. Also, each candidate matching content has corresponding content description information, which may include content profile information and content attribute information, for example.

204. And the server calculates the semantic matching degree of the target participle and the candidate matching content based on the incidence relation between the target participle and the incidence participle and the content description information of the candidate matching content.

The server can calculate the semantic matching degree between the target participle and each candidate matching content based on the incidence relation between the target participle protocol and the incidence participle computer and the content description information of each candidate matching content.

For example, the server may calculate a semantic association degree of the target participle and the candidate matching content based on an association relationship between the target participle and the associated participle, wherein the semantic association degree characterizes a degree of association between the target participle and the candidate matching content at a semantic level. In addition, the server may calculate semantic similarity between the target participle and the candidate matching content based on the content description information of the candidate matching content, where the semantic similarity characterizes a similarity level of the target participle and the candidate matching content at a semantic level. Furthermore, the server can calculate the semantic matching degree of the target participle and the candidate matching content based on the semantic association degree and the semantic similarity.

205. And the server determines and outputs the target matching content of the target participle from the candidate matching content set based on the semantic matching degree.

For example, the server may determine the target matching content from the candidate matching content set as the "network protocol" based on the semantic matching degree of the target participle "protocol" and each candidate matching content, and output the target matching content.

206. And the terminal acquires the target matching content output by the server.

Correspondingly, the terminal can acquire the target matching content output by the server and generate a search result based on the target matching content, so that the search result can be presented to the user in the search client.

As can be seen from the above, in the embodiment of the present application, the semantic matching degree between the target participle and the candidate matching content may be calculated based on the association relationship between the target participle and the associated participle thereof, so that when performing content matching on the target participle, the solution not only focuses on the target participle, but also considers that in the text to be processed, the target participle and the associated participle thereof have a strong semantic correlation, and the solution performs content matching on the target participle based on the semantic correlation, thereby contributing to improving the matching efficiency and the matching accuracy. In addition, when the associated participle of the target participle is also the participle to be matched in the text to be processed, that is, when the text to be processed has a plurality of participles to be matched, the content matching is performed by combining the semantic relevance between the plurality of participles to be matched, so that compared with the case of independently and sequentially performing content matching on each participle to be matched in the text to be processed, the semantic relevance between the plurality of participles to be matched in the text to be processed can be simultaneously calculated by the scheme, and the content matching efficiency for the plurality of participles to be matched in the text to be processed is further improved.

In order to better implement the method, correspondingly, the embodiment of the application also provides a matching device based on text processing, wherein the matching device based on text processing can be integrated in a server or a terminal. The server can be a single server or a server cluster consisting of a plurality of servers; the terminal can be a mobile phone, a tablet computer, a notebook computer and other equipment.

For example, as shown in fig. 9, the matching apparatus based on text processing may include an acquisition unit 301, a determination unit 302, a calculation unit 303, and an output unit 304, as follows:

an obtaining unit 301, configured to obtain a to-be-processed text, where the to-be-processed text includes a target participle to be matched and an associated participle having an association relationship with the target participle on a semantic level;

a determining unit 302, configured to determine a candidate matching content set of the target word, where the candidate matching content set includes at least one candidate matching content of the target word, and each candidate matching content has corresponding content description information;

a calculating unit 303, configured to calculate a semantic matching degree between the target participle and the candidate matching content based on an association relationship between the target participle and the associated participle and the content description information of the candidate matching content;

an output unit 304, configured to determine and output target matching content of the target participle from the candidate matching content set based on the semantic matching degree.

In an embodiment, referring to fig. 10, the calculating unit 303 may include:

a first calculating subunit 3031, configured to calculate a semantic association degree between the target participle and the candidate matching content based on an association relationship between the target participle and the associated participle, where the semantic association degree represents a degree of association between the target participle and the candidate matching content at a semantic level;

a second calculating subunit 3032, configured to calculate semantic similarities of the target participle and the candidate matching content based on the content description information of the candidate matching content, where the semantic similarities represent similarity levels of the target participle and the candidate matching content at a semantic level;

a third computing subunit 3033, configured to compute a semantic matching degree between the target participle and the candidate matching content based on the semantic association degree and the semantic similarity.

In an embodiment, the first calculating subunit 3031 may be configured to:

In an embodiment, the first calculating subunit 3031 may specifically be configured to:

In an embodiment, the second calculating subunit 3032 may specifically be configured to:

In an embodiment, the third computing subunit 3033 may be configured to:

In an embodiment, referring to fig. 11, the output unit 304 may include:

a sorting subunit 3041, configured to sort, based on the semantic matching degree, the candidate matching contents in the candidate matching content set;

the output subunit 3042 may be configured to determine and output the target matching content of the target word segmentation from the candidate matching content set based on the sorting result.

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

As can be seen from the above, in the matching apparatus based on text processing according to this embodiment, the obtaining unit 301 obtains a text to be processed, where the text to be processed includes target participles to be matched and associated participles having an association relationship with the target participles in a semantic level; determining, by a determining unit 302, a candidate matching content set of the target participle, wherein the candidate matching content set includes at least one candidate matching content of the target participle, and each candidate matching content has corresponding content description information; calculating, by the calculating unit 303, a semantic matching degree of the target participle and the candidate matching content based on an association relationship between the target participle and the associated participle and the content description information of the candidate matching content; determining and outputting, by the output unit 304, a target matching content of the target participle from the candidate matching content set based on the semantic matching degree.

In addition, an embodiment of the present application further provides a computer device, where the computer device may be a server or a terminal, and as shown in fig. 12, a schematic structural diagram of the computer device according to the embodiment of the present application is shown, specifically:

the computer device may include components such as a memory 401 including one or more computer-readable storage media, a processor 402 including one or more processing cores, and a power supply 403. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 12 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. Wherein:

the memory 401 may be used to store software programs and modules, and the processor 402 executes various functional applications and data processing by operating the software programs and modules stored in the memory 401. The memory 401 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the computer device, and the like. Further, the memory 401 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 401 may further comprise a memory controller to provide the processor 402 and the input unit 603 with access to the memory 401.

The processor 402 is a control center of the computer device, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the computer device and processes data by operating or executing software programs and/or modules stored in the memory 401 and calling data stored in the memory 401, thereby integrally monitoring the mobile phone. Optionally, processor 402 may include one or more processing cores; preferably, the processor 402 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 402.

The computer device also includes a power supply 403 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 402 via a power management system to manage charging, discharging, and power consumption management functions via the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown, the computer device may further include a camera, a bluetooth module, etc., which will not be described herein. Specifically, in this embodiment, the processor 402 in the computer device loads the executable file corresponding to the process of one or more application programs into the memory 401 according to the following instructions, and the processor 402 runs the application programs stored in the memory 401, so as to implement various functions as follows:

acquiring a text to be processed, wherein the text to be processed comprises target participles to be matched and associated participles which have an association relation with the target participles on a semantic level; determining a candidate matching content set of the target participle, wherein the candidate matching content set comprises at least one candidate matching content of the target participle, and each candidate matching content has corresponding content description information; calculating the semantic matching degree of the target participle and the candidate matching content based on the incidence relation between the target participle and the incidence participle and the content description information of the candidate matching content; and determining and outputting the target matching content of the target participle from the candidate matching content set based on the semantic matching degree.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, the computer device of this embodiment may calculate the semantic matching degree between the target participle and the candidate matching content based on the association relationship between the target participle and the associated participle thereof, so that when the computer device performs content matching on the target participle, not only the computer device focuses on the target participle, but also the computer device considers that in the text to be processed, the target participle and the associated participle thereof have a strong semantic correlation therebetween, and the computer device performs content matching on the target participle based on the semantic correlation, thereby contributing to improving the matching efficiency and the matching accuracy. In addition, when the associated participle of the target participle is also the participle to be matched in the text to be processed, that is, when the text to be processed has a plurality of participles to be matched, the computer device performs content matching by combining semantic correlation degrees among the plurality of participles to be matched, so that compared with the case of performing content matching on each participle to be matched in the text to be processed independently and sequentially, the computer device can simultaneously calculate the semantic correlation degrees among the plurality of participles to be matched in the text to be processed, so that the content matching efficiency for the plurality of participles to be matched in the text to be processed is further improved.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a storage medium having stored therein a plurality of instructions, which can be loaded by a processor to perform the steps in any of the matching methods based on text processing provided by embodiments of the present application. For example, the instructions may perform the steps of:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any matching method based on text processing provided in the embodiments of the present application, the beneficial effects that can be achieved by any matching method based on text processing provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the methods provided in the various alternative implementations of the text-based processing matching aspect described above.

The matching method, device, computer equipment and storage medium based on text processing provided by the embodiments of the present application are introduced in detail above, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

34页详细技术资料下载

Matching method and device based on text processing, computer equipment and storage medium

相关技术

网友询问留言