Training method, generation method, device and equipment of house source title generation model

文档序号：661721 发布日期：2021-04-27 浏览：8次中文

阅读说明：本技术 房源标题生成模型的训练方法、生成方法、装置以及设备 (Training method, generation method, device and equipment of house source title generation model ) 是由傅发佐孙毓钊宋鑫蔡白银于 2021-03-29 设计创作，主要内容包括：本公开提供了一种房源标题生成模型的训练方法、房源标题生成方法、装置以及电子设备、存储介质,涉及人工智能技术领域,其中的方法包括：根据房源特征词向量、用户偏好编码以及房源特征词标注信息生成训练样本,对预设的房源标题生成模型进行训练,并使用预设的损失函数获得与训练样本相对应的特征词选取损失,根据特征词选取损失对房源标题生成模型的参数进行调整；使用训练好的房源标题生成模型获取房源特征词选取标签,生成房源标题。本公开的方法、装置以及电子设备、存储介质,能够自动生成与用户偏好相对应的房源标题,节约了人力成本,并使房源标题具有个性化特点以及卖点创意,解决了房源标题重复性高、不够个性化的问题。(The present disclosure provides a training method for a house source title generation model, a house source title generation method, a device, an electronic device and a storage medium, and relates to the technical field of artificial intelligence, wherein the method comprises the following steps: generating a training sample according to the house source characteristic word vector, the user preference code and the house source characteristic word tagging information, training a preset house source title generation model, obtaining characteristic word selection loss corresponding to the training sample by using a preset loss function, and adjusting parameters of the house source title generation model according to the characteristic word selection loss; and obtaining the house source feature word selection label by using the trained house source title generation model to generate the house source title. The method, the device, the electronic equipment and the storage medium can automatically generate the house source title corresponding to the preference of the user, save labor cost, enable the house source title to have personalized characteristics and point of sale originality, and solve the problems of high repeatability and insufficient personalization of the house source title.)

1. A training method of a house source title generation model comprises the following steps:

acquiring a house source characteristic word vector and a user preference code corresponding to a house source;

generating a training sample according to the house source characteristic word vector, the user preference code and the house source characteristic word labeling information;

training a preset house source title generation model by using the training sample, and calculating feature word selection loss corresponding to the training sample by using a preset loss function;

and adjusting parameters of the house source title generation model according to the feature word selection loss until the feature word selection loss is lower than a preset threshold value, and obtaining the trained house source title generation model.

2. The method of claim 1, wherein the house-source title generation model comprises: a feature extraction layer and a full connection layer; the training of the preset house source title generation model by using the training sample, and the calculation of the feature word selection loss corresponding to the training sample by using the preset loss function comprise:

inputting the room source feature word vector into the feature extraction layer so that the feature extraction layer performs context feature extraction based on semantic relation;

inputting the feature vector sequence output by the feature extraction layer and the corresponding user preference codes into the full-connection layer, performing Concat connection operation, and outputting house source feature word selection labels corresponding to the user preference codes;

calculating cross entropy information according to the house source characteristic word selection label, the house source characteristic word labeling information and a preset loss function, and taking the cross entropy information as characteristic word selection loss; the cross entropy information is used for measuring the difference between the house source feature word selecting label and the house source feature word labeling information.

3. The method of claim 2, wherein the house-source title generation model comprises: an Attention layer and a DropOut layer; the method further comprises the following steps:

inputting the feature vector sequence output by the feature extraction layer into the Attention layer, and distributing corresponding weight to the vector in the feature vector sequence through an Attention mechanism;

passing the output of the Attention layer into the DropOut layer, wherein the DropOut layer is used to prevent model overfitting;

passing the output of the Dropout layer into the fully-connected layer.

4. The method of claim 1, wherein the obtaining of the house source feature word vector corresponding to the house source comprises:

acquiring house source description information corresponding to the house source;

filtering the house source description information based on a preset text length threshold;

carrying out symbol standardization processing on the house source description information after the filtration processing, and carrying out replacement processing on numbering type numbers in the house source description information to generate an original corpus;

generating the room source corpus based on the original corpus;

and generating the house source characteristic word vector corresponding to the house source corpus.

5. The method of claim 4, wherein the generating the room-source corpus based on the original corpus comprises:

obtaining an independent sentence corresponding to the original corpus based on a preset punctuation mark segmentation rule, and carrying out segmentation processing on the independent sentence to obtain a corresponding short sentence list;

splicing the short sentences in the short sentence list to obtain the house source linguistic data corresponding to the short sentence list;

and filtering the house source linguistic data based on a preset connection word filtering rule to obtain effective house source linguistic data.

6. A training device for a house source title generation model comprises:

the characteristic acquisition module is used for acquiring house source characteristic word vectors corresponding to house sources and user preference codes;

the sample construction module is used for generating a training sample according to the house source characteristic word vector, the user preference code and the house source characteristic word labeling information;

the model training module is used for training a preset house source title generation model by using the training sample;

the loss determining module is used for calculating the feature word selection loss corresponding to the training sample by using a preset loss function;

and the parameter adjusting module adjusts the parameters of the house source title generation model according to the characteristic word selection loss until the characteristic word selection loss is lower than a preset threshold value, so as to obtain the trained house source title generation model.

7. A house source title obtaining method comprises the following steps:

acquiring a house source characteristic word vector and a user preference code corresponding to a house source;

using the trained house source title generation model and acquiring a house source feature word selection label based on the house source feature word vector and the user preference code;

selecting a label and the house source characteristic words based on the house source characteristic words to generate a house source title;

wherein, the house source title generation model is obtained by training through the training method of any one of claims 1 to 5.

8. A house source title acquisition apparatus, comprising:

the information acquisition module is used for acquiring the house source characteristic word vector and the user preference code corresponding to the house source;

the model using module is used for generating a model by using the trained house source title and acquiring a house source feature word selecting label based on the house source feature word vector and the user preference code;

the title generation module is used for selecting a label and the house source characteristic words based on the house source characteristic words to generate house source titles;

wherein, the house source title generation model is obtained by training through the training method of any one of claims 1 to 5.

9. A computer-readable storage medium, the storage medium storing a computer program for performing the method of any of the preceding claims 1-5 and/or the method of claim 7.

10. An electronic device, the electronic device comprising:

a processor; a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of the preceding claims 1-5 and/or the method of claim 7.

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a training method for a house source title generation model, a house source title generation method, a house source title generation device, an electronic device, and a storage medium.

Background

In the development process of the real estate industry, more and more real estate sources are converged into a platform, so that the description of how to really highlight the real estate sources becomes a key for guiding more users to click on the real estate sources, wherein the title of the real estate sources is a very important factor for influencing the clicking of the users. The house source title usually summarizes the advantages of the house source, a short and intuitive information summary is given to the user, the user can decide whether to check the detailed information of the house source according to the house source title, the time of the user can be saved, and the user experience is improved. However, the house source title on the current platform is usually filled in manually by a broker, the labor cost is high, and the house source title is not personalized and generated in combination with the preference of the user, so that the house source title has no pertinence to the user and the browsing experience of the user on the platform is influenced.

Disclosure of Invention

The present disclosure is proposed to solve the above technical problems. The embodiment of the disclosure provides a training method of a house source title generation model, a house source title generation method and device, an electronic device and a storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a training method for a house source title generation model, including: acquiring a house source characteristic word vector and a user preference code corresponding to a house source; generating a training sample according to the house source characteristic word vector, the user preference code and the house source characteristic word labeling information; training a preset house source title generation model by using the training sample, and calculating feature word selection loss corresponding to the training sample by using a preset loss function; and adjusting parameters of the house source title generation model according to the feature word selection loss until the feature word selection loss is lower than a preset threshold value, and obtaining the trained house source title generation model.

Optionally, the house source title generation model includes: a feature extraction layer and a full connection layer; the training of the preset house source title generation model by using the training sample, and the calculation of the feature word selection loss corresponding to the training sample by using the preset loss function comprise: inputting the room source feature word vector into the feature extraction layer so that the feature extraction layer performs context feature extraction based on semantic relation; inputting the feature vector sequence output by the feature extraction layer and the corresponding user preference codes into the full-connection layer, performing Concat connection operation, and outputting house source feature word selection labels corresponding to the user preference codes; calculating cross entropy information according to the house source characteristic word selection label, the house source characteristic word labeling information and a preset loss function, and taking the cross entropy information as characteristic word selection loss; the cross entropy information is used for measuring the difference between the house source feature word selecting label and the house source feature word labeling information.

Optionally, the loss function comprises: sigmoid cross entry loss function.

Optionally, the feature extraction layer comprises: and a feature extraction layer constructed based on the BilSTM network model.

Optionally, the house source title generation model includes: an Attention layer and a DropOut layer; the method further comprises the following steps: inputting the feature vector sequence output by the feature extraction layer into the Attention layer, and distributing corresponding weight to the vector in the feature vector sequence through an Attention mechanism; passing the output of the Attention layer into the DropOut layer, wherein the DropOut layer is used to prevent model overfitting; passing the output of the Dropout layer into the fully-connected layer.

Optionally, the obtaining of the house source feature word vector corresponding to the house source includes: acquiring house source description information corresponding to the house source; filtering the house source description information based on a preset text length threshold; carrying out symbol standardization processing on the house source description information after the filtration processing, and carrying out replacement processing on numbering type numbers in the house source description information to generate an original corpus; generating the room source corpus based on the original corpus; and generating the house source characteristic word vector corresponding to the house source corpus.

Optionally, the generating the room source corpus based on the original corpus comprises: obtaining an independent sentence corresponding to the original corpus based on a preset punctuation mark segmentation rule, and carrying out segmentation processing on the independent sentence to obtain a corresponding short sentence list; splicing the short sentences in the short sentence list to obtain the house source linguistic data corresponding to the short sentence list; and filtering the house source linguistic data based on a preset connection word filtering rule to obtain effective house source linguistic data.

Optionally, the generating the room source feature word vector corresponding to the room source corpus includes: performing word segmentation processing on the house source corpus to obtain house source characteristic word data; and training the room source feature word data by using a word2vec model to obtain the room source feature word vector.

Optionally, the obtaining the user preference code corresponding to the house source includes: acquiring user preference label information corresponding to a house source; and coding the user preference label information to obtain the user preference code.

Optionally, the house source description information includes: owner self-recommendation information and broker house assessment information. The user preference encoding includes: and (5) one-hot coding.

According to a second aspect of the embodiments of the present disclosure, there is provided a training apparatus for a house source title generation model, including: the characteristic acquisition module is used for acquiring house source characteristic word vectors corresponding to house sources and user preference codes; the sample construction module is used for generating a training sample according to the house source characteristic word vector, the user preference code and the house source characteristic word labeling information; the model training module is used for training a preset house source title generation model by using the training sample; the loss determining module is used for calculating the feature word selection loss corresponding to the training sample by using a preset loss function; and the parameter adjusting module adjusts the parameters of the house source title generation model according to the characteristic word selection loss until the characteristic word selection loss is lower than a preset threshold value, so as to obtain the trained house source title generation model.

Optionally, the house source title generation model includes: a feature extraction layer and a full connection layer; the model training module is used for inputting the room source feature word vectors into the feature extraction layer so as to enable the feature extraction layer to extract context features based on semantic relations; inputting the feature vector sequence output by the feature extraction layer and the corresponding user preference codes into the full-connection layer, performing Concat connection operation, and outputting house source feature word selection labels corresponding to the user preference codes; the loss determining module is used for calculating cross entropy information according to the house source characteristic word selecting label, the house source characteristic word labeling information and a preset loss function, and taking the cross entropy information as characteristic word selecting loss; the cross entropy information is used for measuring the difference between the house source feature word selecting label and the house source feature word labeling information.

Optionally, the loss function comprises: sigmoid cross entry loss function.

Optionally, the feature extraction layer comprises: and a feature extraction layer constructed based on the BilSTM network model.

Optionally, the house source title generation model includes: an Attention layer and a DropOut layer; the model training module is further configured to input the feature vector sequence output by the feature extraction layer into the Attention layer, and assign corresponding weights to vectors in the feature vector sequence through an Attention mechanism; passing the output of the Attention layer into the DropOut layer, wherein the DropOut layer is used to prevent model overfitting; passing the output of the Dropout layer into the fully-connected layer.

Optionally, the feature obtaining module includes: the information acquisition unit is used for acquiring the house source description information corresponding to the house source; the corpus acquiring unit includes: the cleaning unit is used for filtering the room source description information based on a preset text length threshold value; carrying out symbol standardization processing on the house source description information after the filtration processing, and carrying out replacement processing on numbering type numbers in the house source description information to generate an original corpus; the generating unit is used for generating the room source linguistic data based on the original linguistic data; the vector generating unit is used for generating the house source characteristic word vector corresponding to the house source corpus;

optionally, the generating unit is specifically configured to obtain an independent sentence corresponding to the original corpus based on a preset punctuation mark segmentation rule, and perform segmentation processing on the independent sentence to obtain a corresponding short sentence list; splicing the short sentences in the short sentence list to obtain the house source linguistic data corresponding to the short sentence list; and filtering the house source linguistic data based on a preset connection word filtering rule to obtain effective house source linguistic data.

Optionally, the vector generation unit is configured to perform word segmentation processing on the room source corpus, and acquire room source feature word data; and training the room source feature word data by using a word2vec model to obtain the room source feature word vector.

Optionally, the feature obtaining module includes: the code generating unit is used for acquiring user preference label information corresponding to the house source; and coding the user preference label information to obtain the user preference code.

According to a third aspect of the embodiments of the present disclosure, there is provided a house source title obtaining method, including: acquiring a house source characteristic word vector and a user preference code corresponding to a house source; using the trained house source title generation model and acquiring a house source feature word selection label based on the house source feature word vector and the user preference code; selecting a label and the house source characteristic words based on the house source characteristic words to generate a house source title; wherein, the house source title generation model is obtained by training through the training method.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a house source title obtaining apparatus, including: the information acquisition module is used for acquiring the house source characteristic word vector and the user preference code corresponding to the house source; the model using module is used for generating a model by using the trained house source title and acquiring a house source feature word selecting label based on the house source feature word vector and the user preference code; the title generation module is used for selecting a label and the house source characteristic words based on the house source characteristic words to generate house source titles; wherein, the house source title generation model is obtained by training through the training method.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the above-mentioned method.

According to a sixth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; the processor is used for executing the method.

Based on the training method of the house source title generation model, the house source title generation method and device, the electronic equipment and the storage medium provided by the embodiment of the disclosure, the house source title corresponding to the preference of the user can be automatically generated, the labor cost is saved, the house source title has personalized characteristics and a selling point creative idea, and the problems of high repeatability and insufficient personalization of the house source title are solved; the house source can be displayed to the corresponding users according to the characteristics of different users in a personalized manner, so that the generated house source title is more attractive, and the user experience is improved.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 is a flow chart of one embodiment of a method for training a house source title generation model according to the present disclosure;

fig. 2 is a flowchart of obtaining house source feature word vectors in an embodiment of the training method for a house source title generation model according to the present disclosure;

FIG. 3 is a flow chart of a data cleansing process performed in an embodiment of the training method for the house source title generation model according to the present disclosure;

FIG. 4 is a flow chart of the room source corpus in an embodiment of the training method for the room source title generation model of the present disclosure;

FIG. 5 is a flow chart of the generation of house source feature word vectors in an embodiment of the training method of the house source title generation model of the present disclosure;

FIG. 6 is a schematic diagram of the present disclosure for performing phrase segmentation;

FIG. 7 is a flowchart of model training in an embodiment of the training method for a house source title generation model according to the present disclosure;

FIG. 8 is a schematic diagram of a house source title generation model and training according to the present disclosure;

FIG. 9 is a flow chart of one embodiment of a house source title acquisition method of the present disclosure;

FIG. 10 is a schematic structural diagram of an embodiment of a training apparatus for a house source title generation model according to the present disclosure;

FIG. 11 is a schematic structural diagram of a feature obtaining module in an embodiment of the training apparatus for a house source title generation model according to the present disclosure;

fig. 12 is a schematic structural diagram of a corpus acquiring unit in an embodiment of the training apparatus for a house source title generation model according to the present disclosure;

fig. 13 is a schematic structural diagram of an embodiment of a house source title obtaining apparatus according to the present disclosure;

FIG. 14 is a block diagram of one embodiment of an electronic device of the present disclosure.

Detailed Description

Example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more than two and "at least one" may refer to one, two or more than two.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, such as a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Embodiments of the present disclosure may be implemented in electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with an electronic device, such as a terminal device, computer system, or server, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment. In a distributed cloud computing environment, tasks may be performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the application

In the process of implementing the disclosure, the inventor finds that the house source title on the current platform is usually manually filled by a broker, the labor cost is high, and the house source title is not personalized and generated in combination with the preference of the user, and has no pertinence to the user; in addition, a large amount of house source description data such as house source description corpora such as owner self-recommendation and broker comment are accumulated on the platform, and the corpora are not well utilized currently, and the browsing experience of the user on the platform is not influenced in combination with the preference of the user.

The method for training the house source title generation model and the house source title generation method provided by the disclosure generate a training sample according to a house source characteristic word vector, a user preference code and house source characteristic word tagging information, train a preset house source title generation model, obtain a characteristic word selection loss corresponding to the training sample by using a preset loss function, and adjust parameters of the house source title generation model according to the characteristic word selection loss; acquiring a house source feature word selection label by using the trained house source title generation model to generate a house source title; the house source title corresponding to the preference of the user can be automatically generated, so that the house source title has personalized characteristics and a selling point creative idea, and the problems that the house source title is high in repeatability and not personalized enough are solved.

Exemplary method

Fig. 1 is a flowchart of an embodiment of a training method for a house source title generation model according to the present disclosure, where the method shown in fig. 1 includes the steps of: S101-S104. The following describes each step.

S101, acquiring a house source characteristic word vector and a user preference code corresponding to the house source.

In one embodiment, the source may be a variety of premises, such as a commercial establishment, a cottage, a commercial housing, or the like. The house characteristic words comprise characteristic words about floors, orientation, decoration, patterns, prices, taxes, ages, school districts and the like. Word embedding is a vector for mapping phrases, feature words, and the like to real numbers, and a variety of methods can be used to map house source feature words to obtain house source feature Word vectors. The user preferences include preferences in terms of price, floor, school district, etc., and the user preference code may be generated using a variety of methods.

And S102, generating a training sample according to the house source characteristic word vector, the user preference code and the house source characteristic word marking information.

In one embodiment, the user presets house source characteristic word tagging information, and the house source characteristic word tagging information is used for representing whether a house source characteristic word is selected or not. For example, the house source characteristic words include high floors, low prices, low taxes, and the like, the user preferences include price preferences and the like, and the user sets corresponding house source characteristic word labeling information including: the high floor is 0, the low price is 1, and the low tax is 1, wherein the high floor, the low price, and the low tax are house source characteristic words, "0" indicates not selected, "1" indicates selected. The existing multiple methods can be used for generating training samples according to the house source characteristic word vectors, the user preference codes and the corresponding house source characteristic word labeling information, and training the house source title generation model.

S103, training a preset house source title generation model by using a training sample, and calculating by using a preset loss function to obtain a feature word selection loss corresponding to the training sample. The house source title generation model can be various, such as a neural network model and the like.

And S104, adjusting parameters of the house source title generation model according to the feature word selection loss until the feature word selection loss is lower than a preset threshold value, and obtaining the trained house source title generation model.

In one embodiment, the house source title generation model is subjected to parameter optimization in the training stage，To achieve the training goal. Training the house source title generation model based on the feature word selection loss, and adjusting the house source title generation model by the existing iterative training and other methods until the feature word selection loss is lower than a preset threshold value.

Fig. 2 is a flowchart of obtaining a house source feature word vector in an embodiment of the training method for a house source title generation model of the present disclosure, where the method shown in fig. 2 includes the steps of: S201-S203. The following describes each step.

S201, acquiring house source description information corresponding to the house source. The house source description information comprises owner self-recommendation information, broker house evaluation information and the like.

S202, preprocessing the house source description information to obtain house source corpora.

And S203, generating house source characteristic word vectors corresponding to the house source linguistic data.

The preprocessing includes a data cleansing process and the like. When data cleaning processing is carried out, filtering processing is carried out on the house source description information based on a preset text length threshold, for example, the text length threshold is 50 words, and if the word number of the house source description information is less than 50 words, the house source description information is filtered; and carrying out symbol standardization processing on the house source description information after the filtering processing, replacing the number type number in the house source description information to generate an original corpus, and generating the house source corpus based on the original corpus. The symbol normalization processing may be a plurality of normalization processing, for example, a plurality of commas, periods or question marks are combined into one comma, period or question mark; the number type number may be an enumerated text such as 1, [1],1, etc., and 1, [1],1, etc. are collectively replaced with a blank character.

In one embodiment, the owner self-recommendation information, the broker house evaluation information and the like are texts for introducing house resources, a corpus is generated based on the texts, and data cleaning processing is performed on the texts in the corpus, wherein the data cleaning processing comprises the steps of text length filtering, caption symbol normalization, digital enumeration text filtering and the like. Text length filtering: by counting the length distribution condition of each text in the corpus, 50 words are set as text length threshold values, and the text with the character length smaller than 50 words is filtered. Title notation normalization: english punctuations in the text are uniformly replaced by Chinese punctuations, and if a plurality of continuous commas, periods or question marks are used in the text, a plurality of continuous symbols are uniformly combined into one symbol. Filtering numeric enumeration text: uniformly replacing the enumerated texts in the shapes of 1, [1] and 1) with blank characters.

For example, a text after data cleansing is used as an original corpus, and an original corpus is established, where the original corpus is shown in table 1 below:

serial number	Text
		1	The house can be carried in the bag after being decorated for less than 2 years. The window is opened at the first floor, the view of the garden is good, the house is accepted by the house-changing client, and the house is already in Is a commodity house
2	My house guards are located the middle floor, and daylighting is good, does not have the sheltering from. The house is a three-house with north and south permeability, the main lying house and the living room face south, two times lying in the north! The house is finished and maintained to be clean and tidy all the time, and the trouble and fatigue of decoration are avoided! Small The district property is a medium-speed railway construction property, and the safety and sanitation of the district are!

TABLE 1 original corpus Table

Fig. 3 is a flowchart of a data cleansing process performed in an embodiment of the training method for a house source title generation model according to the present disclosure, where the method shown in fig. 3 includes the steps of: S301-S303. The following describes each step.

S301, obtaining independent sentences corresponding to the original corpus based on a preset punctuation mark segmentation rule, and performing segmentation processing on the independent sentences to obtain corresponding short sentence lists.

In one embodiment, the punctuation segmentation rules include a plurality of rules, for example, segmenting the original corpus into a plurality of independent sentences according to periods, exclamations, question marks, etc., segmenting the independent sentences into a plurality of short sentences according to commas, etc., and generating a short sentence list.

S302, splicing the short sentences in the short sentence list to obtain house source corpora corresponding to the short sentence list. A variety of splicing processes may be used.

S303, filtering the house source corpus based on a preset connecting word filtering rule to obtain the effective house source corpus.

In one embodiment, the original corpus is divided into semantically independent sentence lists according to the signs of periods, exclamation marks, question marks and the like; dividing each sentence into a short sentence list according to commas; and traversing each short sentence list, and splicing each short sentence with the short sentences behind the short sentence in sequence to generate the house source corpus. The house source corpus is in a 2-gram form with short sentences as granularity, and two classification models are trained by manually marking partial short sentences to judge whether the two short sentences are similar in semantics and have a fusion condition. The sentences are fused in the 2-gram form to generate the room source linguistic data, so that on one hand, the short sentences cannot be too long or too short, and the method can be suitable for various service scenes; on the other hand, the problem that the similarity between the single short sentences is too high due to the fact that the length of the single short sentence is too short can be solved, and extra context information is introduced to assist in calculation.

The short sentence has a connective word, which can cause the semantic of the room source corpus to be unsmooth. Constructing a connection word library, and dividing connection words into two types: prefix connectives, suffix connectives, e.g., "if" usually appears in the first half of a sentence, classifying it as prefix connectives, and when it appears in the second half of a house-source corpus (2-gram sentence), the whole short sentence is determined as invalid short sentence filtered out; the processing method for suffix conjunctions is similar.

As shown in FIG. 6, the house can be wrapped by hand when the original language material is decorated for less than 2 years. The view of the garden is good when the window is opened in the first floor, and the house is accepted as a commodity house, and the house is segmented to obtain a corresponding short sentence list; the short sentences in the short sentence list are subjected to fusion processing, house source linguistic data corresponding to the short sentence list are obtained, namely, house decoration is less than 2 years, the house can be wrapped in a hand bag, the sight of a garden is good when a window is opened in the first floor, house changing clients are accepted, and the house changing clients are accepted, wherein the house is a commodity house, and the house source linguistic data are subjected to filtering processing based on the connecting word filtering rule.

Fig. 4 is a flowchart of a method for generating a house source corpus in an embodiment of a training method for a house source title generation model according to the present disclosure, where the method shown in fig. 4 includes the steps of: S401-S402. The following describes each step.

S401, performing word segmentation processing on the house source linguistic data to obtain house source characteristic word data. The word segmentation process may use various word segmentation algorithms, such as a dynamic programming word segmentation algorithm (a final word segmentation algorithm), and the like.

S402, training the house source feature word data by using a word2vec model to obtain a house source feature word vector.

In one embodiment, after the room source corpus is subjected to word segmentation processing through a Chinese word segmentation algorithm, word vectors are trained by using a word2vec model. The room source feature word data can be trained by using a word2vec model by using various existing methods to obtain room source feature word vectors.

Fig. 5 is a flowchart of generating a house source feature word vector in an embodiment of the training method for a house source title generation model of the present disclosure, where the method shown in fig. 5 includes the steps of: S501-S502. The following describes each step.

S501, user preference label information corresponding to the house source is obtained. The user preference tags may be house type preferences, price preference tags, and the like.

S502, encoding the user preference label information to obtain the user preference code. The user preference code may be a one-hot code, and the like, and the user preference tag information may be encoded by using various existing methods.

In one embodiment, the house source title generation model comprises a feature extraction layer and a full connection layer. Fig. 7 is a flowchart of model training in an embodiment of the training method for a house source title generation model of the present disclosure, where the method shown in fig. 7 includes the steps of: S701-S703. The following describes each step.

S701, inputting the house source feature word vectors into a feature extraction layer so as to enable the feature extraction layer to extract context features based on semantic relations.

In one embodiment, the feature extraction layer may be various, for example, a feature extraction layer constructed based on a Bi-directional Long Short-term memory (LSTM) network model, and the like. The BilSTM layer is formed by combining a forward LSTM and a backward LSTM, and can acquire the semantic information of adjacent upper and lower words in the room source feature word vector and the context semantic information between the adjacent upper and lower linguistic data.

And S702, inputting the feature vector sequence output by the feature extraction layer and the corresponding user preference codes into the full-connection layer, performing Concat connection operation, and outputting house source feature word selection labels corresponding to the user preference codes. The full-connection layer can use a plurality of methods to carry out Concat connection operation and output house source feature word selection labels.

And S703, selecting the label and the house source characteristic word marking information according to the house source characteristic word, and calculating the characteristic word selection loss by using a preset loss function.

In one embodiment, the house source feature word selection label is a house source feature word output by the full connection layer and a corresponding selection label, and the selection label may be "0" or "1", where "0" indicates no selection and "1" indicates selection. For example, inputting the house source feature word vector into the feature extraction layer, inputting the feature vector sequence output by the feature extraction layer and the corresponding user preference code into the full connection layer, performing Concat connection operation, and selecting the house source feature word corresponding to the output of the full connection layer and the user preference code by the tag comprises: the high floor is 1, the low price is 0, and the low tax is 1, wherein the high floor, the low price, and the low tax are house source characteristic words, "0" indicates not selected, "1" indicates selected.

And (3) measuring difference information between the house source feature word selection label and the house source feature word tagging information by using Cross Entropy (Cross Entropy), and calculating Cross Entropy information by using a loss function to serve as feature word selection loss. The loss function may be various, such as a Sigmoid cross entry loss function, and the loss may be calculated by using a predetermined loss function using various existing methods.

In one embodiment, the room source title generation model comprises an Attention layer and a Dropout layer, a feature vector sequence output by the feature extraction layer is input into the Attention layer, and corresponding weights are distributed to vectors in the feature vector sequence through an Attention mechanism. The output of the Attention layer is transmitted into a Dropout layer, the Dropout layer is used for preventing the model from being over-fitted, and the output of the Dropout layer is transmitted into a full-link layer. The structure of the Attention layer and the DropOut layer is the structure of various existing Attention layers and DropOut layers.

The BiLSTM-Attention layer can better learn the representation vector of the room source corpus, and the BiLSTM layer can better learn the forward semantics and the backward semantics of each room source feature word representation in the room source corpus, namely the context semantics. The Attention layer can synthesize the semantics of all the room source feature word representations learned in the room source corpus, so that the learned representation vector of each room source corpus has deeper semantic information. The Dropout layer is used for reducing a network structure, preventing the model from being over-fitted and improving the convergence speed. The generation of the house source title is realized by learning the mapping relation between the house source linguistic data and the corresponding labels through the neural network and finally utilizing the output of the model.

The building framework of the room source title generation model is a model framework based on BilSTM + Attention, an input layer is a room source characteristic word vector (Embedding vector) corresponding to room source corpus, the room source characteristic word vector is input to the Attention layer through the BilSTM layer, a full connection layer is accessed after Dropout is carried out, and finally cross entropy output judgment is utilized. The house source title generation model can not only directly perform bidirectional semantic modeling on the sentence sequence, but also automatically weight the word level by using an attention mechanism, thereby bringing more robust model output; by combining with the user preference, the user preference label is converted into the one-hot code and sent to the full connection layer for model training, and the generation of the house source title with personalized characteristics can be realized.

In one embodiment, as shown in FIG. 8, the room source title generation model includes a BilSTM layer, an Attention layer and a Dropout layer, a fully connected layer. The input data of the house source title generation model includes two kinds of data, one is house source corpus, and the other is user preference information (user preference tag information).

Performing word segmentation processing on the room source corpus to obtain room source characteristic words, and obtaining word vectors of each room source characteristic word and inputting the word vectors into a BilSTM layer for learning the time sequence characteristics of an input sequence; then, the output of the BilSTM layer is connected to the Attention layer and is used for automatically weighting the input of the word level so as to enable the model to learn more characteristic expression; the output of the Attention layer is processed to a Dropout layer, so that overfitting of the model is avoided, and the generalization capability of the model is stronger; finally, the output of the Dropout layer is connected to the full connection layer. The one-hot coding is carried out on the user preference label, and then the user preference label is directly sent to the full connection layer for concat operation with the output of the house source linguistic data, so that the preference information of the user can directly participate in the final loss function calculation, and the output of the full connection layer can be enabled to consider the personalized requirements of the user as far as possible while ensuring the diversification.

According to the method and the system, the preference information and the room source corpus information of the user are unified into the room source title generation model, the room source title generation model can not only achieve mining of text time sequence characteristics, but also introduce the personalized information of the user while ensuring the creative diversity of a selling point, and the output result is attractive.

Fig. 9 is a flowchart of an embodiment of a house source title obtaining method according to the present disclosure, where the method shown in fig. 9 includes the steps of: and S901-S903. The following describes each step.

S901, acquiring house source characteristic word vectors corresponding to house sources and user preference codes.

And S902, using the trained house source title generation model and acquiring a house source feature word selection label based on the house source feature word vector and the user preference code.

S903, selecting the label and the house source characteristic word based on the house source characteristic word to generate a house source title; the house source title generation model is obtained by training through the training method in any one of the above embodiments.

In one embodiment, when acquiring the house source title, the method for acquiring the house source feature word vector and the user preference code is the same as the method for acquiring the house source feature word vector and the user preference code in the house source title generation model training.

For example, the house source corpus is obtained, namely 'house decoration is less than 2 years and can be carried into a bag' and 'first floor window opening is good for seeing garden sight and accepting house change clients'; carrying out word segmentation processing on the house source corpus to obtain a house source characteristic word list, wherein the house source characteristic words included in the house source characteristic word list comprise: the decoration is less than 2 years, the house can be carried into the house, the house is opened, the garden is seen through the window, and the house-changing client is accepted. And training the house source feature words in the house source feature word list by using a word2vec model to obtain a house source feature word vector. And obtaining user preference label information including label information of decoration, price, floor and the like, coding the user preference label information to obtain a user preference code, wherein the user preference code is a one-hot code.

Inputting the room source feature word vector into a BilSTM layer for learning the time sequence feature of the input sequence; connecting the output of the BilSTM layer to an Attention layer for automatically weighting the input of the word level; outputting the Attention layer to a Dropout layer for processing, and avoiding overfitting of the model; the outputs of the DropOut layer are connected to the fully connected layer. And inputting the user preference codes into the full-connection layer, performing concat operation with the output based on the house source characteristic word vectors, and outputting the house source characteristic word selection labels by the full-connection layer. For example, the house source feature word extracting labels include: the label of the decoration for less than 2 years is '1', the label of the handbag is '1', the label of the first floor is '1', the label of the window for seeing the garden is '0', the label of the client who accepts room changing is '0', and the like; and selecting the labels and the house source characteristic words based on the house source characteristic words to generate a house source title. For example, the house source is generated with the title "one building can be hand-carried in less than 2 years.

Exemplary devices

In one embodiment, as shown in fig. 10, the present disclosure provides a training apparatus for a house source title generation model, including: a feature acquisition module 1001, a sample construction module 1002, a model training module 1003, a loss determination module 1004, and a parameter adjustment module 1005.

The feature obtaining module 1001 obtains a house source feature word vector and a user preference code corresponding to a house source. The sample construction module 1002 generates a training sample according to the house source feature word vector, the user preference code, and the house source feature word labeling information. The model training module 1003 trains the preset house source title generation model by using the training samples. The loss determination module 1004 obtains a feature word selection loss corresponding to the training sample by using a preset loss function calculation. The parameter adjusting module 1005 adjusts the parameters of the house source title generation model according to the feature word selection loss until the feature word selection loss is lower than a preset threshold value, and obtains the trained house source title generation model.

In one embodiment, the house source title generation model comprises a feature extraction layer and a full connection layer, wherein the feature extraction layer comprises a feature extraction layer constructed based on a BilSTM network model and the like. The model training module 1003 inputs the room source feature word vector into the feature extraction layer so as to enable the feature extraction layer to extract context features based on semantic relations; the model training module 1003 inputs the feature vector sequence output by the feature extraction layer and the corresponding user preference code into the full connection layer, performs Concat connection operation, and outputs a house source feature word selection label corresponding to the user preference code.

The loss determining module 1004 selects the label and the labeling information according to the house source feature word, and calculates the feature word selection loss by using a preset loss function. For example, the loss determination module 1004 uses a loss function to calculate cross-entropy information as the feature word to select the loss, wherein the loss function includes: a Sigmoid cross entry loss function, etc.

In one embodiment, the room source title generation model includes an Attention layer and a DropOut layer; the model training module 1003 inputs the feature vector sequence output by the feature extraction layer into the Attention layer, and assigns corresponding weights to the vectors in the feature vector sequence through an Attention mechanism. Model training module 1003 passes the output of the Attention layer into a DropOut layer, which is used to prevent the model from overfitting. Model training module 1003 passes the output of the DropOut layer into the fully-connected layer.

In one embodiment, as shown in fig. 11, the feature acquisition module 1001 includes: an information acquisition unit 1011, a corpus acquisition unit 1012, a vector generation unit 1013, and a code generation unit 1014. The information acquisition unit 1011 acquires the house source description information and the user preference code corresponding to the house source. The corpus acquiring unit 1012 preprocesses the house source description information to acquire house source corpora. The vector generation unit 1013 generates a house source feature word vector corresponding to the house source corpus.

In one embodiment, the pre-processing includes a data cleansing process or the like. As shown in fig. 12, the corpus acquiring unit 1012 includes: a cleaning unit 1021 and a generation unit 1022. The cleaning unit 1021 performs filtering processing on the house source description information based on a preset text length threshold. The cleaning unit 1021 performs symbol normalization processing on the house source description information after the filtering processing, and performs replacement processing on the number type number in the house source description information to generate an original corpus.

The generating unit 1022 generates the house source corpus based on the original corpus. For example, the generating unit 1022 obtains an independent sentence corresponding to the original corpus based on a preset punctuation mark segmentation rule, and performs segmentation processing on the independent sentence to obtain a corresponding short sentence list; the generating unit 1022 performs concatenation processing on the short sentences in the short sentence list to obtain room source corpora corresponding to the short sentence list; the generating unit 1022 performs filtering processing on the room source corpus based on a preset link filtering rule, so as to obtain an effective room source corpus.

The vector generation unit 1013 performs word segmentation processing on the house source corpus to obtain house source feature word data. The vector generation unit 1013 trains the house source feature word data using the word2vec model to obtain a house source feature word vector. The code generation unit 1014 acquires user preference tag information corresponding to the house source, performs coding processing on the user preference tag information, and acquires a user preference code.

In one embodiment, as shown in fig. 13, the present disclosure provides a house source title obtaining apparatus, including: an information acquisition module 1301, a model use module 1302, and a title generation module 1303. The information obtaining module 1301 obtains a house source feature word vector and a user preference code corresponding to the house source. The model using module 1302 generates a model by using the trained house source title and obtains a house source feature word selection label based on the house source feature word vector and the user preference code. The title generation module 1303 selects the label and the house source feature word based on the house source feature word, and generates a house source title, where the house source title generation model is obtained by training according to the training method in any of the above embodiments.

Fig. 14 is a block diagram of one embodiment of an electronic device of the present disclosure, as shown in fig. 14, the electronic device 141 includes one or more processors 1411 and a memory 1412.

Processor 1411 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in electronic device 141 to perform desired functions.

Memory 1412 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory, for example, may include: random Access Memory (RAM) and/or cache memory (cache), etc. The nonvolatile memory, for example, may include: read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and executed by the processor 1411 to implement the above training method of the house title generation model and/or the house title acquisition method of the various embodiments of the present disclosure and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 141 may further include: an input device 1413, and an output device 1414, among others, interconnected by a bus system and/or other form of connection mechanism (not shown). The input devices 1413 may also include, for example, a keyboard, a mouse, and the like. The output device 1414 can output various information to the outside. The output devices 1414 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 141 relevant to the present disclosure are shown in fig. 14, and components such as buses, input/output interfaces, and the like are omitted. In addition, electronic device 141 may include any other suitable components depending on the particular application.

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the house-source title generation model training method and/or the house-source title acquisition method according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the training method of a house-source title generation model and/or the house-source title acquisition method according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium may include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the above embodiment, the training method of the house source title generation model, the house source title generation method, the apparatus, the electronic device and the storage medium generate a training sample according to the house source feature word vector, the user preference code and the house source feature word tagging information, train the preset house source title generation model, obtain the feature word selection loss corresponding to the training sample by using a preset loss function, and adjust the parameters of the house source title generation model according to the feature word selection loss; acquiring a house source feature word selection label by using the trained house source title generation model to generate a house source title; the house source title corresponding to the preference of the user can be automatically generated, so that the house source title has personalized characteristics and selling point originality, and the problems of high repeatability and insufficient personalization of the house source title are solved; the operation efficiency and accuracy of the house source title generation model are improved, and the robustness and the operation efficiency of the model are effectively improved; when the house source title is generated, the personalized requirements of the users can be considered as much as possible while the diversification is ensured, the house source is displayed to the corresponding users in a personalized manner according to the characteristics of different users, the generated house source title is more attractive, the users can acquire the key information of one set of house source quickly, the time is saved, and the user experience is improved.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, and systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," comprising, "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects, and the like, will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

24页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种维吾尔文语种识别方法、装置及存储介质

Training method, generation method, device and equipment of house source title generation model

相关技术

网友询问留言