Punctuation mark marking method, punctuation mark marking device, computer equipment and storage medium

文档序号：49491 发布日期：2021-09-28 浏览：12次中文

阅读说明：本技术 标点符号标注方法、装置、计算机设备和存储介质 (Punctuation mark marking method, punctuation mark marking device, computer equipment and storage medium ) 是由耿思晴潘晟锋文博刘云峰于 2021-07-22 设计创作，主要内容包括：本申请涉及一种标点符号标注方法、装置、计算机设备和存储介质。该方法包括：获取包括多组样本数据的训练数据集,每组样本数据中包括无标点的样本文本和对应的标点符号样本标签序列；根据各样本文本的文本长度,确定待训练的标点预测模型的卷积核数量和各卷积核的长度,对待训练的标点预测模型进行迭代训练,得到标点预测模型；标点预测模型中包括用于获取长距离的上下文信息的第一卷积核和用于获取短距离的上下文信息的第二卷积核；将无标点符号的待标注文本输入至标点预测模型中,输出对应于待标注文本的标点符号标签序列,并将待标注文本和标点符号标签序列整合,生成标注有标号符号的文本。采用本方法能为文本进行更为准确地预测标点符号。(The application relates to a punctuation marking method, a punctuation marking device, a computer device and a storage medium. The method comprises the following steps: acquiring a training data set comprising a plurality of groups of sample data, wherein each group of sample data comprises a sample text without punctuation and a corresponding punctuation mark sample label sequence; determining the number of convolution kernels of the punctuation prediction model to be trained and the length of each convolution kernel according to the text length of each sample text, and performing iterative training on the punctuation prediction model to be trained to obtain a punctuation prediction model; the punctuation prediction model comprises a first convolution kernel for acquiring long-distance context information and a second convolution kernel for acquiring short-distance context information; inputting the text to be marked without punctuation marks into a punctuation prediction model, outputting punctuation mark label sequences corresponding to the text to be marked, and integrating the text to be marked and the punctuation mark label sequences to generate the text marked with the label marks. By adopting the method, punctuation marks can be predicted more accurately for the text.)

1. A punctuation mark labeling method, characterized in that the method comprises:

acquiring a training data set; the training data set comprises a plurality of groups of sample data, and each group of sample data comprises a sample text without punctuation and a corresponding punctuation mark sample label sequence;

analyzing the text length of each sample text in the training data set, and determining the number of convolution kernels and the length of each convolution kernel according to the text length of each sample text;

constructing a punctuation prediction model to be trained according to the number of the convolution kernels and the length of each convolution kernel;

performing iterative training on the punctuation prediction model to be trained based on the training data set to obtain a punctuation prediction model; the punctuation prediction model comprises a plurality of convolution kernels, and the plurality of convolution kernels comprise a first convolution kernel and a second convolution kernel; the first convolution kernel is used for acquiring long-distance context information; the second convolution kernel is used for acquiring context information of short distance;

inputting a text to be marked without punctuation marks into the punctuation prediction model, outputting a punctuation mark label sequence corresponding to the text to be marked, and integrating the text to be marked and the punctuation mark label sequence to generate a text marked with the punctuation marks.

2. The method of claim 1, wherein iteratively training the punctuation prediction model to be trained based on the training data set, resulting in a punctuation prediction model comprises:

in each iteration, inputting the sample text in the training data set into a punctuation prediction model to be trained in the current iteration, and outputting a punctuation symbol prediction label sequence corresponding to the sample text;

and adjusting model parameters of the punctuation prediction model to be trained in the current round according to the loss value between the punctuation prediction label sequence and the corresponding punctuation sample label sequence until an iteration stop condition is met, and obtaining the trained punctuation prediction model.

3. The method of claim 1, wherein inputting the text to be annotated without punctuation into the punctuation prediction model, and outputting a punctuation label sequence corresponding to the text to be annotated comprises:

inputting the text to be marked into the punctuation prediction model, and carrying out convolution processing on the text sequence to be marked through a plurality of convolution cores in the punctuation prediction model;

splicing the feature vectors obtained by performing convolution processing on the convolution kernels; the feature vectors comprise global feature vectors extracted based on the first convolution kernel and feature vectors extracted based on the second convolution kernel and focused on keywords;

and performing punctuation prediction based on the spliced vector to obtain a punctuation mark label sequence corresponding to the text to be marked.

4. The method of claim 3, wherein the punctuation prediction based on the spliced vectors to obtain a punctuation mark label sequence corresponding to the text to be labeled comprises:

performing punctuation mark label prediction on each character in the text to be marked based on the spliced vector to obtain a punctuation prediction result corresponding to each character; the punctuation prediction result corresponding to each character comprises the probability that the character corresponds to each preset punctuation mark label;

and aiming at each character in the text to be labeled, selecting a punctuation mark label with the maximum probability value from punctuation prediction results corresponding to the characters as a punctuation mark label finally corresponding to the character to obtain a punctuation mark label sequence corresponding to the text to be labeled.

5. The method of claim 1, wherein each word in the text to be annotated has a corresponding punctuation mark in the punctuation mark sequence;

the integrating the text to be labeled and the punctuation mark label sequence to generate the text labeled with the punctuation mark comprises:

determining punctuation marks having corresponding punctuation marks from the punctuation mark sequence;

and for each determined punctuation mark label, inserting punctuation marks corresponding to the punctuation mark label after the punctuation mark label is in the corresponding characters in the text to be marked, and generating the text marked with the punctuation marks.

6. The method of claim 3, wherein the punctuation prediction model is a text convolutional neural network model for punctuation prediction of text;

the inputting the text to be labeled into the punctuation prediction model, performing convolution processing on the text sequence to be labeled through a plurality of convolution cores in the punctuation prediction model, and inputting the text to be labeled into the punctuation prediction model, and performing convolution processing on the text sequence to be labeled through a plurality of convolution cores in the punctuation prediction model includes:

coding each character of the text to be marked into a corresponding character vector to obtain a character vector set, and inputting the character vector set into the text convolution neural network model;

and carrying out convolution processing on the word vectors in the word vector set in parallel through the text convolution neural network model.

7. A punctuation marking device, said device comprising:

the acquisition module is used for acquiring a training data set; the training data set comprises a plurality of groups of sample data, and each group of sample data comprises a sample text without punctuation and a corresponding punctuation mark sample label sequence;

the determining module is used for analyzing the text length of each sample text in the training data set and determining the number of convolution kernels and the length of each convolution kernel according to the text length of each sample text; constructing a punctuation prediction model to be trained according to the number of the convolution kernels and the length of each convolution kernel;

the training module is used for carrying out iterative training on the punctuation prediction model to be trained on the basis of the training data set to obtain a punctuation prediction model; the punctuation prediction model comprises a plurality of convolution kernels, and the plurality of convolution kernels comprise a first convolution kernel and a second convolution kernel; the first convolution kernel is used for acquiring long-distance context information; the second convolution kernel is used for acquiring context information of short distance;

and the marking module is used for inputting the text to be marked without the punctuation marks into the punctuation prediction model, outputting punctuation mark label sequences corresponding to the text to be marked, and integrating the text to be marked and the punctuation mark label sequences to generate the text marked with the punctuation marks.

8. The apparatus of claim 7, wherein the training module is further configured to, in each iteration, input the sample text in the training data set into a punctuation prediction model to be trained in the current iteration, and output a punctuation symbol prediction tag sequence corresponding to the sample text; and adjusting model parameters of the punctuation prediction model to be trained in the current round according to the loss value between the punctuation prediction label sequence and the corresponding punctuation sample label sequence until an iteration stop condition is met, and obtaining the trained punctuation prediction model.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a punctuation mark labeling method, apparatus, computer device, and storage medium.

Background

With the rapid development of computer technology, people can utilize computers to realize a lot of automated processing. In some cases, punctuation marks need to be automatically marked on texts without punctuation marks by using a computer, for example, texts obtained by automatic speech recognition are usually marked without punctuation marks, the non-punctuation marks can greatly reduce the readability of the texts and influence the accuracy of downstream task processing, and therefore, the punctuation marks need to be predicted on the texts obtained by automatic speech recognition.

Punctuation symbol prediction is an important method based on the vocabulary characteristics training machine learning model. At present, when a machine learning model based on a vocabulary characteristic sequence carries out punctuation symbol prediction, context information can be extracted according to a fixed single length, and the prediction accuracy is relatively low due to limitation.

Disclosure of Invention

In view of the above, it is necessary to provide a punctuation marking method, apparatus, computer device and storage medium capable of improving accuracy.

A punctuation marking method, the method comprising:

constructing a punctuation prediction model to be trained according to the number of the convolution kernels and the length of each convolution kernel;

In one embodiment, the iteratively training the punctuation prediction model to be trained based on the training data set to obtain the punctuation prediction model includes:

In one embodiment, the inputting the text to be annotated without punctuation marks into the punctuation prediction model, and the outputting of the punctuation mark label sequence corresponding to the text to be annotated comprises:

and performing punctuation prediction based on the spliced vector to obtain a punctuation mark label sequence corresponding to the text to be marked.

In one embodiment, the performing punctuation prediction based on the spliced vector to obtain a punctuation mark label sequence corresponding to the text to be labeled includes:

In one embodiment, each word in the text to be labeled has a corresponding punctuation mark in the punctuation mark sequence;

the integrating the text to be labeled and the punctuation mark label sequence to generate the text labeled with the punctuation mark comprises:

determining punctuation marks having corresponding punctuation marks from the punctuation mark sequence;

In one embodiment, the punctuation prediction model is a text convolutional neural network model for punctuation prediction of a text;

the step of inputting the text to be labeled into the punctuation prediction model to perform convolution processing on the text sequence to be labeled through a plurality of convolution cores in the punctuation prediction model comprises the following steps:

and carrying out convolution processing on the word vectors in the word vector set in parallel through the text convolution neural network model.

A punctuation marking device, the device comprising:

A computer device comprising a memory storing a computer program and a processor executing the steps of the punctuation marking method.

A computer-readable storage medium having stored thereon a computer program for executing the steps of the above-described punctuation marking method by a processor.

The punctuation mark marking method, the punctuation mark marking device, the computer equipment and the storage medium acquire a training data set; the training data set comprises a plurality of groups of sample data, and each group of sample data comprises a sample text without punctuation and a corresponding punctuation mark sample label sequence; analyzing the text length of each sample text in the training data set, and determining the number of convolution kernels and the length of each convolution kernel according to the text length of each sample text; constructing a punctuation prediction model to be trained according to the number of convolution kernels and the length of each convolution kernel; and performing iterative training on the punctuation prediction model to be trained based on the training data set to obtain the punctuation prediction model. By analyzing the length of the text in the training data set, the number of proper convolution kernels and the proper length of each convolution kernel can be determined, so that a model is constructed for training, and then the obtained punctuation prediction model comprises a plurality of convolution kernels, wherein the plurality of convolution kernels comprise a first convolution kernel and a second convolution kernel which are different in length. The method comprises the steps that a text to be marked without punctuation marks is input into a punctuation prediction model, long-distance context information can be obtained based on a first convolution kernel with a long length, and short-distance context information is obtained based on a second convolution kernel with a short length, so that punctuation mark prediction is carried out based on the long-distance context information and the short-distance context information, the comprehensiveness of the predicted context information can be guaranteed, the limitation of the context information with a single length is avoided, and further, a more accurate punctuation mark label sequence can be output. Therefore, the text to be marked and the punctuation mark label sequence are integrated, and the generated text marked with the punctuation marks is more accurate, namely, the punctuation marks can be more accurately predicted for the text.

Drawings

FIG. 1 is a diagram of an exemplary embodiment of a punctuation marking method;

FIG. 2 is a flow chart illustrating a punctuation marking method according to an embodiment;

FIG. 3 is a flow diagram illustrating the punctuation mark sequence prediction step in one embodiment;

FIG. 4 is a block diagram of an exemplary punctuation marking device;

FIG. 5 is a block diagram illustrating the structure of a tagging module in one embodiment;

FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The punctuation marking method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 110 communicates with the server 120 through a network. The terminal 110 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 120 may be implemented by an independent server or a server cluster formed by a plurality of servers.

The server 120 may obtain a training data set; the training data set comprises a plurality of groups of sample data, and each group of sample data comprises a sample text without punctuation and a corresponding punctuation mark sample label sequence. The server 120 may analyze the text length of each sample text in the training data set, and determine the number of convolution kernels and the length of each convolution kernel according to the text length of each sample text; constructing a punctuation prediction model to be trained according to the number of convolution kernels and the length of each convolution kernel; performing iterative training on a punctuation prediction model to be trained based on a training data set to obtain a punctuation prediction model; the punctuation prediction model comprises a plurality of convolution kernels, and the plurality of convolution kernels comprise a first convolution kernel and a second convolution kernel; the first convolution kernel is used for acquiring long-distance context information; the second convolution kernel is used to obtain context information for short distances. The server 120 may input the text to be labeled without the punctuation marks acquired by the terminal 110 into the punctuation prediction model, output a punctuation mark label sequence corresponding to the text to be labeled, and integrate the text to be labeled and the punctuation mark label sequence to generate a text labeled with the punctuation marks. The server 120 may return the text labeled with the reference symbol to the terminal 110.

In an embodiment, a user inputs a voice question through a microphone of the terminal 110, the terminal 110 may upload the voice question to the server 120, and the server 120 may perform voice-to-text processing on the voice question to obtain a text, where the text may be a text to be annotated without punctuation marks. The server 120 may perform punctuation prediction on the texts to be labeled through the punctuation prediction model, so as to obtain the texts labeled with the labels.

It should be noted that the above application environment is only an example, in some embodiments, the terminal 110 may obtain a punctuation prediction model, and after obtaining the text to be annotated without punctuation marks, the terminal 110 may further use the punctuation prediction model to perform punctuation mark prediction on the text to be annotated respectively, so as to obtain the text annotated with the symbolic marks. The terminal 110 may train itself to obtain the punctuation prediction model, and may also obtain the trained punctuation prediction model sent by the server 120, which is not limited herein. It can be understood that, in the case that the punctuation prediction model is trained by the terminal 110 and the punctuation prediction model is used to predict punctuation of the text to be annotated, the punctuation annotation method in the embodiments of the present application is executed by the terminal 110.

In an embodiment, as shown in fig. 2, a punctuation mark labeling method is provided, and this embodiment is illustrated by applying the method to a server, it is to be understood that the method may also be applied to a terminal, may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. In this embodiment, the method includes the steps of:

s202, acquiring a training data set; the training data set comprises a plurality of groups of sample data, and each group of sample data comprises a sample text without punctuation and a corresponding punctuation mark sample label sequence.

The punctuation sample label sequence is a set of punctuation labels as samples. The punctuation mark is a specific symbol for indicating the punctuation mark. For example, a punctuation mark is a comma, which can be represented by "C", and "C" is a punctuation mark.

In one embodiment, the specific symbol may include at least one of a number, a letter, a special character, and the like, without being limited thereto.

In one embodiment, the server may obtain an initial text with punctuation marks, and the server may split the punctuation marks in the obtained initial text, and sequentially generate a punctuation mark sample tag sequence according to a punctuation mark condition after each character in the initial text.

Specifically, each word in the initial text corresponds to a punctuation mark, and the punctuation mark is used for indicating the corresponding punctuation mark condition behind each word. The same punctuation marks can be uniformly corresponding to the characters without punctuation marks at the back, and the punctuation marks used for representing the punctuation marks can be corresponding to the characters with the corresponding punctuation marks at the back. For example, a number is used to represent the punctuation condition, where 0 corresponds to no punctuation, 1 corresponds to a comma, 2 corresponds to a period, 3 corresponds to a question mark, 4 corresponds to an exclamation mark, and the initial text is "do i forget to take an umbrella today and ask for a rain? ", then, the corresponding punctuation mark sample label sequence is (0, 0,0, 0,0, 0,0, 1, 0,0, 0,0, 0, 3).

In another embodiment, the training data set is pre-obtained data, which the server may directly obtain.

In one embodiment, punctuation may include at least one of commas, jargon, question marks, exclamation marks, and the like.

And S204, analyzing the text length of each sample text in the training data set, and determining the number of convolution kernels and the length of each convolution kernel according to the text length of each sample text.

Specifically, the number of convolution kernels and the length of each convolution kernel play an important role in feature extraction of a sample text, too few or too short convolution kernels affect the accuracy of feature extraction of the sample text, and too many or too long convolution kernels bring system pressure during convolution processing.

In one embodiment, the server may analyze the text length of each sample text in the training data set, and determine a preset text length range corresponding to the text length of each sample text. The server is preset with a corresponding relation between a preset text length range and a convolution kernel selection strategy, and the convolution kernel selection strategy corresponding to the determined preset text length range can be obtained according to the corresponding relation. The convolution kernel selection strategy comprises the number of convolution kernels and the length of each convolution kernel. For example, when the preset text length range is 5-50, the corresponding convolution kernel selection strategy is "use the short convolution kernel with length 3 and the long convolution kernel with length 6 to match".

S206, constructing a punctuation prediction model to be trained according to the number of convolution kernels and the length of each convolution kernel; and performing iterative training on the punctuation prediction model to be trained based on the training data set to obtain the punctuation prediction model.

The punctuation prediction model comprises a plurality of convolution kernels, wherein the plurality of convolution kernels comprise a first convolution kernel and a second convolution kernel; the first convolution kernel is used for acquiring long-distance context information; the second convolution kernel is used to obtain context information for short distances.

It is understood that the same text segment does not have the same information expressed in different texts, and each text segment has context information, for example, a sentence does not have the same information expressed in different complete paragraphs, and each sentence has context information. The first convolution kernel in the punctuation prediction model is used for acquiring long-distance context information, and the second convolution kernel is used for acquiring short-distance context information, so that when the punctuation prediction model is used for prediction, context information at different distances can be acquired through different convolution kernels for prediction, and the prediction accuracy can be improved.

In one embodiment, the punctuation prediction model to be trained may be any type of network structure, and is not limited thereto.

And S208, inputting the text to be annotated without punctuation marks into the punctuation prediction model, and outputting punctuation mark label sequences corresponding to the text to be annotated.

Wherein, the output punctuation mark label sequence is a set of predicted punctuation mark labels.

Specifically, the server may perform vector conversion on each character in the text to be annotated without punctuation marks, so as to generate a corresponding word vector for each character, thereby obtaining a word vector set. The server can input the word vector set into the punctuation prediction model, and perform convolution processing on the word vector set through a plurality of convolution kernels with different lengths in the punctuation prediction model to obtain long-distance context information and short-distance context information, and then perform prediction based on the obtained context information with different distances to obtain a prediction result corresponding to each character. The punctuation conditions following each word are described in the prediction, including the absence of a symbol following a word and the presence of a particular punctuation following a word. The server can combine the prediction results of each character in the text to be labeled according to the sequence of the characters to obtain the punctuation mark label sequence of the text to be labeled.

In one embodiment, a user can input a voice question through a voice acquisition device of a terminal, the terminal can upload the voice question to a server, the server can convert the voice question into a text after voice-to-text processing is performed on the voice question, and the text can be a text to be annotated without punctuation marks. The server can input the converted text to be annotated without punctuation marks into a punctuation prediction model to predict the punctuation marks.

S210, integrating the text to be marked and the punctuation mark label sequence to generate the text marked with the symbolic signs.

Specifically, punctuation marks corresponding to each character are recorded in the punctuation mark sequence, and the server may integrate the text to be marked and the punctuation mark sequence based on a preset rule to generate a text marked with punctuation marks, that is, a required final text.

In the punctuation mark marking method, a training data set comprising a plurality of groups of sample data is obtained; each group of sample data comprises a sample text without punctuation and a corresponding punctuation mark sample label sequence; analyzing the text length of each sample text in the training data set, and determining the number of convolution kernels and the length of each convolution kernel according to the text length of each sample text; constructing a punctuation prediction model to be trained according to the number of convolution kernels and the length of each convolution kernel; and performing iterative training on the punctuation prediction model to be trained based on the training data set to obtain the punctuation prediction model. By analyzing the length of the text in the training data set, the number of proper convolution kernels and the proper length of each convolution kernel can be determined, so that a model is constructed for training. The method has the advantages that the text to be marked without the punctuation marks is input into the punctuation prediction model, long-distance context information can be obtained based on the first convolution kernel with the long length, short-distance context information can be obtained based on the second convolution kernel with the short length, punctuation mark prediction is carried out based on the long-distance context information and the short-distance context information, the comprehensiveness of the predicted context information can be guaranteed, the limitation of the context information with the single length is avoided, and then a more accurate punctuation mark label sequence can be output. Therefore, the text to be marked and the punctuation mark label sequence are integrated, and the generated text marked with the punctuation marks is more accurate, namely, the punctuation marks can be more accurately predicted for the text.

In addition, the model is constructed and trained by selecting the proper number of convolution kernels and the proper length of each convolution kernel, the size of the model can be reasonably controlled while the final effect of the model is ensured, and therefore the system overhead is reduced.

In one embodiment, the step S206 iteratively trains the punctuation prediction model to be trained based on the training data set, and obtaining the punctuation prediction model includes: in each iteration, inputting a sample text in a training data set into a punctuation prediction model to be trained in the current iteration, and outputting a punctuation symbol prediction label sequence corresponding to the sample text; and adjusting model parameters of the punctuation prediction model to be trained in the current round according to the loss value between the punctuation prediction label sequence and the corresponding punctuation sample label sequence until an iteration stop condition is met, and obtaining the trained punctuation prediction model.

The punctuation mark prediction label sequence is a set of punctuation mark labels predicted in the iterative training process.

To facilitate understanding of the punctuation prediction tag sequence and the punctuation sample tag sequence, an example is now given. For example, a number is used to indicate the case of a punctuation mark, where 0 corresponds to no punctuation mark, 1 corresponds to a comma, 2 corresponds to a period, 3 corresponds to a question mark, and 4 corresponds to an exclamation mark. Suppose that the initial text is "do i forget to take an umbrella today, ask for a question to be rained? ", then, the sequence of punctuation mark sample labels pre-marked for this phrase is (0, 0,0, 0,0, 0,0, 1, 0,0, 0,0, 0,0, 3). In the iterative training process, the punctuation mark prediction tag sequence predicted by the punctuation mark prediction model can be (0, 0,0, 0,0, 0,0, 1, 0,0, 0,0, 2), from which it can be known that there is a difference between the punctuation mark sample tag sequence (0, 0,0, 0,0, 0,0, 1, 0,0, 0,0, 0,0, 0, 3) and the punctuation mark prediction tag sequence (0, 0,0, 0,0, 0,0, 1, 0,0, 0,0, 0,0, 2).

Specifically, the server may perform vector conversion on each word in the sample text in the training data set to generate a corresponding word vector for each word, so as to obtain a word vector set. The server inputs the word vector set to the punctuation prediction model to be trained, performs convolution processing on the word vector set through a plurality of convolution kernels with different lengths in the punctuation prediction model to be trained, splices output vectors of the convolution kernels and inputs the spliced output vectors into the full-link layer, and performs classification training to obtain a punctuation symbol prediction label sequence. The server can judge the loss value between the punctuation mark prediction label sequence and the corresponding punctuation mark sample label sequence and adjust the model parameters of the punctuation prediction model to be trained. It will be appreciated that the model parameters of the punctuation prediction model to be trained may be adjusted towards a decrease in the loss value.

In one embodiment, the punctuation prediction model to be trained may be a text convolutional neural network model. The text convolution Neural network model is a Neural network model for punctuation prediction of text based on a textcnn (text conditional Neural networks) model framework. The server may input a set of word vectors converted for each word in the sample text to the text convolutional neural network model in order. The text convolution neural network model can use a plurality of convolution kernels to perform parallel coding and parallel prediction on each character in an input character vector set, and output vectors of the convolution kernels are spliced and input into a full-connection layer for classification training. It can be understood that the model training is carried out by using the text convolution neural network model, and parallel coding and parallel prediction can be realized, so that the inference time of the model is reduced, and the model training efficiency is improved.

In one embodiment, the plurality of convolution kernels in the text convolution neural network model to be trained may include a convolution kernel for acquiring context information of a short distance and a convolution kernel for acquiring context information of a long distance, and the text convolution neural network model may concatenate feature vectors output by the convolution kernels and input the concatenated feature vectors to the full-link layer for classification training. It can be understood that the convolution kernel for obtaining the short-distance context information and the convolution kernel for obtaining the long-distance context information are subjected to convolution processing, so that the short-distance context information and the long-distance context information can be considered during convolution processing, a trained punctuation prediction model is more accurate, and the subsequent prediction accuracy is improved.

In the above embodiment, the punctuation prediction model is iteratively trained according to the loss value between the punctuation prediction tag sequence and the corresponding punctuation sample tag sequence, so that an accurate punctuation prediction model can be obtained, and the subsequent prediction accuracy is improved.

In one embodiment, as shown in fig. 3, step S208, namely, inputting the text to be annotated without punctuation marks into the punctuation prediction model, and outputting a punctuation mark label sequence corresponding to the text to be annotated (referred to as punctuation mark label sequence prediction step for short) specifically includes the following steps:

s302, inputting the text to be marked into the punctuation prediction model, and performing convolution processing on the text sequence to be marked through a plurality of convolution cores in the punctuation prediction model.

Specifically, the server may input the text to be annotated into the punctuation prediction model. The punctuation prediction model comprises a first convolution kernel for acquiring long-distance context information and a second convolution kernel for acquiring short-distance context information, so that short-distance context information extraction and long-distance context information extraction can be respectively carried out on the text sequence to be labeled through the first convolution kernel and the second convolution kernel in the punctuation prediction model, and the feature vector is obtained. The feature vectors include global feature vectors extracted based on the first convolution kernel and feature vectors extracted based on the second convolution kernel focused on the keyword.

For example, the punctuation prediction model is a text convolutional neural network model, the text convolutional neural network model uses a second convolutional kernel with a length of 3 and a first convolutional kernel with a length of 6, wherein the second convolutional kernel with a length of 3 is used for concentrating on the keyword to perform feature extraction, so as to obtain a feature vector concentrated on the keyword to extract, and the first convolutional kernel with a length of 6 is used for global feature extraction, so as to obtain a global feature vector.

And S304, splicing the feature vectors obtained by performing convolution processing on the convolution kernels.

It can be understood that the server may splice the extracted global feature vector and the feature vector extracted by focusing on the keyword to obtain a spliced vector.

In one embodiment, the server may directly concatenate the global feature vector and the feature vector extracted by focusing on the keyword end to end, to obtain a concatenated vector.

In other embodiments, the server may also split the global feature vector and the feature vector concentrated on the keyword extraction according to a preset splitting rule, and then combine and splice the split vectors according to a preset combination rule to obtain a spliced vector.

It should be noted that the vectors for performing the concatenation are not limited to the global feature vector and the feature vector focused on the keyword extraction, and may also include other vectors capable of performing a feature characterization function, which is not limited to this.

And S306, punctuation prediction is carried out on the basis of the spliced vectors to obtain punctuation mark label sequences corresponding to the texts to be marked.

Specifically, the server can perform punctuation prediction based on the spliced vector through a punctuation prediction model, predict to obtain punctuation marks corresponding to each character in the text to be labeled, and then arrange the punctuation marks corresponding to each character according to the sequence of each character in the text to be labeled to obtain punctuation mark sequences corresponding to the text to be labeled.

In one embodiment, the server may directly generate a sequence of punctuation marks corresponding to the text to be labeled according to the predicted punctuation marks in sequence. In another embodiment, the server may also convert the predicted punctuation marks, and generate a punctuation mark sequence corresponding to the text to be labeled according to the converted punctuation marks in sequence.

In this embodiment, when the punctuation mark prediction is performed on the text to be labeled, according to a plurality of convolution kernels in the punctuation prediction model, the global feature vector and the feature vector concentrated on the extraction of the keyword are respectively extracted, and the global feature vector and the feature vector are spliced and then predicted, so that short-distance context information and long-distance context information can be considered, and more accurate prediction can be performed.

In one embodiment, the punctuation prediction based on the spliced vector to obtain a punctuation mark label sequence corresponding to a text to be labeled comprises: performing punctuation mark label prediction on each character in the text to be annotated based on the spliced vector to obtain a punctuation prediction result corresponding to each character; the punctuation prediction result corresponding to each character comprises the probability of each preset punctuation mark label; and aiming at each character in the text to be labeled, selecting a punctuation mark label with the maximum probability value from punctuation prediction results corresponding to the characters as a punctuation mark label finally corresponding to the character to obtain a punctuation mark label sequence corresponding to the text to be labeled.

Specifically, the server may perform punctuation mark label prediction on each character in the text to be annotated based on the spliced vector to obtain a punctuation prediction result corresponding to each character; the punctuation prediction result corresponding to each character comprises the probability of each preset punctuation mark label. It can be understood that the number of the preset punctuation marks is multiple, so that each character has a corresponding probability under each preset punctuation mark, and the server can select the punctuation mark with the maximum probability value as the punctuation mark corresponding to the character finally. The server can arrange the punctuation marks corresponding to each character finally in sequence, and then the punctuation mark sequence corresponding to the text to be marked can be obtained.

For example, the pre-defined punctuation marks may be a set of punctuation marks, commas, jargon, question marks and exclamation marks. The punctuation prediction result comprises the probabilities of five punctuation conditions of no symbol behind the character, a comma behind the character, a sentence behind the character, a question mark behind the character and an exclamation mark behind the character, and the probability is calculated to obtain the corresponding punctuation condition, wherein the probability value is the maximum, and the punctuation condition uses specific characters to represent the condition that no symbol behind the character and the condition that the character has the comma, the sentence, the question mark or the exclamation mark behind the character.

In one embodiment, a memory for storing punctuation marks, called a punctuation mark series memory, is preset in the server, and the size of the memory is the total number of characters in the text to be marked, and a memory for storing punctuation mark situation probability, called a probability storage memory for short, the size of the memory is the number of the preset punctuation marks, the total number of the text characters to be marked, the floating point number occupies the byte space, and the size of the memory occupied by each character is the number of the punctuation marks, the floating point number occupies the byte space. The probability of each preset punctuation mark label obtained by predicting each character is stored in the probability storage memory, the probability of each preset punctuation mark label stored in the probability storage memory can be compared in size, and the preset punctuation mark label with the maximum probability value is selected as the final punctuation mark label of the character. And then, adding the final punctuation marks corresponding to the characters to a punctuation mark series memory in sequence.

In one embodiment, the server may select a position number corresponding to the maximum probability value in the probability storage memory as the final punctuation mark of the text. Then, the punctuation marks corresponding to each character can be added to the punctuation mark series memory in sequence. For example, the "predicted result" in the case of "i forget to take an umbrella and ask for a raining" is (0.1, 0.4, 0.2,0.1, 0.2), and the predicted result is stored in the probability storage memory. It is understood that 0.1 corresponds to position number 0, and 0.4 corresponds to position number 1. And writing the position serial number 1 corresponding to the maximum probability value 0.4 into a punctuation mark series memory as a punctuation mark.

In the embodiment, punctuation mark label prediction is performed on each character in the text to be annotated based on the spliced vector, so that a punctuation prediction result corresponding to each character is obtained; the punctuation prediction result corresponding to each character comprises the probability of each preset punctuation mark label, and the punctuation mark label with the maximum probability value is selected as the punctuation mark label corresponding to the character finally, so that the punctuation mark label of each character can be obtained more accurately, and a more accurate punctuation mark label sequence can be obtained.

In one embodiment, the step S210, that is, integrating the text to be labeled and the punctuation mark sequence, and generating the text labeled with the symbolic label specifically includes: each character in the text to be marked has a corresponding punctuation mark in the punctuation mark sequence; determining punctuation marks with corresponding punctuation marks from the punctuation mark sequence; and aiming at each determined punctuation mark label, inserting punctuation marks corresponding to the punctuation mark label after corresponding characters in the text to be marked, and generating the text marked with the label marks.

It will be appreciated that each word has a corresponding punctuation mark in the punctuation mark sequence, where the punctuation mark includes two types of punctuation marks, one having a corresponding punctuation mark and one having a punctuation mark that does not correspond to a punctuation mark (i.e., indicating that the word is not followed by a punctuation mark). The server may identify punctuation marks having corresponding punctuation marks from the punctuation mark sequence, and then, for each identified punctuation mark having a corresponding punctuation mark, may insert the punctuation mark corresponding to the punctuation mark after the text corresponding to the punctuation mark, to generate a text labeled with the punctuation mark.

For example, the punctuation marks of the words that "i forget to take an umbrella and will take a rain event" are sequentially (0, 0,0, 0, 1, 0,0, 0,0, 3), "0" is the punctuation mark label belonging to a non-corresponding punctuation mark, 1 and 3 are the punctuation mark labels having corresponding punctuation marks, wherein 1 represents a comma and 3 represents a question mark, the server can insert the comma represented by 1 after the character 'corresponding to 1', insert the question mark represented by 3 after the character 'corresponding to 3', and insert no punctuation mark after the character corresponding to "0", thereby obtaining the text with the punctuation marks.

In this embodiment, each character has a corresponding punctuation mark, that is, the characters and the punctuation marks are in a one-to-one correspondence relationship, and when integrating the text and the punctuation marks, only the punctuation marks having the corresponding punctuation marks are identified from the punctuation mark sequence; aiming at each punctuation mark label with corresponding punctuation marks, the corresponding punctuation mark can be directly inserted after the corresponding character, under the condition, the punctuation mark can be quickly inserted according to the corresponding relation between the character and the label, and therefore the text marked with the punctuation mark is quickly generated.

In one embodiment, the punctuation prediction model is a text convolutional neural network model for punctuation prediction of text. Inputting a text to be labeled into a punctuation prediction model, and performing convolution processing on a text sequence to be labeled through a plurality of convolution cores in the punctuation prediction model comprises the following steps: coding each character of a text to be marked into a corresponding word vector to obtain a word vector set, and inputting the word vector set into a text convolution neural network model; and carrying out convolution processing on the word vectors in the word vector set in parallel through a text convolution neural network model.

It can be appreciated that the text convolutional neural network model enables parallel processing. The server can encode each character of the text to be labeled into a corresponding word vector to obtain a word vector set, and the word vector set is input into the text convolution neural network model so as to perform convolution processing on each word vector in the word vector set in parallel through the text convolution neural network model.

For example, if the text to be labeled is "do i forget to take an umbrella and take a rain when asking for a question today", each character is encoded into a word vector, a word vector set corresponding to "do i forget to take an umbrella and take a rain when asking for a question today" can be obtained, and the word vector set is input into the text convolution neural network model, so that each word vector can be subjected to parallel convolution processing in parallel.

In this embodiment, a text convolution neural network model can be used to perform convolution processing in parallel, thereby improving the efficiency of punctuation prediction.

It should be understood that although the various steps in the flow charts of fig. 2-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-3 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 4, there is provided a punctuation marking device 400 comprising: an obtaining module 402, a determining module 404, a training module 406, and a labeling module 408, wherein:

an obtaining module 402, configured to obtain a training data set; the training data set comprises a plurality of groups of sample data, and each group of sample data comprises a sample text without punctuation and a corresponding punctuation mark sample label sequence.

A determining module 404, configured to analyze a text length of each sample text in the training data set, and determine the number of convolution kernels and a length of each convolution kernel according to the text length of each sample text; and constructing a punctuation prediction model to be trained according to the number of the convolution kernels and the length of each convolution kernel.

The training module 406 is configured to perform iterative training on the punctuation prediction model to be trained based on the training data set to obtain a punctuation prediction model; the punctuation prediction model comprises a plurality of convolution kernels, and the plurality of convolution kernels comprise a first convolution kernel and a second convolution kernel; the first convolution kernel is used for acquiring long-distance context information; the second convolution kernel is used to obtain context information for short distances.

The labeling module 408 is configured to input a to-be-labeled text without punctuation marks into the punctuation prediction model, output a punctuation mark tag sequence corresponding to the to-be-labeled text, and integrate the to-be-labeled text and the punctuation mark tag sequence to generate a text labeled with the punctuation marks.

In one embodiment, the training module 406 is further configured to, in each iteration, input the sample text in the training data set into the punctuation prediction model to be trained in the current round, and output a punctuation symbol prediction tag sequence corresponding to the sample text; and adjusting model parameters of the punctuation prediction model to be trained in the current round according to the loss value between the punctuation prediction label sequence and the corresponding punctuation sample label sequence until an iteration stop condition is met, and obtaining the trained punctuation prediction model.

In one embodiment, as shown in FIG. 5, the annotation module 408 comprises: a model input module 408a, a model output module 408b, and a punctuation prediction module 408 c; wherein:

the model input module 408a is configured to input a text to be annotated into the punctuation prediction model, so as to perform convolution processing on the text sequence to be annotated through a plurality of convolution kernels in the punctuation prediction model.

The model output module 408b is configured to splice feature vectors obtained by performing convolution processing on each convolution kernel; the feature vectors include global feature vectors extracted based on the first convolution kernel and feature vectors extracted based on the second convolution kernel and focused on the keywords.

The punctuation prediction module 408c is configured to perform punctuation prediction based on the spliced vectors to obtain a punctuation mark label sequence corresponding to the text to be labeled.

In one embodiment, the punctuation prediction module 408c is further configured to perform punctuation mark label prediction on each character in the text to be labeled based on the spliced vector, so as to obtain a punctuation prediction result corresponding to each character; the punctuation prediction result corresponding to each character comprises the probability that the character corresponds to each preset punctuation mark label; and aiming at each character in the text to be labeled, selecting a punctuation mark label with the maximum probability value from punctuation prediction results corresponding to the characters as a punctuation mark label finally corresponding to the character to obtain a punctuation mark label sequence corresponding to the text to be labeled.

In one embodiment, each word in the text to be annotated has a corresponding punctuation mark in the punctuation mark sequence; the labeling module 408 is further configured to determine punctuation marks having corresponding punctuation marks from the punctuation mark label sequence; and for each determined punctuation mark label, inserting punctuation marks corresponding to the punctuation mark label after the punctuation mark label is in the corresponding characters in the text to be marked, and generating the text marked with the punctuation marks.

In one embodiment, the punctuation prediction model is a text convolutional neural network model for punctuation prediction of text. The model input module 408a is further configured to encode each word of the text to be labeled into a corresponding word vector, obtain a word vector set, input the word vector set into the text convolutional neural network model, and perform convolutional processing on the word vectors in the word vector set in parallel by the text convolutional neural network model.

In the punctuation marking device and the punctuation marking method, a training data set comprising a plurality of groups of sample data is obtained; each group of sample data comprises a sample text without punctuation and a corresponding punctuation mark sample label sequence; analyzing the text length of each sample text in the training data set, and determining the number of convolution kernels and the length of each convolution kernel according to the text length of each sample text; constructing a punctuation prediction model to be trained according to the number of convolution kernels and the length of each convolution kernel; and performing iterative training on the punctuation prediction model to be trained based on the training data set to obtain the punctuation prediction model. By analyzing the length of the text in the training data set, the number of proper convolution kernels and the proper length of each convolution kernel can be determined, so that a model is constructed for training. The method has the advantages that the text to be marked without the punctuation marks is input into the punctuation prediction model, long-distance context information can be obtained based on the first convolution kernel with the long length, short-distance context information can be obtained based on the second convolution kernel with the short length, punctuation mark prediction is carried out based on the long-distance context information and the short-distance context information, the comprehensiveness of the predicted context information can be guaranteed, the limitation of the context information with the single length is avoided, and then a more accurate punctuation mark label sequence can be output. Therefore, the text to be marked and the punctuation mark label sequence are integrated, and the generated text marked with the punctuation marks is more accurate, namely, the punctuation marks can be more accurately predicted for the text.

For the specific limitation of the punctuation mark labeling apparatus, reference may be made to the above limitation on the punctuation mark labeling method, which is not described herein again. The modules in the punctuation marking device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server or a terminal, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a punctuation marking method.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

18页详细技术资料下载

Punctuation mark marking method, punctuation mark marking device, computer equipment and storage medium

相关技术

网友询问留言