Text processing method and related device

文档序号:1170281 发布日期:2020-09-18 浏览:8次 中文

阅读说明:本技术 一种文本处理方法及相关装置 (Text processing method and related device ) 是由 吴悠 于 2020-05-14 设计创作,主要内容包括:本申请涉及人工智能领域,提供一种文本处理方法及相关装置,所述文本处理方法包括:获取待检测文本;从所述待检测文本中获取标题文本和正文文本;根据所述标题文本和所述正文文本,确定所述待检测文本的目标特征向量;将所述目标特征向量输入预先训练得到的神经网络,得到检测结果,所述检测结果包括所述待检测文本偏题的概率或者不偏题的概率。本申请实施例的技术方案,能够提高对文本进行偏题检测的效率和准确率。本申请可用于智慧教育领域,从而推动智慧城市的建设。(The application relates to the field of artificial intelligence, and provides a text processing method and a related device, wherein the text processing method comprises the following steps: acquiring a text to be detected; acquiring a title text and a body text from the text to be detected; determining a target characteristic vector of the text to be detected according to the title text and the body text; and inputting the target feature vector into a neural network obtained by pre-training to obtain a detection result, wherein the detection result comprises the probability of the text to be detected for bias question or the probability of the text to be detected for non-bias question. According to the technical scheme, the efficiency and the accuracy of text partial topic detection can be improved. This application can be used to wisdom education field to promote the construction in wisdom city.)

1. A method of text processing, comprising:

acquiring a text to be detected;

acquiring a title text and a body text from the text to be detected;

determining a target characteristic vector of the text to be detected according to the title text and the body text;

and inputting the target feature vector into a neural network obtained by pre-training to obtain a detection result, wherein the detection result comprises the probability of the text to be detected for bias question or the probability of the text to be detected for non-bias question.

2. The method according to claim 1, wherein the determining the target feature vector of the text to be detected according to the header text and the body text comprises:

acquiring a first characteristic vector of the title text and a second characteristic vector of the body text;

determining a third feature vector according to the first feature vector and the second feature vector, wherein the third feature vector is a combined similarity feature of the title text and the body text;

and determining the target feature vector according to the first feature vector, the second feature vector and the third feature vector.

3. The method of claim 2, wherein the first eigenvector is a word frequency matrix of the header text, the second eigenvector is a word frequency matrix of the body text, and the obtaining the first eigenvector of the header text and the second eigenvector of the body text comprises:

acquiring all words in the title text;

calculating the word frequency of each word in all words in the title text;

determining a word frequency matrix of the title text according to all words in a preset corpus and the word frequency of each word in all words in the title text;

acquiring all words in the text;

calculating the word frequency of each word in all words in the text;

and determining a word frequency matrix of the text according to the word frequency of each word in all words in the preset corpus and all words in the text.

4. The method of claim 3, wherein determining a third eigenvector from the first eigenvector and the second eigenvector comprises:

determining a word frequency inverse text matrix of the title text according to the word frequency matrix of the title text;

determining a word frequency inverse text matrix of the text according to the word frequency matrix of the text;

and calculating the cosine similarity of the word frequency inverse text matrix of the title text and the word frequency inverse text matrix of the body text to obtain the combined similarity characteristic.

5. The method according to any one of claims 2 to 4, wherein the target feature vector is obtained by splicing the first feature vector, the second feature vector and the third feature vector according to a preset sequence, and the obtaining of the detection result by inputting the target feature vector into a neural network obtained by pre-training comprises:

inputting the target characteristic vector into the neural network to obtain a neural network output value;

and mapping the neural network output value into a prediction probability through a normalized exponential function to obtain the detection result.

6. The method of claim 3, wherein the neural network is trained by:

acquiring a preset number of texts;

acquiring a title text of each text and a body text of each text from the preset number of texts;

processing the title text of each text and the body text of each text according to a first preset processing mode or a second preset processing mode to obtain the text sample;

determining a label value corresponding to the text sample according to a mode of processing the title text of each text and the body text of each text, wherein the label value is used for labeling the probability of partial topic or the probability of partial topic of the text sample;

inputting the text sample and the label value into the neural network to obtain loss;

and adjusting network parameters of the neural network according to the loss.

7. The method of claim 6, wherein after the obtaining the preset number of texts, the method further comprises:

acquiring all words in the preset number of texts;

carrying out lower case conversion and de-duplication processing on all words in the preset number of texts to obtain a target word set;

calculating the word frequency of each word in the target word set;

sequencing each word in the target word set according to the sequence of the word frequency from large to small;

and obtaining the first M sequenced words to form the preset corpus, wherein M is a positive integer.

8. A text processing apparatus, characterized in that the apparatus comprises:

the first acquisition module is used for acquiring a text to be detected;

the second acquisition module is used for acquiring a title text and a body text from the text to be detected;

the determining module is used for determining a target characteristic vector of the text to be detected according to the title text and the body text;

and the detection module is used for inputting the target feature vector into a neural network obtained by pre-training to obtain a detection result, wherein the detection result comprises the probability of the text to be detected for bias question or the probability of the text to be detected for non-bias question.

9. An electronic device, comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method of any one of claims 1 to 7.

Technical Field

The present application relates to the field of deep learning technology in artificial intelligence, and in particular, to a text processing method and related apparatus.

Background

At present, writing is an important item in teaching links. In order to check the text, scoring or evaluation needs to be performed from multiple dimensions, and whether the text is partially questioned or not is detected to be one of the dimensions.

However, whether the text is partially detected needs to be manually detected, when the number of the texts needing to be detected is large, a large amount of time is needed, so that the efficiency of detecting the partial problems of the text is low, different people are influenced by artificial subjectivity when detecting the same text, the obtained detection results may be different, and the accuracy of detecting the partial problems of the text is low.

Disclosure of Invention

The application provides a text processing method and a related device, which can improve the efficiency and accuracy of partial topic detection on a text.

A first aspect of the present application provides a text processing method, including:

acquiring a text to be detected;

acquiring a title text and a body text from the text to be detected;

determining a target characteristic vector of the text to be detected according to the title text and the body text;

and inputting the target feature vector into a neural network obtained by pre-training to obtain a detection result, wherein the detection result comprises the probability of the text to be detected for bias question or the probability of the text to be detected for non-bias question.

A second aspect of the present application provides a text processing apparatus, the apparatus comprising:

the first acquisition module is used for acquiring a text to be detected;

the second acquisition module is used for acquiring a title text and a body text from the text to be detected;

the determining module is used for determining a target characteristic vector of the text to be detected according to the title text and the body text;

and the detection module is used for inputting the target feature vector into a neural network obtained by pre-training to obtain a detection result, wherein the detection result comprises the probability of the text to be detected for bias question or the probability of the text to be detected for non-bias question.

A third aspect of the present application provides an electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps of the method of any of the first aspects of the present application.

A fourth aspect of the present application provides a computer readable storage medium having a computer program stored thereon for execution by a processor to perform some or all of the steps described in any of the methods of the first aspect of the present application.

The text processing method and the related device provided by the application can be seen in that firstly, a text to be detected is obtained, secondly, a title text and a body text are obtained from the text to be detected, secondly, a target feature vector of the text to be detected is determined according to the title text and the body text, and finally, the target feature vector is input into a neural network obtained through pre-training to obtain a detection result, wherein the detection result comprises the probability of partial problems or the probability of non-partial problems of the text to be detected. Therefore, when whether the text to be detected is biased or not needs to be detected, the target characteristic vector of the text to be detected is determined, the target characteristic vector is input into the neural network, a detection result is obtained, and whether the text to be detected is biased or not is determined. On the one hand, the method does not need manual detection, saves time, improves the efficiency of detecting the partial problems of the text, on the other hand, the method detects the partial problems of the text through the pre-trained neural network, is not influenced by human subjectivity, and improves the accuracy of detecting the partial problems of the text.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flowchart of a text processing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of another text processing method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a text processing apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present application.

Detailed Description

The text processing method and the related device provided by the embodiment of the application can improve the efficiency and accuracy of partial topic detection on the text.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

The following describes embodiments of the present application in detail.

Referring to fig. 1, fig. 1 is a schematic flow chart of a text processing method according to an embodiment of the present application, which can be used in the field of intelligent education to promote the construction of a smart city. As shown in fig. 1, a text processing method provided in an embodiment of the present application may include:

101. and acquiring the text to be detected.

When it is required to detect whether the text to be detected is a partial topic, the text to be detected is firstly acquired, and the text to be detected can be a Chinese text, an English text or other types of texts, for example.

102. And acquiring a title text and a body text from the text to be detected.

After the text to be detected is obtained, a title text and a body text are required to be obtained from the text to be detected. In a possible implementation manner, when the text to be detected is a Chinese text or an English text, the header text and the body text can be distinguished according to the font size, wherein the font of the header text is larger than that of the body text; or the caption text and the body text can be distinguished according to the font position, wherein the caption text is positioned at the head of the text to be detected.

103. And determining a target characteristic vector of the text to be detected according to the title text and the body text.

Optionally, the method for determining the target feature vector of the text to be detected according to the header text and the body text may be: acquiring a first characteristic vector of the title text and a second characteristic vector of the body text; determining a third feature vector according to the first feature vector and the second feature vector, wherein the third feature vector is a combined similarity feature of the title text and the body text; and determining the target feature vector according to the first feature vector, the second feature vector and the third feature vector.

In one possible embodiment, the first feature vector may be, for example, a word frequency matrix of the header text, and the second feature vector may also be, for example, a word frequency matrix of the body text.

Specifically, the method for obtaining the first feature vector of the title text may be: the method comprises the steps of obtaining all words in a title text, calculating the word frequency of each word in all words in the title text, and determining a word frequency matrix of the title text according to all words in a preset corpus and the word frequency of each word in all words in the title text. Wherein, when all words in the title text are acquired, the deduplication needs to be performed. For example, the text to be detected is an english text, the title text is "how do you do", all words obtained after de-duplication are "how", "do", and "you", and the word frequency of each word is calculated to obtain the word frequencies of "how", "do", and "you" as 1, 2, and 1, respectively. The predetermined corpus is preset, for example, words included in the predetermined corpus are "how", "do", "like", "you", "is", and then the word frequency matrix of the title text can be determined to be [1, 2, 0, 1, 0] according to the words included in the predetermined corpus.

Specifically, the method for obtaining the second feature vector of the text may be: acquiring all words in the text, calculating the word frequency of each word in all words in the text, and determining the word frequency matrix of the text according to all words in the preset corpus and the word frequency of each word in all words in the text. Wherein, when all words in the text are acquired, the duplication removal is needed. The method for determining the word frequency matrix of the text is similar to the method for determining the word frequency matrix of the header text, and for brevity, the description is omitted here.

Specifically, the method for determining the third feature vector according to the first feature vector and the second feature vector may be: determining a word frequency inverse text matrix of the title text according to the word frequency matrix of the title text; determining a word frequency inverse text matrix of the text according to the word frequency matrix of the text; and calculating the cosine similarity of the word frequency inverse text matrix of the title text and the word frequency inverse text matrix of the body text to obtain the combined similarity characteristic.

In a possible implementation manner, the target eigenvector is obtained by splicing the first eigenvector, the second eigenvector and the third eigenvector according to a preset order. For example, the target feature vector is obtained by sequentially splicing the first feature vector, the third feature vector and the second feature vector.

104. And inputting the target feature vector into a neural network obtained by pre-training to obtain a detection result, wherein the detection result comprises the probability of the text to be detected for bias question or the probability of the text to be detected for non-bias question.

Optionally, the method for inputting the target feature vector into a neural network obtained by pre-training to obtain the detection result may be: inputting the target characteristic vector into the neural network to obtain a neural network output value; and mapping the neural network output value into a prediction probability through a normalized exponential function to obtain the detection result.

In a possible embodiment, the pre-trained neural network is trained by: acquiring a preset number of texts; acquiring a title text of each text and a body text of each text from the preset number of texts; processing the title text of each text and the body text of each text according to a first preset processing mode or a second preset processing mode to obtain the text sample; determining a label value corresponding to the text sample according to a mode of processing the title text of each text and the body text of each text, wherein the label value is used for labeling the probability of partial topic or the probability of partial topic of the text sample; inputting the text sample and the label value into the neural network to obtain loss; and adjusting network parameters of the neural network according to the loss.

Further, in a possible implementation, after obtaining the preset number of texts, the method further includes: acquiring all words in the preset number of texts; carrying out lower case conversion and de-duplication processing on all words in the preset number of texts to obtain a target word set; calculating the word frequency of each word in the target word set; sequencing each word in the target word set according to the sequence of the word frequency from large to small; and obtaining the first M sequenced words to form the preset corpus, wherein M is a positive integer.

The text processing method provided by the embodiment of the application can be seen in that firstly, a text to be detected is obtained, secondly, a title text and a body text are obtained from the text to be detected, secondly, a target feature vector of the text to be detected is determined according to the title text and the body text, and finally, the target feature vector is input into a neural network obtained through pre-training to obtain a detection result, wherein the detection result comprises the probability of partial problems or the probability of non-partial problems of the text to be detected. Therefore, when whether the text to be detected is biased or not needs to be detected, the target characteristic vector of the text to be detected is determined, the target characteristic vector is input into the neural network, a detection result is obtained, and whether the text to be detected is biased or not is determined. On the one hand, the method does not need manual detection, saves time, improves the efficiency of detecting the partial problems of the text, on the other hand, the method detects the partial problems of the text through the pre-trained neural network, is not influenced by human subjectivity, and improves the accuracy of detecting the partial problems of the text.

Referring to fig. 2, fig. 2 is a schematic flow chart of another text processing method provided in the embodiment of the present application, which can be used in the field of smart education to promote the construction of a smart city. As shown in fig. 2, another text processing method provided in the embodiment of the present application may include:

201. and acquiring the text to be detected.

When it is required to detect whether the text to be detected is a partial topic, the text to be detected is firstly acquired, and the text to be detected can be a Chinese text, an English text or other types of texts, for example.

202. And acquiring a title text and a body text from the text to be detected.

After the text to be detected is obtained, a title text and a body text are required to be obtained from the text to be detected. In a possible implementation manner, when the text to be detected is a Chinese text or an English text, the header text and the body text can be distinguished according to the font size, wherein the font of the header text is larger than that of the body text; or the caption text and the body text can be distinguished according to the font position, wherein the caption text is positioned at the head of the text to be detected.

203. And acquiring a first characteristic vector of the title text and a second characteristic vector of the body text.

In one possible embodiment, the first feature vector may be, for example, a word frequency matrix of the header text, and the second feature vector may also be, for example, a word frequency matrix of the body text.

Specifically, the method for obtaining the first feature vector of the title text may be: the method comprises the steps of obtaining all words in a title text, calculating the word frequency of each word in all words in the title text, and determining a word frequency matrix of the title text according to all words in a preset corpus and the word frequency of each word in all words in the title text. Wherein, when all words in the title text are acquired, the deduplication needs to be performed. For example, the text to be detected is an english text, the title text is "how do you do", all words obtained after de-duplication are "how", "do", and "you", and the word frequency of each word is calculated to obtain the word frequencies of "how", "do", and "you" as 1, 2, and 1, respectively. The predetermined corpus is preset, for example, words included in the predetermined corpus are "how", "do", "like", "you", "is", and then the word frequency matrix of the title text can be determined to be [1, 2, 0, 1, 0] according to the words included in the predetermined corpus.

Specifically, the method for obtaining the second feature vector of the text may be: acquiring all words in the text, calculating the word frequency of each word in all words in the text, and determining the word frequency matrix of the text according to all words in the preset corpus and the word frequency of each word in all words in the text. Wherein, when all words in the text are acquired, the duplication removal is needed. The method for determining the word frequency matrix of the text is similar to the method for determining the word frequency matrix of the header text, and for brevity, the description is omitted here.

204. And determining a third feature vector according to the first feature vector and the second feature vector, wherein the third feature vector is a combined similarity feature of the title text and the body text.

Specifically, the method for determining the third feature vector according to the first feature vector and the second feature vector may be: determining a word frequency inverse text matrix of the title text according to the word frequency matrix of the title text; determining a word frequency inverse text matrix of the text according to the word frequency matrix of the text; and calculating the cosine similarity of the word frequency inverse text matrix of the title text and the word frequency inverse text matrix of the body text to obtain the combined similarity characteristic. Determining that the word frequency inverse text matrix of the title text meets the following formula according to the word frequency matrix of the title text:

wherein TF-idf (x) refers to a word frequency inverse text matrix of the caption text, TF (x) refers to a word frequency matrix of the caption text, x refers to a word in the caption text, N refers to a number of texts included in a preset corpus, and N (x) refers to a number of texts including x in the preset corpus. That is, each word in the title text can calculate a TF-IDF value through the above formula, and then combine the TF-IDF values calculated for each word into a vector, so as to obtain the word frequency inverse text matrix of the title text.

For example, the preset corpus includes 100 texts, the title text of the text to be detected is "how doyou do", the TF-IDF value of the word "do" is calculated, the word "do" appears twice in the title text, and if 80 texts in the preset corpus include the word "do", then:

TF-IDF(do)=2*[log(100+1)/(80+1)+1]=1.4413

similarly, the values of TF-IDF (how) and TF-IDF (you) can be calculated and combined into a vector, and the word frequency inverse text matrix of the title text can be obtained.

Specifically, the method for determining the word frequency inverse text matrix of the text according to the word frequency matrix of the text is similar to the method for determining the word frequency inverse text matrix of the header text, and for brevity, no further description is provided here.

Specifically, after a word frequency inverse text matrix of the title text and a word frequency inverse text matrix of the body text are obtained, the cosine similarity of the word frequency inverse text matrix of the title text and the word frequency inverse text matrix of the body text is calculated and used as a combined similarity characteristic of the title text and the body text, wherein a formula for calculating the cosine similarity is as follows:

wherein similarity refers to cosine similarity, and A, B is a word frequency inverse text matrix of the title text and a word frequency inverse text matrix of the body text respectively.

205. And determining a target feature vector according to the first feature vector, the second feature vector and the third feature vector.

In a possible implementation manner, the target eigenvector is obtained by splicing the first eigenvector, the second eigenvector and the third eigenvector according to a preset order. For example, the target feature vector is obtained by sequentially splicing the first feature vector, the third feature vector and the second feature vector. For example, if the preset corpus includes 3000 words, the vector dimension of the finally obtained target feature vector is 1 × 6001.

206. And inputting the target characteristic vector into a neural network obtained by pre-training to obtain a neural network output value.

For example, the neural network obtained by pre-training is a fully-connected neural network, the preset corpus includes 3000 words, when the vector dimension of the target feature vector is 1 × 6001, correspondingly, the dimension of the fully-connected input neuron is 6001, the dimension of the hidden layer neuron is 100, and the dimension of the output layer neuron is 2, and the target feature vector is input into the neural network, so as to obtain the output value of the neural network.

207. And mapping the output value of the neural network into a prediction probability through a normalized index function to obtain a detection result, wherein the detection result comprises the probability of the text to be detected for bias question or the probability of the text to be detected for non-bias question.

In one possible embodiment, the pre-trained neural network is trained by: acquiring a preset number of texts; acquiring a title text of each text and a body text of each text from a preset number of texts; processing the title text of each text and the body text of each text according to a first preset processing mode or a second preset processing mode to obtain a text sample; determining a label value corresponding to the text sample according to a mode of processing a title text of each text and a body text of each text, wherein the label value is used for labeling the probability of partial topic or the probability of partial topic of the text sample; inputting the text sample and the label value into a neural network to obtain loss; and adjusting network parameters of the neural network according to the loss.

Specifically, the text sample and the label value are input into the neural network to obtain the loss, and the following formula is satisfied:

where H (p, q) refers to a cross entropy loss function, p (x) refers to a label value, i.e., a true probability, q (x) refers to a prediction probability obtained through a neural network, n refers to the number of text samples, and i refers to the ith text sample.

In one possible implementation, during the training process of the neural network, in order to avoid overfitting, L2 regularization and dropout strategy are required for the weights of the neural network.

In a possible implementation manner, in the process of training the neural network, the Adam optimization algorithm is used to update the weights of the neural network when reverse propagation is performed. And (3) back propagation, namely, returning to update the weight of the neural network according to the output result of the neural network, wherein the specific formula is as follows:

Figure BDA0002492115370000091

where H is the cross entropy loss function, η is the learning rate, and W is the neural network weight.

The process comprises the following steps: assuming that the output layer of the neural network is L layer, W of the output layerLThe following formula is satisfied:

aL=σ(zL)=σ(WLaL-1+bL)

wherein Z isLRepresenting no activation of the L-th layerTo output of (c).

When solving the output layer W, there is an intermediate dependent part

Figure BDA0002492115370000092

Therefore, Z can be firstly treatedLCalculating and recording as follows:

for the l-th layer inactive output zlIts gradient can be expressed as:

Figure BDA0002492115370000094

according to the forward propagation algorithm, there are:

zl=Wlal-1+bl

so that W of the l-th layer can be calculatedlGradient (2):

Figure BDA0002492115370000095

and finally, updating the weight of the neural network according to a back propagation formula.

Specifically, the method for acquiring the preset number of texts may be: a preset amount of composed text is crawled from the web and if the amount of composed text is not sufficient, the missing one can be replaced by news text and/or encyclopedia text.

In a possible embodiment, when acquiring the written text and/or other text, certain weights may be satisfied, for example, the weights of the written text, the news text, and the encyclopedia text are respectively 60%, 20%, and 20%, which may improve the source richness of the text sample, and the weights are only examples, and may be modified according to the needs, and are not limited herein.

In another possible embodiment, when obtaining the written text, certain weights may also be satisfied for different types of written texts, for example, the types of english written texts include narrative texts, contrast texts, causal texts, discussion texts, and other forms of texts, and for these types of texts, the weights may be respectively 20%, and 20%, so as to improve the richness of the types of text samples, where the above weights are merely examples, and may be modified as required, and are not limited herein.

When obtaining the news text, for different types of news text, a certain weight may also be satisfied, for example, the news text includes political news text, economic news text, legal news text, military news text, scientific news text, cultural and educational news text, sports news text, social news text, and the like, for these several types of news text, the weights may be respectively 12.5%, and 12.5%, so that the type richness of the text sample may also be improved, and the above weights are only examples, and may be changed according to requirements, and are not limited herein.

Specifically, a title text of each text and a body text of each text are obtained from the preset number of texts, for example, for a text a and a text B, the title text of the text a is a title, and the body text is a text; the title text of the text B is B _ title, and the body text is B _ text.

Specifically, the first preset processing mode is as follows: for a certain text, keeping a title text of the text and a body text of the text unchanged; the second preset processing mode is as follows: for a certain text, the title text of the text is kept unchanged, and the body text of the text is replaced by the body text of other text.

For example, for a text a and a text B, a text sample obtained by a first preset processing mode for a title text and a body text of the text a is [ a _ title, a _ text ], a corresponding label value is 1, and a probability for labeling the text sample to be not biased is 1; a text sample obtained by the title text and the body text of the text A according to a second preset processing mode is [ A _ title, B _ text ], the corresponding label value is 0, and the probability for labeling the text sample to be not biased is 0; a text sample obtained by the title text and the body text of the text B according to a first preset processing mode is [ B _ title, B _ text ], the corresponding label value is 1, and the probability for labeling the text sample to be not biased is 1; and a text sample obtained by the title text and the body text of the text B according to a second preset processing mode is [ B _ title, A _ text ], the corresponding label value is 0, and the probability for labeling the text sample to be not biased is 0. That is, the title text of the text a is combined with the body text of the text a to obtain a text sample that is not topical, but the title text of the text a is combined with the body text of the text B to obtain a text sample that is topical.

Further, after a preset number of texts are obtained, the word frequencies of all words of the preset number of texts are counted. For example, the step of counting the word frequencies of all words is as follows:

(1) all words of the preset number of texts are obtained.

For example, the text a is "How do you do", and all the words obtained include "How", "do", "you", "do".

(2) And carrying out lower case conversion and de-duplication on all the acquired words.

For example, for all the words "How", "do", "you", and "do" acquired as described above, the lower case conversion and deduplication are performed to obtain: "how", "do", "you".

(3) And (5) counting word frequency.

For example, the words "how", "do", and "you" obtained by the above-described lowercase conversion and deduplication have word frequencies of 1, 2, and 1, respectively, which are statistically obtained.

After the word frequencies of all the words are counted, all the words are sorted from large to small according to the word frequencies, and the first M words are extracted to form the preset corpus, where M is a positive integer, for example, M may be 3000, which means that the preset corpus includes 3000 words, and is not limited herein.

It can be seen that, with the text processing method provided in the embodiment of the present application, when it is necessary to detect whether the text to be detected is partial, the target feature vector of the text to be detected is determined, and the target feature vector is input to the neural network to obtain a detection result, thereby determining whether the text to be detected is partial. On the one hand, the method does not need manual detection, saves time, improves the efficiency of detecting the partial problems of the text, on the other hand, the method detects the partial problems of the text through the pre-trained neural network, is not influenced by human subjectivity, and improves the accuracy of detecting the partial problems of the text. In addition, in the process of training the neural network, when the text samples are obtained, a large number of text samples can be constructed based on the obtained texts, manual labeling is not needed, time and labor cost can be saved, and meanwhile, the neural network obtained through training can have stronger robustness due to the large-scale text samples.

Referring to fig. 3, fig. 3 is a schematic view of a text processing apparatus according to an embodiment of the present application. As shown in fig. 3, a text processing apparatus provided in an embodiment of the present application may include:

the first obtaining module 301 is configured to obtain a text to be detected;

a second obtaining module 302, configured to obtain a title text and a body text from the text to be detected;

a determining module 303, configured to determine a target feature vector of the text to be detected according to the header text and the body text;

the detection module 304 is configured to input the target feature vector into a neural network obtained through pre-training to obtain a detection result, where the detection result includes a probability of the text to be detected that is biased or a probability of the text not to be detected that is biased.

For specific implementation of the text processing apparatus of the present application, reference may be made to various embodiments of the text processing method, which are not described herein again.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present application. As shown in fig. 4, an electronic device of a hardware operating environment according to an embodiment of the present application may include:

a processor 401, such as a CPU.

The memory 402 may alternatively be a high speed RAM memory or a stable memory such as a disk memory.

A communication interface 403 for implementing connection communication between the processor 401 and the memory 402.

Those skilled in the art will appreciate that the configuration of the electronic device shown in fig. 4 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 4, the memory 402 may include an operating system, a network communication module, and a text processing program. An operating system is a program that manages and controls the hardware and software resources of an electronic device, supporting the operation of text processing programs and other software or programs. The network communication module is used to enable communication between the components within the memory 402 and with other hardware and software in the electronic device.

In the electronic device shown in fig. 4, the processor 401 is configured to execute the text processing program stored in the memory 402, and implement the following steps:

acquiring a text to be detected;

acquiring a title text and a body text from the text to be detected;

determining a target characteristic vector of the text to be detected according to the title text and the body text;

and inputting the target feature vector into a neural network obtained by pre-training to obtain a detection result, wherein the detection result comprises the probability of the text to be detected for bias question or the probability of the text to be detected for non-bias question.

For specific implementation of the electronic device of the present application, reference may be made to the embodiments of the text processing method, which are not described herein again.

Another embodiment of the present application provides a computer-readable storage medium storing a computer program for execution by a processor to perform the steps of:

acquiring a text to be detected;

acquiring a title text and a body text from the text to be detected;

determining a target characteristic vector of the text to be detected according to the title text and the body text;

and inputting the target feature vector into a neural network obtained by pre-training to obtain a detection result, wherein the detection result comprises the probability of the text to be detected for bias question or the probability of the text to be detected for non-bias question.

For specific implementation of the computer-readable storage medium of the present application, reference may be made to the embodiments of the text processing method, which are not described herein again.

It is also noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

16页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:文本处理方法、装置、设备及计算机可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!