English mail text data processing method, device, equipment and storage medium

文档序号:907628 发布日期:2021-02-26 浏览:6次 中文

阅读说明:本技术 英文邮件文本数据处理方法、装置、设备及可存储介质 (English mail text data processing method, device, equipment and storage medium ) 是由 祁俊辉 于 2020-09-25 设计创作,主要内容包括:本发明适用数据处理技术领域,提供英文邮件文本数据处理方法、装置、设备及可存储介质,获取英文邮件文本数据中的标点符号的类型及对应数量,确定逗号占比;当逗号占比不小于预设比例阈值时,则根据N-Gram语言模型对逗号进行纠正处理,进而对经逗号纠正处理后的英文邮件文本数据进行句子切分处理;当判断经句子切分处理后的英文邮件文本数据中存在逗号数量大于预设数量阈值的长句时,则根据N-Gram语言模型对英文邮件文本数据进行长句处理,得到处理后的英文邮件文本数据。本发明解决了因不同人书写邮件的习惯不同而导致的正则表达式分句不规范的现象,可以达到正确地对邮件文本进行句子划分的目的,为邮件数据的后续挖掘作技术支持。(The invention is suitable for the technical field of data processing, provides an English mail text data processing method, a device, equipment and a storage medium, obtains the types and the corresponding number of punctuations in the English mail text data, and determines the comma ratio; when the comma proportion is not less than a preset proportion threshold, carrying out correction processing on the comma according to the N-Gram language model, and further carrying out sentence segmentation processing on the English mail text data subjected to comma correction processing; and when the long sentences with commas larger than a preset number threshold exist in the English mail text data after sentence segmentation processing, performing long sentence processing on the English mail text data according to the N-Gram language model to obtain the processed English mail text data. The invention solves the problem of irregular sentence division of the regular expression caused by different habits of different people on writing mails, can achieve the aim of correctly dividing sentences of the mail text, and provides technical support for the subsequent mining of mail data.)

1. An English mail text data processing method is characterized by comprising the following steps:

acquiring text data of an English mail to be processed;

acquiring the types and the corresponding quantity of punctuation marks in the English mail text data;

determining comma occupation ratio according to the type and the corresponding number of the punctuation marks;

when the comma proportion is not less than a preset proportion threshold value, correcting the comma according to a preset N-Gram language model;

carrying out sentence segmentation processing on the English mail text data subjected to comma correction processing according to the preset N-Gram language model;

and when the long sentences with commas larger than a preset number threshold exist in the English mail text data after sentence segmentation processing, performing long sentence processing on the English mail text data according to the preset N-Gram language model to obtain the processed English mail text data.

2. The method for processing text data of an english mail according to claim 1, wherein the step of obtaining text data of an english mail to be processed is followed by the steps of:

and removing irregular characters in the text data of the English mail according to a preset standard format rule.

3. The method for processing the text data of the english mail according to claim 1, wherein the step of correcting the comma according to a preset N-Gram language model when the comma occupation ratio is not less than a preset ratio threshold value comprises:

when the comma proportion is not less than a preset proportion threshold, carrying out sentence splitting processing on the English mail text data to obtain a primary sentence list;

extracting a primary first sentence from the primary sentence list according to a preset primary first sentence condition;

when the sentence end character of the primary first sentence is judged not to be a punctuation mark, determining a conventional punctuation mark type corresponding to the sentence end of the primary first sentence according to a preset N-Gram language model;

determining the primary sentence list without the primary first sentence as a new generation primary sentence list, and judging whether the new generation primary sentence list is empty or not; if not, returning to the step of extracting the primary initial sentence from the primary sentence list according to the preset primary initial sentence condition; if so, the comma correction processing is ended.

4. The method for processing text data of an english mail according to claim 3, wherein the step of determining a conventional punctuation mark type corresponding to the end of the primary sentence according to a preset N-Gram language model when it is determined that the end of the primary sentence is not a punctuation mark comprises:

when the sentence end character of the primary first sentence is judged not to be a punctuation mark, adding various conventional punctuation mark types to the sentence end of the primary first sentence in sequence, and calculating scores corresponding to the various conventional punctuation mark types in sequence according to a preset N-Gram language model;

and determining the conventional punctuation mark type with the highest score as the conventional punctuation mark type corresponding to the tail of the primary first sentence.

5. The method for processing the text data of the english mail according to claim 1, wherein the step of performing sentence segmentation processing on the text data of the english mail after comma correction processing according to the preset N-Gram language model comprises:

carrying out sentence splitting processing on the text data of the English mail subjected to comma correction processing to obtain a primary sentence list;

extracting a first-level first sentence from the first-level sentence list according to a preset first-level first sentence condition;

when the sentence end character of the first-level sentence is judged not to be the punctuation mark, determining the type of the ending punctuation mark corresponding to the sentence end of the first-level sentence according to a preset N-Gram model;

determining the first-stage sentence list without the first-stage first sentence as a new-generation first-stage sentence list, and judging whether the new-generation first-stage sentence list is empty or not; if not, returning to the step of extracting the primary first sentence from the primary sentence list according to the preset primary first sentence condition; if yes, the sentence segmentation processing process is ended.

6. The method for processing the text data of the english mail according to claim 1, wherein the step of, when it is determined that the long sentence with the comma number greater than the threshold value of the preset number exists in the text data of the english mail after sentence division processing, performing sentence division processing on the long sentence according to the preset N-Gram language model to obtain the processed text data of the english mail comprises:

when the fact that the long sentences with commas larger than a preset number threshold exist in the English mail text data after sentence segmentation processing is judged, cutting the long sentences according to the comma positions to obtain a comma cutting list;

extracting a secondary first sentence from the comma cutting list according to a preset secondary first sentence condition;

when the sentence end character of the second-level first sentence is judged not to be a punctuation mark, determining a conventional punctuation mark type corresponding to the sentence end of the second-level first sentence according to a preset N-Gram language model;

determining the first-stage sentence list without the second-stage first sentence as a new-generation comma cut list, and judging whether the new-generation comma cut list is empty or not; if not, returning to the step of extracting a secondary first sentence from the secondary sentence list according to a preset secondary first sentence condition; if yes, ending the long sentence processing process to obtain the processed English mail text data.

7. An apparatus for processing text data of an english mail, comprising:

the text data acquisition unit is used for acquiring text data of the English mail to be processed;

a punctuation mark type and quantity obtaining unit, configured to obtain the type and the corresponding quantity of punctuation marks in the text data of the english mail;

a comma duty determining unit for determining a comma duty according to the type and the corresponding number of the punctuations;

the comma correction unit is used for correcting the comma according to a preset N-Gram language model when the comma proportion is not less than a preset proportion threshold;

the sentence dividing processing unit is used for carrying out sentence dividing processing on the English mail text data after comma correction processing according to the preset N-Gram language model; and

and the long sentence processing unit is used for carrying out long sentence processing on the English mail text data according to the preset N-Gram language model to obtain the processed English mail text data when the long sentences of which the comma number is larger than the preset number threshold value exist in the English mail text data after sentence segmentation processing.

8. The apparatus for processing text data of english mail according to claim 7, characterized by further comprising:

and the character removing unit is used for removing irregular characters in the English mail text data according to a preset standard format rule.

9. A computer device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to carry out the steps of the method of processing text data of english mail according to any one of claims 1 to 6.

10. A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, causes the processor to carry out the steps of the method for processing text data of an english mail according to any one of claims 1 to 6.

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to an English mail text data processing method, device, equipment and a storage medium.

Background

In the mail data processing, in order to automatically acquire services such as a mail digest, a mail text needs to be first sentence-cut. However, each person writes different mails, for example, in english mail, some people will use multiple spaces to represent punctuation marks, some people will use carriage returns to replace punctuation marks to represent different sentences, and other people will always play commas, wherein there is no ending of a period, which is very irregular.

The prior art only supports sentence cutting of standard texts, and the sentences are still cut by matching punctuations in the modes of regular expressions and the like, but the method is extremely inapplicable to irregular mail text data.

Therefore, the existing text sentence cutting method cannot be suitable for the problem that the regular expression is irregular in sentence division and limited in use due to different habits of different people on writing mails.

Disclosure of Invention

The embodiment of the invention aims to provide an English mail text data processing method, and aims to solve the problems that the existing text sentence cutting method cannot be suitable for the phenomenon of irregular sentence division of a regular expression caused by different habits of different people on writing mails and is limited in use.

The embodiment of the invention is realized in such a way that an English mail text data processing method comprises the following steps:

acquiring text data of an English mail to be processed;

acquiring the types and the corresponding quantity of punctuation marks in the English mail text data;

determining comma occupation ratio according to the type and the corresponding number of the punctuation marks;

when the comma proportion is not less than a preset proportion threshold value, correcting the comma according to a preset N-Gram language model;

according to the preset N-Gram language model, carrying out sentence division processing on the text data of the English mail subjected to comma correction processing;

and when the long sentences with commas larger than a preset number threshold exist in the English mail text data after sentence segmentation processing, performing long sentence processing on the English mail text data according to the preset N-Gram language model to obtain the processed English mail text data.

Another objective of an embodiment of the present invention is to provide an apparatus for processing text data of an english mail, including:

the text data acquisition unit is used for acquiring text data of the English mail to be processed;

a punctuation mark type and quantity obtaining unit, configured to obtain the type and the corresponding quantity of punctuation marks in the text data of the english mail;

a comma duty determining unit for determining a comma duty according to the type and the corresponding number of the punctuations;

the comma correction unit is used for correcting the comma according to a preset N-Gram language model when the comma proportion is not less than a preset proportion threshold;

the sentence dividing processing unit is used for carrying out sentence dividing processing on the English mail text data after comma correction processing according to the preset N-Gram language model; and

and the long sentence processing unit is used for carrying out long sentence processing on the English mail text data according to the preset N-Gram language model to obtain the processed English mail text data when the long sentences of which the comma number is larger than the preset number threshold value exist in the English mail text data after sentence segmentation processing.

Another object of an embodiment of the present invention is a computer device, including a memory and a processor, the memory having stored therein a computer program, which, when executed by the processor, causes the processor to execute the steps of the method for processing text data of english mail.

The computer readable storage medium stores a computer program thereon, and when the computer program is executed by a processor, the processor executes the steps of the method for processing the text data of the english mail.

The method for processing the text data of the English mail determines the comma occupation ratio by the type and the corresponding number of punctuations in the text data of the English mail to be processed, corrects the comma when the comma occupation ratio is not less than a preset ratio threshold, further performs sentence segmentation processing on the text data of the English mail after the comma correction processing, and performs long sentence processing on the text data of the English mail when judging that long sentences with comma numbers greater than a preset number threshold exist in the text data of the English mail after the sentence segmentation processing, so as to obtain the text data of the processed English mail. Compared with the prior art, the method and the device solve the problem that regular expression clauses are irregular due to different habits of different people on writing mails, can achieve the purpose of correctly dividing the sentences of the mail text, and provide technical support for the subsequent mining of mail data.

Drawings

Fig. 1 is a flowchart illustrating an implementation of a method for processing text data of an english email according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating an implementation of another method for processing text data of an english email according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating an implementation of another method for processing text data of an english email according to an embodiment of the present invention;

fig. 4 is a flowchart illustrating an implementation of a method for processing text data of an english email according to an embodiment of the present invention;

fig. 5 is a flowchart illustrating an implementation of a method for processing text data of an english email according to an embodiment of the present invention;

fig. 6 is a block diagram of an apparatus for processing text data of an english mail according to an embodiment of the present invention;

fig. 7 is a block diagram of another english mail text data processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, etc. may be used to describe various information in embodiments of the present invention, the information should not be limited by these terms. These terms are only used to distinguish one type of information from another.

The embodiment of the invention aims to solve the problem that the existing text sentence cutting method cannot be suitable for the phenomenon of irregular sentence division of a regular expression caused by different habits of different persons on writing mails and is limited in use, and provides an English mail text data processing method The text is divided into sentences, and technical support is provided for subsequent mining of the mail data.

To further explain the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the embodiments, structures, features and effects according to the present invention will be given with reference to the accompanying drawings and preferred embodiments.

As shown in fig. 1, in an embodiment, an english mail text data processing method is proposed, and for convenience of description, only parts related to the embodiment of the present invention are shown, which are detailed as follows:

and step S101, obtaining text data of the English mail to be processed.

In the embodiment of the present invention, the format of the text data of the english mail, that is, the text content of the english mail, may be ASCII, MIME, txt, etc., and is not particularly limited; in addition, the mail text data may be obtained from electronic mailbox application software such as google mailbox, QQ mailbox, 163 mailbox, new wave mailbox, search mailbox, 126 mailbox, and the like, which is not limited specifically.

In a preferred embodiment of the present invention, as shown in fig. 2, after the step S101, the method further includes:

step S201, according to a preset standard format rule, removing irregular characters in the text data of the english mail.

In the embodiment of the invention, when the format of the text data of the English mail is ASCII, characters which are not in an ASCII code table and exist in the text data of the English mail are removed; among them, ASCII (American Standard Code for Information exchange) is a set of computer coding system based on latin letters, mainly used to display modern english and other western european languages, which is the most common Information exchange Standard and is equivalent to the international Standard ISO/IEC 646.

And step S102, acquiring the types and the corresponding quantities of punctuations in the English mail text data.

In the embodiment of the present invention, the types of punctuation marks in the text data of the english mail include, but are not limited to, comma, period, exclamation point, question mark, and the like, and the proportion of a certain punctuation mark in the whole text data of the english mail in the total number of punctuation marks can be further calculated and obtained by obtaining the type and the corresponding number of each punctuation mark.

And step S103, determining comma occupation ratio according to the type and the corresponding number of the punctuations.

In the embodiment of the invention, according to the big data display, commas are the marks of most people when writing mailsThe most commonly used punctuation marks are often commas when the punctuation marks are typical punctuation marks such as periods, exclamation marks, question marks and the like, so the comma proportion is further determined according to the number of the obtained commas corresponding to each punctuation mark; however, in general, in the english text, the sentence is often segmented only according to the period, the exclamation mark and the question mark, so in the embodiment of the present invention, only the punctuations of the several kinds need to be counted, and for other colon marks, semicolons, ellipses, etc., only the common characters need to be processed. E.g. the number num of commas in the calculated text data of the mail1Number of periods num2Number of exclamation marks num3And number of question marks num4Then the comma to ratio is calculated according to equation (1)1

And step S104, when the comma proportion is not less than a preset proportion threshold, correcting the comma according to a preset N-Gram language model.

In the embodiment of the present invention, the preset ratio threshold may be set according to the performance and/or the implementation requirement of the terminal during the specific implementation, and the size of the preset ratio threshold is not specifically limited in the embodiment of the present invention, for example, the preset ratio threshold may be 0.3, 0.5, 0.7, and the like. The invention here preferably provides a predetermined ratio threshold of 0.7, i.e. if the comma-based ratio is1Comma correction is carried out when the number is greater than or equal to 0.7; otherwise, this link is skipped. In the embodiment of the present invention, the preset ratio threshold is set to 0.7, which is the best value obtained by many experiments in the present invention, for example, a sentence has 10 total punctuations, wherein 9 commas and 1 sentence are provided, so that the sentence is obviously not consistent with a conventional sentence, and the sentence can be considered to have a language sickness, and may be wrongly written as a comma in writing the sentence.

In The embodiment of The present invention, if The text data of The english mail is "I have a good friend, and her name is Li Hua, We have come friends for about two years, She is popular king, while I step is in The order of The first time, She has me to get family with The string environment, The most important thing that is The same as that of The family of The company of.

In the embodiment of the invention, for the training of the preset N-Gram language model, a multilingual Wikipedia corpus can be used as a training corpus, a multilingual word segmentation tool is used for word segmentation, and then the training of the N-Gram language model is carried out through the N-Gram language model training tool; in addition, when the N value is large, a huge training corpus is needed for training the N-Gram language model, data are sparse and serious, and time complexity is greatly improved, so that N in the N-Gram language model is set to 3 in the embodiment of the invention; for long-tail words (i.e. words with total occurrence times less than 10), marking them as LongTailWord; it is denoted NUMERAL for numerical uniformity. Wherein, the N-Gram language model training tool can use one of open source tools such as SRILM, IRSTLM, Berkeley LM and KenLM.

In a preferred embodiment of the present invention, as shown in fig. 3, the step S104 includes:

and S301, when the comma proportion is not less than a preset proportion threshold, performing sentence splitting processing on the English mail text data to obtain a primary sentence list.

In the embodiment of the invention, the manner of sentence division processing on the text data of the English mail is that sentence division processing is carried out according to a regular expression, and primary sentences are cut at intervals of periods, exclamation marks, question marks and carriage returns for identifying the positions of the punctuations in the text data of the English mail to obtain a primary sentence list; company suffixes (e.g., inc., Corp.), names (e.g., mr. wang), URLs (e.g., www.baidu.com), mailbox addresses (e.g., [email protected] qq. com), numbers, sequence numbers, abbreviations, etc., appearing in the sentences are ignored in the cut.

In the embodiment of the invention, the English mail text data is subjected to primary sentence divisionObtaining the primary sentence list lst ═ s1,s2,…,sn]And n is the number of the primary sentences.

In The embodiment of The present invention, for example, The text data of The english email is "The heat define today. i with fly a kit. my more viewed.", where The initial letter of i with fly a kit is lowercase, and it should be The same sentence as The heat is define today, but there is a sentence division between them, which indicates The word illness. A preliminary sentence division is performed, lst [ "The weather is fine today.", "i will fly a kit.", "My mothered." ].

In The embodiment of The present invention, for example, The english mail text data is "I have a good friend", "and her name is Li Hua", "We have a friend for about two layers", She is friend for about three layers ", while I step in The same group for The first time, She have a friend from The family with The string environment, The most important friend from The family with The same group for The same group of people, so that The corresponding primary sentence [" I have a friend "," and name is Li a "," I have a friend from The family "," I step in The same group of people "," I have a friend from The family "," I have a friend from The family with The same group, "and" I step in The same group of people "or" in The same group of people "," I have a friend from The family "," I step in The same group of people with The same group of people "," I step in The same group of people with The same group of people "of people in The same group of people, "The most important is that this be The same share The same interest", "so we have a lot in common", "i Chinese ear friend so much" ], wherein The number of The initial sentences n is 9.

Step S302, extracting a primary first sentence from the primary sentence list according to a preset primary first sentence condition.

In the embodiment of the present invention, an initial index i is set to 0, an end index j is set to 1, and if the first character of the sentence lst [ j ] is a lower case, or the first character is a number, or the first character is &, -a connector, or the first word is all an upper case, the end index j +1 is executed until the condition is not satisfied; otherwise, take out the sentence lst [ i: j ] with index i-j as the first sentence (excluding the sentence lst [ j ]).

In The embodiment of The present invention, for example, lst [ "The weather is finish date.", "i will fly a kit.", "My other copied." ]. Assuming that The initial index i is 0 and The ending index j is 1, at this time, The sentence lst [1] 'i will fly a kit ", The initials of which are lower case, and The condition is satisfied, so that The ending index j +1 is reached, at this time, j is 2, at this time, The sentence lst [2 ]' My more viewed. In The embodiment of The present invention, for example, lst [ "I have a good friend", "and her name is Li Hua", "We have a good friend for about two years", "She is right", "where I step is in The order of The class for The first time", "She has a me to get family with The class environment", "The last import is The same as The week name of The class name of The family name of The class name of The family name. Specifically, the initial index I is 0, the ending index j is 1, where lst [0], "I have a good friend", and lst [1], "and her name is Li Hua" satisfy the condition that the first character is lower case, so that the ending index j is j +1 ═ 2, and lst [2], "We have come friends for about aboutto eyes", and do not satisfy all the above conditions, so that the first sentence is lst [0:2], "I have a good friend", "and her name is Li Hua" ], "I have a good friend, and her name is Li Hua" with a subscript of 0-2 is taken as the first sentence.

Step S303, when the sentence end character of the primary first sentence is judged not to be the punctuation mark, determining the conventional punctuation mark type corresponding to the sentence end of the primary first sentence according to a preset N-Gram language model.

In the embodiment of the invention, when the sentence tail character of the primary first sentence is judged not to be the punctuation mark, various conventional punctuation mark types are added to the sentence tail of the primary first sentence in sequence, and the scores corresponding to the various conventional punctuation mark types are calculated in sequence according to a preset N-Gram language model; and determining the conventional punctuation mark type with the highest score as the conventional punctuation mark type corresponding to the tail of the primary first sentence. Specifically, if the sentence end character of the primary initial sentence lst [ i: j ] is not a punctuation mark, adding a comma, a period, an exclamation mark and a question mark after the primary initial sentence, respectively, calculating the N-Gram score according to a preset N-Gram language model, and using the punctuation mark with the highest score; namely, the punctuation mark type with the highest score is determined as the punctuation mark type to be added at the end of the initial sentence.

In The embodiment of The present invention, if The current primary sentence is "The heat is fine today", and The tail of The sentence has no punctuation mark, a comma, a period, an exclamation mark, and a question mark are respectively added to The tail of The sentence, and then "The heat is fine today", "The heat is fine today! "," The weather is fine today? "an N-Gram score; if the first sentence is "I have a good friend, and her name is Li Hua", and the end of the sentence has no punctuation mark, then a comma, a period, an exclamation mark and a question mark are respectively added to the end of the sentence, and then the "I have a good friend, and her name is Li Hua", "I have a good friend, and her name is Li Hua.", "I have a good friend, and her name is Li Hua! "," I have a good friend, and her name is Li Hua? The N-Gram score and the N-Gram score can be calculated by the conventional N-Gram language model tool, the calculation mode is a conventional mode, detailed description is not needed, and if the N-Gram score of the sentence end added with the sentence number is the highest, the sentence number is added at the sentence end.

Step S304, determining the primary sentence list without the primary first sentence as a new generation primary sentence list, and judging whether the new generation primary sentence list is empty or not; if not, returning to the step S302; if yes, the process proceeds to step S305.

In an embodiment of the present invention, the primary first sentence lst [ i: j ] is divided]Returning, the remaining sentence list lst ═ sj,sj+1,…,sn]Also go back, i.e. lst ═ sj,sj+1,…,sn]To remove the primary first sentence lst [ i: j ]]When the sentence is judged to exist in the primary sentence list or the number of the sentences is not 0, the step of returning to the step of extracting the primary first sentence from the primary sentence list according to the preset primary first sentence condition is executed;and when the sentence does not exist in the primary sentence list or the number of the sentences is 0, the comma correction of the mail text data is completed. Specifically, lst [0:2] is used in the above example]The remaining sentences are returned as The first sentence, and The rest of The sentence list lst [ "We have come from friends for about two years", "She is right done", "where I step into The class for The first time", "She hells me to get family with The class environment", "The last animal meal for The class environment", "so have a lot in The common", "I She all family with The class environment" and "The best animal meal for The family of The family"]And also returns.

In step S305, comma correction is completed, and the comma correction processing is ended.

And step S105, carrying out sentence segmentation processing on the text data of the English mail subjected to comma correction processing according to the preset N-Gram language model.

In a preferred embodiment of the present invention, as shown in fig. 4, the step S105 includes:

step S401, sentence splitting processing is carried out on the English mail text data after comma correction processing, and a first-level sentence list is obtained.

In the embodiment of the present invention, the method for performing sentence division processing on the text data of the english mail after comma correction processing may refer to the above sentence division processing manner, and details are not described herein again.

In the embodiment of the invention, the English mail text data after comma correction processing is subjected to sentence division to obtain a primary sentence list lst ═ s1,s2,…,sn]And n is the number of first-level sentences.

In the embodiment of the present invention, in the above example, the comma corrected english mail text data obtained after the comma correction is finished is: "I have a good friend for skin, and her name is Li hua.we have a good friend for about two layers of skin is vertical, while I step inter The label for The first time, She hei he me to get a good friend with The string environment, The most important friend with The same name for The same name, so a good friend in The same name, I She outer friend in The same name," first sentence in The same name, "I have a good friend for skin, and her name is" in The same name, "She has a good friend for skin," She me to "for skin," She hi name to be found in The same name, so we have a lot in common, i cherish outer friend shift so mu ch. "], wherein the number n of the primary sentences is 5.

Step S402, extracting a primary first sentence from the primary sentence list according to a preset primary first sentence condition.

In the embodiment of the invention, an initial index i is set to be 0, an end index j is set to be 1, and if the last character of a sentence lst [ i ] is a punctuation mark, the sentence lst [ i: j ] with indexes of i to j is taken out as a first sentence; otherwise, if the first character of the sentence lst [ j ] is lowercase, or the first character is a number, or the first character is &, -a connector, or the first word is all uppercase, then the end subscript j +1 is applied until this condition is not satisfied.

In The embodiment of The present invention, The primary sentence division is performed as in The existing sentences "The heat is fine today. i with fly a kit. My more linked." (note that The initials of i with fly a kit are lowercase i, which is The same sentence as The heat is fine today, but they are divided into periods in between, which are described as being a word disorder.), and The lst [ "The heat is fine. Assuming that The initial index i is 0 and The ending index j is 1, at this time, The sentence lst [1] 'i will fly a kit ", The initials of which are lowercase, and The condition is satisfied, so that The ending index j +1 is reached, so that j is 2, and at this time, The sentence lst [2 ]' My more viewed.

In The embodiment of The present invention, as The primary sentence list lst [ "I have a good friend, and her name is Li hua.", "We have come friends for about two years", where "She is right.", "where I step is in The vicinity of The first time, where e help with The string environment", "where I step is in The vicinity of The first time, where e help with The string name to get The family name with The string environment", "where I move animal is at The same time as The week share of The same company, where I move animal friend has been in The same place as The first sentence, where" I "0, where I" has a friend found friend "1, where I" has come from The same place as The first sentence, where I "has a friend" 1, where I "has a friend found" 1, where all The above-mentioned conditions are taken out as "0, where I" h move year found "1, where I" h move year found "1" I "has two" 1, where I "has two, where one, where I" one has two times, where one has two of The same sentence is taken out, where The same as The same sentence, where The same as one of The same sentence, where The same sentence is taken as one of The same as The same, and her name is Li hua. "], I has a good friend, and her name is Li hua." as the first primary sentence.

Step S403, when the sentence end character of the first-level sentence is judged not to be the punctuation mark, determining the end punctuation mark type corresponding to the sentence end of the first-level sentence according to a preset N-Gram model.

In the embodiment of the invention, when the sentence end character of the first-level first sentence is judged not to be the punctuation mark, various conventional punctuation mark types are added to the sentence end of the first-level first sentence in sequence, and the scores corresponding to the various conventional punctuation mark types are calculated in sequence according to a preset N-Gram language model; and determining the conventional punctuation mark type with the highest score as the conventional punctuation mark type corresponding to the sentence end of the first-level sentence. Specifically, if the end character of the first sentence lst [ i: j ] is not a punctuation mark, then a period, an exclamation mark, and a question mark are added after the sentence to calculate the N-Gram score, and the punctuation mark with the highest score is used.

In the embodiment of the present invention, if the first sentence is "I have a good friend, and her name is Li hua.

Step S404, determining the first-stage sentence list without the first-stage first sentence as a new-generation first-stage sentence list, and judging whether the new-generation first-stage sentence list is empty or not; if not, returning to the step S402; if yes, the process proceeds to step S405.

In an embodiment of the present invention, the primary first sentence lst [ i: j ] is divided]Returning, the remaining sentence list lst ═ sj,sj+1,…,sn]Also go back, i.e. lst ═ sj,sj+1,…,sn]To remove the first-level first sentence lst [ i: j ]]When the primary sentence list is judged to have sentences or the number of the sentences is not 0, the step of returning to the step of extracting the primary first sentence from the primary sentence list according to the preset primary first sentence condition is executed; and when judging that no sentence exists in the first-level sentence list or the number of sentences is 0, indicating that the sentence segmentation of the mail text data is finished. Specifically, lst [0:1] is used in the above example]"I have a good friend, and her name is Li hua." is returned as The first sentence, and The remaining sentences list lst [ "We have a come friend for about two years.", "She is right.", "while I step in The close for The first time, She hells me to get family with The string environment", "The most animal which is The same as The We have a good friend The same as The first sentence, so We have a lot in The common, I She outer friend has a good choice of The study.]And also returns.

In step S405, the sentence segmentation process is completed, and the sentence segmentation process is ended.

And step S106, when the long sentence with commas larger than a preset number threshold exists in the English mail text data after sentence segmentation processing, performing long sentence processing on the English mail text data according to the preset N-Gram language model to obtain the processed English mail text data.

In a preferred embodiment of the present invention, as shown in fig. 5, the step S105 includes:

step S501, when it is judged that the long sentence with the comma number larger than the preset number threshold exists in the English mail text data after sentence segmentation processing, the long sentence is segmented according to the comma position to obtain a comma segmentation list.

In the embodiment of the invention, sentence cutting is carried out according to commas, and a comma cutting list lst ═ s is obtained1,s2,…,sn]And n is the number of comma cuts.

In the embodiment of the present invention, in the above example, the sentence list corresponding to the text data of the english mail after the sentence segmentation processing is: the sentence 1 is "I have a good friend", and her name is Li hua ", The sentence 2 is" We have come friends for about two years ", The sentence 3 is" She is right ", The sentence 4 is" When The customer is in The proximity of The first time, The sentence he has me to get The family with The string environment ", The sentence 5 is" The mobile advertisement which is The family of The show inter, The family of The show individual in The same place, The sentence 5 is specific, and The sentence 5 is exemplified, The name of The person in The same place of The show inter, The family of The person in The same place of The show inter, The sentence 5 is exemplified, and The sentence 5 is "The person in The same place", and The sentence 3 is obtained. Step S502, extracting a secondary first sentence from the comma cut list according to a preset secondary first sentence condition.

In the embodiment of the invention, an initial index i is set to be 0, an end index j is set to be 1, and if commas of a current sentence lst [ i: j ] are more than or equal to 2 and a first word of the sentence lst [ j ] appears in a first word statistical list, the sentence lst [ i: j ] with indexes i to j is taken out as a first sentence; if the initial of the sentence lst [ j ] is lowercase, a comma is used to join the next sentence until this condition is not met.

In The embodiment of The present invention, as specific to The comma cut list lst [ "The most important having been given by The we share The same interest", "so we have a lot in common", "i cherish friend so macro ]", The initial subscript i ═ 0, and The end subscript j ═ 1, in this case, lst [0] - "The most important having been given by The we share The same interest", and lst [1] - "so have a lot in common"; the current sentence lst [0:1] comma is less than 2, and The first letter of The sentence lst [1] is lowercase, connected using commas, and let j +1 be 2, when The comma of The sentence lst [0:2] is exactly equal to 2, and The first "i" of The sentence lst [2] appears in The first word statistics list, so The sentence lst [0:2 ]' The last input is that there is a we share The same interest, so we have a lot in common "is taken as The second-level first sentence.

Step S503, when the sentence end character of the secondary first sentence is judged not to be the punctuation mark, determining the conventional punctuation mark type corresponding to the sentence end of the secondary first sentence according to a preset N-Gram language model.

In the embodiment of the invention, if the tail character of the two-level first sentence lst [ i: j ] is not a punctuation mark, comma, period, exclamation mark and question mark are respectively added after the sentence to calculate the N-Gram score, and the punctuation mark with the highest score is used. If The current first sentence is "The most important is this we share The same interest, and The end of The sentence has no punctuation mark, then add comma, sentence, exclamation mark, question mark to The end of The sentence, and further calculate" The most important is this we share The same interest, so "The most important is this we share The same interest, so" The most important is this we share The same name in The same place "," The most important is this we share The same interest, this is this we share The same interest! "," The most important that is The same we share The same interest, so we have a lot in common? "N-Gram score.

Step S504, determining the first-level sentence list without the second-level first sentence as a comma cut list of a new generation, and judging whether the comma cut list of the new generation is empty; if not, returning to the step S502; if yes, the process proceeds to step S505.

In the embodiment of the present invention, the second-level first sentence lst [ i: j ] is divided]Returning, the remaining comma cut list lst ═ sj,sj+1,…,sn]Also go back, i.e. lst ═ sj,sj+1,…,sn]To remove the second stage first sentence lst [ i: j ]]When the fact that sentences still exist in the comma cut list or the number of the sentences is not 0 is judged, returning to the step of extracting the second-level first sentence from the comma cut list according to the preset second-level first sentence condition is executed; and when judging that no sentence exists in the comma cutting list or the number of sentences is 0, indicating that the long sentence processing of the mail text data is finished. Specifically, lst [0:2] is used in the above example]=“The most important thing is that we sharethe same interest, so we have a lot in common, "is returned as the first sentence, and the remaining sentence list lst [" i chess outer friend so mu.]And also returns.

And step S505, ending the long sentence processing process to obtain the processed English mail text data.

In the embodiment of the present invention, as in the above example, after the long sentence processing is finished, the final sentence list is: the sentence 1 is "I have a good friend, and her name is Li hua.", The sentence 2 is "We have come friends for about two years year", The sentence 3 is "She is right.", The sentence 4 is "When I step into The bone for The first time, She hells me to get family with The string environment", The sentence 5 is "The mobile import of The family name The same interest, so We have a lot in The sentence, 6 is" chinese friend "so.

The method for processing the text data of the English mail determines the comma occupation ratio by the type and the corresponding number of punctuations in the text data of the English mail to be processed, corrects the comma when the comma occupation ratio is not less than a preset ratio threshold, further performs sentence segmentation processing on the text data of the English mail after the comma correction processing, and performs long sentence processing on the text data of the English mail when judging that long sentences with comma numbers greater than a preset number threshold exist in the text data of the English mail after the sentence segmentation processing, so as to obtain the text data of the processed English mail. Compared with the prior art, the method and the device solve the problem of irregular sentence division of the regular expression caused by different habits of different people on writing the mails, can achieve the aim of correctly dividing the sentences of the mail text, and provide technical support for the subsequent mining of mail data.

As shown in fig. 6, in an embodiment, an apparatus for processing text data of an english mail is provided, which may specifically include a text data obtaining unit 610, a punctuation mark type and number obtaining unit 620, a comma ratio determining unit 630, a comma correcting unit 640, a clause processing unit 650, and a long sentence processing unit 660.

A text data obtaining unit 610, configured to obtain text data of an english email to be processed.

In the embodiment of the present invention, the format of the text data of the english mail, that is, the text content of the english mail, may be ASCII, MIME, txt, etc., and is not particularly limited; in addition, the mail text data may be obtained from electronic mailbox application software such as google mailbox, QQ mailbox, 163 mailbox, new wave mailbox, search mailbox, 126 mailbox, and the like, which is not limited specifically.

A punctuation mark type and quantity obtaining unit 620, configured to obtain the type and the corresponding quantity of punctuation marks in the text data of the english mail.

In the embodiment of the present invention, the types of punctuation marks in the text data of the english mail include, but are not limited to, comma, period, exclamation point, question mark, and the like, and the proportion of a certain punctuation mark in the whole text data of the english mail in the total number of punctuation marks can be further calculated and obtained by obtaining the type and the corresponding number of each punctuation mark.

A comma duty determining unit 630, configured to determine a comma duty according to the type and the corresponding number of the punctuation marks.

In the embodiment of the invention, a comma is a punctuation mark which is most commonly used by most people when writing mails and is often used as a comma when the punctuation mark is a typical punctuation mark such as a period mark, an exclamation mark, a question mark and the like according to big data display, so that the comma proportion is further determined according to the number of the obtained commas corresponding to each punctuation mark; however, in general, in the english text, the sentence is often segmented only according to the period, the exclamation mark and the question mark, so in the embodiment of the present invention, only the punctuations of the several kinds need to be counted, and for other colon marks, semicolons, ellipses, etc., only the common characters need to be processed. E.g. the number num of commas in the calculated text data of the mail1Number of periods num2Number of exclamation marks num3And number of question marks num4Then the comma to ratio is calculated according to equation (1)1

And a comma correction unit 640, configured to correct the comma according to a preset N-Gram language model when the comma occupancy is not less than a preset proportion threshold.

In the embodiment of the present invention, the preset ratio threshold may be set according to the performance and/or the implementation requirement of the terminal during the specific implementation, and the size of the preset ratio threshold is not specifically limited in the embodiment of the present invention, for example, the preset ratio threshold may be 0.3, 0.5, 0.7, and the like. The invention here preferably provides a predetermined ratio threshold of 0.7, i.e. if the comma-based ratio is1Comma correction is carried out when the number is greater than or equal to 0.7; otherwise, this link is skipped.

In the embodiment of the invention, for the training of the preset N-Gram language model, a multilingual Wikipedia corpus can be used as a training corpus, a multilingual word segmentation tool is used for word segmentation, and then the training of the N-Gram language model is carried out through the N-Gram language model training tool; in addition, when the N value is large, a huge training corpus is needed for training the N-Gram language model, data are sparse and serious, and time complexity is greatly improved, so that N in the N-Gram language model is set to 3 in the embodiment of the invention; for long-tail words (i.e. words with total occurrence times less than 10), marking them as LongTailWord; it is denoted NUMERAL for numerical uniformity. Wherein, the N-Gram language model training tool can use one of open source tools such as SRILM, IRSTLM, Berkeley LM and KenLM.

Specifically, in the embodiment of the present invention, the english mail text data is subjected to primary sentence division to obtain a primary sentence list lst ═ s1,s2,…,sn]And n is the number of the primary sentences. Setting the initial index i to 0 and the ending index j to 1, if the sentence lst [ j [ ]]The first character of (A) is lower case, or the first character is number&-if the connector or the first word is all capital, then the end subscript j +1 is applied until the condition is not satisfied; otherwise, take out the sentence lst [ i: j ] with index of i-j]As a primary first sentence (excluding the sentence lst j]). If the first sentence lst [ i: j ] of the first level]If the sentence end character is not a punctuation mark, adding comma, period, exclamation mark and question mark after the primary first sentence, respectively, calculating the N-Gram score according to a preset N-Gram language model, and using the punctuation mark with the highest score; further combine the first sentence lst [ i: j ] of the first level]Returning, the remaining sentence list lst ═ sj,sj+1,…,sn]Also go back, i.e. lst ═ sj,sj+1,…,sn]To remove the primary first sentence lst [ i: j ]]When the sentence is judged to exist in the primary sentence list or the number of the sentences is not 0, the step of returning to the step of extracting the primary first sentence from the primary sentence list according to the preset primary first sentence condition is executed; and when the sentence does not exist in the primary sentence list or the number of the sentences is 0, the comma correction of the mail text data is completed.

And a sentence dividing processing unit 650, configured to perform sentence dividing processing on the english mail text data after comma correction processing according to the preset N-Gram language model.

In the embodiment of the invention, the English mail text data after comma correction processing is subjected to sentence division to obtain a primary sentence list lst ═ s1,s2,…,sn]And n is the number of first-level sentences. Setting the initial index i to 0 and the end index j to 1, if the sentence lst [ i ═ 1]The last character of the sentence is a punctuation mark, the sentence lst [ i: j ] with index of i-j is taken out]As a first sentence; otherwise, if the sentence lst [ j ]]The first character of (A) is lower case, or the first character is number&A, -connector, or first word is all capitalized, then the end subscript j +1 is applied until this condition is not met. If the first sentence lst [ i: j ]]If the sentence end character is not punctuation, then respectively adding a period number, an exclamation point and a question mark to the sentence end to calculate the N-Gram score, and using the punctuation with the highest score. The first sentence lst [ i: j ] of the first level]Returning, the remaining sentence list lst ═ sj,sj+1,…,sn]Also go back, i.e. lst ═ sj,sj+1,…,sn]To remove the first-level first sentence lst [ i: j ]]First-level sentence of new generationA sublist, when judging that sentences still exist in the primary sentence list or the number of the sentences is not 0, executing the step of returning to the step of extracting the primary first sentence from the primary sentence list according to the preset primary first sentence condition; and when judging that no sentence exists in the first-level sentence list or the number of sentences is 0, indicating that the sentence segmentation of the mail text data is finished.

And the long sentence processing unit 660 is configured to, when it is determined that a long sentence with a comma number greater than a preset number threshold exists in the english mail text data after the sentence segmentation processing, perform long sentence processing on the english mail text data according to the preset N-Gram language model to obtain processed english mail text data.

In the embodiment of the invention, sentence cutting is carried out according to commas, and a comma cutting list lst ═ s is obtained1,s2,…,sn]And n is the number of comma cuts. Setting an initial index i to 0 and an end index j to 1, and if the current sentence lst [ i: j ═ is]Has commas of 2 or more, and the sentence lst [ j ]]If the first word appears in the first word statistical list, the sentence lst [ i: j ] with index i-j is taken out]As a first sentence; if the sentence lst [ j ]]The first letter of (a) is lower case, a comma is used to connect the next sentence until this condition is not met. If the second level first sentence lst [ i: j]If the sentence end character is not a punctuation mark, then a comma, a period, an exclamation mark and a question mark are added after the sentence to calculate the N-Gram score, and the punctuation mark with the highest score is used. The second-level first sentence lst [ i: j]Returning, the remaining comma cut list lst ═ sj,sj+1,…,sn]Also go back, i.e. lst ═ sj,sj+1,…,sn]To remove the second stage first sentence lst [ i: j ]]When the fact that sentences still exist in the comma cut list or the number of the sentences is not 0 is judged, returning to the step of extracting the second-level first sentence from the comma cut list according to the preset second-level first sentence condition is executed; and when judging that no sentence exists in the comma cutting list or the number of sentences is 0, indicating that the long sentence processing of the mail text data is finished.

The device for processing the text data of the english mail determines the comma occupation ratio by the type and the corresponding number of the punctuations in the text data of the english mail to be processed, corrects the comma when the comma occupation ratio is not less than a preset ratio threshold, further performs sentence segmentation processing on the text data of the english mail after the comma correction processing, and performs long sentence processing on the text data of the english mail when it is determined that long sentences with comma numbers greater than a preset number threshold exist in the text data of the english mail after the sentence segmentation processing, so as to obtain the text data of the processed english mail. Compared with the prior art, the method and the device solve the problem of irregular sentence division of the regular expression caused by different habits of different people on writing the mails, can achieve the aim of correctly dividing the sentences of the mail text, and provide technical support for the subsequent mining of mail data.

As shown in fig. 7, in an embodiment, an apparatus for processing text data of an english mail is provided, which is different from the apparatus shown in fig. 6 in that the apparatus further includes a character removing unit 710 for removing irregular characters in the text data of the english mail according to a preset standard format rule.

In the embodiment of the invention, when the format of the text data of the English mail is ASCII, characters which are not in an ASCII code table and exist in the text data of the English mail are removed; among them, ASCII (American Standard Code for Information exchange) is a set of computer coding system based on latin letters, mainly used to display modern english and other western european languages, which is the most common Information exchange Standard and is equivalent to the international Standard ISO/IEC 646.

In one embodiment, a computer device is proposed, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

acquiring text data of an English mail to be processed;

respectively acquiring the types and the corresponding quantities of punctuations in the English mail text data;

determining comma occupation ratio according to the type and the corresponding number of the punctuation marks;

when the comma proportion is not less than a preset proportion threshold value, correcting the comma according to a preset N-Gram language model;

carrying out sentence segmentation processing on the English mail text data subjected to comma correction processing according to the preset N-Gram language model;

and when the long sentences with commas larger than a preset number threshold exist in the English mail text data after sentence segmentation processing, performing long sentence processing on the English mail text data according to the preset N-Gram language model to obtain the processed English mail text data.

In one embodiment, a computer readable storage medium is provided, having a computer program stored thereon, which, when executed by a processor, causes the processor to perform the steps of:

acquiring text data of an English mail to be processed;

respectively acquiring the types and the corresponding quantities of punctuations in the English mail text data;

determining comma occupation ratio according to the type and the corresponding number of the punctuation marks;

when the comma proportion is not less than a preset proportion threshold value, correcting the comma according to a preset N-Gram language model;

carrying out sentence segmentation processing on the English mail text data subjected to comma correction processing according to the preset N-Gram language model;

and when the long sentences with commas larger than a preset number threshold exist in the English mail text data after sentence segmentation processing, performing long sentence processing on the English mail text data according to the preset N-Gram language model to obtain the processed English mail text data.

It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

23页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:音频标注的检错方法和装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!