Named entity identification and extraction using genetic programming

文档序号：1047875 发布日期：2020-10-09 浏览：3次中文

阅读说明：本技术 利用遗传编程的命名实体识别和提取 (Named entity identification and extraction using genetic programming ) 是由王德胜刘佳伟章鹏于 2020-04-24 设计创作，主要内容包括：本文公开了使用遗传算法生成模式程序的方法、系统和装置,包括编码在计算机存储介质上的计算机程序。遗传算法对表示将通过命名实体识别来识别或提取的数据类别的示例数据串进行操作。在初始化阶段,基于表示将通过命名实体识别来识别或提取的数据类别的示例数据串,生成初始模式程序。从初始模式程序开始,迭代地进行遗传操作以生成多代的后代模式程序。在每一轮遗传操作中,后代模式程序是通过交叉繁殖操作和变异操作生成的。(Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating pattern programs using genetic algorithms are disclosed. Genetic algorithms operate on example data strings that represent categories of data to be identified or extracted by named entity identification. In an initialization phase, an initial schema program is generated based on an example data string representing a class of data to be identified or extracted by named entity recognition. Starting with the initial pattern program, genetic operations are iteratively performed to generate a plurality of generations of offspring pattern programs. In each round of genetic manipulation, offspring pattern programs are generated by cross-breeding manipulations and mutation manipulations.)

1. A computer-implemented method, the method comprising:

receiving a plurality of first data strings;

identifying a sub-string of characters from the plurality of first data strings;

obtaining a first population of candidate programs based at least in part on the plurality of first data strings, the substrings represented as individual units in the candidate programs in the first population of candidate programs;

generating a second population of candidate programs by performing an iterative genetic operation on the first population of candidate programs, the iterative genetic operation including calculating a fitness score for each candidate program in the second population of candidate programs using a fitness function and the plurality of first data strings, the fitness function evaluating a rate of match of a candidate program with the plurality of first data strings; and

a plurality of second data strings are extracted from the data stream using a first candidate program of the second population of candidate programs.

2. The method of claim 1, wherein:

the first group of candidate programs includes a first number of candidate programs,

the second population of candidate programs includes a second number of candidate programs, and

the second number is reduced from the first number.

3. The method of claim 2, wherein the second number is reduced from the first number following one or more of an exponential decay algorithm, a linear decay algorithm, or an interleaved decay algorithm.

4. The method of claim 3, further comprising setting a minimum number of candidate programs for the second population.

5. The method of claim 1, wherein the fitness function evaluates a length of a candidate program against a data string length of the plurality of first data strings.

6. The method of claim 5, wherein the plurality of first data strings includes a first set of positive example data strings, wherein each positive example data string represents a target data category for a named entity identification task, and the fitness function evaluates the length of the candidate program against an average length of all of the first set of positive example data strings.

7. The method of claim 1, wherein:

the plurality of first data strings includes a first set of positive example data strings each representing a target data category of a named entity identification task and a second set of negative example data strings each representing a data category that is not the target data category; and is

The fitness function evaluates a first number of positive example data strings in the first set of positive example data strings that completely match a candidate program and a second number of negative example data strings in the second set of negative example data strings that completely match the candidate program.

8. The method of claim 1, wherein:

The fitness function evaluates a first number of characters in the first positive example data string set that match a candidate procedure and a second number of characters in the second negative example data string set that match the candidate procedure.

9. The method of claim 1, further comprising:

obtaining a plurality of third data strings, the plurality of third data strings being a subset of the plurality of second data strings; and

generating a second candidate program by performing the iterative genetic operation on a second population of the candidate programs using the plurality of third data strings.

10. The method of any preceding claim, further comprising:

grouping the plurality of first data strings into a first group of data strings and at least one second group of data strings; and

performing the iterative genetic operation separately on a first population of the candidate programs using each of the first set of data strings or the at least one second set of data strings.

11. The method of any preceding claim, wherein the iterative genetic manipulation comprises a cross-breeding manipulation and a mutation manipulation.

12. The method of any preceding claim, wherein the candidate programs in each of the first population of candidate programs or the second population of candidate programs are regular expressions.

13. The method of claim 9, wherein:

the fitness score of the first candidate program is highest in the second population of candidate programs;

performing iterative genetic operations on the second population of candidate programs using the plurality of third data strings to generate a third population of candidate programs; and

the fitness score of the second candidate program is highest in the third population of candidate programs.

14. The method of claim 13, wherein the fitness score of the second candidate program is higher than the fitness score of the first candidate program calculated using at least one of the plurality of first data strings and the plurality of third data strings.

15. A system, comprising:

one or more processors; and

one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon that are executable by the one or more processors to perform the method of any of claims 1-14.

16. An apparatus comprising a plurality of means for performing the method of any one of claims 1-14.

Background

Advances in network and storage subsystem design continue to enable the processing of increasingly large data streams between and within computer systems. At the same time, the content of such data streams is subject to increasingly stringent scrutiny. For example, the collection, analysis, and storage of personal data is subject to review and supervision. Organizations must ensure that personal data is collected legally under stringent conditions. Organizations that collect and manage personal data are obligated to protect them from abuse and illegal use and to respect the rights of the data owners. Personal data or other sensitive data includes, but is not limited to, name, date of birth, place of birth, identification number, home address, credit card number, telephone number, email address, URL, IP address, bank account number, and the like.

The classification and extraction of personal data or other sensitive data from a data stream involves named entity recognition. In general, named entity identification is an information extraction task that aims to identify and classify atomic elements in text into predefined categories, such as personal names, personal identities (such as social security number "SSN" or resident identification number, etc.), home addresses, e-mail addresses, bank account numbers, telephone numbers, credit card numbers, etc. These predefined categories of data are referred to as "named entities" or simply "entities". Entities typically follow some type of syntactic pattern. Programs such as regular expressions, deterministic finite state automata (finite state automata) or symbolic finite state automata (symbolytic finite state automata) are used to specify patterns in a data stream. However, generating such programs typically involves a significant amount of expert programming effort, which is both inefficient and slow. In the big data and cloud-based services era, service providers or platforms are faced with the need to handle entity identification tasks in a large variety of data stream categories that cannot be handled through manual programming.

Therefore, there is a need for efficiently generating programs for named entity recognition tasks.

Disclosure of Invention

Techniques for generating pattern programs using genetic algorithms are described herein. Genetic algorithms operate on example data strings that represent categories of data to be identified or extracted by named entity identification. Such an example data string is referred to as a "positive example" data string. The genetic algorithm may also operate on negative example data strings that represent data strings that are not positive example data strings, e.g., are not targets of the named entity recognition task. In an initialization phase, an initial schema program is generated based on an example data string representing a class of data to be identified or extracted by named entity recognition. Starting with the initial pattern program, genetic operations are iteratively performed to generate a plurality of generations of offspring pattern programs. In each round of genetic manipulation, offspring pattern programs are generated by cross-breeding manipulations and mutation manipulations. A small portion of the randomly generated pattern program will be added to each generation of offspring pattern programs. The fitness function is used to determine a fitness score for the pattern program in each generation of the offspring pattern program. The fitness score is used to filter the offspring pattern programs in a generation, such that the population size of each generation of offspring pattern programs remains stable. For example, each generation includes the same number of descendant pattern programs. After the iterative genetic operation is completed, a pattern program with the highest fitness score is selected for the named entity recognition task.

If the genetic operation fails to generate a pattern program with the desired extraction behavior, the example data strings are classified into two or more subgroups based on, for example, the type or length of each example data string. The genetic operation is performed in parallel for each subgroup of the example data string, each subgroup generating a corresponding pattern program. The plurality of mode programs are linked by an or function tag.

The genetic algorithm processes data strings, each of which accurately represents a target data class identified by the named entity. These technical features bring valuable technical advantages. First, the pattern program generated from the genetic algorithm will have tailored extraction behavior because of the efficient capture and continuation of the good "genes" contained in the example data string by the genetic algorithm. In this way, the generated pattern program will correctly detect and extract the data string of the target data category. Moreover, the use of such example data strings also reduces manual input and errors in the process, as there is no need to manually identify named entities from non-representative data strings. Moreover, the initial population of pattern programs is generated primarily, e.g., 90%, from the example data string, which significantly reduces the amount of iterative genetic operations required to implement a satisfactory pattern program. In the era of big data and cloud-based data services, saving computing resources is critical to managing large-scale data flows.

Further, the fitness function considers whether the pattern program matches a negative example data string that is not the target of the named entity recognition task. Thus, a pattern program selected based on the fitness function will avoid the class of data represented by the negative example data string. Thus, false positive errors will be significantly reduced, which makes the results of the named entity recognition task more reliable and meaningful. As such, the techniques herein are efficient and suitable for performing named entity identification tasks on large-scale data streams.

Also provided herein are one or more non-transitory computer-readable storage media coupled to one or more processors and having instructions stored thereon that, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with embodiments of the methods provided herein.

The present application also provides a system for implementing the methods provided herein. The system includes one or more processors and a computer-readable storage medium coupled to the one or more processors and having instructions stored thereon, which, when executed by the one or more processors, cause the one or more processors to perform operations in accordance with embodiments of the methods provided herein.

It should be understood that any combination of the aspects and features described herein may be included in accordance with the methods herein. That is, methods according to the present disclosure are not limited to the combinations of aspects and features specifically described herein, but also include any combination of the aspects and features provided.

The details of one or more embodiments of the disclosure are set forth in the accompanying drawings and the description below. Other features and advantages will be apparent from the description and drawings, and from the claims.

Drawings

FIG. 1 is a diagram illustrating an example of an environment that may be used to perform embodiments herein.

Fig. 2 is a diagram illustrating an example of operation according to embodiments herein.

FIG. 3 is an example program generation module that generates a pattern program according to embodiments herein.

Fig. 4 schematically illustrates an example process of generating a pattern program using a genetic algorithm according to embodiments herein.

Fig. 5A schematically illustrates an example process of generating a candidate mode program using byte pair encoding according to embodiments herein.

Fig. 5B schematically illustrates an example operation of generating a candidate mode program using byte pair encoding according to embodiments herein.

Fig. 6 schematically illustrates another example process of generating a pattern program using a genetic algorithm according to embodiments herein.

Like reference numbers and designations in the various drawings indicate like elements.

Detailed Description

Techniques for generating pattern programs using genetic algorithms are described herein. Genetic algorithms operate on example data strings that represent categories of data to be identified or extracted by named entity identification. Such an example data string is referred to as a "positive example" data string. The genetic algorithm may also operate on negative example data strings that represent data strings that are not positive example data strings, e.g., are not targets of the named entity recognition task. In an initialization phase, an initial schema program is generated based on an example data string representing a class of data to be identified or extracted by named entity recognition. In some embodiments, byte-pair encoding techniques are used to extract frequent substrings from example data strings, and each frequent substring extracted is considered a single expression unit when generating the initial pattern program. Starting with the initial pattern program, genetic operations are iteratively performed to generate a plurality of generations of offspring pattern programs. In each round of genetic manipulation, offspring pattern programs are generated by cross-breeding manipulations and mutation manipulations. A small portion of the randomly generated pattern program will be added to each generation of offspring pattern programs. The fitness function is used to determine a fitness score for the pattern program in each generation of the offspring pattern program. In some embodiments, the fitness function evaluates the length of the pattern program against the length of the example data string, e.g., the average length of the example data string. The fitness function evaluates a first number of positive example data strings that exactly match the candidate program against a second number of negative example data strings that exactly match the candidate program. The fitness function evaluates a third number of characters that the candidate program matches the positive example data string against a fourth number of characters that the candidate program matches the negative example data string.

The fitness score is used to filter the offspring pattern programs in a generation, such that the population size of each generation of offspring pattern programs remains stable. For example, each generation includes the same number of descendant mode programs or a smaller number of descendant mode programs than the parent/parent or population of initial mode programs. In some embodiments, the population size of the generations decays exponentially. After the iterative genetic operation is completed, a pattern program with the highest fitness score is selected for the named entity recognition task.

The fitness function includes one or more factors related to: (1) simplicity of the mode program; (2) a first match rate of the pattern program on the positive example data string; (3) a second match rate of the pattern program on the negative example data string; or (4) the edit distance between the mode program and the data string being instantiated. In some embodiments, the simplicity of a pattern program is not evaluated as absolute simplicity, e.g., the absolute length of the pattern program, but as relative simplicity relative to the average length of the positive example data string. For example, a pattern program having a length closer to the average length of the positive example data string has a higher fitness score, whether it is larger or smaller, than a pattern program having a length further away from the average length of the positive example data string.

The present disclosure is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and may be applied in various ways that provide benefits and advantages in computing, programming, and data management in general.

FIG. 1 is an operating environment 100 for detecting a data string of a target data category from a data stream, such as named entity identification. Environment 100 includes one or more users 110, a service provider 120, and one or more sources of data streams 130, all communicatively or logically connected to each other via a network 140. Each of the users 110 or service providers 120 is a computing device, such as a personal computer "PC", a server, a router, a network PC, a mobile device, a peer device or other common network node. Network 140 may be one or more local area networks ("LANs") and/or one or more wide area networks ("WANs") or other networks, each configured as an enterprise-wide computer network, an intranet, or the internet or other network application. The user 110, the service provider 120, or the source of the data stream 130 is classified based on the corresponding function in the environment 100. The user 110, the service provider 120, or the source of the data stream 130 may reside physically on the same computing device or in physically separate computing devices. The user 110, the service provider 120, or the source of the data stream 130 may belong to the same individual or legal entity, or different individuals or legal entities. For example, user 110 may be a business segment of a financial technology company and may have sensitive data detection, identification, and review tasks performed on data stream 130. The service provider 120 may be a technical service department of the same financial technology company or a third party service provider that provides services related to named entity identification.

FIG. 2 illustrates a flow diagram of a process flow 200 of operations or interactions between parties in the environment 100. Referring to fig. 1 and 2 together, in an example operation 210, a user 110 specifies or provides an initial set of example data strings to a service provider 120. The data string includes a combination of one or more letters, characters, symbols, and/or other expression elements. The initial set of example data strings contains examples that accurately represent the target data categories that the user 110 desires to extract or identify from the data stream 130. That is, the example data string does not contain any characters or data bits other than the represented target data class.

The data stream 130 may be specific to the user 110, or may be shared by multiple users 110 or applicable to multiple users 110. Similarly, user 110 may use two or more data streams 130. The user 110 may have the same or different named entity identification tasks for each data stream 130. For different data streams 130, the user 110 may provide different initial sets of example data strings. In this way, the user 110 may provide the named entity identification task to the service provider 120, with the service provider 120 specifying the applicable data stream 130 and the corresponding initial set of example data strings. The task may also specify a data category of the data string to be identified. The specified data category may have been represented by an initial set of example data strings, or may be represented by example data strings as further described herein. For example, the user 110 may request that personal data be identified from the data stream 130. Example personal data includes a person's name, date of birth, place of birth, identification number, home address, credit card number, telephone number, email address, URL, IP address, bank account number, and the like. In some embodiments, the user 110 provides an example data string of personal data to the service provider 120. An example data string of personal data may include multiple categories of personal data, such as phone numbers, personal identification numbers, credit card numbers, and the like. The example data string of personal data may also include various data formats or schema formats of the same category of personal data. For example, an example phone number may include the following patterns:

001.234.456.7899；

+1.234.456.7899；

1 234 456 7899；

(234)456 7899；

234 456 7899

in some embodiments, the example data strings themselves each accurately contain the target data category. In some embodiments, at least some of the provided example data strings each contain a first segment representing a target data category and a second segment not representing the target category. The second segment may provide context for identifying a target category of data contained in the first set of segments. User 110 may identify the first segment as representing the target data category. User 110 may also specify a named entity recognition task for the service provider to recognize the first segment as representing the target data category. In some embodiments, the example data strings may also include data strings representing data categories that are not targets of the named entity identification task of the user 110. In the description herein, a "positive example data string" refers to an example data string that represents a target data category. "negative example data string" refers to an example data string that represents a category of data that is not the target of the named entity recognition task.

In an example operation 220, the program generation module 122 generates a named entity identification program based on an example data string provided by the user 110 using a genetic algorithm. The generated named entity recognition program represents a syntactic data schema of the target data class, and is referred to herein for descriptive purposes as a "schema program". The schema program may be in the form of a regular expression, deterministic finite state automata ("DFA"), symbolic finite state automata ("SFA"), or other suitable program that represents a syntactic data schema. In some embodiments, the pattern program is generated by a genetic algorithm implemented by the program generation module 122 of the service provider 120. In the description herein, regular expressions are used as example schema programs to illustrate the operation of service provider 120 and/or program generation module 122.

In some embodiments, the program generation module 122 performs initialization operations, composition operations, and validation operations. In an initialization operation, an initial population of candidate programs is obtained. In some embodiments, the majority of the initial candidate programs are obtained based on the positive example data string. For example, for each positive example data string, one or more candidate regular expressions are obtained whose extraction behavior is consistent with the target data class represented by the positive example data string. It should be appreciated that multiple regular expressions may be generated for each target data category. In some embodiments, some candidate programs are randomly generated. The ratio between the number of candidate programs obtained based on the positive example data string and the number of candidate programs generated at random is a parameter of the initialization operation, and the parameter may be adjusted. In some embodiments, the ratio is 9: 1, so that the desired extraction behavior or good "genes" of the positive example data string is easily captured and extended in the genetic manipulation. The population size of the initial population of candidate regular expressions, e.g., the total number of candidate regular expressions, is another tunable parameter for the initialization operation.

In synthetic or genetic manipulation, initial candidate programs evolve through the manipulation of genetic algorithms. The genetic algorithm is implemented in an iterative manner. In each round of evolution, candidate programs in the parent/parent population are synthesized to create children of the candidate programs. The synthesis operations may include cross-breeding operations and mutation operations. The ratio between the sub-candidate programs generated by the cross-breeding operation and the sub-candidate programs generated by the mutation operation is a parameter of the synthesis operation, which can be adjusted. In some embodiments, the ratio is about 9: 1, and can be adjusted to be greater or less than 9: 1. the candidate programs are each evaluated by a fitness function to determine a fitness score. The fitness score represents the degree to which the extraction behavior of the pattern program is consistent with the target data category represented by the example data string or other data string used in the fitness score calculation. In some embodiments, the example data string used to generate the initial candidate program is also used to calculate a fitness score for the initial candidate program or sub-candidate program. In some embodiments, the example data strings provided by the user 110 are divided into two groups. One set of example data strings is used to generate initial candidate programs and another set of example data strings is used to calculate fitness scores for the candidate programs. The latter approach may help to avoid over-adaptation problems, if any. In the description herein, for purposes of illustration, the example data strings used to generate the initial candidate programs are also used to calculate the fitness scores of the candidate programs, which is not limiting of the scope herein.

In some embodiments, the fitness score of a candidate program may affect its use (if any) in the next round of evolution. For example, a new parent/parent population of candidate programs may be selected based on the fitness score of existing candidate programs. For example, candidates with lower fitness scores may be filtered out as "not fit" to become parents for the next generation of evolution. In some examples, the probability that a candidate program is selected for cross-breeding and/or mutation depends on the fitness score of the candidate program. For example, a candidate program with a higher fitness score may have a higher probability of being paired with another candidate program in a cross-breeding operation. Candidate programs with higher fitness scores are typically selected for mutation operations with higher probabilities, although the probabilities of mutation operations vary less significantly than the probabilities of cross-breeding operations.

In some embodiments, a new parent/parent population is selected only from the latest generation of child candidates. In some embodiments, a new parent/parent population is selected from all existing candidate program pools based on fitness scores. For example, existing candidate programs with higher fitness scores are selected to form a new parent/parent program population. In the description herein, a "parent/parent" candidate program refers to a generation candidate program for generating a new candidate program under a composition operation; "child" candidate program refers to a generation of candidate programs generated from a synthesis operation. If the entire population of newest offspring is used for the next round of genetic operation, the newest offspring candidate program may overlap exactly with the new parent/parent candidate program. A "generation" candidate program is used as applicable to any of the child or parent/parent candidate programs.

In some embodiments, the generation of candidate programs also includes a percentage of randomly generated candidate programs, for example in a range between 5% and 15%. Mixing candidate programs generated by genetic manipulation with randomly generated candidate programs ensures that "good genes" can be maintained from generation to generation and new "enhanced genes" are introduced. Thus, the fitness score of a candidate program is typically boosted on a generation-by-generation basis. When the fitness score of the candidate program reaches a first threshold or the total number of evolutionary rounds reaches a second threshold, the synthesis operation is then completed. After the synthesis operation is completed, the candidate program with the highest fitness score is selected as the final pattern program to be used in the named entity recognition task. The final program may not have to be selected from the previous generation of candidate programs. The final program may be selected from any generation of candidate programs.

The fitness function may have various forms and criteria, all of which are included within the scope herein. In some embodiments, the fitness function includes factors related to: the compactness of the candidate regular expression, e.g., the length of the candidate expression; a first match rate of the candidate regular expression on the positive example data string; a second match rate of the candidate regular expression on the negative example data string; or an edit distance between the candidate regular expression and the example data string.

The first match rate is calculated as the ratio of the number of positive example data strings to the total number of positive example data strings that match the candidate regular expression by 100%. The second match rate is calculated as the ratio of the number of negative example data strings to the total number of negative example data strings that match the candidate regular expression by 100%. The edit distance is determined as the minimum number of edits to convert the extracted data string into the target data category contained in the positive example data string. For example, in some embodiments, characters in the example data strings that are matched by the extraction behavior of the candidate regular expression and characters in the example data strings that are ignored by the extraction behavior of the candidate regular expression are analyzed to determine an edit distance of the candidate regular expression.

In an example operation 230, the extraction module 124 extracts the target data category from the data stream 130 using the regular expression generated by the program generation module 122. Specifically, the extraction module 124 finds a data string in the data stream 130 that matches the pattern represented by the regular expression. In some embodiments, a percentage match threshold may be used to implement the extraction operation. For example, if a data string in the data stream 130 includes characters or segments that match a regular expression by a percentage greater than 55%, the extraction module 124 extracts the data string as belonging to a target data category. The percentage match threshold may be adjusted based on the configuration of the named entity identification task, e.g., a tolerance specified by the user 110 for false positives or false negatives.

In an example operation 240, the extracted or identified data string is provided to the user 110. The user 110 may verify the provided extraction results, may confirm the correct extraction, and may identify false extractions, e.g., false positives or false negatives.

In the learning operation 250, the user 110 provides the correct extraction and/or the incorrect extraction as a training set to the service provider 120 to adjust the composition operation. For example, false positive extraction is used as an additional negative example data string when training a candidate program. False negative results, i.e., data strings in the target data category that are not extracted by the regular expression, may be provided as additional positive example data strings. With the additional example data strings, the scores of the candidate programs over the generations of evolution may be recalculated. This changes the process and the results of the synthesis operation. In some embodiments, the synthesis operation is not adjusted at its starting point, but is trained beginning at an intermediate generation of the evolution process. In some embodiments, the fitness score is only recalculated for all candidate programs that have been generated during the evolution process, i.e. no further synthesis operations are required, so that the recalculated fitness score may result in a different candidate program being selected as the final program without generating a new candidate program. Other uses of additional negative example data strings or additional positive example data strings are also possible and are included within the scope of this document. For example, an initial candidate mode procedure may be generated using a different policy than that initially used in operation 220. As a result, a new regular expression is generated with a fitness score higher than the fitness score of the regular expression previously used in operation 240. The fitness score for the new regular expression and the previous regular expression are calculated using the same example data string, e.g., at least one of the initial set of initial example data strings and the additional example data strings.

In some embodiments, parameters of the genetic algorithm may be adjusted in the learning operation 250. For example, the ratio between candidate programs generated from the positive example data string and randomly generated candidate programs may be adjusted based on the user's 110 feedback on the extraction results. For example, if a false negative is represented by an initial positive example data string, the learning process may reduce the percentage of randomly generated candidate programs in the genetic operation, thereby better representing the "genes" of the positive example data string in the regular expression generated by the genetic operation.

Fig. 3 is an example program generation module 122. The program generation module 122 includes an initialization unit 310, a random program generation unit 320, a synthesis unit 330, a fitness measurement unit 340, a controller 350, and a learning unit 360. The initialization unit 310 includes an example grouping unit 312, an initial program generation unit 314, and a parse tree unit 316. The initial program generating unit 314 includes a frequent substring determining unit 315. The synthesis unit 330 includes a cross-breeding unit 332 and a mutation unit 334. The controller 350 includes an adjustment unit 352.

In some embodiments, program generation module 122 and the various units therein are computer-executable instructions dedicated to various functions and operations. The executable instructions include routines, programs, objects, components, and data structures that, when executed by a processor, enable the processor to perform particular tasks or implement particular abstract data types. The various units of the program generation module 122 may reside on the same computing device or may reside on multiple computing devices functioning together in a distributed computing environment. In a distributed computing environment, the various elements of the program generation module 122 may be stored in local or remote computer storage media including computer storage devices.

The operation and function of the various software elements in the program generation module 122 are further described herein.

FIG. 4 shows an example process 400 for program generation module 122 to generate a regular expression based on an example data string provided by user 110. In some embodiments, the example data strings include a positive example data string and a negative example data string. Each positive example data string contains exactly example target data categories, such as example birth date, example social security number, example resident identification number, example bank account number, and so forth. Using the example target data class itself as the positive example data string will simplify the operation of the genetic algorithm and help continue the "good genes" of the example target data class during the evolution process. Each negative example data string does not contain a target data category. In some embodiments, some negative example data strings are negative examples, which represent exceptions to the general characteristics of the target data class. For example, although the data string "10.26.2050" appears to follow the data format of the birth date, since no one is born in the year 2050 at the time of operation, the data string may be used as a negative example of the birth date. In some embodiments, each anti-instance data string contains exactly the anti-instance of the target data category without any other/additional characters. The inclusion of such counter examples enables "bad genes" to be avoided during the evolution of genetic manipulations. The anti-instance data string may be identified or classified separately from other negative-instance data strings. In some embodiments, the example data string initially received from user 110 does not include any negative example data strings. In some embodiments, the example data strings received from user 110 include only positive and negative example data strings, and no other negative example data strings.

For purposes of illustration, the example process 400 is described by an example task of generating a regular expression based on an example data string. In an example act 410, the initialization unit 310 obtains an initial population of candidate regular expressions, for descriptive purposesThis initial population is referred to as the zeroth generation G₀. Example action 410 includes sub-actions 412, 414, and 418. In sub-act 412, optionally, the example grouping unit 312 may initially group existing example data strings into initial groups with the goal that example data strings in the same initial group share particular pattern features to be represented by the same regular expression. For example, the initial grouping may be based on a character classification of the example data string, e.g., whether the example data string contains a word character without a number, whether the example data string contains a number without a word character, or whether the example data string contains a mixture of numbers and word characters. The initial grouping may also take into account the natural language of the word characters, such as whether the word characters are Chinese, English, Japanese, or Korean. The initial grouping may also take into account the language family of the word characters, e.g., whether the word characters belong to Karl, Italian, Chinese, Japanese, Slam, or other language family. The initial grouping may also take into account the relevant target data categories represented by the example data strings. For example, an example data string for a birth date may contain a different format, such as "mm-dd-yy"; "mm/dd/yy"; "dd-mm-yyyy"; "yyyy.mm.dd"; or other format. Example data strings containing different formats of birth date information are grouped together. The initial grouping may also take into account the length of the example data string.

In some embodiments, the example grouping unit 312 assigns weights to each group of example data strings. The weights may influence the analysis of the candidate program in terms of fitness score and match rate. The weights may also affect the actual usage of each set of example data strings in the genetic operation. The weights assigned to each set of example data strings are dynamically adjustable in the genetic operation of operation 220 or in the learning operation 250. Assigning weights to different sets of example data strings helps to ensure that more important target data classes have priority represented in regular expressions generated by genetic algorithms.

In some embodiments, the program generation module 122 operates to generate a single regular expression for all of the target data categories represented by the positive example data string. The grouping of example data strings does not necessarily result in a separate genetic operation being performed on each group of example data strings. However, based on the final or intermediate results of the genetic operation, the grouping of the example data strings may be adjusted, and the genetic operation may be adjusted accordingly, as will be described in detail herein.

In some embodiments, the initial grouping operation in sub-act 412 is omitted. The program generation module 122 will first attempt to generate a single regular expression that is capable of extracting all of the target data categories represented by the example data string by default. The example grouping unit 312 may group the example data strings based on input or feedback of the adjustment unit 352 in the operation of the genetic algorithm, which is described in detail later herein.

In sub-act 414, the initial program generation unit 314 coordinates with the random program generation unit 320 to generate an initial population of candidate regular expressions, which are also referred to as "candidate programs". Specifically, in some embodiments, the initial program generation unit 314 generates candidate regular expressions based on the positive example data string. The random program generating unit 320 randomly generates candidate regular expressions. In some embodiments, the ratio between the number of candidate regular expressions generated based on the positive example data string and the number of randomly generated candidate regular expressions is maintained at about 1: 7 to about 1: 10, in the range between. In some embodiments, the ratio between the number of candidate regular expressions generated based on the positive example data string and the number of randomly generated candidate regular expressions is 1: 9. in some embodiments, this ratio is controlled by controller 350. Experimental data indicates that such a range of ratios helps ensure that the final regular expression exhibits extraction behavior that is consistent with and further extends from the extraction behavior of the example data string. In some embodiments, all candidate regular expressions in the initial population are randomly generated, and the example data string provided by the user 110 is used for the genetic operations 420, 430 described herein.

Regular expressions are typically described in strings that describe the pattern they represent. Regular expressions may contain one or more elements of: text such as \ a \ and; a character range such as \ a-z ] \; reverse character ranges such as [. lamda-z ] \; cascades such as a [ bc ] \; such as \ a? \ option operator; star operators such as \ a \; plus operators such as \ a + \; such as \ a? Is there a \\\\ a? \\\\ a +? A non-exhaustive operator of \; alternate operators such as \ a | b \; or a capture group operator such as (ab) \.

For at least some of the positive example data strings, two or more candidate regular expressions are generated based on each of them. As an illustrative example, an example data string of "175.8" may be represented by the following regular expression:

r ═ d \ d \ d \; or

r ═ d + \\ d.; or

r:r:＝\[0-2]\d\d\.\d

Moreover, candidate regular expressions generated for different positive example data strings may overlap. Overlapping regular expressions can be filtered out of the initial population, or can be retained in the initial population to increase the likelihood of: the extraction behavior of such candidate regular expressions or "genes" is properly represented in the initial population and is fully extended in the evolution process.

In some embodiments, all candidate regular expressions are constructed by using a syntax tree, where leaf nodes are basic regular expression elements selected from the endpoint set, and non-leaf nodes represent operators including concatenation operations and matching operations. The endpoint set may include:

(1) letter constants such as "a", "B", … "Y", "Z", "A", "B", … "Y", "Z", and the like;

(2) digital constants such as "0", "1", … "8", "9", and the like;

(3) symbolic constants, e.g., "; "," \\\\\\\\\\\\\\ and a "? "," @ ", etc.;

(4) alphabetic and numeric ranges, such as "a-Z", "A-Z", "0-9", and the like;

(5) general character classes such as "\\ w", "\ d", etc.;

(6) wildcards, such as "-";

(7) others

The set of functions may include:

(1) cascade operator "t₁t₂”；

(2) Group operator "(t)₁)”；

(3) List match operator "[ t₁]", and list mismatch operator" [ ^ t1]”；

(4) Matching one or more operators "t₁++”；

(5) Matching zero or more operators "t₁*+”；

(6) Matching zero or one operator "t₁？+”；

(7) Matching the minimum maximum operator "t₁{ n, m } + ", n is minimum, m is maximum;

(8) and others.

In some embodiments, various strategies are considered in generating the initial population of regular expressions based on the positive example data string. For example, a policy may favor simplified regular expressions over complex regular expressions. The policy may attempt to reduce or increase the function labels or types of function labels used in the regular expression. These strategies may affect the final regular expression generated by the genetic operation, which may be adjusted in the learning operation 250.

In some embodiments, the population size of the initial population of candidate regular expressions is greater than the total number of positive example data strings. For example, the population size is about 1.5-2 times the number n of positive example data strings.

In some embodiments, in step 415 of sub-act 414, the initial program generation unit 314 generates at least some of the candidate regular expressions in the initial population based on the byte pair encoding technique. For example, common pairs or common sets or characters of consecutive bytes of an example data string are identified and considered as a single unit when generating the candidate regular expressions of the initial population. In the description herein, pairs of consecutive bytes or sets of consecutive characters are used interchangeably and are referred to herein as substrings for descriptive purposes. For example, in a leaf node of the syntax tree, a common pair of consecutive bytes is represented as a single expression unit. Such a common set of consecutive characters can be considered as representing a good "gene" of an example data string. By keeping a contiguous character set, rather than multiple characters thereof, as a single unit, good "genes" are maintained in the operation of genetic algorithms by generating regular expressions. As a result, the runtime of the genetic algorithm is greatly reduced.

In some embodiments, a frequently contiguous set of characters is extracted from the positive example data string by using byte pair encoding ("BPE"). In some embodiments, the granularity of the frequently consecutive character sets is controlled by a hyper-parameter of the training period (epoch). In some embodiments, a frequency threshold is set to determine whether the set of consecutive characters is sufficiently common in the positive example data string such that the set of consecutive characters is identified as a set of frequently consecutive characters. Algorithm 1 below is an example of coding BPE using Python language. Other programming languages, such as C + +, Java, Fantom, may also implement BPE operations.

FIG. 5A shows an example process of step 415. Referring to fig. 5A, in an example act 510, the frequent substring determination unit 315 segments the example data string to obtain a substring. In some embodiments, a random combination of consecutive characters is obtained as a substring. In some embodiments, the rules are applied to obtain substrings from example data strings. For example, a rule may specify that only logically or linguistically meaningful character sets are available as substrings. In some embodiments, the substrings are obtained by random segmentation and rule application. As an illustrative example, for the string "low", the substrings "lo", "ow", and "low" may be obtained.

In example act 520, frequent substring determination unit 315 determines frequent substrings from the substrings. In some embodiments, the frequency value of the substring is calculated based on the number of occurrences of the substring in the positive example data string. The frequency value may be calculated as:

where P represents the frequency, m represents the number of occurrences of substrings in all positive example data strings, and N represents the total number of positive example data strings. A threshold frequency value may be set. If the frequency value of the substring is equal to or higher than the threshold frequency value, the substring is determined to be a frequent substring. In some embodiments, the rules may be applied in determining frequent substrings. For example, a rule may assign higher weights to frequency values computed for logically or linguistically meaningful substrings. Other ways of determining frequent substrings are possible and are included within the scope of this document.

In an example act 530, the initial program generation unit 314 generates candidate regular expressions based on the positive example data string, where each identified frequent substring is treated as a single expression unit. For example, the identified frequent substrings will not be further parsed in the regular expression (regex). FIG. 5B shows an example operation of step 415 for an example data string for illustration. Referring to fig. 5B, the character strings "low", "newest", "widest" include the identified frequent sub-character string "lo" or "est". In the generated regular expression and syntax tree, the frequent substrings "lo" and "est" are respectively represented as a single unit. Other word characters, "w" in "low"; "w", "e", "r" in "lower"; "n", "e", "w" in "newest"; "w", "i", "d" in "widest" are respectively represented by the common character class "\ w" to represent a single word character.

In some embodiments, in step 416 of sub-act 414, the fitness measuring unit 340 calculates a fitness score for each candidate regular expression in the initial population. The fitness measuring unit 340 may use various suitable fitness algorithms to calculate the fitness score, all of which are included within the scope of this document. In some embodiments, the fitness measuring unit 340 calculates the fitness score using the following algorithm:

wherein, t_iRepresenting a positive example data string; n is the total number of positive example data strings; s_iRepresenting a positive example data string t contained in a category as target data_iA fragment of (1); r (t)_i) Representing a string of data t from a positive example by a regular expression_iThe fragments extracted in (1); d (x)₁，x₂) Representing the edit distance between data strings/segments x1, x 2; l (R) represents the length of the regular expression; p + represents the ratio of regular expression to positive example string match, e.g., R (t)_i)＝s_iP-represents the ratio of regular expression to negative example string matches, and α, β, γ are constants that may be adjusted by controller 350_i＝s_iThis is because the positive example data string happens to represent the target data class. It should be appreciated that although each of a large portion of the initial population of candidate regular expressions is generated directly from one or more positive example data strings, each may not be able to extract other positive example data strings. It should be understood that the data strings used to calculate the fitness score may be a different set or group of data strings than the data strings used to generate the initial candidate program.

In some embodiments, n may be the number of all example data strings; t is t_iRepresenting an example data string; s_iRepresenting an example data string t contained in a category as target data_iAnd for negative example data strings that do not contain the target data class, s_i＝0；R(t_i) Representing a sample data string t by a regular expression_iThe fragments extracted in (1); d (x)₁，x₂) Representing data string/fragment x₁、x₂The edit distance between.

In some embodiments, β, γ are adjusted based on the fault tolerance of the user 110. For example, if the user 110 balances towards a trend of false positives over false negatives, β will increase. If the user 110 balances towards a trend of false negatives over false positives, gamma will increase.

In some embodiments, the weights assigned to a set of example data strings may be introduced into the fitness function.

Wherein, w_iIs an example data string t_iA weight equal to the weight assigned to t_iThe weight of the group to which it belongs; w is a_jIs the weight assigned to the example data string group j; p is a radical of_+jIs the ratio of regular expressions to positive example data string matches of group j; p is a radical of_-jIs the ratio of regular expressions to negative example string matches for group j; m is the total number of example data string groups.

In some embodiments, the fitness function is treated as a multi-objective function to account for multiple factors that access the fitness scores of the candidate regular expressions. For example, the regular expression that the fitness function defines satisfies should match more positive example data strings and less negative example data strings. In addition, from the perspective of a single character contained in an example data string, a satisfied regular expression should match more with characters in a positive example data string and less with characters in a negative example data string. In addition, the length of the candidate regular expressions is also evaluated. In some embodiments, the length of the regular expression is evaluated relative to the length of the positive example data string. The fitness score of regular expressions with similar length to the positive example data string is better. In some embodiments, the length of the candidate regular expression is compared to the average length of the positive example data string. The average length may be determined as the average, median, mode, or any other average of the lengths of the positive example data strings. In some embodiments, the fitness measuring unit 340 calculates the fitness score of the candidate regular expression using the following algorithm:

fitness(r)＝α*P_s+β*P_c+l_score(5),

where P represents a positive example data string, N represents a negative example data string, len () represents the length of a character string or regular expression, count (r, i) represents the number of characters in an example data string i that matches regular expression r, and k represents the total number of positive example data strings; α and β represent tunable constants. The values of the constants α and β may be adjusted based on a particular genetic algorithm or named entity extraction task.

The fitness functions (2), (3), and (5) may be used in combination. Further, the components of the fitness functions (2), (3), and (5) may be recombined in various ways. For example, algorithm (2) may be modified to include replacing l (r) with l_scoreTo generate:

in sub-act 418, parse tree unit 316 parses initial G₀Each candidate regular expression in the generation. At least some of the candidate regular expressions are parsed into two or more components. In some embodiments, a parse tree is used to represent a regular expression that is parsed into two or more components. In some embodiments, a signed finite state automaton is used to represent a regular expression that is parsed into two or more components. Other methods of representing the parsed regular expression and/or the correspondence between two or more components of the parsed regular expression are possible and are included hereinWithin the scope of (1). In the description herein, the operation of the program generation module 122 is illustrated by taking a parse tree as an example, which does not limit the scope of the present disclosure.

In some embodiments, the parse tree is a selection-based parse tree that includes an endpoint node and a non-endpoint node. Leaf nodes (end nodes) of the parse tree are all labeled with end labels, representing the components of the regular expression that have been parsed. Each leaf node does not have any child nodes and cannot be further expanded. When leaf nodes are cascaded together, candidate regular expressions are obtained. Each interior or non-leaf node (non-end node) of the parse tree is labeled with a non-end label. The non-end point tags may include a placeholder tag c and a function tag. The immediate children of the internal node must follow the pattern of the generation rules of the functional tags in the grammar. The placeholder label c indicates the "location" of the associated child node. The functional label indicates a functional relationship of a location c or a functional relationship between a plurality of locations c. For example, (c1c2) indicates that the child nodes associated with the two placeholders c1, c2 are cascaded; c indicates that the child node associated with placeholder c is inverted. In some embodiments, the parse tree is formed using a string conversion method. The character string conversion of the internal node is realized by replacing the placeholder c with the character string conversion result of the child node associated with the placeholder c. Other methods of forming parse trees based on regular expressions are also possible and are included within the scope of this document.

In example act 420, the synthesis unit 330 synthesizes candidate programs in the parent/parent Gp-generation candidate program to generate a child Gp + 1-generation candidate program. The synthesis operation includes cross-breeding and mutation operations on parent/parent candidate programs. The cross-breeding operation interleaves the two or more parent/parent candidate programs into two or more new candidate programs by recombining the component or gene values of the two or more parent/parent candidate programs to generate "child" candidate programs, each of which includes a component from each parent candidate program. For example, where a parent/parent candidate is represented as a parse tree, sub-trees or branches of the parent/parent parse tree may be recombined to generate a child parse tree. The mutation operation alters one or more components or gene values of the parent/parent candidate program to generate child candidate programs. For example, where a parent/parent candidate program is represented as a parse tree, a sub-tree or branch of the parse tree may be replaced with a randomly generated sub-tree or branch to generate a child candidate program. The function of mutation operations is to increase the diversity of the candidate program population.

In some embodiments, the fitness scores of the candidate programs are considered in selecting the candidate programs for mutation and cross-breeding operations. For example, for a cross-breeding operation, the chance that a candidate program is selected to pair with another candidate program may be consistent with its fitness score. That is, a candidate program with a higher fitness score has a higher chance of being paired with another candidate program in a cross-breeding operation than a candidate program with a lower fitness score. In this way, a "good gene", e.g. a suitable extraction behavior, can be continued to the next generation. In some embodiments, candidate programs with lower fitness scores are selected with a higher probability for mutation operations than for cross-breeding operations. This increases the chance of introducing "good genes" into the candidate program population. In some embodiments, a candidate program with a higher fitness score may have a higher probability of being selected for mutation operations than a candidate program with a lower fitness score.

In some embodiments, the sub Gp +1 generation candidate programs also include a small portion of the candidate programs randomly generated by the random program generation unit 320.

In some embodiments, the child Gp +1 generation candidate programs have the same number of candidate programs as the parent/parent Gp generation candidate programs. In the case where the candidate programs initially generated by the synthesis operation are more than the required number, the generated candidate programs are filtered by their fitness scores. Candidate programs with lower fitness scores are filtered out until the child Gp +1 generation candidate programs have the same population size as the parent/parent Gp generation.

In some embodiments, the sub-Gp +1 generation of candidate programs includes a first subset of candidate programs generated by a cross-breeding operation; a second subset of candidate programs generated by the mutation operation;and a third subset of randomly generated candidate programs. In some embodiments, the ratio between the first, second and third subsets of candidate programs, among the number of candidate programs each subset contains, is in the division of the initial generation G₀All other generation candidates remain substantially the same. For example, the ratio between the first subset, the second subset and the third subset is between 3: 1: 1 and 18: 1: 1, in the range between. The size of the ratio may be controlled by the number of candidate programs generated by each of the cross-breeding operation, mutation operation, or random operation. The magnitude of this ratio can also be controlled by selectively filtering out the candidate programs with lower fitness scores in each subset.

In some embodiments, a ratio between the first subset of candidate programs generated by the cross-breeding operation and the second subset of candidate programs generated by the mutation operation is determined based on an initial analysis of the example data string. For example, where the example data string is more uniform, such as in terms of the length of the example data string or the class of data represented by the example data string, the size of the first subset will increase. In case the example data string is more cluttered, the size of the second subset will increase.

In an example sub-act 422, the cross-propagate unit 332 performs a cross-propagate operation. As shown in the examples used herein, the candidate programs are regular expressions, and each regular expression is represented as a parse tree, which is a data structure suitable for cross-breeding operations. The cross-breeding operation may be performed in various ways by recombining the components of the parent/parent program of the pair. For example, one or more of single-point cross breeding, two-point cross breeding (or k-point cross breeding), or uniform cross breeding may be used. Furthermore, the functional tags of the parse tree may be considered in the cross-breeding operation. For example, one or more of partial match cross breeding, round cross breeding, sequential-based cross breeding, location-based cross breeding, voting recombination cross breeding, alternate location cross breeding, or sequence construction cross breeding can be used to appropriately process the functional tags in the parse tree.

In some embodiments, for pair candidates represented as parse trees, subtrees/branches of the parse trees are randomly selected to be recombined in a cross-breeding operation. That is, when an internal node is selected, the entire branch under the selected internal node, i.e., all children nodes under the internal node, are used for recombination in the cross-breeding operation. In some other embodiments, the nodes of the parse tree are randomly selected, and only the selected nodes are used for recombination in the cross-breeding operation. The child nodes of the selected node (if any) will not be used for recombination.

In some embodiments, only leaf nodes (or end nodes) of the parse tree may be selected for reassembly in the cross-propagate operation. The leaf nodes are selected randomly or based on some constraint. For example, the chance that a leaf node is selected is related to the distance between the leaf node and the root node of the parse tree. In some embodiments, leaf nodes that are located further from the root node, e.g., more interior nodes in between, have a higher chance of being selected for recombination. In some other embodiments, leaf nodes that are located further from the root node are selected with a lower chance of being recombined.

In some embodiments, the opportunity for the internal node to be selected for recombination is related to the height of the internal node, e.g., the longest distance between the internal node and the leaf node below the internal node. For example, highly large internal nodes may be more likely to be selected for recombination. As another example, a highly large internal node may be less likely to be selected for recombination.

Other methods of selecting nodes in the parse tree for reassembly are possible and are included within the scope of this document. In some embodiments, the method of selecting components of candidate programs for recombination may be configured and adjusted by controller 350, as described herein.

In example sub-act 424, mutation unit 334 performs a mutation operation on the candidate program selected for performing the mutation operation. As shown in the examples used herein, the candidate programs are regular expressions, and each regular expression is represented as a parse tree, which is a data structure suitable for mutation operations. The mutation operation may be performed in various ways controlled by the controller 350, which are included in the scope of the present disclosure. For example, one or more of a bit string variation, flip bit variation, boundary variation, non-uniform variation, gaussian variation, or systolic variation may be used.

In some embodiments, for candidate programs represented as parse trees, subtrees/branches of the parse tree are randomly selected for mutation operations. Randomly generated subtrees or branches will replace the selected subtree. That is, when an internal node is selected, the entire branch under the selected internal node, i.e., all children nodes under the internal node, is replaced by another subtree in the mutation operation. In some other embodiments, the nodes of the parse tree are randomly selected, and only the selected nodes are replaced by another randomly generated node. For example, the functional label of the non-end point label may be replaced with a randomly generated functional label. The children of the selected node (if any) will not be used for mutation.

In some embodiments, only leaf nodes (or end nodes) of the parse tree are selected for mutation. The leaf nodes are selected randomly or based on some constraint. For example, the chance that a leaf node is selected is related to the distance between the leaf node and the root node of the parse tree. In some embodiments, leaf nodes that are located further away from the root node, e.g., more interior nodes in between, have a higher chance of being selected for mutation. In some other embodiments, leaf nodes that are located further from the root node are selected with a lower probability for mutation.

In some embodiments, the chance that an internal node is selected for mutation is related to the height of the internal node, e.g., the longest distance between the internal node and a leaf node below the internal node. For example, highly large internal nodes may be more likely to be selected for mutation. As another example, highly large internal nodes may be less likely to be selected for mutation.

Other methods of selecting nodes in the parse tree for mutation operations are possible and are included within the scope of this document. In some embodiments, the method of selecting components of candidate programs for mutation may be configured and adjusted by the controller 350, as described herein.

In example sub-act 426, random program generation unit 320 randomly generates candidate programs for the Gp +1 generation.

In example act 430, the fitness measuring unit 340 obtains a fitness score for the candidate program. In some embodiments, the same fitness function, such as functions (2), (3), (5), or (9), may be used to obtain Gp generation parent/parent candidate programs, Gp +1 generation child candidate programs, and G₀A fitness score for the initial candidate program. In some embodiments, different fitness functions may be used. In some embodiments, the fitness function includes factors related to one or more of the following: compactness of the candidate regular expression, (e.g., length of the candidate regular expression); a first match rate of the candidate regular expression on the positive example data string; a second match rate of the candidate regular expression on the negative example data string; or an edit distance between the candidate regular expression and the example data string.

In example sub-act 432, the fitness measuring unit 340 optionally filters the new candidate programs based on their fitness scores. For example, new candidate programs with lower fitness scores may be removed from the population of Gp +1 generation candidate programs. In some embodiments, the filtering operation is performed separately for the first, second, and third subsets of new candidate programs, thereby maintaining a ratio size between the first, second, and third subsets of new candidate programs.

Together, the actions 420, 430 are referred to as a round of genetic manipulation or evolution. Genetic manipulations are performed iteratively, with each round of genetic manipulation or evolution generating a new generation of candidate programs. The controller 350 may set a threshold condition for completing or terminating the iterative genetic operation. For example, the threshold condition includes the total number of iterations reaching a threshold number or the fitness score of the candidate program reaching a threshold fitness score. The threshold condition may also include that a round of genetic manipulation does not yield new benefits (benefits). The new benefits include fitness scores, whether individual or average scores are improved. The new revenue also includes new candidate programs that are different from any existing candidate programs.

In some embodiments, the controller 350 controls the population size of the candidate programs after each round of genetic manipulation or evolution. In some embodiments, the population size remains the same as the initial population of candidate programs. In some embodiments, the population size of the offspring is attenuated following an attenuation algorithm. For example, the decay algorithm is:

where λ is the attenuation parameter and λ ∈ [0, 1 ]](ii) a E represents the epoch size (epoch size) of the genetic manipulation or the total number of iterations; n is a radical of_popIs the size of the initial population, andis the minimum population size set by the controller 350. Following the decay algorithm (10), the population size of the offspring population will be continuously reduced by the parameter λ until a minimum population size is reached

The algorithm (10) is an example exponential decay algorithm. Other attenuation algorithms are possible and are included within the scope of this document. For example, the attenuation algorithm may be a linear attenuation or a stepped attenuation (stepped attenuation), which includes different attenuation algorithms for different stages of the iteration. An example linear and staggered decay (stationary decay) algorithm is provided below:

or the like, or, alternatively,

where k is an attenuation parameter and k ∈ [0, 1 ]]；b₂Is a constant; and E₁Representing the boundaries of the phase. E.g. E₁＝100。

In example act 440, the controller 350 determines whether a threshold condition has been met for completing the iterative genetic operation. If none of the threshold conditions are met, the controller 350 controls the genetic operation to continue iterating. If one or more threshold conditions are met, the controller 350 controls the genetic operation to be completed.

In example act 450, after the genetic operation is completed, the program generation module selects the candidate program with the highest fitness score as the final mode program. The final schema program is output to the extraction module 124 for the named entity recognition task on the data stream 130.

Fig. 6 illustrates another example process 600. Process 600 includes additional actions beyond the example process 400. The example actions 410, 420, 430, 440, and 450 in the example process 600 are similar to those in the process 400, and for the sake of simplicity, a description of the similar actions will be omitted for the process 600.

After completing a round of genetic operations, e.g., acts 420 and 430, controller 350 may route (route) the operation to act 610, where adjustment unit 352 evaluates candidate programs generated in this round of genetic operations to determine whether iterative genetic operations should be adjusted. Specifically, in sub-act 612, the adjustment unit 352 obtains the average fitness score for all sub-candidate programs in the Gp +1 generation. The average fitness score is compared to the average fitness score of the parent/parent Gp generation candidate program. The parameters of the genetic algorithm may be adjusted if the average fitness score of the Gp +1 generation is less than the average fitness score of the Gp generation.

In sub-act 614, adjustment unit 352 evaluates the regrouping of the example data string. In some embodiments, the adjustment unit 352 analyzes each positive example data string as to whether the positive example data string matches a candidate program in the Gp +1 generation. A matching rate is obtained for each positive example data string, which is calculated as the ratio of the number of matches between the positive example data string and the candidate program to the total number of candidate programs in the Gp +1 generation. A threshold match rate, such as a 50% match, may be set by the controller 350. Positive example data strings having a match rate above the threshold match rate may be regrouped into "breakthrough" groups, which indicate that the extraction behavior or "genes" of Gp +1 generation candidates generally fit the particular positive example data string. Positive example data strings having a match rate below the threshold match rate may be regrouped into an "unconfirmed" group, which indicates that the extraction behavior or "genes" of Gp +1 generation candidates do not typically match the particular positive example data string. The positive example data string of the breakthrough group may be used for further genetic operations, e.g., for calculation of a fitness score for the candidate program. The non-breakthrough set of positive example data strings may be used to obtain another pattern program in a separate genetic operation.

In sub-act 616, the adjustment unit 352 evaluates the regrouping of candidate programs based on their extraction behavior on different sets of positive example data strings. For example, for each set of example data strings, a fitness score and/or a positive match rate for the candidate program is calculated. For each set of example data strings, candidate programs are grouped based on their fitness scores or positive match rates. For example, the candidate program has a matching rate of 70% for the first set of positive example data strings and a matching rate of 20% for the second set of positive example data strings. The candidate programs may be grouped into candidate programs suitable for extracting the target data category represented by the first set of positive example data strings. A set of candidate programs may be used for genetic manipulation within the set. For example, a candidate program may only be paired with another candidate program in the same group for cross-breeding operations.

The regrouping of the example data strings or candidate programs may result in multiple genetic operations being performed in parallel and multiple final mode programs being generated from the multiple genetic operations. In some embodiments, multiple final mode programs may be linked by an OR function in the extraction task.

In example act 620, controller 350 determines whether an adjustment should be made to the genetic operation based on the results of the evaluation of act 510. If it is determined that one or more adjustments should be made, controller 350 implements the adjustment to act 410 or act 420. For example, the regrouping of the positive example data string may be used to adjust the genetic operation beginning with act 420. For example, multiple genetic operations begin to run in parallel. The positive example data strings of the breakthrough group and the non-breakthrough group may also be used to reshape the initial population of candidate programs at act 410. Other methods of adjusting the operation of the program generation module 122 are also possible and are included within the scope of this document.

The learning unit 360 is configured to function with the controller 350 in further training the genetic operation. For example, training data such as correct extraction results and incorrect extraction results may be used as the training data string. A generation of candidate programs may be selected as an initial training population of candidate programs to begin the training operation. In some embodiments, the last generation candidate program is used as the initial training population. In some embodiments, processes 400, 600 of fig. 4 or 6 may perform similar operations using training data strings on an initial training population of candidate programs. The training operation generates a new final pattern program that overcomes the disadvantages of previous pattern programs that extract incorrect data strings.

The systems, apparatuses, modules or units shown in the foregoing embodiments may be embodied by using a computer chip or entity, or may be implemented by using an article having a specific function. Typical embodiment devices are computers, which may be personal computers, laptop computers, cellular phones, camera phones, smart phones, personal digital assistants, media players, navigation devices, email receiving and sending devices, game consoles, tablets, wearable devices, or any combination of these devices.

For the implementation of the functions and roles of each module in the device, reference may be made to the implementation of the corresponding steps in the previous method. Details are omitted here for simplicity.

As the apparatus embodiments substantially correspond to the method embodiments, reference may be made to the relevant description in the method embodiments for the relevant components. The device implementation described in the preceding is merely an example. Modules described as separate parts may or may not be physically separate, and parts shown as modules may or may not be physical modules, may be located in one location, or may be distributed over multiple network modules. Some or all of the modules may be selected based on actual needs to achieve the goals of the present solution. One of ordinary skill in the art will understand and appreciate embodiments of the present application without undue inventive effort.

The techniques described herein produce one or more technical effects. The genetic algorithm processes data strings, each of which accurately represents a target data class identified by the named entity. These technical features bring valuable technical advantages. First, a pattern program generated from a genetic algorithm will have tailored extraction behavior because of the efficient capture and continuation of good "genes" contained in an example data string by the genetic operation of the genetic algorithm. In this way, the generated pattern program will correctly detect and extract the data string of the target data category. Moreover, the use of such example data strings also reduces manual input and errors in the process, as there is no need to manually identify named entities from non-representative data strings. Moreover, the initial population of pattern programs is generated primarily, e.g., 90%, from the example data string, which significantly reduces the amount of iterative genetic operations required to implement a satisfactory pattern program. In the era of big data and cloud-based data services, saving computing resources is critical to managing large-scale data flows.

The technique operates on the various example data strings without distinguishing between them, which results in a pattern program whose function is to extract the target data class represented by all the example data strings. In this way, the technical operation can be carried out completely autonomously without manual intervention. If the initial attempt to generate a single pattern program fails, the example data strings may be regrouped and the genetic operating parameters may be adjusted based on an evaluation of the results of the previous operations without human intervention. In this way, the techniques generate a computer program, such as a regular expression, entirely autonomously based on an example string of data representing a category of data to be matched by the regular expression.

Embodiments of the described subject matter may include one or more features, either alone or in combination. For example, in a first embodiment, a computer-implemented method obtains a first population of candidate programs; generating a second population of candidate programs by performing iterative genetic operations on the first population of candidate programs; and extracting a plurality of second data strings from the data stream using the first candidate program in the second population of candidate programs. The iterative genetic operation includes calculating a fitness score for each candidate program in the second population of candidate programs using the fitness function and the plurality of first data strings. The fitness function evaluates a matching rate of the candidate program with the plurality of first data strings.

In a second embodiment, a computer-implemented method receives a plurality of first data strings; identifying a sub-string of characters from the plurality of first data strings; obtaining a first population of candidate programs based at least in part on the plurality of first data strings, the substrings represented as individual units in the candidate programs in the first population of candidate programs; generating a second population of candidate programs by performing an iterative genetic operation on the first population of candidate programs, the iterative genetic operation comprising calculating a fitness score for each candidate program in the second population of candidate programs using a fitness function that evaluates a rate of matching of the candidate program to the plurality of first data strings and the plurality of first data strings; and extracting a plurality of second data strings from the data stream using a first candidate program of the second population of candidate programs.

The foregoing and other described embodiments may each optionally include one or more of the following features.

The first feature, which may be combined with any of the previous or following features, specifies that the method further comprises obtaining a plurality of third data strings. The plurality of third data strings is a subset of the plurality of second data strings. The method also includes generating a second candidate program by iteratively genetically operating on a second population of candidate programs using a plurality of third data strings.

The second feature, which may be combined with any of the previous or following features, specifies that the plurality of first data strings includes a plurality of positive example data strings, each positive example data string representing a target data category of the named entity identification task.

The third feature, which may be combined with any of the previous or following features, specifies that the plurality of first data strings includes a plurality of negative example data strings, each negative example data string representing a data category that is not the target data category.

The fourth feature, which may be combined with any of the previous or following features, specifies a fitness function to evaluate a first match rate of the candidate program with respect to the plurality of positive example data strings and a second match rate of the candidate program with respect to the plurality of negative example data strings.

Fifth feature, combinable with any of the previous or following features, specifying that the method further comprises grouping the plurality of first data strings into a first set of data strings and at least one second set of data strings; and separately performing an iterative genetic operation on the first set of candidate programs using each of the first set of data strings or the at least one second set of data strings.

The sixth feature, which may be combined with any of the previous or following features, specifies a fitness function that further evaluates the compactness of the candidate program and the edit distance between the candidate program and a data string of the plurality of first data strings.

The seventh feature, which may be combined with any of the previous or following features, specifies iterative genetic manipulation including cross-breeding manipulation and mutation manipulation.

The eighth feature, which may be combined with any of the previous or following features, specifies that the candidate programs in each of the first population of candidate programs or the second population of candidate programs are regular expressions.

A ninth feature which can be combined with any one of the previous or following features, to assign a weight to each of the plurality of first data strings; and the fitness function evaluates a weight of each of the plurality of first data strings.

A tenth feature, combinable with any of the previous or following features, that specifies that the fitness score of the first candidate program is highest in the second population of candidate programs; performing iterative genetic operations on the second population of candidate programs using a plurality of third data strings to generate a third population of candidate programs; and the fitness score of the second candidate program is highest in a third population of candidate programs.

The eleventh feature, which may be combined with any one of the previous or following features, specifies that the fitness score of the second candidate program is higher than the fitness score of the first candidate program calculated using at least one of the plurality of first data strings and the plurality of third data strings.

The twelfth feature, which may be combined with any of the previous or following features, specifies that obtaining the first population of candidate programs comprises obtaining at least a portion of the first population of candidate programs based on the plurality of first data strings.

A thirteenth feature, combinable with any of the previous or following features, specifies that the first population of candidate programs comprises a first number of candidate programs, the second population of candidate programs comprises a second number of candidate programs, and the second number is reduced from the first number.

Fourteenth feature, combinable with any of the previous or following features, specifies that the second number is reduced from the first number following one or more of an exponential decay algorithm, a linear decay algorithm, or a staggered decay algorithm.

A fifteenth feature, combinable with any of the previous or following features, specifies that the method sets a minimum number of candidate programs for the second population.

A sixteenth feature, which can be combined with any of the previous or following features, specifies a fitness function to evaluate a length of the candidate program against data string lengths of the plurality of first data strings.

A seventeenth feature, which may be combined with any of the previous or following features, specifies that the plurality of first data strings comprises a first set of positive example data strings, wherein each positive example data string represents a target data class of the named entity identification task, and the fitness function evaluates the length of the candidate program against an average length of all of the first set of positive example data strings.

An eighteenth feature, combinable with any of the previous or following features, specifying that the plurality of first data strings includes a first set of positive example data strings each representing a target data category of the named entity identification task and a second set of negative example data strings each representing a data category that is not the target data category; and the fitness function evaluates a first number of positive example data strings in the first set of positive example data strings that completely match the candidate program and a second number of negative example data strings in the second set of negative example data strings that completely match the candidate program.

A nineteenth feature, combinable with any of the previous or following features, that specifies that the plurality of first data strings includes a first set of positive example data strings each representing a target data category of the named entity identification task and a second set of negative example data strings each representing a data category that is not the target data category; and the fitness function evaluates a first number of characters in the first positive example string set that match the candidate program and a second number of characters in the second negative example string set that match the candidate program.

In a second embodiment, a system comprises: one or more processors; and one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon that are executable by the one or more processors to perform actions. The actions include: receiving a plurality of first data strings from a user; obtaining a first population of candidate programs based at least in part on the plurality of first data strings; generating a second population of candidate programs by performing an iterative genetic operation on the first population of candidate programs, the iterative genetic operation comprising calculating a fitness score for each candidate program in the second population of candidate programs using a fitness function and a plurality of first data strings; extracting a plurality of second data strings from the data stream using a first candidate program selected from a second population of candidate programs; providing a plurality of second data strings to a user; receiving a plurality of third data strings from a user, the plurality of third data strings being a subset of the plurality of second data strings; a second candidate program is obtained based at least in part on the plurality of third data strings and a second population of candidate programs.

In a third embodiment, an apparatus includes a plurality of modules and units. The plurality of modules and units include: an initial program generation unit operative to obtain a first population of candidate programs; a synthesis unit operative to generate a second population of candidate programs by performing iterative genetic operations on a first population of candidate programs; a fitness measuring unit that operates to calculate a fitness score of each of the second population of candidate programs using a fitness function that evaluates a matching rate of the candidate program with the plurality of first data strings and the plurality of first data strings; and an extraction module operative to extract a plurality of second data strings from the data stream using a first candidate program selected from the first population of candidate programs and the second population of candidate programs.

In a fourth embodiment, a non-transitory computer-readable storage medium stores executable instructions that cause a processor to perform acts comprising: obtaining a first population of candidate programs; generating a second population of candidate programs by performing an iterative genetic operation on the first population of candidate programs, the iterative genetic operation comprising calculating a fitness score for each candidate program in the second population of candidate programs using a fitness function and a plurality of first data strings; dividing the plurality of first data strings into a first subset of data strings and at least one second subset of data strings based on the fitness scores of the candidate programs in the second population of candidate programs; generating a third population of candidate programs by performing iterative genetic operations on the second population of candidate programs using the first subset of data strings; a plurality of second data strings are extracted from the data stream using a first candidate program selected from a third population of candidate programs.

Embodiments of the subject matter described herein, and the acts and operations, can be implemented in digital electronic circuitry, tangibly embodied in computer software or firmware, in computer hardware comprising the structures disclosed herein and their structural equivalents, or in combinations of one or more of them. Implementations of the subject matter described herein may be implemented as one or more computer programs, e.g., one or more modules of computer program instructions encoded on a computer program carrier for execution by, or to control the operation of, data processing apparatus. For example, the computer program carrier may include one or more computer-readable storage media having instructions encoded or stored thereon. The carrier may be a tangible, non-transitory computer-readable medium such as a magnetic, magneto-optical disk or optical disk, a solid state drive, Random Access Memory (RAM), Read Only Memory (ROM), or other type of medium. Alternatively or additionally, the carrier may be an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be, or be part of, a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Computer storage media is not a propagated signal.

A computer program, which may also be referred to or described as a program, software application, app, module, software module, engine, script, or code, may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; it can be deployed in any form, including as a stand-alone program or as a module, component, engine, subroutine, or other unit suitable for execution in a computing environment, which may include one or more computers at one or more locations interconnected by a data communications network.

A computer program may, but need not, correspond to a file in a file system. The computer program may be stored in: a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document; in a single file dedicated to the program in question; or multiple coordinated files, such as files that store one or more modules, sub programs, or portions of code.

Processors for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data of a computer program for execution from a non-transitory computer-readable medium coupled to the processor.

The term "data processing apparatus" encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The data processing device may comprise special purpose logic circuitry, e.g., an FPGA (field programmable gate array), an ASIC (application specific integrated circuit), or a GPU (graphics processing unit). In addition to hardware, the apparatus can include code that creates an execution environment for the computer program, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

The processes and logic flows described herein can be performed by one or more computers or processors executing one or more computer programs to perform operations by operating on input data and generating output. The processes or logic flows can also be performed by, and in combination with, special purpose logic circuitry, e.g., an FPGA, an ASIC, a GPU, or by a combination of special purpose logic circuitry and one or more programmed computers.

A computer suitable for executing a computer program may be based on a general purpose or special purpose microprocessor, or both, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. Elements of a computer may include a central processing unit for executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or integrated in, special purpose logic circuitry.

Typically, a computer will also include or be operatively coupled to receive data from or transfer data to one or more storage devices. The storage device may be, for example, a magnetic, magneto-optical disk or optical disk, a solid state drive, or any other type of non-transitory computer readable medium. However, a computer does not necessarily have such a device. Thus, a computer may be coupled to one or more storage devices, e.g., one or more memories located locally and/or remotely. For example, a computer may include one or more local memories that are integral components of the computer; or the computer may be coupled to one or more remote memories located in a cloud network. In addition, a computer may also be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Components may be "coupled" to one another by communicative connection, such as electrical or optical, with one another, either directly or via one or more intermediate components. Components may also be "coupled" to one another if one component of the component is integrated into another. For example, a storage component integrated into a processor (e.g., an L2 cache component) "couples" to a processor.

For interacting with a user, embodiments of the subject matter described herein may be implemented on or configured to communicate with a computer having a display device, e.g., an LCD (liquid crystal display) monitor, for displaying information to the user and an input device, e.g., a keyboard and a pointing device, through which the user may provide input to the computer, e.g., a mouse, trackball, or touch pad. Other types of devices may also be used to interact with the user, for example, feedback provided to the user may be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and user input may be received in any form, including acoustic, speech, or tactile input. In addition, the computer may interact with the user by sending and receiving documents to and from a device used by the user, for example, by sending web pages to a web browser on the user device in response to requests received from the web browser, or by interacting with an app running on the user device, such as a smartphone or electronic tablet. In addition, the computer may interact with the user by taking turns sending text messages or other forms of messages to the personal device (e.g., a smartphone running a messaging application) and receiving response messages from the user.

The term "configured" is used herein in relation to systems, apparatuses and computer program components. For a system of one or more computers configured to perform a particular operation or action, it is meant that the system has installed thereon software, firmware, hardware, or a combination thereof that in operation causes the system to perform that operation or action. For one or more computer programs configured to perform a particular operation or action, it is meant that the one or more programs include instructions, which when executed by a data processing apparatus, cause the apparatus to perform the operation or action. For a specific logic circuit configured to perform a particular operation or action, it means that the circuit has electronic logic to perform the operation or action.

While this document contains many specific embodiment details, these should not be construed as limitations on the scope of the claims, which are defined by the claims themselves, but rather as descriptions of specific features of particular embodiments. Certain features that are described herein in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as: it may be desirable to perform the operations in the particular order shown, or in sequence, or to perform all of the operations shown, in order to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the division of the various system modules and components in the embodiments described above should not be understood as requiring such division in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Specific embodiments of the subject matter have been described. Other implementations are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not require the particular order shown, or sequence, to achieve desirable results. In some cases, multitasking parallel processing may be advantageous.

30页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：指纹识别模组及其驱动方法、制作方法、显示装置

Named entity identification and extraction using genetic programming

相关技术

网友询问留言