Text recognition method and device based on regular matching and electronic equipment

文档序号:830124 发布日期:2021-03-30 浏览:17次 中文

阅读说明:本技术 基于正则匹配的文本识别方法、文本识别装置和电子设备 (Text recognition method and device based on regular matching and electronic equipment ) 是由 熊思宇 朱永强 于 2020-12-28 设计创作,主要内容包括:本申请提供的基于正则匹配的文本识别方法、文本识别装置和电子设备,涉及文本识别技术领域。在本申请中,首先,获取待识别文本,其中,待识别文本包括至少一个基于变长编码表示的待识别字符。其次,在至少一个待识别字符中确定一个目标待识别字符,并将目标待识别字符转换为目标进制的目标数字。然后,基于确定有限状态自动机对目标数字进行匹配处理,其中,确定有限状态自动机基于对目标正则表达式进行转换处理得到,且确定有限状态自动机中每一条转移边对应的字符基于目标进制表示。最后,若目标数字匹配失败,则停止对待识别文本进行匹配处理。基于上述方法,可以改善现有的文本识别技术中存在的资源浪费的问题。(The application provides a text recognition method, a text recognition device and electronic equipment based on regular matching, and relates to the technical field of text recognition. In the application, firstly, a text to be recognized is obtained, wherein the text to be recognized comprises at least one character to be recognized represented based on variable length coding. Secondly, determining a target character to be recognized in at least one character to be recognized, and converting the target character to be recognized into a target number of a target system. And then, matching the target number based on a finite state automaton, wherein the finite state automaton is obtained by converting a target regular expression, and the characters corresponding to each transfer edge in the finite state automaton are represented based on a target system. And finally, if the target number fails to be matched, stopping matching the text to be recognized. Based on the method, the problem of resource waste in the existing text recognition technology can be solved.)

1. A text recognition method based on regular matching is characterized by comprising the following steps:

acquiring a text to be recognized, wherein the text to be recognized comprises at least one character to be recognized, and the character to be recognized is represented based on variable length coding;

determining a target character to be recognized in the at least one character to be recognized, and converting the target character to be recognized into a target number of a target system;

matching the target number based on a pre-obtained finite state automaton, wherein the finite state automaton is obtained based on processing a target regular expression, and characters corresponding to each transfer edge in the finite state automaton are represented based on the target system;

and if the target number matching fails, stopping matching the text to be recognized.

2. The method for recognizing text based on regular matching according to claim 1, wherein the characters to be recognized are plural, the method further comprising:

step a, if the target number is successfully matched, determining a new target character to be recognized in other characters to be recognized except the target character to be recognized;

b, converting the new target character to be recognized into a new target digit of the target system, and matching the new target digit based on the determined finite state automata;

c, if the new target number is successfully matched, determining a new target character to be recognized again in the new target number and other characters to be recognized except the target character to be recognized, and executing the step b again based on the new target character to be recognized;

and d, if the new target number fails to be matched, stopping matching the text to be recognized.

3. The text recognition method based on regular matching according to claim 1 or 2, wherein the step of converting the target character to be recognized into a target number of a target system comprises:

determining the byte length of the target character to be recognized;

and converting the target character to be recognized into a target number of a target system based on a conversion rule corresponding to the byte length.

4. The method of claim 3, wherein the step of determining the byte length of the target character to be recognized comprises:

determining whether a target bit corresponding to the target character to be recognized is 0, wherein the target bit is the highest bit of a first byte stored in a binary form in a memory of the target character to be recognized;

if the target bit is 0, determining that the byte length of the target character to be recognized is 1;

and if the target bit is 1, determining that the byte length of the target character to be recognized is the target bit number of the target character to be recognized, wherein the target bit number is the continuous bit number of 1 in multi-bit binary data stored in a binary form in a memory by the target character to be recognized.

5. The method for recognizing text based on regular matching according to claim 3, wherein the step of converting the target character to be recognized into a target number in a target system based on the conversion rule corresponding to the byte length comprises:

judging whether the byte length is greater than a preset byte length or not;

if the length of the bytes is less than or equal to the preset length of the bytes, converting the first byte of the character to be recognized, which is stored in a binary form in the memory, into a target binary system to obtain a target number corresponding to the character to be recognized;

and if the byte length is greater than the preset byte length, deleting binary data of each byte stored in the memory in a binary form of the target character to be recognized, and converting the reserved binary data into a target binary system to obtain a target number corresponding to the target character to be recognized.

6. The text recognition method based on canonical matching according to claim 5, wherein the step of performing binary data deletion processing on each byte of the target character to be recognized stored in the memory in binary form comprises:

deleting the binary data of the highest target length bit in the first byte aiming at the first byte in which the target character to be identified is stored in the memory in a binary form, wherein the target length bit is equal to the length of the byte plus 1;

and deleting the highest 2-bit binary data in each byte except the first byte in which the target character to be identified is stored in the memory in a binary form.

7. The method of text recognition based on canonical matching according to claim 1 or 2, further comprising a step of obtaining the deterministic finite state automata, the step comprising:

determining whether the target regular expression has a target operator, wherein the target operator is used for carrying out parallel operation processing;

if the target regular expression has the target operator, converting each character corresponding to the target operator into the target system to obtain a target system number corresponding to each character;

carrying out interval conversion processing on the basis of the target system digit corresponding to each character to obtain a corresponding target interval, and updating the target regular expression on the basis of the target interval to obtain an updated target regular expression;

and generating a corresponding definite finite state automaton based on the updated target regular expression.

8. A text recognition device based on regular matching, comprising:

the text recognition device comprises a text acquisition module, a recognition module and a recognition module, wherein the text to be recognized comprises at least one character to be recognized, and the character to be recognized is represented based on variable length coding;

the character conversion module is used for determining a target character to be recognized in the at least one character to be recognized and converting the target character to be recognized into a target number of a target system;

the digital matching module is used for matching the target digit based on a pre-obtained finite state automaton, wherein the finite state automaton is obtained by processing a target regular expression, and characters corresponding to each transfer edge in the finite state automaton are represented based on the target system;

and the matching stopping module is used for stopping the matching processing of the text to be recognized when the target number fails to be matched.

9. The regular matching based text recognition device of claim 8, wherein the character conversion module comprises:

the byte length determining submodule is used for determining the byte length of the target character to be recognized;

and the character to be recognized conversion submodule is used for converting the target character to be recognized into a target number of a target system based on a conversion rule corresponding to the byte length.

10. An electronic device, comprising:

a memory for storing a computer program;

a processor coupled to the memory for executing a computer program stored in the memory to implement the text recognition method based on canonical matching according to any one of claims 1 to 7.

Technical Field

The application relates to the technical field of text recognition, in particular to a text recognition method, a text recognition device and electronic equipment based on regular matching.

Background

The regular matching is a process of judging whether a text to be matched conforms to a specified regular expression. Before matching, a Deterministic Finite state Automaton (DFA) is generally created according to a regular expression, and the Deterministic DFA is composed of state nodes and transition edges, where each state node has one or more transition edges indicating a next state to which a character is input in a current state. And each character of the text to be matched is placed into the finite state automata one by one, and from the initial state node of the finite state automata, each character is placed and compared with the character on the transition edge of the current state node to obtain the next state node to be jumped to.

When a deterministic finite automaton is described by using a C + + code, states are defined to represent State nodes, each State node may contain a plurality of transition edges, a structure Edge is defined to represent the transition edges, a variable accept _ character is used to represent an input character in the Edge, a variable next _ State is used to represent the sequence number of the next State node to be jumped to, and accept _ characters of a plurality of edges contained in one State are different. However, when characters in text and rules are represented by variable length coding, such as UTF-8 coding, one character may be 1, 2, 3 or 4 bytes, and accept _ character can no longer be directly defined as char or wcar _ t type, based on which, it is necessary to convert the regular expression and text content of UTF-8 into unicode code representation in advance and then perform regular matching.

However, the inventors have found that the above-described technique has a problem of wasting resources.

Disclosure of Invention

In view of the above, an object of the present application is to provide a text recognition method, a text recognition apparatus and an electronic device based on regular matching, so as to solve the problem of resource waste in the existing text recognition technology.

In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:

a text recognition method based on regular matching comprises the following steps:

acquiring a text to be recognized, wherein the text to be recognized comprises at least one character to be recognized, and the character to be recognized is represented based on variable length coding;

determining a target character to be recognized in the at least one character to be recognized, and converting the target character to be recognized into a target number of a target system;

matching the target number based on a pre-obtained finite state automaton, wherein the finite state automaton is obtained based on processing a target regular expression, and characters corresponding to each transfer edge in the finite state automaton are represented based on the target system;

and if the target number matching fails, stopping matching the text to be recognized.

In a preferred selection of the embodiment of the present application, in the text recognition method based on regular matching, the number of the characters to be recognized is multiple, and the method further includes:

step a, if the target number is successfully matched, determining a new target character to be recognized in other characters to be recognized except the target character to be recognized;

b, converting the new target character to be recognized into a new target digit of the target system, and matching the new target digit based on the determined finite state automata;

c, if the new target number is successfully matched, determining a new target character to be recognized again in the new target number and other characters to be recognized except the target character to be recognized, and executing the step b again based on the new target character to be recognized;

and d, if the new target number fails to be matched, stopping matching the text to be recognized.

In a preferred option of the embodiment of the present application, in the text recognition method based on regular matching, the step of converting the target character to be recognized into a target number in a target system includes:

determining the byte length of the target character to be recognized;

and converting the target character to be recognized into a target number of a target system based on a conversion rule corresponding to the byte length.

In a preferred option of the embodiment of the present application, in the text recognition method based on regular matching, the step of determining the byte length of the target character to be recognized includes:

determining whether a target bit corresponding to the target character to be recognized is 0, wherein the target bit is the highest bit of a first byte stored in a binary form in a memory of the target character to be recognized;

if the target bit is 0, determining that the byte length of the target character to be recognized is 1;

and if the target bit is 1, determining that the byte length of the target character to be recognized is the target bit number of the target character to be recognized, wherein the target bit number is the continuous bit number of 1 in multi-bit binary data stored in a binary form in a memory by the target character to be recognized.

In a preferred option of the embodiment of the present application, in the text recognition method based on regular matching, the step of converting the target character to be recognized into a target number in a target system based on a conversion rule corresponding to the byte length includes:

judging whether the byte length is greater than a preset byte length or not;

if the length of the bytes is less than or equal to the preset length of the bytes, converting the first byte of the character to be recognized, which is stored in a binary form in the memory, into a target binary system to obtain a target number corresponding to the character to be recognized;

and if the byte length is greater than the preset byte length, deleting binary data of each byte stored in the memory in a binary form of the target character to be recognized, and converting the reserved binary data into a target binary system to obtain a target number corresponding to the target character to be recognized.

In a preferred option of the embodiment of the present application, in the text recognition method based on the regular matching, the step of deleting binary data from each byte of the target character to be recognized stored in the memory in a binary form includes:

deleting the binary data of the highest target length bit in the first byte aiming at the first byte in which the target character to be identified is stored in the memory in a binary form, wherein the target length bit is equal to the length of the byte plus 1;

and deleting the highest 2-bit binary data in each byte except the first byte in which the target character to be identified is stored in the memory in a binary form.

In a preferred option of the embodiment of the present application, in the text recognition method based on regular matching, the method further includes a step of determining the deterministic finite state automata, where the step includes:

determining whether the target regular expression has a target operator, wherein the target operator is used for carrying out parallel operation processing;

if the target regular expression has the target operator, converting each character corresponding to the target operator into the target system to obtain a target system number corresponding to each character;

carrying out interval conversion processing on the basis of the target system digit corresponding to each character to obtain a corresponding target interval, and updating the target regular expression on the basis of the target interval to obtain an updated target regular expression;

and generating a corresponding definite finite state automaton based on the updated target regular expression.

The embodiment of the present application further provides a text recognition device based on regular matching, including:

the text recognition device comprises a text acquisition module, a recognition module and a recognition module, wherein the text to be recognized comprises at least one character to be recognized, and the character to be recognized is represented based on variable length coding;

the character conversion module is used for determining a target character to be recognized in the at least one character to be recognized and converting the target character to be recognized into a target number of a target system;

the digital matching module is used for matching the target digit based on a pre-obtained finite state automaton, wherein the finite state automaton is obtained by processing a target regular expression, and characters corresponding to each transfer edge in the finite state automaton are represented based on the target system;

and the matching stopping module is used for stopping the matching processing of the text to be recognized when the target number fails to be matched.

In a preferred option of the embodiment of the present application, in the text recognition device based on regular matching, the character conversion module includes:

the byte length determining submodule is used for determining the byte length of the target character to be recognized;

and the character to be recognized conversion submodule is used for converting the target character to be recognized into a target number of a target system based on a conversion rule corresponding to the byte length.

On the basis, an embodiment of the present application further provides an electronic device, including:

a memory for storing a computer program;

and the processor is connected with the memory and is used for executing the computer program stored in the memory so as to realize the text recognition method based on the regular matching.

According to the text recognition method, the text recognition device and the electronic equipment based on the regular matching, one target character to be recognized in at least one character to be recognized included in a text to be recognized is converted into a target digit of a target system, so that the target digit can be matched based on a pre-obtained definite finite state automaton, and then when the matching fails, the matching of the text to be recognized is stopped. Therefore, the problem that more memory resources are required to be occupied due to the fact that all characters to be recognized are directly converted into target numbers can be avoided, the problem of resource waste in the existing text recognition technology is solved, and the method has high practical value.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

Fig. 1 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Fig. 2 is a schematic flowchart of steps included in the text recognition method based on regular matching according to the embodiment of the present application.

Fig. 3 is a flowchart illustrating sub-steps included in step S120 in fig. 2.

Fig. 4 is a flowchart illustrating a text recognition method based on regular matching according to an embodiment of the present application, including a step of obtaining a deterministic finite state automaton.

Fig. 5 is a flowchart illustrating other steps included in the text recognition method based on regular matching according to the embodiment of the present application.

Fig. 6 is a block diagram illustrating functional modules included in a text recognition apparatus based on canonical matching according to an embodiment of the present application.

Icon: 10-an electronic device; 12-a memory; 14-a processor; 100-text recognition means based on canonical matching; 110-a text acquisition module; 120-character conversion module; 130-a digital matching module; 140-match stop module.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As shown in FIG. 1, an embodiment of the present application provides an electronic device 10 that may include a memory 12, a processor 14, and a text recognition apparatus 100 based on canonical matching.

Wherein the memory 12 and the processor 14 are electrically connected directly or indirectly to realize data transmission or interaction. For example, they may be electrically connected to each other via one or more communication buses or signal lines. The regular matching based text recognition apparatus 100 includes at least one software functional module that can be stored in the memory 12 in the form of software or firmware (firmware). The processor 14 is configured to execute an executable computer program stored in the memory 12, for example, a software functional module and a computer program included in the text recognition device 100 based on canonical matching, so as to implement the text recognition method based on canonical matching provided by the embodiment of the present application.

Alternatively, the Memory 12 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

The Processor 14 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), a System on Chip (SoC), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

It is understood that the structure shown in fig. 1 is only an illustration, and the electronic device 10 may further include more or fewer components than those shown in fig. 1, or have a different configuration from that shown in fig. 1, for example, a communication unit for performing information interaction with other devices (for example, when the electronic device 10 is a background server, the other devices may be terminal devices such as a mobile phone and a computer, etc.) (for example, the text to be recognized may be acquired from the other devices through the communication unit, or the recognition result of the text to be recognized may be sent to the other devices).

With reference to fig. 2, an embodiment of the present application further provides a text recognition method based on regular matching, which is applicable to the electronic device 10. Wherein the method steps defined by the flow relating to the text recognition method may be implemented by the electronic device 10.

The specific process shown in FIG. 2 will be described in detail below.

Step S110, acquiring a text to be recognized.

In this embodiment, when the text to be recognized needs to be recognized, the electronic device 10 may acquire the text to be recognized. The text to be recognized may include at least one character to be recognized, and the character to be recognized may be represented based on variable length coding (variable length coding refers to coding in which the length of the irregular operation code is variable and is dispersed at different positions of the instruction word).

Step S120, determining a target character to be recognized from the at least one character to be recognized, and converting the target character to be recognized into a target number in a target system.

In this embodiment, after obtaining the text to be recognized based on step S110, the electronic device 10 may determine a character to be recognized in at least one character to be recognized included in the text to be recognized, use the character to be recognized as a target character to be recognized, and then convert the target character to be recognized into a target number in a target system (the target system does not belong to a binary system).

And step S130, matching the target digit based on a predetermined finite state automaton.

In this embodiment, after obtaining the target number based on step S120, the electronic device 10 may perform matching processing on the target number based on a predetermined finite state automaton, so that a matching result corresponding to the target number may be obtained.

The deterministic finite state automaton is obtained by processing a target regular expression, and characters corresponding to each transfer edge in the deterministic finite state automaton are represented based on the target system. And, if the matching result is that the target number matching fails, step S140 may be executed.

And step S140, stopping the matching processing of the text to be recognized.

In this embodiment, after determining that the target number matching fails based on step S130, the electronic device 10 may stop the matching process for the text to be recognized.

Based on the method, when the matching of the target number corresponding to the target character to be recognized fails, the matching processing of the text to be recognized is stopped, so that the problem that more memory resources are occupied due to the fact that all characters to be recognized are directly converted into the target number can be avoided, the problem of resource waste in the existing text recognition technology is solved, and the method has high practical value.

In the first aspect, it should be noted that, in step S110, when the character to be recognized in the text to be recognized is represented based on variable length coding, a specific coding rule is not limited.

For example, in an alternative example, the character to be recognized may be represented based on a UTF-8 encoding (a standard mechanism for converting wide character values to Unicode in a byte stream).

In the second aspect, it should be noted that, in step S120, a specific manner for determining one target character to be recognized is not limited, and may be selected according to actual application requirements.

For example, in an alternative example, one character to be recognized may be randomly determined among the at least one character to be recognized as a target character to be recognized.

For another example, in another alternative example, one character to be recognized may be determined according to the precedence order of each character to be recognized in the at least one character to be recognized, such as taking the first (top) character to be recognized as the target character to be recognized.

It should be noted that, in step S120, a specific manner of converting the target character to be recognized into a target number in a target system is not limited, and may be selected according to actual requirements.

For example, in an alternative example, the target character to be recognized may be converted according to a predetermined conversion rule to obtain a corresponding target number.

For another example, in another alternative example, in conjunction with fig. 3, step S120 may include step S121 and step S122, which are described in detail below.

And step S121, determining the byte length of the target character to be recognized.

In the present embodiment, after obtaining the text to be recognized based on step S110 and determining the target character to be recognized based on the foregoing steps, the byte length of the target character to be recognized may be determined.

And step S122, converting the target character to be recognized into a target number of a target system based on a conversion rule corresponding to the byte length.

In this embodiment, after determining the byte length based on step S121, the target character to be recognized may be converted into a target number in a target system based on the byte length correspondence conversion rule. Therefore, targeted characters to be recognized with different byte lengths can be converted in a targeted mode, and the obtained target numbers can effectively represent the targeted characters to be recognized.

Optionally, in the above example, the specific manner for determining the byte length based on step S121 is not limited, and may be selected according to the actual application requirement.

For example, in an alternative example, in order to enable the determined byte length to have a high reliability, thereby ensuring that the conversion rule corresponding to the byte length can perform effective conversion processing, the step S121 may include the following steps:

firstly, determining whether a target bit corresponding to the target character to be recognized is 0, wherein the target bit is the highest bit of a first byte stored in a binary form in a memory by the target character to be recognized; secondly, if the target bit is 0, determining that the byte length of the target character to be recognized is 1; and then, if the target bit is 1, determining that the byte length of the target character to be recognized is the target bit number of the target character to be recognized, wherein the target bit number is the continuous bit number of 1 in the multi-bit binary data stored in the memory in a binary form by the target character to be recognized.

For the above steps, in a specific application example, if the target character to be recognized is "a" and "a" is represented as "01100001" in binary form in the memory, the highest bit of the first byte of the target character to be recognized is 0, so that the determined byte length is 1. If the target character to be recognized is "me" which is represented in binary form in the memory as "111001101000100010010001", the highest bit of the first byte of the target character to be recognized is 1, and the number of bits which are 1 in succession from the highest bit is 3, so that the determined byte length is 3.

Optionally, in the above example, the specific manner of converting the target character to be recognized into the target number in the target system based on step S122 is not limited, and may be selected according to the actual application requirement.

For example, in an alternative example, in order to ensure that the converted target number can effectively represent the target character to be recognized, step S122 may include the following steps:

firstly, judging whether the length of the bytes is greater than a preset length of the bytes; if the length of the bytes is less than or equal to the preset length of the bytes, converting a first byte stored in a binary form in a memory of the character to be recognized into a target binary system to obtain a target number corresponding to the character to be recognized; and if the byte length is greater than the preset byte length, deleting binary data of each byte stored in the memory in a binary form of the target character to be recognized, and converting the reserved binary data into a target binary system to obtain a target number corresponding to the target character to be recognized.

For the above steps, in a specific application example, as in the foregoing example, the target character to be recognized is represented as "01100001" in binary form in the memory, and the corresponding byte length is 1, and if the preset length of the byte is 1, the first byte "01100001" is converted into the target scale, for example, into decimal, so as to obtain the target number as 97.

That is, the character "a" may be represented by the number "97".

It can be understood that, when the byte length is greater than the preset byte length, the specific manner of performing the binary data deletion processing is not limited, and may be selected according to the actual application requirements.

For example, in an alternative example, the binary data deleting process may be performed on each byte of the target character to be identified stored in the memory in the binary form based on the following steps:

firstly, deleting the binary data of the highest target length bit in a first byte stored in a binary form in a memory by aiming at the target character to be identified, wherein the target length bit is equal to the length of the byte plus 1; secondly, deleting the highest 2-bit binary data in each byte except the first byte in which the target character to be identified is stored in the memory in a binary form.

For the above steps, in a specific application example, as in the foregoing example, the target to-be-identified character is represented in binary form as "111001101000100010010001" in the memory, and the corresponding byte length is 3, if the preset length of the byte is 1, that is, the byte length is greater than the preset length of the byte, so that the target length may be 3+1 ═ 4, so that the highest 4-bit binary data in the first byte "11100110" may be deleted, and "0110" is obtained; deleting the highest 2-bit binary data in the second byte "10001000" to obtain "001000"; the highest 2-bit binary data in the second byte "10010001" is deleted, resulting in "010001".

Based on this, the "conversion of the reserved binary data into the target scale" means that "0110001000010001" is converted into the target scale, for example, into decimal, and the target number is 25105.

That is, the character "I" may be represented by the number "25105".

In the third aspect, it should be noted that, for the step S130, in order to ensure effective execution of the step S130, the deterministic finite state automaton needs to be obtained first. With reference to fig. 4, the step of obtaining the deterministic finite state automata may include step S151, step S152, step S153, and step S154, which is described in detail below.

Step S151, determining whether the target regular expression has a target operator.

In this embodiment, before the matching processing is required to be performed based on step S130, it is also required to determine whether the target regular expression for performing the matching processing has a target operator.

Wherein the target operator is used for performing a union operation, e.g., a meta character ". multidot.. And, when the target regular expression has the target operator, step S152 may be performed.

Step S152, converting each character corresponding to the target operator into the target scale, and obtaining a target scale number corresponding to each character.

In this embodiment, since the target character to be recognized has been converted into the target digit of the target scale in step S120, in order to make the matching process effectively proceed, when it is determined that the target regular expression has the target operator based on step S151, each character corresponding to the target operator may be converted into the target scale, and thus, the target digit corresponding to each character may be obtained.

Step S153, performing interval conversion processing based on the target system digit corresponding to each character to obtain a corresponding target interval, and updating the target regular expression based on the target interval to obtain an updated target regular expression.

In this embodiment, after obtaining the target binary number corresponding to each character based on step S152, the section conversion process may be performed based on the target binary number corresponding to each character to obtain the corresponding target section. And then, updating the target regular expression based on the target interval to obtain an updated target regular expression.

For example, for a single character "a" which is not numerically consecutive, the corresponding numeral "97" may be used as the upper limit value and the lower limit value of the target interval, respectively, to obtain [97,97 ]. For another example, for a plurality of numerically consecutive characters, and a relation in which a plurality of characters are operated in parallel, such as "a | b | c", may be represented as [97,99 ]. As another example, since the decimal data corresponding to the line break is "11", for the meta character "-", it may be represented as [0,10], [12, Max ], where Max represents the maximum value in the corresponding alphabet, e.g., the ascii code corresponds to the maximum value in the alphabet of 255.

Step S154, based on the updated target regular expression, generating a corresponding definite finite state automaton.

In this embodiment, after obtaining the updated target regular expression based on step S153, a corresponding determination finite state machine may be generated based on the updated target regular expression.

Based on this, because the interval representation method is adopted to process the parallel transportation, the problem that the number of state nodes needs to be correspondingly configured because more transfer edges need to be configured in the automatic machine generation process can be avoided, and the problem of explosion of the number of the state nodes can be effectively avoided.

Optionally, in the above example, the specific manner of generating the corresponding deterministic finite state automaton based on step S154 is not limited, and may be selected according to the actual application requirement.

For example, in an alternative example, an NFA equivalent to the updated target regular expression may be constructed based on Thompson construction (a non-deterministic finite state automaton, which is a finite state automaton that may have multiple possible next states for each state and input symbol pair) and then converted to a DFA based on subset construction (i.e., a deterministic finite state automaton, having a uniquely deterministic next possible state).

In the fourth aspect, it should be further noted that, in the above example, after determining that the target number matching fails based on step S130, step S140 may be executed to stop performing matching processing on the text to be recognized, so that the problem of resource waste may be avoided. If it is determined that the target number is successfully matched based on step S130, the matching process may be continued to be performed on other characters to be recognized, so as to complete the recognition and matching of the text to be recognized.

Based on this, in conjunction with fig. 5, the text recognition method based on regular matching may further include step S160, step S170, step S180, and step S190, which is described in detail below.

Step S160, determining a new target character to be recognized from the characters to be recognized except the target character to be recognized.

In this embodiment, when the matching result obtained in step S130 is that the target number is successfully matched, the electronic device 10 may select another character to be recognized (for example, a character to be recognized next to the target character to be recognized, that is, if the target character to be recognized is a first character in the text to be recognized, the other character to be recognized is a second character in the text to be recognized) other than the target character to be recognized from among the characters to be recognized included in the text to be recognized, and determine the other character to be recognized as a new target character to be recognized.

Step S170, converting the new target character to be recognized into a new target digit of the target system, and performing matching processing on the new target digit based on the deterministic finite state automata.

In this embodiment, after determining the new target character to be recognized based on step S160, the electronic device may convert the new target character to be recognized into a new target digit of the target system, and then perform matching processing on the new target digit based on the determined finite state automaton (where, for a specific processing procedure, the foregoing explanation on the relevant contents of step S120 and step S130 may be referred to, and details are not repeated here).

If the new target number is successfully matched as a result of the matching process performed on the new target number, step S180 may be performed; otherwise, if the result of the matching process performed on the new target number is that the new target number fails to be matched, step S190 may be performed.

Step S180, determining a new target character to be recognized again in the new target character to be recognized and other characters to be recognized except the target character to be recognized.

In this embodiment, after obtaining the result that the new target number is successfully matched based on step S170, the electronic device 10 may select one new target character to be recognized and other characters to be recognized except the target character to be recognized from the characters to be recognized included in the text to be recognized, and determine the other characters to be recognized as new target characters to be recognized. Then, step 170 may be executed again based on the new target character to be recognized, and the loop is executed, so that each time the matching is successful, the characters to be recognized included in the text to be recognized may be matched.

And step S190, stopping the matching processing of the text to be recognized.

In this embodiment, after obtaining the result of the new target number matching failure based on step S170, the electronic device 10 may stop performing the matching process on the text to be recognized, so that the conversion process on other characters to be recognized may not be performed, and the problem of resource waste is effectively avoided.

With reference to fig. 6, an embodiment of the present application further provides a text recognition apparatus 100 based on regular matching, which is applicable to the electronic device 10. The regular matching-based text recognition apparatus 100 may include a text acquisition module 110, a character conversion module 120, a number matching module 130, and a matching stop module 140.

The text obtaining module 110 may be configured to obtain a text to be recognized, where the text to be recognized includes at least one character to be recognized, and the character to be recognized is represented based on variable length coding. In this embodiment, the text obtaining module 110 may be configured to execute step S110 shown in fig. 2, and reference may be made to the foregoing description of step S110 for relevant contents of the text obtaining module 110.

The character conversion module 120 may be configured to determine a target character to be recognized from the at least one character to be recognized, and convert the target character to be recognized into a target number in a target system. In this embodiment, the character conversion module 120 may be configured to perform step S120 shown in fig. 2, and reference may be made to the foregoing description of step S120 for relevant contents of the character conversion module 120.

The number matching module 130 may be configured to perform matching processing on the target number based on a pre-obtained deterministic finite state automata, where the deterministic finite state automata is obtained by processing a target regular expression, and characters corresponding to each transition edge in the deterministic finite state automata are represented based on the target system. In this embodiment, the digital matching module 130 may be configured to perform step S130 shown in fig. 2, and reference may be made to the foregoing description of step S130 for relevant contents of the digital matching module 130.

The matching stopping module 140 may be configured to stop the matching process on the text to be recognized when the target number fails to be matched. In this embodiment, the matching stopping module 140 may be configured to execute step S140 shown in fig. 2, and reference may be made to the foregoing description of step S140 for relevant contents of the matching stopping module 140.

Optionally, in an alternative example, the character conversion module 120 may include a byte length determination sub-module and a character conversion sub-module to be recognized.

In detail, the byte length determining submodule is configured to determine the byte length of the target character to be recognized. And the character to be recognized is converted into a target number of a target system based on a conversion rule corresponding to the byte length.

In summary, according to the text recognition method, the text recognition device and the electronic device based on the regular matching provided by the application, a target character to be recognized in at least one character to be recognized included in a text to be recognized is converted into a target number in a target system, so that the target number can be matched based on a predetermined finite state automaton, and then, when the matching fails, the matching of the text to be recognized is stopped. Therefore, the problem that more memory resources are required to be occupied due to the fact that all characters to be recognized are directly converted into target numbers can be avoided, the problem of resource waste in the existing text recognition technology is solved, and the method has high practical value.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus and method embodiments described above are illustrative only, as the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

17页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:事件抽取方法、装置、计算机设备和存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!