Chinese address duplication eliminating method, system and equipment based on state bit

文档序号:682763 发布日期:2021-04-30 浏览:2次 中文

阅读说明:本技术 基于状态位的中文地址去重方法、系统及设备 (Chinese address duplication eliminating method, system and equipment based on state bit ) 是由 虞开稳 于 2021-01-12 设计创作,主要内容包括:本申请公开了一种基于状态位的中文地址去重方法、系统及设备,所述方法具体包括以下步骤:获取原始地址数据;结合全国行政区划数据获取所述原始地址数据中的前三级别地址以及后缀地址字符串;利用HashSet对所述原始地址数据进行遍历,设置对应所述原始地址数据的状态位;对所述状态位进行检查,纠正所述状态位;根据纠正后的所述状态位去除重复的所述前三级别地址及/或后缀地址。通过本申请,确保找到地址字符串自身的重复字符,对地址字符串进行去冗余化。(The application discloses a Chinese address duplication eliminating method, a system and equipment based on state bits, wherein the method specifically comprises the following steps: acquiring original address data; combining national administrative division data to obtain the first three-level address and the suffix address character string in the original address data; traversing the original address data by using HashSet, and setting a state bit corresponding to the original address data; checking the status bit and correcting the status bit; and removing the repeated addresses of the front three levels and/or the addresses of the suffixes according to the corrected status bits. By the method and the device, the repeated characters of the address character string are found, and the address character string is subjected to redundancy removal.)

1. A Chinese address deduplication method based on status bits is characterized by comprising the following steps:

a data acquisition step: acquiring original address data;

an address acquisition step: combining national administrative division data to obtain the first three-level address and the suffix address in the original address data;

traversing: traversing the original address data by using HashSet, and setting a state bit corresponding to the original address data;

and (3) checking: checking the status bit and correcting the status bit;

and (3) repeating the removing step: and removing the repeated addresses of the front three levels and/or the addresses of the suffixes according to the corrected status bits.

2. The method of claim 1, wherein the top three levels of addresses in the address obtaining step comprise membership of provincial, urban and prefecture.

3. The method of claim 1, wherein the status bits are divided into repeated status bits and non-repeated status bits.

4. The method of claim 1, wherein the checking step comprises the steps of:

and (3) correcting the state bit: correcting the status bit;

a suffix checking step: a repeat string suffix check is performed on the status bit.

5. The method of claim 4, wherein the status bit correction step comprises the steps of:

a first judgment step: judging whether the repeated state bits are continuously more than or equal to two bits;

a first correction step: if so, the repeated state bit is not changed, otherwise, the repeated state bit is corrected to be the non-repeated state bit.

6. The method of claim 4, wherein the suffix checking step comprises the steps of:

the setting step: setting a suffix set;

a second judgment step: judging whether a repeat address marking the repeat status bit in the original address data is matched with the suffix set;

a second correction step: and if so, correcting the state bit of the repeated address into the non-repeated state bit, and otherwise, keeping the state bit of the repeated address unchanged.

7. A Chinese address deduplication system based on status bits is characterized by comprising:

the data acquisition module acquires original address data;

the address acquisition module is used for acquiring the first three-level address and the suffix address character string in the original address data by combining national administrative division data;

the traversal module is used for traversing the original address data by utilizing the HashSet and setting a state bit corresponding to the original address data;

the checking module is used for checking the state bit and correcting the state bit;

and the repeated removing module is used for removing the repeated front-level address and/or the repeated suffix address according to the corrected state bit.

8. The system of claim 7, wherein the top three levels of addresses in the address acquisition module comprise membership in provincial, municipal, or county.

9. The system of claim 7, wherein the checking module comprises:

a state bit correction unit correcting the state bit;

and a suffix check unit for performing repeat string suffix check on the status bit.

10. An apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the state bit based Chinese address deduplication method of any one of claims 1-6 when executing the computer program.

Technical Field

The invention relates to the technical field of data processing. More particularly, the present invention relates to a method, system and device for removing duplication of a chinese address based on status bits.

Background

With the development of Chinese search engines and data mining technologies, the technology of efficient Chinese address duplication judgment has also developed rapidly and has been widely paid attention to the industry and academia. Compared with English addresses, Chinese addresses have variable writing methods and complex semantics, and are more challenging to judge the duplication of Chinese addresses.

Generally, the business scenes that the chinese address needs to process include express address filling, bank credit loan officer checking, and personal information inquiry of the official examination department. Particularly in the express industry, the number of the express mails is in an exponential growth situation based on the development of the logistics industry in recent years, and the accuracy of receiving and sending the express mails provides a challenge for the accuracy of addresses. The existing various data deduplication methods mainly focus on the problems of judgment of similarity of processed text data, judgment of dependency relationship among data, judgment of data abbreviation and the like, and the following problems still exist in Chinese address deduplication processing:

1. in the process of filling in the address, the system with the address limited to the first three-level address such as province, city, district and county is selected to be filled, even all the addresses are possibly manually filled, and then the final address is obtained through machine scanning, and the character recognition system cannot ensure that the Chinese characters are completely and accurately recognized, so that the situations of address loss, information redundancy caused by repetition of part of the addresses, scanning errors and the like can be inevitably caused;

2. the repeated judgment of a plurality of addresses mainly aims at the condition that the same address has a plurality of expressions, and redundant information cannot be removed or normalization cannot be carried out on a single address;

3. except the first three levels of addresses such as province, city, county and the like, the expression of other addresses is relatively flexible, and the duplicate removal cannot be carried out uniformly.

Disclosure of Invention

The embodiment of the application provides a Chinese address duplication eliminating method based on state bits, which is used for at least solving the problem of subjective factor influence in the related technology.

The invention provides a Chinese address duplication eliminating method based on status bits, which comprises the following steps:

a data acquisition step: acquiring original address data;

an address acquisition step: combining national administrative division data to obtain the first three-level address and the suffix address character string in the original address data;

traversing: traversing the original address data by using HashSet, and setting a state bit corresponding to the original address data;

and (3) checking: checking the status bit and correcting the status bit;

and (3) repeating the removing step: and removing the repeated addresses of the front three levels and/or the addresses of the suffixes according to the corrected status bits.

As a further improvement of the present invention, the top three levels of addresses in the address obtaining step include subordination relationships between provinces, cities and counties.

As a further development of the invention, the status bits are divided into repetitive status bits and non-repetitive status bits.

As a further improvement of the present invention, the checking step specifically comprises the steps of:

and (3) correcting the state bit: correcting the status bit;

a suffix checking step: a repeat string suffix check is performed on the status bit.

As a further improvement of the present invention, the status bit correcting step specifically includes the steps of:

a first judgment step: judging whether the repeated state bits are continuously more than or equal to two bits;

a first correction step: if so, the repeated state bit is not changed, otherwise, the repeated state bit is corrected to be the non-repeated state bit.

As a further improvement of the present invention, the suffix checking step specifically includes the steps of:

the setting step: setting a suffix set;

a second judgment step: judging whether a repeat address marking the repeat status bit in the original address data is matched with the suffix set;

a second correction step: and if so, correcting the state bit of the repeated address into the non-repeated state bit, and otherwise, keeping the state bit of the repeated address unchanged.

Based on the same invention idea, the invention also discloses a Chinese address duplication elimination system based on the state bit based on the Chinese address duplication elimination method based on any one of the inventions,

the Chinese address deduplication system based on the state bits comprises:

the data acquisition module acquires original address data;

the address acquisition module is used for acquiring the first three-level address and the suffix address character string in the original address data by combining national administrative division data;

the traversal module is used for traversing the original address data by utilizing the HashSet and setting a state bit corresponding to the original address data;

the checking module is used for checking the state bit and correcting the state bit;

and the repeated removing module is used for removing the repeated front-level address and/or the repeated suffix address according to the corrected state bit.

As a further improvement of the present invention, the top three levels of addresses in the address obtaining module include subordination relationships between provinces, cities, districts and counties.

As a further improvement of the present invention, the inspection module specifically includes:

a state bit correction unit correcting the state bit;

and a suffix check unit for performing repeat string suffix check on the status bit.

In addition, to achieve the above object, the present invention further provides an apparatus including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements a status bit-based chinese address deduplication method when executing the computer program.

Compared with the prior art, the invention has the following beneficial effects:

1. a state bit-based Chinese address deduplication method is provided, wherein the duplication condition in a Chinese address is eliminated based on the state bit, a state array with the same length as an address character string is defined and set by using a built-in data structure of Java, such as HashSet, and the duplication state is set for a specific index position, so that the duplication address in the address character string is obtained;

2. the repeated characters of the address character string are ensured to be found, the address character string is normalized or subjected to redundancy removal, relatively complete and accurate address information is provided for the downstream function of the address service, the address information can be applied to services such as public security, banking, logistics distribution and the like, the service efficiency is improved, and the cost is reduced;

3. effective address information can be represented at minimum cost, and the cost of business practice and communication is reduced;

4. the address character string duplication removal reduces the information storage cost, improves the accuracy and uniqueness of customer service in specific service scenes (express industry, bank insurance companies and the like), is favorable for clustering and mining a plurality of information in the same prefix address, and provides a foundation for the development of downstream data services;

5. the repetition of the province, city, district and county level address (Top3 level address), the repetition of the Top3 level address in the rest of the addresses except for the Top3 level address, and the partial repetition of the rest of the addresses can be effectively removed by the status bit, so that the addresses can be acquired more accurately.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flowchart illustrating an overall method for removing duplicate addresses based on status bits according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating a main body of the present embodiment;

FIG. 3 is a flowchart illustrating the overall process of step S4 disclosed in FIG. 1;

FIG. 4 is a flowchart illustrating the whole step S41 disclosed in FIG. 3;

FIG. 5 is a flowchart illustrating the whole step S42 disclosed in FIG. 3;

FIG. 6 is a block diagram of a system architecture for removing duplicate addresses from a Chinese address based on status bits according to this embodiment;

fig. 7 is a block diagram of a computer device according to an embodiment of the present invention.

In the above figures:

1. a data acquisition module; 2. an address acquisition module; 3. a traversing module; 4. an inspection module; 5. a module is removed repeatedly; 41. a status bit correction unit; 411. a first judgment unit; 412. a first correcting unit; 42. a suffix checking unit; 421. a setting unit; 422. a second judgment unit; 423. a second correcting unit; 80. a bus; 81. a processor; 82. a memory; 83. a communication interface.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference to the terms "first," "second," "third," and the like in this application merely distinguishes similar objects and is not to be construed as referring to a particular ordering of objects.

The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that the functional, methodological, or structural equivalents of these embodiments or alternatives thereof fall within the scope of the present invention.

Before describing in detail the various embodiments of the present invention, the core inventive concepts of the present invention are summarized and described in detail by the following several embodiments.

The invention can remove the duplication of the Chinese address based on the status bit, effectively remove the duplicated address data and accurately obtain the effective address information.

The first embodiment is as follows:

referring to fig. 1 to 5, the present example discloses an embodiment of a status bit-based chinese address deduplication method (hereinafter referred to as "method").

Specifically, the overall concept of the method is first described. The method is based on a built-in Java data structure of HashSet, sequentially scans an address character string by using the irreproducibility of the HashSet, then adds characters which are not in the created HashSet, creates a state bit array with the same length as the address character string, sets a state bit under a corresponding index as a repetition state bit if the characters in the address character string exist in the HashSet, and sets the state bit as a non-repetition state bit if the characters in the address character string do not exist in the HashSet.

Specifically, after scanning is finished, a state bit array and a repetition condition of the address character string are obtained, then self-correction is carried out on the state bit according to the relation of the repetition state bit, the self-correction of the state bit and the detection of a state bit suffix based on an address alias are utilized to ensure that a repeated character of the address character string is found, and the address character string is normalized or subjected to redundancy removal.

Specifically referring to fig. 1 and 2, the method disclosed in this embodiment mainly includes the following steps:

step S1, original address data is acquired.

Then, step S2 is executed to obtain the first three-level address and the suffix address character string in the original address data by combining the national administrative division data.

Specifically, in some embodiments, Top three levels of addresses (Top3 level addresses) are obtained in combination with national administrative district data, Top3 level addresses include membership in provincial or prefecture, and then matching is performed from the address data to obtain Top3 level addresses and remaining address strings (suffix address strings) of the original address data except for Top3 level addresses.

Specifically, in some of these embodiments, there are essentially four address repetition modes, as follows:

1. zhejiang art museum No. 138 of south mountain of Hangzhou city, Zhejiang province, two south doors;

2. 2 units 402 of a Tiandu jazz garden in the Hangzhou region of Hangzhou, Zhejiang and Hangzhou, province;

3. sibin in Sibin Si-bin in Neizhou, Anhui Sizhou, Si-bin in Sizhou, Anhui Si-bin B8-1;

4. 300 # Wanke Yungu apartment 13-1301 Zhejiang province West lake region three-pier Zhejiang province four-pier Yungu apartment 300.

Wherein, the first address has Top3 level address besides province, city, district and county level address, but the duplication can not be removed; the second address has Top3 level address besides Top3 level address, and needs to be deduplicated; the third address also has Top3 level address repetition needing to be deduplicated; the fourth address has Top3 level address repetition and suffix address repetition. The four wrong addresses are all repetition cases under different conditions, wherein some repeated information needs to be deduplicated and some need to be reserved, some Top3 level address repetition and some subsequent address repetition exist.

Then, step S3 is executed, the HashSet is used to traverse the original address data, and the status bit corresponding to the original address data is set.

Specifically, in some of these embodiments, the status bits are divided into two types: a repeated status bit and a non-repeated status bit. In this embodiment, the repeated status bit is marked as 1, and the non-repeated status bit is marked as 0, but the invention is not limited thereto.

Specifically, in some embodiments, the original address string is traversed, the HashSet and the state bit array are used to set the state bit under the corresponding position index, the initial state bit is set to 0 by default, and the state bit is 1 if there is a repetition after the traversal.

Then, step S4 is executed to check the status bit and correct the status bit.

Specifically, referring to fig. 3, step S4 specifically includes the following steps:

s41, correcting the state bit;

and S42, carrying out repeated string suffix check on the status bit.

Specifically, in some embodiments, after traversing the address string by HashSet, and setting the state bit array and the corresponding string state bits, not all repeated characters need to be deduplicated, so that the state bits are corrected, and the state bits are checked and corrected by using a state bit correction algorithm and a repeated string suffix check, so as to further ensure the accuracy of the state bit identification.

Specifically, if the address is deduplicated based only on the flag of the initial status bit for the first address of the four address repetition schemes mentioned above, the result is as follows:

wherein, the first line represents the original address character string, the second line represents the state bit of the original address character string set based on the no-repeat state obtained by HashSet, wherein 0 represents no-repeat, and 1 represents repeat. The last row represents the non-duplicated address string, but it is clear that "Zhejiang" is not able to be deleted in the string, it is a valid part of a complete noun expression, so there is inaccuracy in address deduplication, and therefore it is necessary to check and correct the status bit.

Specifically, referring to fig. 4, step S41 specifically includes the following steps:

s411, judging whether the repeated state bit is continuously more than or equal to two bits;

s412, if yes, the repeated state bit is not changed, and if not, the repeated state bit is corrected to be the non-repeated state bit.

Specifically, in some embodiments, it is determined that the status bit is valid only when the repeated status bit exceeds and includes 2 bits consecutively, otherwise, the status bit is changed from 1 to 0, and the status bit is self-corrected, for example, a certain character corresponds to a status bit of 1 in the status array, status bits of two adjacent characters on the left and right are 0, and the status bit marked as 1 does not exceed or equal to two bits consecutively, and then the status bit is automatically set to 0 after being corrected by the status bit, and the correction process is as follows:

specifically, referring to fig. 5, step S42 specifically includes the following steps:

s421, setting a suffix set;

s422, judging whether the repeated address marking the repeated state bit in the original address data is matched with the suffix set or not;

and S423, if so, correcting the status bit of the repeated address into the non-repeated status bit, and otherwise, keeping the status bit of the repeated address unchanged.

Specifically, in some embodiments, a repeat substring suffix check is performed based on address aliases, for example, the first address in the four address repeat modes mentioned above, the field "zhejiang" is the alias of "zhejiang province" to be checked, and when the character index positions of "zhejiang province" are all marked as "1", the "zhejiang province" is followed by "art gallery", so that a suffix set, such as fields "art gallery, natatorium, teenager palace", and the like, is set, and once matching with the suffix set is completed, the repeat state is set as "0" for correction, because the field "zhejiang province", although having a repeat with Top3 level address, has a real meaning and cannot be deleted.

Then, step S5 is executed to remove the repeated addresses of the previous level and/or the suffix according to the corrected status bits.

Specifically, in some embodiments, after correction and checking, according to the setting condition of the status bit, the corresponding index position character with the status bit of 0 is output, and the deduplicated address data is obtained. For example, the second address in the four address repetition modes mentioned above, "the hough region in hangzhou city, zhejiang is repeated invalid information, and duplication removal is required, the original address character string is traversed and added to HashSet, once duplication is found, the index position corresponding to the state bit array (the initial value is 0) is set to 1, and then the address character string after duplication removal is obtained by using the state bits, the process is as follows:

according to the method for removing the duplication of the Chinese address based on the state bit, the duplication condition in the Chinese address is removed based on the state bit, a state array with the same length as the address character string is defined and set by using a built-in data structure of Java, such as HashSet, and the duplication state is set for a specific index position, so that the duplication address in the address character string is obtained; the repeated characters of the address character string are ensured to be found, the address character string is normalized or subjected to redundancy removal, relatively complete and accurate address information is provided for the downstream function of the address service, the address information can be applied to services such as public security, banking, logistics distribution and the like, the service efficiency is improved, and the cost is reduced; effective address information can be represented at minimum cost, and the cost of business practice and communication is reduced; the address character string duplication removal reduces the information storage cost, improves the accuracy and uniqueness of customer service in specific service scenes (express industry, bank insurance companies and the like), is favorable for clustering and mining a plurality of information in the same prefix address, and provides a foundation for the development of downstream data services; the repetition of the province, city, district and county level address (Top3 level address), the repetition of the Top3 level address in the rest of the addresses except for the Top3 level address, and the partial repetition of the rest of the addresses can be effectively removed by the status bit, so that the addresses can be acquired more accurately.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Example two:

in combination with the method for removing duplicate Chinese addresses based on status bits disclosed in the first embodiment, the present embodiment discloses a specific implementation example of a Chinese address removing system (hereinafter referred to as "system") based on status bits.

Referring to fig. 6, the system includes:

the data acquisition module 1 acquires original address data;

the address acquisition module 2 is used for acquiring the first three-level address and the suffix address character string in the original address data by combining national administrative division data;

the traversal module 3 is used for traversing the original address data by using HashSet and setting a state bit corresponding to the original address data;

the checking module 4 is used for checking the state bit and correcting the state bit;

and the repeated removing module 5 is used for removing the repeated addresses of the front level and the rear level according to the corrected state bits.

Specifically, in some embodiments, the first three levels of addresses in the address obtaining module 2 include subordination between province, city, district and county.

Specifically, in some of these embodiments, the status bits are divided into repetitive status bits and non-repetitive status bits.

Specifically, in some embodiments, the checking module 4 specifically includes:

a status bit correcting unit 41 that corrects the status bit;

and a suffix check unit 42 for performing repeat string suffix check on the status bit.

Specifically, in some embodiments, the status bit correcting unit 41 specifically includes:

a first judgment unit 411 that judges whether the repetition state bits are continuously greater than or equal to two bits;

the first correcting unit 412 corrects the repeated status bit to the non-repeated status bit if the repeated status bit is not changed.

Specifically, in some embodiments, the suffix checking unit 42 specifically includes:

a setting unit 421 that sets a suffix set;

a second determining unit 422, configured to determine whether a duplicate address, which marks the duplicate state bit in the original address data, matches the suffix set;

the second correcting unit 423 corrects the status bit of the duplicate address to the non-duplicate status bit if the status bit of the duplicate address is positive, and otherwise, the status bit of the duplicate address is unchanged.

For reference, the embodiment a refers to the description of the embodiment, and details are not repeated herein.

Example three:

referring to FIG. 7, the embodiment discloses an embodiment of a computer device. The computer device may comprise a processor 81 and a memory 82 in which computer program instructions are stored.

Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 82 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.

The processor 81 implements any of the state bit based Chinese address deduplication methods described above in embodiments by reading and executing computer program instructions stored in the memory 82.

In some of these embodiments, the computer device may also include a communication interface 83 and a bus 80. As shown in fig. 7, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.

The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.

Bus 80 includes hardware, software, or both to couple the components of the computer device to each other. Bus 80 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 80 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 80 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

The computer device may implement Chinese address deduplication based on status bits, thereby implementing the method described in conjunction with FIG. 1.

In addition, in combination with the method for removing duplicate of a chinese address based on status bits in the above embodiments, the embodiments of the present application may provide a computer-readable storage medium for implementation. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the state bit based Chinese address deduplication methods of the above embodiments.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

In summary, the method for removing duplication of a chinese address based on a status bit has the advantages that the status bit is used to remove duplication in the chinese address, a Java built-in data structure, such as HashSet, is used to define and set a status array with the same length as an address character string, and a duplication status is set for a specific index position, so as to obtain a duplicate address in the address character string; the repeated characters of the address character string are ensured to be found, the address character string is normalized or subjected to redundancy removal, relatively complete and accurate address information is provided for the downstream function of the address service, the address information can be applied to services such as public security, banking, logistics distribution and the like, the service efficiency is improved, and the cost is reduced; effective address information can be represented at minimum cost, and the cost of business practice and communication is reduced; the address character string duplication removal reduces the information storage cost, improves the accuracy and uniqueness of customer service in specific service scenes (express industry, bank insurance companies and the like), is favorable for clustering and mining a plurality of information in the same prefix address, and provides a foundation for the development of downstream data services; the repetition of the province, city, district and county level address (Top3 level address), the repetition of the Top3 level address in the rest of the addresses except for the Top3 level address, and the partial repetition of the rest of the addresses can be effectively removed by the status bit, so that the addresses can be acquired more accurately.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

16页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种航段登机信息处理方法和装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!