Method and device for comparing company names, computer equipment and storage medium

文档序号：830121 发布日期：2021-03-30 浏览：22次中文

阅读说明：本技术 公司名称比对的方法、装置、计算机设备和存储介质 (Method and device for comparing company names, computer equipment and storage medium ) 是由林建明王文杰于 2019-09-30 设计创作，主要内容包括：本申请涉及一种公司名称比对的方法、装置、计算机设备和存储介质。方法包括：获取第一公司的第一名称和第二公司的第二名称；对第一名称和第二名称进行预处理；分别对预处理后的第一名称和第二名称进行分割,得到第一名称和第二名称的各个区域；将第一名称的区域与对应的第二名称的区域进行对比,得到各个区域的相似度；对各个区域的相似度进行加权求和,得到第一名称和第二名称的最终相似度；当最终相似度大于预设阈值时,第一公司和第二公司属于同一公司,这种方式考虑到了文本、拼音相似度层面的计算,对错别字、名称缩写有一定的容忍度,具有很高的稳定性和准确性,提高了审批效率,也减少了审批的人工成本和时间成本。(The application relates to a method and a device for comparing company names, computer equipment and a storage medium. The method comprises the following steps: acquiring a first name of a first company and a second name of a second company; preprocessing the first name and the second name; respectively segmenting the preprocessed first name and the preprocessed second name to obtain each area of the first name and the second name; comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area; carrying out weighted summation on the similarity of each region to obtain the final similarity of the first name and the second name; when the final similarity is larger than the preset threshold value, the first company and the second company belong to the same company, the calculation of the similarity level of the text and the pinyin is considered, the tolerance on wrongly-written characters and name abbreviations is certain, the stability and the accuracy are high, the approval efficiency is improved, and the labor cost and the time cost of approval are reduced.)

1. A method for comparing company names, the method comprising:

acquiring a first name of a first company and a second name of a second company;

preprocessing the first name and the second name;

respectively segmenting the preprocessed first name and the preprocessed second name to obtain each area of the first name and the second name;

comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area;

carrying out weighted summation on the similarity of each region to obtain the final similarity of the first name and the second name;

and when the final similarity is larger than a preset threshold value, the first company and the second company belong to the same company.

2. The method according to claim 1, wherein the segmenting the preprocessed first name and the preprocessed second name to obtain respective areas of the first name and the second name comprises:

and respectively dividing the preprocessed first name and the preprocessed second name into a preset number of areas, wherein the areas comprise an organization area, an administrative division area, an industry information area and a company word size area.

3. The method of claim 2, wherein after the pre-processing of the first name and the second name is segmented into the respective regions corresponding to the first name and the second name, the method further comprises:

clearing the organizational structure areas of the first name and the second name, and comparing other areas except the organizational structure areas;

and determining the administrative similarity corresponding to the administrative division areas, the industry information similarity corresponding to the industry information areas and the company character size similarity corresponding to the company character size areas.

4. The method according to claim 3, wherein the comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area comprises:

comparing the administrative division areas of the first name and the second name, wherein when the administrative division areas of the first name and the second name are consistent, the similarity of the administrative division areas of the first name and the second name is first administrative similarity;

when the administrative division areas of the first name and the second name are not consistent, the similarity of the administrative division areas of the first name and the second name is a second administrative similarity;

and when at least one of the administrative division areas of the first name and the second name is empty, the similarity of the administrative division areas of the first name and the second name is the third administrative similarity.

5. The method according to claim 3, wherein the comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area comprises:

comparing the first name with the company word size area of the second name to obtain a company word size editing distance between the company word size area of the first name and the company word size area of the second name;

comparing the industry information areas of the first name and the second name to obtain the industry information editing distance between the industry information area of the first name and the industry information area of the second name;

taking the difference value of the ratio of a preset natural number to the editing distance of the company font size and the number of characters in the company font size area as the company font size similarity of the first name and the second name;

and taking the difference value of the ratio of the preset natural number to the industry information editing distance to the industry information region character number as the industry information similarity of the first name and the second name.

6. The method of claim 1, wherein pre-processing the first name and the second name comprises:

cleaning the first name and the second name, and deleting special characters in the first name and the second name, wherein the special characters are characters except Chinese, English and numbers;

unifying the formats of the cleaned first name and the cleaned second name.

7. The method of claim 1, further comprising:

and when at least one of the first name and the second name is null, the final similarity of the first name and the second name is a preset fixed value.

8. An apparatus for comparing company names, the apparatus comprising:

the company name acquisition module is used for acquiring a first name of a first company and a second name of a second company;

the preprocessing module is used for preprocessing the first name and the second name;

the area division module is used for respectively dividing the preprocessed first name and the preprocessed second name to obtain each area of the first name and the second name;

the area comparison module is used for comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area;

the similarity confirming module is used for carrying out weighted summation on the similarity of each area to obtain the final similarity of the first name and the second name; and when the final similarity is larger than a preset threshold value, the first company and the second company belong to the same company.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for comparing company names, a computer device, and a storage medium.

Background

The company name is a name of a company in which an independent legal person is established, and basically, the company name is filled in when the company business is handled, and a corresponding handling organization verifies whether the filled company name is the same as an actual company name. However, the same company may have different filling methods, for example, the company is named as "ABCD service limited", and the client may fill out "ABCD". In the conventional technology, the examination and approval mode is generally that one piece of the product is examined manually, but the method is time-consuming and labor-consuming and has low efficiency.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus, a computer device and a storage medium for comparing company names, which can improve the efficiency of comparing company names.

A method of company name alignment, the method comprising:

acquiring a first name of a first company and a second name of a second company;

preprocessing the first name and the second name;

respectively segmenting the preprocessed first name and the preprocessed second name to obtain each area of the first name and the second name;

comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area;

carrying out weighted summation on the similarity of each region to obtain the final similarity of the first name and the second name;

and when the final similarity is larger than a preset threshold value, determining that the first name and the second name belong to the same company name.

An apparatus for company name comparison, the apparatus comprising:

the company name acquisition module is used for acquiring a first name of a first company and a second name of a second company;

the preprocessing module is used for preprocessing the first name and the second name;

the area division module is used for respectively dividing the preprocessed first name and the preprocessed second name to obtain each area of the first name and the second name;

the area comparison module is used for comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area;

the similarity confirming module is used for carrying out weighted summation on the similarity of each area to obtain the final similarity of the first name and the second name; and when the final similarity is larger than a preset threshold value, determining that the first name and the second name belong to the same company name.

A computer device comprising a memory, the memory storing a computer program, a processor implementing the following steps when the processor executes the computer program:

acquiring a first name of a first company and a second name of a second company;

preprocessing the first name and the second name;

respectively segmenting the preprocessed first name and the preprocessed second name to obtain each area of the first name and the second name;

comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area;

carrying out weighted summation on the similarity of each region to obtain the final similarity of the first name and the second name;

and when the final similarity is larger than a preset threshold value, determining that the first name and the second name belong to the same company name.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring a first name of a first company and a second name of a second company;

preprocessing the first name and the second name;

respectively segmenting the preprocessed first name and the preprocessed second name to obtain each area of the first name and the second name;

comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area;

carrying out weighted summation on the similarity of each region to obtain the final similarity of the first name and the second name;

and when the final similarity is larger than a preset threshold value, determining that the first name and the second name belong to the same company name.

The method, the device, the computer equipment and the storage medium for comparing the company names firstly preprocess a first name of a first company and a second name of a second company, perform word segmentation processing, segment the first name and the second name into a plurality of areas, compare the areas of the first name with the corresponding areas of the second name to obtain the similarity of each area, perform weighted summation on the similarity of each area to obtain the final similarity of the first name and the second name, and accordingly determine whether the first name and the second name belong to the same company name or not according to the final similarity. When the final similarity is greater than a preset threshold, it may be determined that the first company and the second company belong to the same company; when the final similarity is smaller than or equal to the preset threshold, the first name and the second name do not belong to the same company name, the company name comparison mode not only considers the calculation of the similarity level of the text and the pinyin, but also has certain tolerance on wrongly written characters and name abbreviations, has high stability and accuracy, greatly improves the approval efficiency, and reduces the labor cost and the time cost of approval.

Drawings

FIG. 1 is a diagram of an exemplary embodiment of a company name matching method;

FIG. 2 is a schematic flow chart diagram illustrating a method for company name comparison in one embodiment;

FIG. 3 is a flow diagram of a method for company name comparison in one embodiment;

FIG. 4 is a schematic illustration of a company name alignment in another embodiment;

FIG. 5 is a block diagram of an apparatus for company name matching in one embodiment;

FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The method for comparing company names provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 acquires a company name input by the user through the terminal 102 as a first name through a network, and acquires a second name compared with the first name from a database. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, a method for comparing company names is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step 201, a first name of a first company and a second name of a second company are obtained.

The method comprises the steps that a company name input by a user through a terminal is obtained, the company name input by the user can be used as a first name, the company name pre-stored in a database is used as a second name, and therefore the first name and the second name are compared.

Step 202, preprocessing the first name and the second name.

After the first name and the second name are obtained, the first name and the second name need to be preprocessed, and then the next specific comparison operation is performed.

In one embodiment, preprocessing the first name and the second name includes: cleaning the first name and the second name, and deleting special characters in the first name and the second name, wherein the special characters are characters except Chinese, English and numbers; unifying the formats of the cleaned first name and the cleaned second name.

When the first name and the second name are preprocessed, a cleaning operation may be performed on the first name and the second name first, that is, the special characters in the first name and the second name are deleted. The special characters are characters except Chinese, English and numbers, such as punctuation marks, operation symbols, underlines or horizontal bars. That is, the cleaning operation is to delete the punctuation marks, operation marks, underlines or horizontal bars and other special marks in the first name and the second name, and only to keep and operate the Chinese, English or numbers in the first name and the second name, so that the contents of the cleaned first name and the second name are all Chinese, English or numbers.

Then, the formats of the cleaned first name and the second name may be unified, for example, the first name and the second name may be converted into a simplified format. The Chinese numbers in the first name and the second name can be uniformly converted into Arabic numbers, English words are uniformly converted into lower case format, all corner symbols in English words are uniformly converted into half corner symbols, and the like.

And 203, segmenting the preprocessed first name and the preprocessed second name respectively to obtain each area of the first name and the second name.

After the first name and the second name are preprocessed, the preprocessed first name and the preprocessed second name may be divided, and a plurality of areas corresponding to the first name and a plurality of areas corresponding to the second name may be obtained. For example, the first name may be divided into the area 1, the area 2, the area 3, and the area 4, and the second name may be divided into the area 1, the area 2, the area 3, and the area 4, so that the areas corresponding to the first name and the second name may be obtained.

In one embodiment, the segmenting the preprocessed first name and the preprocessed second name respectively to obtain each area of the first name and the second name includes: and respectively dividing the preprocessed first name and the preprocessed second name into a preset number of areas, wherein the areas comprise an organization area, an administrative division area, an industry information area and a company character size area.

When the preprocessed first name and the preprocessed second name are divided into a plurality of regions corresponding to the first name and a plurality of regions corresponding to the second name, the regions may include an organization region, an administrative division region, an industry information region, and a company font size region. The organizational structure area refers to an area in the form of an organizational structure of a company, such as XX company and XX company, and the area corresponding to "company limited" or "company limited" is the organizational structure area. The administrative division region refers to a region corresponding to the administrative region where the company name is located, for example, shenzhen city XX limited company, shenzhen XX company, and "shenzhen city" and "shenzhen" in the company name are administrative division regions. The industry information region refers to a region corresponding to industry information in which the company is engaged in the company name, for example, shenzhen city XX internet financial service limited company, guangzhou city XX information technology limited company, so that "shenzhen city" and "guangzhou city" are administrative division regions, the "internet financial service" and the "information technology" are industry information regions, the "limited company" is an organization region, and the "XX" is a company character number region of the company, that is, a main part of the company name. That is, a complete company name is composed of 4 parts, including an organization area, an administrative division area, an industry information area, and a company letter area.

And 204, comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area.

When the company names of two companies are compared, the areas divided according to the company names may be compared, respectively. That is, the first name of the organization region may be compared with the second name of the organization region, the first name of the organization region may be compared with the second name of the administrative division region, the first name of the industry information region may be compared with the second name of the administrative division region, the first name of the company font size region may be compared with the second name of the administrative division region, and the regions may be compared with each other, so that the similarity corresponding to the regions may be obtained.

In an embodiment, after the pre-processed first name and the pre-processed second name are respectively segmented to obtain the respective areas corresponding to the first name and the second name, the method further includes: clearing the organizational structure areas with the first name and the second name, and comparing the organizational structure areas with other areas except the organizational structure areas; and determining the administrative similarity corresponding to the administrative division areas, the industry information similarity corresponding to the industry information areas and the company character size similarity corresponding to the company character size areas.

When comparing the areas of the first name and the second name, the organizational structure areas of the first name and the second name may be removed first, that is, the organizational structure area portions may not be compared, and the similarity between the organizational structure areas does not need to be determined. Then, during comparison, the other areas except the organizational areas may be compared, and after comparison, the administrative similarity corresponding to the administrative division areas, the business information similarity corresponding to the business information areas, and the company font size similarity corresponding to the company font size areas may be determined.

In one embodiment, comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area includes: comparing the first name with the administrative division areas of the second name, wherein when the first name is consistent with the administrative division areas of the second name, the similarity of the first name and the administrative division areas of the second name is the first administrative similarity; when the administrative division areas of the first name and the second name are inconsistent, the similarity of the administrative division areas of the first name and the second name is a second administrative similarity; and when at least one of the administrative division areas of the first name and the second name is empty, the similarity of the administrative division areas of the first name and the second name is the third administrative similarity.

When the first name and the second name are compared separately, the comparison may be actually performed for each area. When comparing the administrative divisions of a first name and a second name, three situations may arise:

in the first case: the administrative division areas of the first name and the second name are consistent, in which case the similarity between the administrative division areas of the first name and the second name can be determined as a first administrative similarity, which may be 1;

in the second case: if the administrative division areas of the first name and the second name are not consistent, the similarity between the administrative division areas of the first name and the second name can be determined as a second administrative similarity, and the second administrative similarity can be 0;

in the third case: if at least one of the administrative division areas of the first name and the second name is empty, it cannot be compared whether the administrative division areas of the first name and the second name are consistent, and it can be determined that the similarity between the administrative division areas of the first name and the second name is the third administrative similarity, and the third administrative similarity may be-1.

In one embodiment, comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area includes: comparing the first name with the company word size area of the second name to obtain a company word size editing distance between the company word size area of the first name and the company word size area of the second name; comparing the industry information areas of the first name and the second name to obtain the industry information editing distance between the industry information area of the first name and the industry information area of the second name; taking the difference value of the ratio of the preset natural number to the editing distance of the company font size and the number of characters in the company font size area as the company font size similarity of the first name and the second name; and taking the difference value of the ratio of the preset natural number to the editing distance of the industry information and the number of characters in the industry information area as the similarity of the industry information of the first name and the second name.

The difference is that when the company word size area and the industry information area of the first name and the second name are compared, a text similarity mode is adopted to determine the company word size similarity corresponding to the company word size area and the industry information similarity corresponding to the industry information area.

When the first name and the company character size area of the second name are compared, the company character size editing distance between the company character size area of the first name and the company character size area of the second name can be obtained, and the number of characters in the company character size area can be obtained, so that the similarity of the company character size can be determined as the preset natural number-the company character size editing distance/the number of characters in the company character size area. For example, assuming that the company font area of the first name is "sammory", and the company font area of the first name is "sammory", it may be determined that the company font editing distance between the first name and the second name is 1, the number of characters of the company font area is 3, and the preset natural number is set to 1, and the company font similarity between the first name and the second name is 1-1/3-2/3.

When comparing the industry information areas of the first name and the second name, the same comparison method as the company font size area can be adopted. Similarly, when comparing the industry information areas of the first name and the second name, the industry information editing distance between the industry information areas of the first name and the second name and the character number of the industry information area can be obtained, so that the industry information similarity of the first name and the second name can be determined as the industry information similarity of the first name and the second name, wherein the difference value of the preset natural number and the ratio of the industry information editing distance to the character number of the industry information area is determined. Assuming that the industry information editing distance is X1, the number of characters in the industry information area is X, and the preset natural number is 1, the similarity of the industry information between the first name and the second name is 1-X1/X.

In one embodiment, the method further comprises: and when at least one of the first name and the second name is null, the final similarity of the first name and the second name is a preset fixed value.

When comparing the first name and the second name, if at least one of the first name and the second name is empty, the first name and the second name cannot be further compared, and the final similarity of the first name and the second name can be directly determined to be a preset fixed value, wherein the preset fixed value can be set to be-1, and represents that the first name and the second name are not successfully compared. Therefore, it can also be considered that, when the first name and the second name are not empty, the first name and the second name are continuously compared, and in this case, the final similarity between the first name and the second name is determined to be [ 0,1 ].

Step 205, performing weighted summation on the similarity of each region to obtain the final similarity of the first name and the second name.

And step 206, when the final similarity is larger than a preset threshold value, determining that the first company and the second company belong to the same company.

After determining the similarity of each area of the first name and the second name, that is, the administrative similarity corresponding to the administrative division areas of the first name and the second name, the business information similarity corresponding to the business information area, and the company font size similarity corresponding to the company font size area, the similarity of each area may be weighted and summed, so as to calculate the final similarity of the first name and the second name. Since the organizational region does not perform similarity calculation, the final similarity is administrative similarity Q1+ business information similarity Q2+ corporate word size similarity Q3. When the final similarity is greater than a preset threshold, it may be determined that the first company and the second company belong to the same company; when the final similarity is less than or equal to a preset threshold, it may be determined that the first name and the second name do not belong to the same company name.

In the method for comparing the company names, a first name of a first company and a second name of a second company are preprocessed and subjected to word segmentation, the first name and the second name are segmented into a plurality of areas, the areas of the first name are compared with the areas of the corresponding second name, so that the similarity of the areas is obtained, the similarity of the areas is weighted and summed, the final similarity of the first name and the second name is obtained, and whether the first name and the second name belong to the same company name or not can be determined according to the final similarity. When the final similarity is greater than a preset threshold, it may be determined that the first company and the second company belong to the same company; when the final similarity is smaller than or equal to the preset threshold, the first name and the second name do not belong to the same company name, the company name comparison mode not only considers the calculation of the similarity level of the text and the pinyin, but also has certain tolerance on wrongly written characters and name abbreviations, has high stability and accuracy, greatly improves the approval efficiency, and reduces the labor cost and the time cost of approval.

In one embodiment, as shown in the flowchart of the method for comparing company names in fig. 3, two company names to be compared are preprocessed, that is, a first name of a first company and a second name of a second company are cleaned: the method comprises the steps of simple and complex body conversion, special character elimination and the like, then word segmentation is carried out on the first name and the second name, and the first name and the second name are segmented to obtain each area of the first name and the second name. The tissue mechanism regions in the first and second names may then be excised, i.e., removed, without calculating the similarity of the tissue mechanism regions of the first and second names. And then comparing other areas except the organizational areas, namely determining the administrative similarity corresponding to the administrative division areas, the industry information similarity corresponding to the industry information areas and the company font size similarity corresponding to the company font size areas. First, whether the regional information of the first name is consistent with that of the second name is compared, that is, the administrative similarity of the first name and the second name is determined, then, the industry information similarity and the company word number similarity of the first name and the second name are determined by adopting a text similarity algorithm, the similarity of each region is weighted and summed, so that the final similarity of the first name and the second name can be obtained, and finally, whether the first name and the second name belong to the same company name can be determined according to the final similarity. When the final similarity is greater than a preset threshold, it may be determined that the first company and the second company belong to the same company; when the final similarity is less than or equal to a preset threshold, it may be determined that the first name and the second name do not belong to the same company name.

In one embodiment, as shown in FIG. 4, there are three company names entered by the user, which may be referred to as the first names, "Shenzhen AAA Internet financial services Limited", "BCD Productions agency Limited in Beijing", "EF animal Hospital", respectively. The first name and the second name are divided respectively, so that the areas of the first name and the second name, namely the organizational structure area, the administrative division area, the industry information area and the company character number area of the first name and the second name can be determined.

Here, the region is represented by MISSING when the region is partially empty, which means that the region is empty. Therefore, the administrative similarity corresponding to the administrative division areas, the industry information similarity corresponding to the industry information areas and the company word size similarity corresponding to the company word size areas can be determined and calculated, finally, the similarities of the areas are weighted and summed, the final similarity of the first name and the second name can be calculated, and whether the first name and the second name belong to the same company name or not can be determined according to the value of the final similarity. When the final similarity is greater than a preset threshold, it may be determined that the first company and the second company belong to the same company; when the final similarity is less than or equal to a preset threshold, it may be determined that the first name and the second name do not belong to the same company name.

It should be understood that although the various steps in the flow charts of fig. 2-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 5, there is provided an apparatus for comparing company names, including: the system comprises a company name acquisition module, a preprocessing module, an area segmentation module, an area comparison module and a similarity confirmation module, wherein:

a company name obtaining module 501, configured to obtain a first name of a first company and a second name of a second company.

A preprocessing module 502, configured to preprocess the first name and the second name.

The area dividing module 503 is configured to divide the preprocessed first name and the preprocessed second name, so as to obtain each area of the first name and the second name.

The area comparison module 504 is configured to compare the area with the first name with the corresponding area with the second name, so as to obtain a similarity of each area.

The similarity confirming module 505 is configured to perform weighted summation on the similarity of each region to obtain a final similarity between the first name and the second name; and when the final similarity is larger than a preset threshold value, determining that the first company and the second company belong to the same company.

In one embodiment, the preprocessing module 502 is further configured to clean the first name and the second name, and delete special characters in the first name and the second name, where the special characters are characters other than chinese, english, and numbers; unifying the formats of the cleaned first name and the cleaned second name.

In one embodiment, the region dividing module 503 is further configured to divide the preprocessed first name and the preprocessed second name into a preset number of regions, where the regions include an organization region, an administrative division region, an industry information region, and a company letter number region.

In one embodiment, the region comparison module 504 is further configured to clear the organizational regions of the first name and the second name, and compare the regions other than the organizational regions; and determining the administrative similarity corresponding to the administrative division areas, the industry information similarity corresponding to the industry information areas and the company character size similarity corresponding to the company character size areas.

In one embodiment, the area comparison module 504 is further configured to compare the administrative division areas of the first name and the second name, and when the administrative division areas of the first name and the second name are consistent, the similarity between the administrative division areas of the first name and the second name is the first administrative similarity; when the administrative division areas of the first name and the second name are inconsistent, the similarity of the administrative division areas of the first name and the second name is a second administrative similarity; and when at least one of the administrative division areas of the first name and the second name is empty, the similarity of the administrative division areas of the first name and the second name is the third administrative similarity.

In one embodiment, the area comparison module 504 is further configured to compare the first name with the company font size area of the second name, and obtain a company font size edit distance between the company font size area of the first name and the company font size area of the second name; comparing the industry information areas of the first name and the second name to obtain the industry information editing distance between the industry information area of the first name and the industry information area of the second name; taking the difference value of the ratio of the preset natural number to the editing distance of the company font size and the number of characters in the company font size area as the company font size similarity of the first name and the second name; and taking the difference value of the ratio of the preset natural number to the editing distance of the industry information and the number of characters in the industry information area as the similarity of the industry information of the first name and the second name.

In one embodiment, the similarity determination module 505 is further configured to, when at least one of the first name and the second name is null, set the final similarity between the first name and the second name to a preset fixed value.

For the specific definition of the device for comparing company names, reference may be made to the above definition of the method for comparing company names, which is not repeated herein. All or part of each module in the device for company name comparison can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing related data of company name comparison. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of company name comparison.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring a first name of a first company and a second name of a second company; preprocessing the first name and the second name; respectively segmenting the preprocessed first name and the preprocessed second name to obtain each area of the first name and the second name; comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area; carrying out weighted summation on the similarity of each region to obtain the final similarity of the first name and the second name; when the final similarity is greater than a preset threshold, the first company and the second company belong to the same company.

In one embodiment, the processor, when executing the computer program, further performs the steps of: respectively segmenting the preprocessed first name and the preprocessed second name to obtain each area of the first name and the second name, wherein the method comprises the following steps: and respectively dividing the preprocessed first name and the preprocessed second name into a preset number of areas, wherein the areas comprise an organization area, an administrative division area, an industry information area and a company character size area.

In one embodiment, after the preprocessed first name and second name are respectively segmented to obtain the areas corresponding to the first name and second name, the processor executes the computer program to further implement the following steps: clearing the organizational structure areas with the first name and the second name, and comparing the organizational structure areas with other areas except the organizational structure areas; and determining the administrative similarity corresponding to the administrative division areas, the industry information similarity corresponding to the industry information areas and the company character size similarity corresponding to the company character size areas.

In one embodiment, comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area includes: comparing the first name with the company word size area of the second name to obtain a company word size editing distance between the company word size area of the first name and the company word size area of the second name; comparing the industry information areas of the first name and the second name to obtain the industry information editing distance between the industry information area of the first name and the industry information area of the second name; taking the difference value of the ratio of the preset natural number to the editing distance of the company font size and the number of characters in the company font size area as the company font size similarity of the first name and the second name; and taking the difference value of the ratio of the preset natural number to the editing distance of the industry information and the number of characters in the industry information area as the similarity of the industry information of the first name and the second name.

In one embodiment, the processor, when executing the computer program, further performs the steps of: and when at least one of the first name and the second name is null, the final similarity of the first name and the second name is a preset fixed value.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring a first name of a first company and a second name of a second company; preprocessing the first name and the second name; respectively segmenting the preprocessed first name and the preprocessed second name to obtain each area of the first name and the second name; comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area; carrying out weighted summation on the similarity of each region to obtain the final similarity of the first name and the second name; when the final similarity is greater than a preset threshold, the first company and the second company belong to the same company.

In one embodiment, the computer program when executed by the processor further performs the steps of: respectively segmenting the preprocessed first name and the preprocessed second name to obtain each area of the first name and the second name, wherein the method comprises the following steps: and respectively dividing the preprocessed first name and the preprocessed second name into a preset number of areas, wherein the areas comprise an organization area, an administrative division area, an industry information area and a company character size area.

In one embodiment, after segmenting the preprocessed first name and second name into regions corresponding to the first name and second name, respectively, the computer program when executed by the processor further implements the following steps: clearing the organizational structure areas with the first name and the second name, and comparing the organizational structure areas with other areas except the organizational structure areas; and determining the administrative similarity corresponding to the administrative division areas, the industry information similarity corresponding to the industry information areas and the company character size similarity corresponding to the company character size areas.

In one embodiment, comparing the area with the first name with the corresponding area with the second name to obtain the similarity of each area, includes: comparing the first name with the company word size area of the second name to obtain a company word size editing distance between the company word size area of the first name and the company word size area of the second name; comparing the industry information areas of the first name and the second name to obtain the industry information editing distance between the industry information area of the first name and the industry information area of the second name; taking the difference value of the ratio of the preset natural number to the editing distance of the company font size and the number of characters in the company font size area as the company font size similarity of the first name and the second name; and taking the difference value of the ratio of the preset natural number to the editing distance of the industry information and the number of characters in the industry information area as the similarity of the industry information of the first name and the second name.

In one embodiment, the computer program when executed by the processor further performs the steps of: and when at least one of the first name and the second name is null, the final similarity of the first name and the second name is a preset fixed value.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

17页详细技术资料下载

Method and device for comparing company names, computer equipment and storage medium

相关技术

网友询问留言