TF-IDF word vector-based bank name batch correction method and system

文档序号:1087393 发布日期:2020-10-20 浏览:9次 中文

阅读说明:本技术 基于tf-idf词向量的银行名称批量校正方法及系统 (TF-IDF word vector-based bank name batch correction method and system ) 是由 李振 张刚 鲍东岳 尹正 刘昊霖 彭加欣 陈厚霖 陈婷 潘仕林 黑小波 王豫丰 于 2020-07-09 设计创作,主要内容包括:本发明提供了一种基于TF-IDF词向量的银行名称批量校正方法及系统,涉及数据查询技术领域,能够实现银行名称批量校正,准确率高、速度快,能大幅提升企业和银行批量转账效率;该方法包括S1、采用TF-IDF算法训练处理后银行名称库得到第一词向量矩阵;S2、逐行做归一化处理得到第一归一化矩阵;S3、处理待矫正银行名得到每个银行名对应2字词组集;S4、采用TF-IDF算法训练词组集得到第二词向量矩阵,逐行归一化得到第二归一化矩阵;S4、计算两个归一化矩阵的余弦相似度,选择银行全称作为匹配结果替换对应的待较正银行名称,并附加上对应行号作为较正结果输出。本发明提供的技术方案适用于银行名称批量校正的过程中。(The invention provides a bank name batch correction method and system based on TF-IDF word vectors, relates to the technical field of data query, can realize the batch correction of bank names, has high accuracy and high speed, and can greatly improve the batch transfer efficiency of enterprises and banks; the method comprises the steps of S1, obtaining a first word vector matrix by a bank name library after the training processing of the TF-IDF algorithm; s2, carrying out normalization processing line by line to obtain a first normalization matrix; s3, processing the bank names to be corrected to obtain 2 word group sets corresponding to each bank name; s4, training the word group set by adopting a TF-IDF algorithm to obtain a second word vector matrix, and normalizing line by line to obtain a second normalized matrix; s4, calculating cosine similarity of the two normalization matrixes, selecting a full bank name as a matching result to replace a corresponding bank name to be corrected, and adding a corresponding line number as a correction result to output. The technical scheme provided by the invention is suitable for the batch correction process of the bank names.)

1. A bank name batch correction method based on TF-IDF word vector technology is characterized by comprising the following steps:

s1, processing the bank name library to be used as a first training set, and processing the first training set by adopting a TF-IDF algorithm to obtain a first word vector matrix;

s2, performing normalization processing on each row of the first word vector matrix respectively to obtain a first normalization matrix;

s3, leading in the bank names to be corrected in batch, and carrying out multi-thread batch preprocessing on the bank names to obtain a second training set corresponding to each bank name;

s4, processing the second training set by adopting a TF-IDF algorithm to obtain a second word vector matrix corresponding to each bank name, and performing normalization processing on each row of the second word vector matrix to obtain a second normalization matrix;

s5, calculating cosine similarity of the first normalization matrix and the second normalization matrix, selecting a bank full name as a matching result to replace a corresponding bank name to be corrected according to the cosine similarity, and adding a corresponding line number as a correction result to output in batches.

2. The batch calibration method for bank names based on the TF-IDF word vector technology as claimed in claim 1, wherein the specific content of selecting the bank full name as the matching result according to the cosine similarity is as follows: and performing dot multiplication on the second normalization matrix and the transpose of the first normalization matrix, and selecting the bank full name corresponding to the position of the maximum value in each row in the multiplication result as a matching result.

3. The batch calibration method for bank names based on the TF-IDF word vector technology of claim 1, wherein the processing of the bank name library in the step S1 comprises:

s11, filtering out non-key characters in the bank name;

and S12, segmenting and recombining the filtered bank names to obtain a plurality of single words and a plurality of 2-word phrases.

4. The batch calibration method for bank names based on TF-IDF word vector technology according to claim 1, wherein the normalization process in step S2 or step S4 is L2Norm normalization processes, making the modulus of each row vector equal to 1.

5. The TF-IDF word vector technology-based bank name batch correction method of claim 4, wherein L is2The norm normalization processing is calculated in the following manner:

wherein norm2(x) is L2Norm, xiIs an element, x 'in the first or second word vector matrix'iIs an element in the corresponding first or second normalization matrix; x is the number of1、x2…xnFor each of the values in the vector it is,representing the corresponding TF-IDF value of the bank name.

6. The bank name batch correction method based on the TF-IDF word vector technology according to claim 1, wherein the TF-IDF algorithm is specifically: TF-IDF ═ TF × IDF;

TF is the word frequency of a word, and IDF is the reverse file frequency.

7. The TF-IDF word vector technology-based bank name batch correction method according to claim 6, wherein the TF has the calculation formula:

Figure FDA0002577294050000023

8. the natural-language word vector-based bank name matching method according to claim 6, characterized in that the steps of the method further comprise: and S5, adding the new bank name into the bank name library, and repeating S1-S4.

9. A bank name batch correction system based on TF-IDF word vector technology, wherein the correction system is used to implement the correction method according to any one of claims 1 to 8;

the correction system includes:

the IO module is used for importing the names of the banks to be corrected in batches and outputting the correction results in batches;

the model module is responsible for carrying out preprocessing on the bank name library and the bank name to be corrected in batches and acquiring a word vector matrix of each bank name;

and the storage calculation module is used for storing the bank name library and the word vector matrix and calculating, and the calculation contents comprise normalization processing, cosine similarity calculation and result matching according to the cosine similarity calculation result.

10. A financial institution name batch correction method based on TF-IDF word vector technology, wherein the correction method is directed to financial institutions including banks, and the steps of the correction method according to any one of claims 1 to 8 are performed.

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of data query, in particular to a bank name batch correction method and system based on a TF-IDF word vector technology.

[ background of the invention ]

Transfer accounts are a very common bank currency settlement mode, transfer accounts can be divided into single transfer and multiple batches according to the account number of the transfer accounts, the single transfer accounts are common to individual customers, and the multiple batches of transfer accounts are common to enterprises and banks, for example, the enterprises use batch payment directly connected with the bank enterprises, bank issuing staff wages, bank batch payment of goods and money and the like. The batch payment relates to several banks or even dozens of banks, and in order to ensure that the transfer is successfully completed, the name of the bank and the corresponding bank number are required to be ensured to be accurate and correct. It takes hours for business personnel to query bank names when faced with a large number of re-billing needs.

Therefore, there is a need to develop a bank name batch correction system based on TF-IDF word vector technology to address the deficiencies of the prior art and to solve or alleviate one or more of the above problems.

[ summary of the invention ]

In view of the above, the invention provides a bank name batch correction method and system based on a TF-IDF word vector technology, which can realize the batch correction of bank names and the automatic matching of line numbers, have the advantages of high accuracy and high correction speed, can greatly improve the batch transfer efficiency of enterprises and banks, and save the time cost of the enterprises.

In one aspect, the invention provides a bank name batch correction method based on a TF-IDF word vector technology, which is characterized by comprising the following steps:

s1, processing the bank name library to be used as a first training set, and processing the first training set by adopting a TF-IDF algorithm to obtain a first word vector matrix;

s2, performing normalization processing on each row of the first word vector matrix respectively to obtain a first normalization matrix;

s3, leading in the bank names to be corrected in batch, and carrying out multi-thread batch preprocessing on the bank names to obtain a second training set corresponding to each bank name;

s4, processing the second training set by adopting a TF-IDF algorithm to obtain a second word vector matrix corresponding to each bank name, and performing normalization processing on each row of the second word vector matrix to obtain a second normalization matrix;

s5, calculating cosine similarity of the first normalization matrix and the second normalization matrix, selecting a bank full name as a matching result to replace a corresponding bank name to be corrected according to the cosine similarity, and adding a corresponding line number as a correction result to output in batches.

As for the above-mentioned aspects and any possible implementation manner, an implementation manner is further provided, and the specific content of selecting the bank full name as the matching result according to the cosine similarity is as follows: and performing dot multiplication on the second normalization matrix and the transpose of the first normalization matrix, and selecting the bank full name corresponding to the position of the maximum value in each row in the multiplication result as a matching result.

The above-described aspect and any possible implementation manner further provide an implementation manner, and the processing performed on the bank name repository in step S1 includes:

s11, filtering out non-key characters in the bank name;

and S12, segmenting and recombining the filtered bank names to obtain a plurality of single words and a plurality of 2-word phrases.

The above-described aspect and any possible implementation further provides an implementation in which the normalization process in step S2 or step S4 is L2Norm normalization processes, making the modulus of each row vector equal to 1.

The above aspects and any possible implementations further provide an implementation, L2The norm normalization processing is calculated in the following manner:

Figure BDA0002577294060000032

wherein norm2(x) is L2Norm, xiIs an element, x 'in the first or second word vector matrix'iIs an element in the corresponding first or second normalization matrix; x is the number of1、x2...xnAnd each value in the vector represents the TF-IDF value corresponding to the bank name.

As for the above-mentioned aspects and any possible implementation manner, there is further provided an implementation manner, where the TF-IDF algorithm is specifically: TF-IDF ═ TFIDF;

TF is the word frequency of a word, and IDF is the reverse file frequency.

As described in the above aspect and any possible implementation manner, there is further provided an implementation manner, where the calculation formula of the TF is:the formula for the IDF is:

the above-described aspects and any possible implementation further provide an implementation, and the steps of the method further include: and S5, adding the new bank name into the bank name library, and repeating S1-S4.

In another aspect, the present invention provides a batch calibration system for bank names based on TF-IDF word vector technology, wherein the steps of any one of the calibration methods described above can be implemented;

the correction system includes:

the IO module is used for importing the names of the banks to be corrected in batches and outputting the correction results in batches;

the model module is responsible for carrying out preprocessing on the bank name library and the bank name to be corrected in batches and acquiring a word vector matrix of each bank name;

and the storage calculation module is used for storing the bank name library and the word vector matrix and calculating, and the calculation contents comprise normalization processing, cosine similarity calculation and result matching according to the cosine similarity calculation result.

The above-mentioned aspects and any possible implementation manner further provide an implementation manner, and the accuracy of the system for correcting the bank name reaches more than 99.9%.

In another aspect, the present invention provides a method for batch calibration of names of financial institutions based on TF-IDF word vector technology, wherein the calibration method is applied to financial institutions including banks, and the steps of the calibration method are as described above, but the names of banks, full names of banks, short names of banks, etc. need to be changed to the names of corresponding financial institutions, full names of banks, or short names of banks.

Compared with the prior art, the invention can obtain the following technical effects: the automatic bank name correcting and line number matching device has the advantages of being high in accuracy rate and high in correcting speed, and can greatly improve the batch transfer efficiency of enterprises and banks and save time cost of the enterprises.

Of course, it is not necessary for any one product in which the invention is practiced to achieve all of the above-described technical effects simultaneously.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a detailed flowchart of a TF-IDF word vector training step in the bank name batch correction method according to an embodiment of the present invention;

fig. 2 is a logic diagram of the operation of the bank name batch correction system according to an embodiment of the present invention.

[ detailed description ] embodiments

For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

Aiming at the defects of the prior art, the invention provides a bank name batch correction system based on a TF-IDF word vector technology, which can realize the batch correction of bank names and the automatic matching of line numbers, has the advantages of high accuracy, high correction speed and the like, can greatly improve the batch transfer efficiency of enterprises and banks, and saves the time cost of the enterprises. The system comprises three modules, namely an input/output module, a model module and a calculation and storage module. The input and output module provides input of batch row names and output of correction results, the model module is responsible for preprocessing input data and converting the data into TF-IDF word vectors, and the calculation and storage module is responsible for row name correction and row number matching.

The bank name batch correction method comprises the following specific contents:

step 1: and preprocessing all banks in the bank name library to obtain training data.

The preprocessing content specifically comprises:

and (3) filtering: filtering the words of irrelevant key information in the bank full name, including but not limited to company Limited, stock Limited, bank and branch;

cutting: making the filtered bank full name into participles, and cutting the full name into parts according to the character unit;

combining: and (3) arranging and combining every two segmented characters in sequence, and removing words in reverse order to obtain a 2-character word group, wherein the 2-character word group is added with all words segmented in the previous step to jointly form a training set.

For example, the Chinese trade company Beijing market is changed into [ China, trade company, Beijing, Shunyi, Zhonggong, Gongshun, Jingshun ].

Step 2: inputting a training set by using a TF-IDF algorithm to obtain a word vector matrix, then normalizing the word vectors, storing the normalized matrix X in a storage and calculation system, simultaneously obtaining a trained TF-IDF model, and storing the TF-IDF model in a model system; the TF-IDF model is the stored content, and contains TF and IDF information of each word and character in the training set and serial numbers of the words and characters. After the steps are completed, batch correction of the row names and row number matching can be carried out.

The TF-IDF calculation mode is as follows: TF-IDF ═ TFIDF; TF is the word frequency, and IDF is the reverse file frequency;

wherein, the calculation formula of TF is:

Figure BDA0002577294060000061

the formula for the IDF is:

Figure BDA0002577294060000062

the normalization of the word vector is specifically that each row in the word vector matrix is respectively normalized, and the normalization process adopts L2And normalizing the norm.

L2The norm normalization is calculated in the following manner:

wherein norm2(x) is L2A norm; x is the number ofiIs an element, x 'in a TF-IDF matrix'iAre elements in the corresponding normalized matrix; x in the norm formula1、x2...xnEach value in a vector represents the TF-IDF value after the certain bank name is combined and segmented;

and step 3: inputting batch line names (the line names can be short names of banks, or default line names containing error information, etc.) needing to be corrected, for example, importing the batch line names in the form of an Excel table, wherein each line is a line name.

And 4, step 4: and (3) transmitting the table into a model system, preprocessing each row name in the table, and preprocessing the table in the same step 1. For example, the input business bank is ordered, and after filtering, cutting and combining, the input business bank becomes: [ Care and business, order, work and order … ]

And 5: inputting the preprocessed data into a trained TF-IDF model to obtain a TF-IDF word vector, and normalizing the vector as Y.

Step 6: multiplying the matrix Y in the step 5 by the transpose of the matrix X in the S2, namely performing cosine similarity calculation to obtain a new matrix Z, and finding out the bank full name and the line number corresponding to the position of the maximum value in each line of Z;

and 7: and replacing the original input line name with the found bank full name, and adding the line number as a correction result to output.

Batch corrections differ technically from single corrections in that: the batch correction is to perform quick one-time calculation matching on the name query table uploaded by a user, specifically, a plurality of short multithreading preprocesses are performed to obtain a TF-IDF matrix, a row of vectors are obtained through single correction, then the TF-IDF matrix and a stored matrix X are directly calculated, all matching results are found at one time, and compared with a single cycle query speed, the batch correction is greatly improved and is suitable for a batch matching service scene.

The content of the application can also be applied to batch correction processes of other financial institutions.

The invention has the following beneficial effects:

1) the method can realize batch correction of bank names, can obtain the bank full name after automatic correction and the correctly matched line number only by importing the batch line names by one key, has simple and efficient operation steps, can replace a large amount of repetitive work of business personnel, and improves the working efficiency of enterprises and banks.

2) The model related by the invention has light magnitude and simple algorithm, and can be easily embedded into an enterprise bank-enterprise direct connection or related financial system.

3) The invention adopts a matrix calculation mode, compared with a circular calculation mode, the calculation speed is greatly improved, thousands of names can be corrected within a few minutes, and compared with the manual correction of the expense in hours, the invention can greatly improve the productivity.

4) The invention has high correction accuracy, and the accuracy rate is up to 99.9% after 14 ten thousand data are tested.

The bank name batch correction system based on the TF-IDF word vector technology provided by the embodiment of the present application is described in detail above. The above description of the embodiments is only for the purpose of helping to understand the method of the present application and its core ideas; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

As used in the specification and claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. "substantially" means within an acceptable error range, and a person skilled in the art can solve the technical problem within a certain error range to substantially achieve the technical effect. The description which follows is a preferred embodiment of the present application, but is made for the purpose of illustrating the general principles of the application and not for the purpose of limiting the scope of the application. The protection scope of the present application shall be subject to the definitions of the appended claims.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The foregoing description shows and describes several preferred embodiments of the present application, but as aforementioned, it is to be understood that the application is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the application as described herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the application, which is to be protected by the claims appended hereto.

10页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种数据处理方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!