Card square box-dividing method based on safe multi-party calculation

文档序号：1875628 发布日期：2021-11-23 浏览：17次中文

阅读说明：本技术 一种基于安全多方计算的卡方分箱方法 (Card square box-dividing method based on safe multi-party calculation ) 是由何道敬孙黎彤杜润萌张民张熙廖清于 2021-08-27 设计创作，主要内容包括：本发明公开了一种基于安全多方计算的卡方分箱方法,对于联邦学习的特征工程,提出一种新的卡方值计算方法,不需要加密所有的特征数据发送给数据应用方进行特征预处理,而是先将特征数据按类别分组,混入虚假分组,并对分组类别进行标记后加密发送给数据应用方,加密分组类别将会大幅度减少加密处理的数据量,数据应用方不需要解密所有特征数据,避免了巨大的资源损耗；数据提供方发送给数据应用方的是特征数据的分组信息,数据应用方解密后获取的是特征数据的分组信息,不包含特征数据的实际内容,并且该分组信息添加了虚假分组信息,将真实分组和虚假分组进行编码标记,相比传输脱敏数据和将真实数据加密后传输来说,提高了数据隐私的安全性。(The invention discloses a safe multiparty computation-based chi-square binning method, which provides a new chi-square value computation method for the feature engineering of federal study, does not need to encrypt all feature data and send the feature data to a data application party for feature preprocessing, but firstly groups the feature data according to categories, mixes the feature data into false packets, marks the packet categories and then encrypts and sends the false packets to the data application party, the encryption packet categories can greatly reduce the data volume of encryption processing, the data application party does not need to decrypt all the feature data, and huge resource loss is avoided; the data application side decrypts the packet information of the characteristic data, does not contain the actual content of the characteristic data, adds the false packet information, and carries out coding and marking on the real packet and the false packet.)

1. A card party box dividing method based on secure multi-party calculation is characterized by comprising the following steps:

step 1: the data provider generates a pair of public key pk and private key sk through a homomorphic encryption system, and sets characteristic data X as { X }₀,x₁,...,x_n-1},id∈[0,n-1]Grouping, namely dividing the id of the data with the same category in the characteristic data X into a section, and recording the section as s groups and recording the group as X_t,t∈[0,s-1]N, s are positive integers, and mark the real packet x_tClass 1, the packet class is encrypted using the public key pk, denoted as E_xGet the real grouping information Group (E (1))_t(x_t,E_x)；

Step 2: constructing false grouping, randomly dividing id of the feature data X into s grouping intervals, keeping the number of the grouping intervals consistent with the number of real grouping, and recording the intervals as X_v,v∈[0,s-1]And marking the false packet as class 0, and encrypting the packet class as E using the public key pk_xGet the false grouping information as Group (E (0))_v(x_v,E_x)；

And step 3: connecting the real grouping information and the false grouping information according to lines and disordering the lines to obtain grouping information Group_XThe data provider groups the information Group_X(x_i,E_x) Sending the data to a data application party;

and 4, step 4: grouping information Group by data application party_X(x_i,E_x) And tag data Y ═ Y₀,y₁,...,y_i,...,y_n-1},id∈[0,n-1]To obtain each grouping interval x_iCorresponding tag data y_iEach grouping interval x_iCorresponding tag data y_iThe values of (A) are added to obtain the number Group of response samples in the grouping interval_yAccording to the total number Group of data in the grouping interval_sAnd calculating the number Group of the unresponsive samples in the grouping interval_n＝Group_s-Group_yAnd divide all intoNumber Group of response samples of Group interval_yNumber of unresponsive samples Group_nTotal number of samples Group_sAnd a packet type flag E corresponding to the packet section_xSending the data to a data provider;

and 5: data provider labels the packet class E using a private key_xDecrypting to obtain a decrypted packet class label D_xWherein D is_xTrue group if 1, D_xIf the packet is 0, the packet is a false packet, and false packet information is deleted;

step 6: the data provider responds to the number Group of the corresponding response samples in the real grouping interval_yNumber of unresponsive samples Group_nTotal number of samples Group_sCalculating the ith, i ∈ [0,2s-1 ]]Expected number of samples E of jth class of each group_ijWhere j ∈ [0,2) denotes both response samples and non-response samples; according to the expected number E of samples of two adjacent real groups_ijNumber of samples A of two adjacent real packets_ijCalculating chi-square value of two adjacent real groups²；

And 7: and the data provider sets a box dividing number limit, combines two groups with the minimum chi-square value according to the chi-square values of the adjacent groups, recalculates the chi-square values of the adjacent groups after combination, and stops combining until the box dividing number reaches the box dividing number limit to obtain a chi-square box dividing result.

2. The chi-square binning method based on secure multi-party computation of claim 1, wherein the real packet x of step 1_tIncluding only the id of the feature data, id ∈ [0, n-1 ]]And the actual value of the characteristic data is not included, so that the leakage of the actual value of the characteristic data is avoided.

3. The chi-square binning method based on secure multi-party computation of claim 1, wherein the id of the feature data X is randomly divided into s packet intervals in step 2 in order to construct a dummy packet, mix the dummy packet into a real packet, and protect real packet information.

4. The chi-square binning method based on secure multiparty computation of claim 1, wherein the grouping information Group of step 3_X(x_i,E_x) And the false packet information and the real packet information are mixed, and the categories of the false packet and the real packet are encrypted to protect the privacy of the characteristic data.

5. The chi-square binning method based on secure multi-party computation of claim 1, wherein the response sample number Group of step 4_yObtained according to the following way: packet information x_iThe id of the characteristic data is contained, and the id is corresponding to the id of the label data Y to obtain grouping information x_iCorresponding label value if x in ith packet information_i＝[0,2]Then the corresponding tag value is [ y ]₀,y₂]Since the label value of the response sample is 1 and the label value of the non-response sample is 0, the label values corresponding to the grouping information are added to obtain the Group of the number of response samples of the grouping_y。

6. The chi-square binning method based on secure multi-party computation of claim 1, wherein the number of unresponsive samples Group in step 4_nThe obtaining method comprises the following steps: the number of samples in each packet is packet information x_iThe number of id in the packet information, i.e. x in the packet information_iThe length of the packet is obtained as the number Group of samples of the packet_sSubtracting the number of response samples from the number of samples of the Group to obtain the number Group of unresponsive samples_n。

7. The chi-square binning method based on secure multi-party computation of claim 1, wherein the expected number of samples E of jth class of the ith group in step 6_ijThe calculation formula of (2) is as follows:

wherein R is_iThe sum of the number of samples of the j, j +1 th class representing the ith group, i.e. R_i＝Group_s ⁽ⁱ⁾，C_jRepresents the sum of the number of samples of the jth class in the ith and i +1 th groups, and when j represents the response sample class, C_j＝Group_y ⁽ⁱ⁾+Group_y ⁽ⁱ ⁺¹⁾N denotes the total number of samples of two adjacent packets, i.e. N-Group_s ⁽ⁱ⁾+Group_s ⁽ⁱ⁺¹⁾。

8. The chi-square binning method of claim 1, wherein the chi-square value of step 6 is²The calculation formula is as follows:

wherein A is_ijIs the actual number of samples in the ith group and the jth class, if j represents the response sample of the ith group, then A_ij＝Group_y ⁽ⁱ⁾，E_ijIs the expected number of samples for the ith group and the jth class.

Technical Field

The invention belongs to the field of federal learning, and particularly relates to a card square binning method based on safe multi-party calculation.

Background

It is first necessary to build a data set before federal learning begins, rather than modeling directly using raw data. The task of converting raw data into a data set is called feature engineering.

The feature selection is an important step in feature engineering, generally, when a classification model is established, firstly, continuous variables need to be discretized, and after the feature discretization, the model is more stable, so that the risk of overfitting the model is reduced. In the process of feature selection, a binning operation is often performed, and binning is to discretize continuous feature data. The benefits of binning are many, for example: the method has stronger robustness on the abnormal data, and solves the problem of the interference of the abnormal data on modeling; after the characteristic data are discretized, each characteristic data has an independent weight, nonlinearity is introduced into the logistic regression model, and the expression capacity of the model can be improved; the missing values of the features can be taken as an independent class to be brought into the model by binning, the sparse vector inner product multiplication speed formed after feature discretization is high, the calculation result is convenient to store and easy to expand, and the like. For exact discretization, the data is partitioned into classes, and two adjacent intervals can be merged if they have very similar class distributions, otherwise they should remain separate, and a low chi-square value indicates similar class distributions in the two adjacent intervals. And calculating the chi-square value of the characteristic data after the characteristic data are subjected to binning, wherein the smaller the chi-square value is, the more similar the distribution is, and the chi-square value can be combined into one bin.

In the process of feature discretization or feature prediction capability evaluation, in the process of feature preprocessing of federal learning, a party lacking feature tag data needs to send own feature data to a party with a feature tag for joint feature preprocessing.

In most of the existing federal learning frameworks, in order to meet the privacy protection requirement, a data provider encrypts all feature matrices by using a public key in calculation, and then sends a ciphertext matrix to a data application party, and the data application party decrypts the data by using the private key to calculate. In large-scale data collection, this approach obviously causes huge resource consumption and performance degradation. The other part directly transmits desensitization data for calculation, so that the data privacy safety cannot be protected, the legal specification is not met, and the other part of participants independently train and fuse the training results, so that the data value cannot be fully exerted.

Disclosure of Invention

The invention aims to provide a novel safe multiparty computation-based chi-square binning method, which is used for accurate discretization of data. And calculating the chi-square value of the characteristic data after the characteristic data are subjected to binning, wherein the smaller the chi-square value is, the more similar the distribution is, and the chi-square value can be combined into one bin.

The specific technical scheme for realizing the purpose of the invention is as follows:

a card party box dividing method based on secure multi-party calculation comprises the following steps:

step 1: the data provider generates a pair of public key pk and private key sk through a homomorphic encryption system, and sets characteristic data X as { X }₀,x₁,...,x_n-1},id∈[0,n-1]Grouping, namely dividing the id of the data with the same category in the characteristic data X into a section, and recording the section as s groups and recording the group as X_t,t∈[0,s-1]Where n, s are positive integers, and marks the real packet x_tClass 1, the packet class is encrypted using the public key pkRecorded as E_xGet the real grouping information Group (E (1))_t(x_t,E_x)；

and 4, step 4: grouping information Group by data application party_X(x_i,E_x) And tag data Y ═ Y₀,y₁,...,y_i,...,y_n-1},id∈[0,n-1]To obtain each grouping interval x_iCorresponding tag data y_iEach grouping interval x_iCorresponding tag data y_iThe values of (A) are added to obtain the number Group of response samples in the grouping interval_yAccording to the total number Group of data in the grouping interval_sAnd calculating the number Group of the unresponsive samples in the grouping interval_n＝Group_s-Group_yAnd the number Group of response samples of all the grouping intervals_yNumber of unresponsive samples Group_nTotal number of samples Group_sAnd a packet type flag E corresponding to the packet section_xSending the data to a data provider;

Step 1 true grouping x_tIncluding only the id of the feature data, id ∈ [0, n-1 ]]And the actual value of the characteristic data is not included, so that the leakage of the actual value of the characteristic data is avoided.

And 2, randomly dividing id of the feature data X into s packet intervals to construct a false packet, mixing the false packet into a real packet and protecting real packet information.

Step 3, grouping the information Group_X(x_i,E_x) And the false packet information and the real packet information are mixed, and the categories of the false packet and the real packet are encrypted to protect the privacy of the characteristic data.

Step 4, the number Group of response samples_yObtained according to the following way: packet information x_iThe id of the characteristic data is contained, and the id is corresponding to the id of the label data Y to obtain grouping information x_iCorresponding label value if x in ith packet information_i＝[0,2]Then the corresponding tag value is [ y ]₀,y₂]Since the label value of the response sample is 1 and the label value of the non-response sample is 0, the label values corresponding to the grouping information are added to obtain the Group of the number of response samples of the grouping_y。

Step 4, the number Group of the unresponsive samples_nThe obtaining method comprises the following steps: the number of samples in each packet is packet information x_iThe number of id, i.e. grouping informationIn x_iThe length of the packet is obtained as the number Group of samples of the packet_sSubtracting the number of response samples from the number of samples of the Group to obtain the number Group of unresponsive samples_n。

6 the expected number E of samples in the jth category of the ith group_ijThe calculation formula of (2) is as follows:

wherein R is_iThe sum of the number of samples of the j, j +1 th class representing the ith group, i.e. R_i＝Group_s ⁽ⁱ⁾，C_jRepresents the sum of the number of samples of the jth class in the ith and i +1 th groups, and when j represents the response sample class, C_j＝Group_y ⁽ⁱ⁾+Group_y ⁽ⁱ⁺¹⁾N denotes the total number of samples of two adjacent packets, i.e. N-Group_s ⁽ⁱ⁾+Group_s ⁽ⁱ⁺¹⁾。

6, chi square value x²The calculation formula is as follows:

The invention has the advantages of

In the aspect of safety, the invention protects the data privacy of card party sub-boxes in the stage of Federal learning feature engineering, the feature data is grouped, the data index id of the same class is used as real grouping information, false grouping information is added, the real grouping class is marked as 1, the false grouping class is marked as 0, the 0 and 1 codes of the grouping class are encrypted, the real grouping information is mixed with the false grouping information and then is sent to a data application party, the data application party does not know the specific value of the feature data of the grouping, only knows the id corresponding to the feature data, and mixes the false grouping, thereby protecting the data privacy of the feature data.

In the aspect of operation efficiency, all characteristic values do not need to be encrypted and sent to a data application party, only the grouping type of the characteristic data is encrypted, the calculation overhead of encrypting and decrypting a large amount of data is avoided, and the efficiency is very obvious in a large data set scene.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following specific examples and the accompanying drawings. The procedures, conditions, experimental methods and the like for carrying out the present invention are general knowledge and common general knowledge in the art except for the contents specifically mentioned below, and the present invention is not particularly limited.

Examples

Taking the card party binning result of the data provider feature data X as an example, the card party binning method based on secure multi-party calculation has the following steps:

firstly, the data provider divides the id of the data with the same type of the characteristic data X into a section, and the grouping result is as follows: x is the number of_t＝[0]，[1，2]，[3]，[4]，[5，6，7]A total of 5 packets, which are nominally hexyl-these packets are authentic, and the packet class E is encrypted using male dk_xGet the real grouping information Group (E (1))_t(x_t，E_x) Real grouping information Group_t(x_t，E_x) The specific contents are as follows:

x_t	E_x
		[0]	E(1)
[1，2]	E(1)
		[3]	E(1)
[4]	E(1)
		[5，6，7]	E(1)

secondly, constructing a false grouping, and randomly dividing id of the feature data X into s intervals, wherein the grouping result is as follows: x is the number of_v＝[0，1，2]，[3，4]，[5]，[6]，[7]The number of packets is consistent with the number of real packets, and the number of packets is 5 packets. Marking the packets as dummy packets and encrypting packet class E using public key pk_xGet the false grouping information Group (E (0))_v(x_v，E_x) Dummy grouping information Group_v(x_v，E_x) The specific contents are as follows:

x_v	E_x
		[0，1，2]	E(0)
[3，4]	E(0)
		[5]	E(0)
[6]	E(0)
		[7]	E(0)

then, real grouping information Group is added_t(x_t，E_x) And a dummy grouping information Group_v(x_v，E_x) Connected according to the line and disordered according to the line to obtain grouping information Group_X(x_i，E_x) And sending the grouping information to the data application side, grouping information Group_X(x_i，E_x) The specific contents are as follows:

x_i	E_x
		[0，1，2]	E(0)
[3，4]	E(0)
		[0]	E(1)
[5]	E(0)
		[1，2]	E(1)
[3]	E(1)
		[6]	E(0)
[7]	E(0)
		[4]	E(1)
[5，6，7]	E(1)

then, the data application side groups the information Group_XMapping id of tag data Y ═ {0, 1, 1, 1, 0, 0, 1, 1}, obtaining the value of tag data corresponding to each packet section, and assigning each packet section x as follows_iCorresponding tag data y_iAdding to obtain the number Group of response samples in the grouping interval_yAccording to the total number Group of data in the grouping interval_sAnd calculating the number of the unresponsive samples in the packet interval

Group_n＝Group_s-Group_y

Then, the number Group of response samples of all the grouping intervals_yNumber of unresponsive samples Group_nTotal number of samples Group_sAnd a packet type flag E corresponding to the packet section_xSending the data to a data provider;

the data provider decrypts the packet class label E using the private key sk_xTrue packet information is obtained, and the packet whose packet type flag is decrypted to obtain 1 is a true packet. According to the number Group of the corresponding response samples of each real packet interval_yNumber of unresponsive samples Group_nTotal number of samples Group_sCalculating the expected number E of samples of the jth category of the ith group_ijWhere j e [0,2) denotes both response and non-response samples, here with two adjacent real packet intervals [0 ∈ 2 ]]And [1, 2 ]]For example, chi-squared values of two packets are calculated, and the information of two adjacent real packets is as follows:

packet numbering	Grouping	Group_y	Group_n	R_i(Group_s)
					0	[0]	0	1	1
1	[1，2]	2	0	2
					-------------	C_j	2	1	3

Packet interval [0 ]]Group of response sample number_y ⁽⁰⁾0, total number of samples Group_s ⁽⁰⁾1, the number of unresponsive samples is Group_n ⁽⁰⁾1, the expected number of samples for this packet is:

grouping [1, 2 ]]Is expected to be the number of samplesAccording to the expected number E of samples of two adjacent real groups_ijNumber of samples A of two adjacent real packets_ijFinally, calculating chi-square value chi of two adjacent real groups²；

The data provider sets the limitation of the number of the sub-boxes according to the chi-square value, chi-square value and the like of adjacent groups²Merging the minimum two groups, recalculating the chi-square value of the adjacent groups after merging, stopping merging until the number of the sub-boxes reaches the limit of the number of the sub-boxes, and obtaining the chi-square scoreAnd (5) box results.

9页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种环绕行星探测器的轨道测量方法

Card square box-dividing method based on safe multi-party calculation

相关技术

网友询问留言