Card square box-dividing method based on safe multi-party calculation

文档序号:1875628 发布日期:2021-11-23 浏览:17次 中文

阅读说明:本技术 一种基于安全多方计算的卡方分箱方法 (Card square box-dividing method based on safe multi-party calculation ) 是由 何道敬 孙黎彤 杜润萌 张民 张熙 廖清 于 2021-08-27 设计创作,主要内容包括:本发明公开了一种基于安全多方计算的卡方分箱方法,对于联邦学习的特征工程,提出一种新的卡方值计算方法,不需要加密所有的特征数据发送给数据应用方进行特征预处理,而是先将特征数据按类别分组,混入虚假分组,并对分组类别进行标记后加密发送给数据应用方,加密分组类别将会大幅度减少加密处理的数据量,数据应用方不需要解密所有特征数据,避免了巨大的资源损耗;数据提供方发送给数据应用方的是特征数据的分组信息,数据应用方解密后获取的是特征数据的分组信息,不包含特征数据的实际内容,并且该分组信息添加了虚假分组信息,将真实分组和虚假分组进行编码标记,相比传输脱敏数据和将真实数据加密后传输来说,提高了数据隐私的安全性。(The invention discloses a safe multiparty computation-based chi-square binning method, which provides a new chi-square value computation method for the feature engineering of federal study, does not need to encrypt all feature data and send the feature data to a data application party for feature preprocessing, but firstly groups the feature data according to categories, mixes the feature data into false packets, marks the packet categories and then encrypts and sends the false packets to the data application party, the encryption packet categories can greatly reduce the data volume of encryption processing, the data application party does not need to decrypt all the feature data, and huge resource loss is avoided; the data application side decrypts the packet information of the characteristic data, does not contain the actual content of the characteristic data, adds the false packet information, and carries out coding and marking on the real packet and the false packet.)

1. A card party box dividing method based on secure multi-party calculation is characterized by comprising the following steps:

step 1: the data provider generates a pair of public key pk and private key sk through a homomorphic encryption system, and sets characteristic data X as { X }0,x1,...,xn-1},id∈[0,n-1]Grouping, namely dividing the id of the data with the same category in the characteristic data X into a section, and recording the section as s groups and recording the group as Xt,t∈[0,s-1]N, s are positive integers, and mark the real packet xtClass 1, the packet class is encrypted using the public key pk, denoted as ExGet the real grouping information Group (E (1))t(xt,Ex);

Step 2: constructing false grouping, randomly dividing id of the feature data X into s grouping intervals, keeping the number of the grouping intervals consistent with the number of real grouping, and recording the intervals as Xv,v∈[0,s-1]And marking the false packet as class 0, and encrypting the packet class as E using the public key pkxGet the false grouping information as Group (E (0))v(xv,Ex);

And step 3: connecting the real grouping information and the false grouping information according to lines and disordering the lines to obtain grouping information GroupXThe data provider groups the information GroupX(xi,Ex) Sending the data to a data application party;

and 4, step 4: grouping information Group by data application partyX(xi,Ex) And tag data Y ═ Y0,y1,...,yi,...,yn-1},id∈[0,n-1]To obtain each grouping interval xiCorresponding tag data yiEach grouping interval xiCorresponding tag data yiThe values of (A) are added to obtain the number Group of response samples in the grouping intervalyAccording to the total number Group of data in the grouping intervalsAnd calculating the number Group of the unresponsive samples in the grouping intervaln=Groups-GroupyAnd divide all intoNumber Group of response samples of Group intervalyNumber of unresponsive samples GroupnTotal number of samples GroupsAnd a packet type flag E corresponding to the packet sectionxSending the data to a data provider;

and 5: data provider labels the packet class E using a private keyxDecrypting to obtain a decrypted packet class label DxWherein D isxTrue group if 1, DxIf the packet is 0, the packet is a false packet, and false packet information is deleted;

step 6: the data provider responds to the number Group of the corresponding response samples in the real grouping intervalyNumber of unresponsive samples GroupnTotal number of samples GroupsCalculating the ith, i ∈ [0,2s-1 ]]Expected number of samples E of jth class of each groupijWhere j ∈ [0,2) denotes both response samples and non-response samples; according to the expected number E of samples of two adjacent real groupsijNumber of samples A of two adjacent real packetsijCalculating chi-square value of two adjacent real groups2

And 7: and the data provider sets a box dividing number limit, combines two groups with the minimum chi-square value according to the chi-square values of the adjacent groups, recalculates the chi-square values of the adjacent groups after combination, and stops combining until the box dividing number reaches the box dividing number limit to obtain a chi-square box dividing result.

2. The chi-square binning method based on secure multi-party computation of claim 1, wherein the real packet x of step 1tIncluding only the id of the feature data, id ∈ [0, n-1 ]]And the actual value of the characteristic data is not included, so that the leakage of the actual value of the characteristic data is avoided.

3. The chi-square binning method based on secure multi-party computation of claim 1, wherein the id of the feature data X is randomly divided into s packet intervals in step 2 in order to construct a dummy packet, mix the dummy packet into a real packet, and protect real packet information.

4. The chi-square binning method based on secure multiparty computation of claim 1, wherein the grouping information Group of step 3X(xi,Ex) And the false packet information and the real packet information are mixed, and the categories of the false packet and the real packet are encrypted to protect the privacy of the characteristic data.

5. The chi-square binning method based on secure multi-party computation of claim 1, wherein the response sample number Group of step 4yObtained according to the following way: packet information xiThe id of the characteristic data is contained, and the id is corresponding to the id of the label data Y to obtain grouping information xiCorresponding label value if x in ith packet informationi=[0,2]Then the corresponding tag value is [ y ]0,y2]Since the label value of the response sample is 1 and the label value of the non-response sample is 0, the label values corresponding to the grouping information are added to obtain the Group of the number of response samples of the groupingy

6. The chi-square binning method based on secure multi-party computation of claim 1, wherein the number of unresponsive samples Group in step 4nThe obtaining method comprises the following steps: the number of samples in each packet is packet information xiThe number of id in the packet information, i.e. x in the packet informationiThe length of the packet is obtained as the number Group of samples of the packetsSubtracting the number of response samples from the number of samples of the Group to obtain the number Group of unresponsive samplesn

7. The chi-square binning method based on secure multi-party computation of claim 1, wherein the expected number of samples E of jth class of the ith group in step 6ijThe calculation formula of (2) is as follows:

wherein R isiThe sum of the number of samples of the j, j +1 th class representing the ith group, i.e. Ri=Groups (i),CjRepresents the sum of the number of samples of the jth class in the ith and i +1 th groups, and when j represents the response sample class, Cj=Groupy (i)+Groupy (i +1)N denotes the total number of samples of two adjacent packets, i.e. N-Groups (i)+Groups (i+1)

8. The chi-square binning method of claim 1, wherein the chi-square value of step 6 is2The calculation formula is as follows:

wherein A isijIs the actual number of samples in the ith group and the jth class, if j represents the response sample of the ith group, then Aij=Groupy (i),EijIs the expected number of samples for the ith group and the jth class.

Technical Field

The invention belongs to the field of federal learning, and particularly relates to a card square binning method based on safe multi-party calculation.

Background

It is first necessary to build a data set before federal learning begins, rather than modeling directly using raw data. The task of converting raw data into a data set is called feature engineering.

The feature selection is an important step in feature engineering, generally, when a classification model is established, firstly, continuous variables need to be discretized, and after the feature discretization, the model is more stable, so that the risk of overfitting the model is reduced. In the process of feature selection, a binning operation is often performed, and binning is to discretize continuous feature data. The benefits of binning are many, for example: the method has stronger robustness on the abnormal data, and solves the problem of the interference of the abnormal data on modeling; after the characteristic data are discretized, each characteristic data has an independent weight, nonlinearity is introduced into the logistic regression model, and the expression capacity of the model can be improved; the missing values of the features can be taken as an independent class to be brought into the model by binning, the sparse vector inner product multiplication speed formed after feature discretization is high, the calculation result is convenient to store and easy to expand, and the like. For exact discretization, the data is partitioned into classes, and two adjacent intervals can be merged if they have very similar class distributions, otherwise they should remain separate, and a low chi-square value indicates similar class distributions in the two adjacent intervals. And calculating the chi-square value of the characteristic data after the characteristic data are subjected to binning, wherein the smaller the chi-square value is, the more similar the distribution is, and the chi-square value can be combined into one bin.

In the process of feature discretization or feature prediction capability evaluation, in the process of feature preprocessing of federal learning, a party lacking feature tag data needs to send own feature data to a party with a feature tag for joint feature preprocessing.

In most of the existing federal learning frameworks, in order to meet the privacy protection requirement, a data provider encrypts all feature matrices by using a public key in calculation, and then sends a ciphertext matrix to a data application party, and the data application party decrypts the data by using the private key to calculate. In large-scale data collection, this approach obviously causes huge resource consumption and performance degradation. The other part directly transmits desensitization data for calculation, so that the data privacy safety cannot be protected, the legal specification is not met, and the other part of participants independently train and fuse the training results, so that the data value cannot be fully exerted.

Disclosure of Invention

The invention aims to provide a novel safe multiparty computation-based chi-square binning method, which is used for accurate discretization of data. And calculating the chi-square value of the characteristic data after the characteristic data are subjected to binning, wherein the smaller the chi-square value is, the more similar the distribution is, and the chi-square value can be combined into one bin.

The specific technical scheme for realizing the purpose of the invention is as follows:

a card party box dividing method based on secure multi-party calculation comprises the following steps:

step 1: the data provider generates a pair of public key pk and private key sk through a homomorphic encryption system, and sets characteristic data X as { X }0,x1,...,xn-1},id∈[0,n-1]Grouping, namely dividing the id of the data with the same category in the characteristic data X into a section, and recording the section as s groups and recording the group as Xt,t∈[0,s-1]Where n, s are positive integers, and marks the real packet xtClass 1, the packet class is encrypted using the public key pkRecorded as ExGet the real grouping information Group (E (1))t(xt,Ex);

Step 2: constructing false grouping, randomly dividing id of the feature data X into s grouping intervals, keeping the number of the grouping intervals consistent with the number of real grouping, and recording the intervals as Xv,v∈[0,s-1]And marking the false packet as class 0, and encrypting the packet class as E using the public key pkxGet the false grouping information as Group (E (0))v(xv,Ex);

And step 3: connecting the real grouping information and the false grouping information according to lines and disordering the lines to obtain grouping information GroupXThe data provider groups the information GroupX(xi,Ex) Sending the data to a data application party;

and 4, step 4: grouping information Group by data application partyX(xi,Ex) And tag data Y ═ Y0,y1,...,yi,...,yn-1},id∈[0,n-1]To obtain each grouping interval xiCorresponding tag data yiEach grouping interval xiCorresponding tag data yiThe values of (A) are added to obtain the number Group of response samples in the grouping intervalyAccording to the total number Group of data in the grouping intervalsAnd calculating the number Group of the unresponsive samples in the grouping intervaln=Groups-GroupyAnd the number Group of response samples of all the grouping intervalsyNumber of unresponsive samples GroupnTotal number of samples GroupsAnd a packet type flag E corresponding to the packet sectionxSending the data to a data provider;

and 5: data provider labels the packet class E using a private keyxDecrypting to obtain a decrypted packet class label DxWherein D isxTrue group if 1, DxIf the packet is 0, the packet is a false packet, and false packet information is deleted;

step 6: the data provider responds to the number Group of the corresponding response samples in the real grouping intervalyNumber of unresponsive samples GroupnTotal number of samples GroupsCalculating the ith, i ∈ [0,2s-1 ]]Expected number of samples E of jth class of each groupijWhere j ∈ [0,2) denotes both response samples and non-response samples; according to the expected number E of samples of two adjacent real groupsijNumber of samples A of two adjacent real packetsijCalculating chi-square value of two adjacent real groups2

And 7: and the data provider sets a box dividing number limit, combines two groups with the minimum chi-square value according to the chi-square values of the adjacent groups, recalculates the chi-square values of the adjacent groups after combination, and stops combining until the box dividing number reaches the box dividing number limit to obtain a chi-square box dividing result.

Step 1 true grouping xtIncluding only the id of the feature data, id ∈ [0, n-1 ]]And the actual value of the characteristic data is not included, so that the leakage of the actual value of the characteristic data is avoided.

And 2, randomly dividing id of the feature data X into s packet intervals to construct a false packet, mixing the false packet into a real packet and protecting real packet information.

Step 3, grouping the information GroupX(xi,Ex) And the false packet information and the real packet information are mixed, and the categories of the false packet and the real packet are encrypted to protect the privacy of the characteristic data.

Step 4, the number Group of response samplesyObtained according to the following way: packet information xiThe id of the characteristic data is contained, and the id is corresponding to the id of the label data Y to obtain grouping information xiCorresponding label value if x in ith packet informationi=[0,2]Then the corresponding tag value is [ y ]0,y2]Since the label value of the response sample is 1 and the label value of the non-response sample is 0, the label values corresponding to the grouping information are added to obtain the Group of the number of response samples of the groupingy

Step 4, the number Group of the unresponsive samplesnThe obtaining method comprises the following steps: the number of samples in each packet is packet information xiThe number of id, i.e. grouping informationIn xiThe length of the packet is obtained as the number Group of samples of the packetsSubtracting the number of response samples from the number of samples of the Group to obtain the number Group of unresponsive samplesn

6 the expected number E of samples in the jth category of the ith groupijThe calculation formula of (2) is as follows:

wherein R isiThe sum of the number of samples of the j, j +1 th class representing the ith group, i.e. Ri=Groups (i),CjRepresents the sum of the number of samples of the jth class in the ith and i +1 th groups, and when j represents the response sample class, Cj=Groupy (i)+Groupy (i+1)N denotes the total number of samples of two adjacent packets, i.e. N-Groups (i)+Groups (i+1)

6, chi square value x2The calculation formula is as follows:

wherein A isijIs the actual number of samples in the ith group and the jth class, if j represents the response sample of the ith group, then Aij=Groupy (i),EijIs the expected number of samples for the ith group and the jth class.

The invention has the advantages of

In the aspect of safety, the invention protects the data privacy of card party sub-boxes in the stage of Federal learning feature engineering, the feature data is grouped, the data index id of the same class is used as real grouping information, false grouping information is added, the real grouping class is marked as 1, the false grouping class is marked as 0, the 0 and 1 codes of the grouping class are encrypted, the real grouping information is mixed with the false grouping information and then is sent to a data application party, the data application party does not know the specific value of the feature data of the grouping, only knows the id corresponding to the feature data, and mixes the false grouping, thereby protecting the data privacy of the feature data.

In the aspect of operation efficiency, all characteristic values do not need to be encrypted and sent to a data application party, only the grouping type of the characteristic data is encrypted, the calculation overhead of encrypting and decrypting a large amount of data is avoided, and the efficiency is very obvious in a large data set scene.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following specific examples and the accompanying drawings. The procedures, conditions, experimental methods and the like for carrying out the present invention are general knowledge and common general knowledge in the art except for the contents specifically mentioned below, and the present invention is not particularly limited.

Examples

Taking the card party binning result of the data provider feature data X as an example, the card party binning method based on secure multi-party calculation has the following steps:

firstly, the data provider divides the id of the data with the same type of the characteristic data X into a section, and the grouping result is as follows: x is the number oft=[0],[1,2],[3],[4],[5,6,7]A total of 5 packets, which are nominally hexyl-these packets are authentic, and the packet class E is encrypted using male dkxGet the real grouping information Group (E (1))t(xt,Ex) Real grouping information Groupt(xt,Ex) The specific contents are as follows:

xt Ex
[0] E(1)
[1,2] E(1)
[3] E(1)
[4] E(1)
[5,6,7] E(1)

secondly, constructing a false grouping, and randomly dividing id of the feature data X into s intervals, wherein the grouping result is as follows: x is the number ofv=[0,1,2],[3,4],[5],[6],[7]The number of packets is consistent with the number of real packets, and the number of packets is 5 packets. Marking the packets as dummy packets and encrypting packet class E using public key pkxGet the false grouping information Group (E (0))v(xv,Ex) Dummy grouping information Groupv(xv,Ex) The specific contents are as follows:

xv Ex
[0,1,2] E(0)
[3,4] E(0)
[5] E(0)
[6] E(0)
[7] E(0)

then, real grouping information Group is addedt(xt,Ex) And a dummy grouping information Groupv(xv,Ex) Connected according to the line and disordered according to the line to obtain grouping information GroupX(xi,Ex) And sending the grouping information to the data application side, grouping information GroupX(xi,Ex) The specific contents are as follows:

xi Ex
[0,1,2] E(0)
[3,4] E(0)
[0] E(1)
[5] E(0)
[1,2] E(1)
[3] E(1)
[6] E(0)
[7] E(0)
[4] E(1)
[5,6,7] E(1)

then, the data application side groups the information GroupXMapping id of tag data Y ═ {0, 1, 1, 1, 0, 0, 1, 1}, obtaining the value of tag data corresponding to each packet section, and assigning each packet section x as followsiCorresponding tag data yiAdding to obtain the number Group of response samples in the grouping intervalyAccording to the total number Group of data in the grouping intervalsAnd calculating the number of the unresponsive samples in the packet interval

Groupn=Groups-Groupy

Then, the number Group of response samples of all the grouping intervalsyNumber of unresponsive samples GroupnTotal number of samples GroupsAnd a packet type flag E corresponding to the packet sectionxSending the data to a data provider;

the data provider decrypts the packet class label E using the private key skxTrue packet information is obtained, and the packet whose packet type flag is decrypted to obtain 1 is a true packet. According to the number Group of the corresponding response samples of each real packet intervalyNumber of unresponsive samples GroupnTotal number of samples GroupsCalculating the expected number E of samples of the jth category of the ith groupijWhere j e [0,2) denotes both response and non-response samples, here with two adjacent real packet intervals [0 ∈ 2 ]]And [1, 2 ]]For example, chi-squared values of two packets are calculated, and the information of two adjacent real packets is as follows:

packet numbering Grouping Groupy Groupn Ri(Groups)
0 [0] 0 1 1
1 [1,2] 2 0 2
------------- Cj 2 1 3

Packet interval [0 ]]Group of response sample numbery (0)0, total number of samples Groups (0)1, the number of unresponsive samples is Groupn (0)1, the expected number of samples for this packet is:

grouping [1, 2 ]]Is expected to be the number of samplesAccording to the expected number E of samples of two adjacent real groupsijNumber of samples A of two adjacent real packetsijFinally, calculating chi-square value chi of two adjacent real groups2

The data provider sets the limitation of the number of the sub-boxes according to the chi-square value, chi-square value and the like of adjacent groups2Merging the minimum two groups, recalculating the chi-square value of the adjacent groups after merging, stopping merging until the number of the sub-boxes reaches the limit of the number of the sub-boxes, and obtaining the chi-square scoreAnd (5) box results.

9页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种环绕行星探测器的轨道测量方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!