Distributed and collaborative analysis of encrypted data using a deep polynomial network

文档序号:1909601 发布日期:2021-11-30 浏览:21次 中文

阅读说明:本技术 使用深度多项式网络对加密数据的分布式和协作分析 (Distributed and collaborative analysis of encrypted data using a deep polynomial network ) 是由 张世雄 俞栋 于 2020-06-26 设计创作,主要内容包括:本公开涉及云-本地联合或协作数据分析框架,该框架提供数据分析模型,数据分析模型在后端服务器中进行训练并托管在后端服务器中,用于处理由远程终端设备预处理和加密的数据项。数据分析模型配置成生成加密输出数据项,然后将加密输出数据项传送到本地终端设备以进行解密和后处理。该框架在不会使本地终端设备的解密密钥暴露于后端服务器和通信网络的情况下起作用。后端服务器中的加密/解密和数据分析配置成高效地处理和传送数据项,以对来自远程终端设备的数据分析请求提供实时或近似实时的系统响应。(The present disclosure relates to a cloud-local federated or collaborative data analysis framework that provides data analysis models that are trained and hosted in back-end servers for processing data items that are pre-processed and encrypted by remote terminal devices. The data analysis model is configured to generate an encrypted output data item and then transmit the encrypted output data item to the local terminal device for decryption and post-processing. The framework functions without exposing the decryption key of the local end device to the backend server and the communication network. Encryption/decryption and data analysis in the back-end server are configured to efficiently process and transmit data items to provide real-time or near real-time system responses to data analysis requests from remote end devices.)

1. A system for providing remote data analysis, comprising:

a communication interface;

a memory for storing a deep learning neural network; and

circuitry in communication with the communication interface and the memory and configured to:

receiving an encrypted data item from a remote terminal device through the communication interface;

propagating the encrypted data item forward through the deep-learning neural network in encrypted form to obtain an encrypted output data item; and

transmitting the encrypted output data item to the remote terminal device over the communication interface, wherein the deep-learning neural network is trained using unencrypted training data and includes neurons interconnected into multiple layers, and wherein at least one activation operation and at least one pooling operation of the deep-learning neural network are polynomial.

2. The system of claim 1, wherein the remote data analysis comprises a remote speech recognition service.

3. The system of claim 2, wherein:

the encrypted data item comprises concatenated features of frames of an audio waveform of predetermined frame duration derived at the remote terminal device using a speech perception model and subsequently encrypted at the remote terminal device;

the deep learning neural network comprises an acoustic model for processing the concatenated features encrypted by the remote terminal device into the encrypted output data item; and

the encrypted output data items of the deep-learning neural network include probability vectors corresponding to a phone codebook.

4. The system of claim 1, wherein the at least one pooling operation is polynomial using scaled average pooling.

5. The system of claim 1, wherein the at least one activation operation is polynomial using a cubic polynomial approximation of an s-type function.

6. The system of claim 1, wherein the encrypted data item is encrypted based on a public key at the remote terminal device.

7. The system of claim 6, wherein at least one subset of model parameters of the deep-learning neural network trained using unencrypted training data remains unencrypted for forward propagation of the encrypted data item.

8. The system of claim 7, wherein the subset of model parameters includes a plurality of weights and a plurality of batch normalization parameters.

9. The system of claim 8, wherein the subset of model parameters further comprises a plurality of convolution kernels.

10. The system of claim 1, wherein the deep-learning neural network is trained by:

initially training the deep learning neural network, wherein a set of model parameters is trained to a first precision; and

retraining the deep-learning neural network by quantizing the set of model parameters to a second precision less than the first precision during forward and backward propagation of training data.

11. The system of claim 10, wherein the quantization levels of the model parameter sets are determined by calculating a statistical distribution of the model parameter sets of the first precision such that denser quantization levels are assigned around more concentrated values of the model parameter sets.

12. The system of claim 10, wherein the first precision of the second precision is represented by a first predetermined number of parameter bits and a second predetermined number of parameter bits, respectively, of the model parameter set, and wherein the second predetermined number of parameter bits is 8.

13. The system of claim 1, wherein the deep-learning neural network comprises a perceptual model and a following acoustic model, wherein:

the encrypted data item comprises an encrypted frame of an audio waveform of a predetermined frame duration transmitted from the remote terminal device;

the perceptual model is configured to convert the encrypted frames of audio waveforms into perceptual features; and

the acoustic model is configured to convert the perceptual features into the encrypted output data items, the encrypted output data items comprising probability vectors corresponding to a phone codebook of the deep-learning neural network.

14. The system of claim 1, wherein the deep-learning neural network comprises an acoustic model and a followed language model, and wherein:

the encrypted data item comprises an encrypted perceptual feature of each of a plurality of frames of an audio waveform of a predetermined frame duration transmitted from the remote terminal device;

the acoustic model is configured to convert the encrypted data item into a plurality of encrypted probability vectors corresponding to a phone codebook, each encrypted probability vector corresponding to one of the plurality of frames of an audio waveform; and

the language model is configured to convert the plurality of encrypted probability vectors into the encrypted output data item, the encrypted output data item comprising an encrypted text segment.

15. The system of claim 1, wherein the deep-learning neural network comprises a perceptual model, a following acoustic model, and a following language model, and wherein:

the encrypted data item comprises each encrypted frame of a plurality of encrypted frames of an audio waveform of a predetermined frame duration transmitted from the remote terminal device;

the perceptual model is configured to convert the plurality of encrypted frames of audio waveforms into a plurality of sets of perceptual features;

the acoustic model is configured to convert the plurality of sets of perceptual features into a plurality of encrypted probability vectors corresponding to a phone codebook, each encrypted probability vector corresponding to one of the plurality of frames of an audio waveform; and

the language model is configured to convert the plurality of encrypted probability vectors into the encrypted output data item, the encrypted output data item comprising an encrypted text segment.

16. A system for providing remote data analysis, comprising:

a terminal device;

a remote server, comprising:

a communication interface;

a memory for storing a deep learning neural network; and

circuitry in communication with the communication interface and the memory,

wherein the circuitry of the terminal device and the remote server are configured to:

encrypting a data item by the terminal device to obtain an encrypted data item;

sending, by the terminal device, an encrypted data item to the communication interface of the remote server;

receiving, by the circuitry, the encrypted data item from the terminal device via the communication interface;

propagating forward, by the circuitry, the encrypted data item through the deep-learning neural network in encrypted form to obtain an encrypted output data item;

sending, by the circuitry, the encrypted output data item to the terminal device via the communication interface;

receiving, by the terminal device, the encrypted output data item from the remote server; and

decrypting the encrypted output data item by the terminal device to obtain a decrypted data item, an

Wherein the deep-learning neural network is trained using unencrypted training data and includes neurons interconnected into multiple layers, and wherein at least one activation operation and at least one pooling operation of the deep-learning neural network are polynomial.

17. The system of claim 16, wherein:

the remote data analysis comprises a remote speech recognition service;

the encrypted data item comprises concatenated features of frames of an audio waveform of predetermined frame duration derived at the terminal device using a speech perception model and subsequently encrypted at the terminal device;

the deep learning neural network comprises an acoustic model for processing the concatenated features encrypted by the terminal device into the encrypted output data item; and

the encrypted output data items of the deep-learning neural network include probability vectors corresponding to a phone codebook.

18. The system of claim 16, wherein at least one subset of model parameters of the deep-learning neural network trained using unencrypted training data remains unencrypted for forward propagation of the encrypted data item.

19. The system of claim 16, wherein the deep-learning neural network is trained by:

initially training the deep learning neural network, wherein a set of model parameters is trained to a first precision; and

retraining the deep-learning neural network by quantizing the set of model parameters to a second precision less than the first precision during forward and backward propagation of training data.

20. A method for providing remote data analysis performed by a server, the server comprising a communication interface, a memory for storing a deep learning neural network, and circuitry in communication with the communication interface and the memory, the method comprising:

receiving, by the circuitry, an encrypted data item from a remote terminal device via the communication interface;

propagating forward, by the circuitry, the encrypted data item through the deep-learning neural network in encrypted form to obtain an encrypted output data item; and

transmitting, by the circuitry, the encrypted output data item to the remote terminal device via the communication interface,

wherein the deep-learning neural network is trained using unencrypted training data and includes neurons interconnected into multiple layers, and wherein at least one activation operation and at least one pooling operation of the deep-learning neural network are polynomial.

Technical Field

The present disclosure relates generally to data analysis and, more particularly, to speech recognition of encrypted data as a real-time on-demand service.

Background

Analysis of some data items may be based on complex data processing models (e.g., artificial intelligence models based on neural networks) that are more suitable for deployment in powerful and centrally managed remote backend servers. Furthermore, these models may require a great deal of effort to generate, and developers of these models may be inclined to centralized deployment rather than distributing these models to local terminal devices to avoid their algorithm leakage. Thus, such data analysis may be provided as a remote on-demand service. For example, a local terminal device that needs such a data analysis service may transmit a data item to a remote backend server through a communication network and then receive the result after data analysis is performed by the remote backend server. In some cases, the data item may be sensitive or confidential and may not be exposed to the communication network and/or remote backend servers in unencrypted form. Thus, for some security applications, it may be necessary to encrypt the data item at the local end device before sending a request for data analysis services to the remote backend server. Thus, a data analysis model deployed in a remote backend server may need to be configured to process encrypted data items without having to access any decryption keys. Special data encryption/decryption algorithms for these data items may be developed that may provide near invariance to the data analysis model between encrypted input data and unencrypted input data, but may be very complex and require a significant amount of time to run in the local terminal device. Thus, such special data encryption/decryption algorithms are impractical for many applications that require a real-time or near real-time response to data analysis needs, including but not limited to conversational speech recognition applications.

Disclosure of Invention

The present disclosure relates to a cloud-local federated or collaborative data analysis framework that provides data analysis models that are trained and hosted in back-end servers for processing data items that are pre-processed and encrypted by remote terminal devices. The data analysis model is configured to generate an encrypted output data item and then transmit the encrypted output data item to the local terminal device for decryption and post-processing. The framework functions without exposing the secret decryption key of the local terminal device to the backend server and the communication network between the local terminal device and the backend server. Thus, in addition to providing protection for data analysis models from hacking by deploying patterns in a back-end server controlled by the model developer (rather than in the terminal device), the framework also provides privacy protection for user data. Encryption/decryption and data analysis in the back-end server are configured to efficiently process and transmit data items to provide real-time or near real-time system responses to data analysis requests from remote end devices. For example, the framework can be applied to provide a remotely controlled real-time on-demand speech recognition service.

In one implementation, a system for providing remote data analysis is disclosed. The system comprises: a communication interface; a memory for storing a deep learning neural network; and circuitry in communication with the communication interface and the memory. The circuitry may be configured to: receiving an encrypted data item from a remote terminal device through a communication interface; forward propagating (forward propagating) the encrypted data item through a deep learning neural network in encrypted form to obtain an encrypted output data item; and transmitting the encrypted output data item to the remote terminal device via the communication interface. The deep-learning neural network is trained using unencrypted training data and includes neurons interconnected into multiple layers, wherein at least one activation operation and at least one pooling operation of the deep-learning neural network are polynomial.

In the above implementation, the remote data analysis includes a remote speech recognition service. In any of the above implementations, the encrypted data item includes a concatenated feature of frames of the audio waveform of a predetermined frame duration derived at the remote terminal device using a speech perception model and subsequently encrypted at the remote terminal device; the deep learning neural network comprises an acoustic model, wherein the acoustic model is used for processing the cascade features encrypted by the remote terminal equipment into encrypted output data items; and the encrypted output data items of the deep-learning neural network include probability vectors corresponding to the phone codebook.

In any of the above implementations, the at least one pooling operation is polynomial using scaled average pooling. In any of the above implementations, at least one of the activation operations is polynomial using a cubic polynomial approximation of an s-type function. In any of the above implementations, the encrypted data item may be encrypted based on a public key at the remote terminal device. In any of the above implementations, at least one subset of model parameters of the deep-learning neural network trained using unencrypted training data remains unencrypted for forward propagation of encrypted data items.

In any of the above implementations, the subset of model parameters includes a plurality of weights and a plurality of batch normalization parameters. In any of the above implementations, the subset of model parameters further includes a plurality of convolution kernels.

In any of the above implementations, the deep learning neural network may be trained as follows: initially training a deep learning neural network, wherein a set of model parameters is trained to a first precision; and retraining the deep learning neural network by quantizing the set of model parameters to a second precision less than the first precision during forward and backward propagation (backward propagation) of the training data.

In any of the above implementations, the quantization levels of the model parameter sets are determined by computing a statistical distribution of the model parameter sets of the first precision such that denser quantization levels are assigned around more concentrated values of the model parameter sets. In any of the above implementations, the first precision of the second precision is represented by a first predetermined number of parameter bits and a second predetermined number of parameter bits of the model parameter set, respectively, wherein the second predetermined number of parameter bits is 8.

In any of the above implementations, the deep learning neural network includes a perceptual model and a following acoustic model. The encrypted data item includes an encrypted frame of an audio waveform of a predetermined frame duration transmitted from the remote terminal device. The perceptual model is configured to convert the encrypted frames of the audio waveform into perceptual features. The acoustic model is configured to convert the perceptual features into encrypted output data items, the encrypted output data items comprising probability vectors corresponding to a phone codebook of the deep-learning neural network.

In any of the above implementations, the deep-learning neural network includes an acoustic model and a followed language model. The encrypted data item includes an encrypted perceptual characteristic of each of a plurality of frames of an audio waveform of a predetermined frame duration transmitted from the remote terminal device. The acoustic model is configured to convert the encrypted data item into a plurality of encrypted probability vectors corresponding to a phone codebook, each encrypted probability vector corresponding to one of a plurality of frames of an audio waveform. The language model is configured to convert the plurality of encrypted probability vectors into an encrypted output data item, the encrypted output data item comprising an encrypted text segment.

In any of the above implementations, the deep learning neural network includes a perception model, a following acoustic model, and a following language model. The encrypted data item includes each encrypted frame of a plurality of encrypted frames of an audio waveform of a predetermined frame duration transmitted from the remote terminal device. The perceptual model is configured to convert a plurality of encrypted frames of an audio waveform into a plurality of sets of perceptual features. The acoustic model is configured to convert the plurality of sets of perceptual features into a plurality of encrypted probability vectors corresponding to a phone codebook, each encrypted probability vector corresponding to one of a plurality of frames of an audio waveform. The language model is configured to convert the plurality of encrypted probability vectors into an encrypted output data item, the encrypted output data item comprising an encrypted text segment.

In another implementation, a system for providing remote data analysis is provided. The system comprises a terminal device and a remote server. The remote server includes: a communication interface; a memory for storing a deep learning neural network; and circuitry in communication with the communication interface and the memory. The circuitry of the terminal device and the remote server is configured to: encrypting the data item by the terminal device to obtain an encrypted data item; sending the encrypted data item to a communication interface of a remote server through a terminal device; receiving, by the circuitry, an encrypted data item from the terminal device via the communication interface; forward propagating, by the circuitry, the encrypted data item through the deep learning neural network in encrypted form to obtain an encrypted output data item; transmitting, by the circuitry, the encrypted output data item to the terminal device via the communication interface; receiving, by the terminal device, an encrypted output data item from the remote server; and decrypting the encrypted output data item by the terminal device to obtain a decrypted data item. The deep-learning neural network is trained using unencrypted training data and includes neurons interconnected into multiple layers, wherein at least one activation operation and at least one pooling operation of the deep-learning neural network are polynomial.

In the above implementation, the remote data analysis includes a remote speech recognition service. The encrypted data item comprises concatenated features of frames of the audio waveform of a predetermined frame duration derived at the terminal device using a speech perception model and subsequently encrypted at the terminal device. The deep learning neural network includes an acoustic model for processing the concatenated features encrypted by the terminal device into an encrypted output data item. The encrypted output data items of the deep-learning neural network include probability vectors corresponding to a phone codebook.

In any of the above system implementations, at least one subset of model parameters of the deep-learning neural network trained using unencrypted training data remains unencrypted for forward propagation of encrypted data items.

In yet another implementation, a method for providing remote data analysis performed by a server comprising a communication interface, a memory for storing a deep learning neural network, and circuitry in communication with the communication interface and the memory is disclosed. The method comprises the following steps: receiving, by the circuitry, an encrypted data item from the remote terminal device via the communication interface; forward propagating, by the circuitry, the encrypted data item through the deep learning neural network in encrypted form to obtain an encrypted output data item; and transmitting, by the circuitry, the encrypted output data item to the remote terminal device via the communication interface. The deep-learning neural network is trained using unencrypted training data and includes neurons interconnected into multiple layers, wherein at least one activation operation and at least one pooling operation of the deep-learning neural network are polynomial.

Drawings

FIG. 1 illustrates an example system for providing remote data analysis services.

Fig. 2a and 2b illustrate various implementations of the system of fig. 1 for providing data analysis services.

FIG. 3 illustrates an example implementation of distributed and collaborative data analysis for providing encrypted data items.

FIG. 4 illustrates an example implementation for providing distributed and collaborative data analysis of data items encrypted using a particular encryption scheme.

FIG. 5 illustrates an example data analysis pipeline for speech recognition of digitized input audio waveforms.

FIG. 6 illustrates an example implementation for providing distributed and collaborative speech recognition of encrypted audio data items.

FIG. 7 illustrates another example implementation for providing distributed and collaborative speech recognition of encrypted audio data items.

FIG. 8 illustrates another example implementation for providing distributed and collaborative speech recognition of encrypted audio data items.

FIG. 9 illustrates yet another example implementation for providing distributed and collaborative speech recognition of encrypted audio data items.

FIG. 10 illustrates a more detailed implementation of the distributed and collaborative speech recognition process of FIG. 6, showing a deep learning neural network deployed in a back-end server.

FIG. 11 illustrates a polynomial representation of a conventional deep learning neural network for speech recognition.

FIG. 12 illustrates a logic flow for training a deep polynomial neural network.

Fig. 13 illustrates an example quantization of a model parameter value space for a depth polynomial network.

Fig. 14 shows a screen shot of retraining of a depth polynomial network using floating point calculations but for a spatial non-uniform quantization of floating point values into a low resolution fixed point value space.

Fig. 15 illustrates various electronic elements of a computing device that may be used as the terminal device or server device in fig. 1.

Detailed Description

Analysis of complex data may rely on data processing pipelines that require powerful computing power and large memory space. Such a data processing pipeline may include various types of data processing components and data processing models, and may be hosted in a backend server to remotely serve local end devices. In particular, these data processing pipelines may be hosted in back-end servers in the form of virtual machines that utilize virtual computing resources distributed in a cloud platform. A local terminal device that needs such a data analysis service can send a data item and a request for processing the data item to a remote backend server through a communication network and then receive the result after the remote backend server performs the requested data analysis.

In many security applications, data items requiring data analysis services may be sensitive or confidential and may not be exposed to the communication network and/or remote backend servers in unencrypted form. For these applications, the data item may need to be encrypted before it leaves the local terminal device and before the request for data analysis services is sent to the remote backend server. The back-end server may be provided with the encrypted data item for security purposes without having to access a decryption key, and therefore must provide a data analysis service by processing the data item in encrypted form. Thus, the data processing components and data processing models included in the data analysis pipeline hosted in the back-end server may need to be able to process encrypted data.

The present disclosure relates to a cloud-local federated or collaborative data analysis framework that provides data analysis models that are trained and hosted in back-end servers for processing data items that are pre-processed and encrypted by remote terminal devices. A data analysis model hosted in the backend server generates an encrypted output data item, which is then transmitted to the local terminal device requesting the data analysis service for decryption and post-processing. The framework disclosed herein functions without exposing the decryption key of the local terminal device to the backend server and the communication network between the local terminal device and the backend server. Thus, the framework disclosed herein provides data privacy protection. Encryption/decryption and data analysis in the back-end are configured to efficiently process and transmit data items to provide real-time or near real-time system responses to data analysis requests from remote end devices.

Furthermore, the data analysis models hosted in the back-end server and the operation of these data analysis models and their training are adapted and modified so that the data analysis models can process the data items in encrypted form. The same data analysis model may be used to provide services to different clients, each with its own decryption key. The framework and data analysis models may be used to provide remote on-demand speech recognition services and other types of data analysis services.

FIG. 1 illustrates an example system 100 for providing remote data analysis services. The system 100 includes a backend server 112 and a remote end device 120 deployed in a cloud platform 110. The backend server 112 and the remote terminal devices are connected through a public or private communication network 102. Exemplary cloud platforms suitable for deploying the backend server 112 include, but are not limited to amazon's AWSTMGoogle cloud and microsoft AzureTM. Although a single cloud platform 110 is shown in fig. 1, backend servers may be deployed across multiple cloud platforms. As further shown in fig. 1, the backend servers may instead be deployed as dedicated servers 130 rather than virtual machines in one or more cloud platforms. Likewise, the private server 130 is connected to the remote terminal device 120 through a public or private communication network 102.

The system 100 also includes a cloud-based repository or database 114 and/or a non-cloud-based repository or database 132, the repository or database 114 and/or the repository or database 132 being connected to the communication network 102 for storing various data analysis models and various input data items, intermediate date items, and final data items processed by the data analysis models and pipelines. Terminal device 120 may include, but is not limited to, a desktop computer, a laptop computer, a tablet computer, a mobile phone, a personal digital assistant, a wearable device, and the like, as illustrated by terminal devices 122, 124, 126, and 128. The communication network 102 may include, for example, any combination of wired networks, wireless networks, access networks, and core networks having network protocol stacks configured to send and receive data.

Fig. 2a and 2b also illustrate a number of exemplary implementations for providing remote data analysis services to terminal devices 120 from backend servers 112 deployed in cloud 110. As shown in fig. 2a, data items prepared or pre-processed at the terminal device 120 for further data analysis by the back-end server 112 are denoted by f. The data item may be transmitted to the back end server 112 as indicated by arrow 203. The data analysis performed by the back-end server may be represented by m (), as shown at 204. Thus, the output data items after data analysis is performed by the back-end server 112 may be represented by m (f), as shown at 206. The processed data item m (f) may then be sent by the back-end server 112 to the terminal device 120, as indicated by arrow 207. The terminal device 120 may also post-process the data items m (f), as specified by an application (not shown in fig. 2 a) in the terminal device 120 requesting a remote data analysis service.

In many applications involving, for example, sensitive and confidential medical, financial, enterprise, or other private data, it may not be safe to provide data analysis services after the implementation of FIG. 2 a. In particular, in the data transmission and communication scheme shown in fig. 2a, the data item f in unencrypted form is exposed to the potentially untrusted communication channel 203 as well as to the backend server 112. Likewise, the processed, unencrypted form of data item m (f) is further exposed to the untrusted communication channel 207. The integrity of the data items f and m (f) can be compromised.

Fig. 2b shows an implementation modified from fig. 2a for adding protection to data items f and m (f). As shown in fig. 2b, data item f is first encrypted by terminal device 120, as indicated at 208, before it is sent by terminal device 120 to backend server 112 for data analysis services. The encryption function performed by the terminal device 120 may be represented by E (). In this way, the encrypted data item (210) represented by e (f) (rather than f) may be sent by the terminal device 120 to the backend server 112. The decryption key may be kept secret by the terminal device 120 and may not be exposed to the communication channels 203 and 207 and the backend server 112. The backend server 112 may then process the encrypted data item e (f)210 using a data analysis processing function m (), as shown at 204, to obtain a processed data item m (e (f)), as indicated at 212. The processed data item m (e (f))212 may then be transmitted from the backend server 112 to the remote terminal device 120. Then, when processed data item m (e (f))212 is received, terminal device 120 may perform decryption D (), as shown at 214, to obtain decrypted data item D (m (e (f)) before post-processing (not shown in fig. 2 b), as shown at 206.

In the implementation of fig. 2b, the data analysis function m () is typically developed to process unencrypted data items, rather than encrypted data items, because the back-end server 112 does not have access to any client's secret decryption key. In order for the implementation of fig. 2b to function successfully when processing encrypted data items using the same data analysis function m (), the decryption process 214 of fig. 2b needs to recover, or nearly recover, data items generated in the case where unencrypted input data items are processed directly by the data analysis function m (). This is illustrated in fig. 3, where the decryption result D (m (e (f)) at the terminal device 120 is restored to m (f) (otherwise, fig. 3 is the same as fig. 2 b).

To analyze function m () for any data, D (m (E (f)) m (f)) is implemented using the schemes of fig. 2b and 3, encryption algorithm E () and decryption algorithm D () may need to be implemented in some special and unconventional manner. Specifically, conventional encryption schemes lock the data so that the above conditions are difficult, if not impossible, to satisfy without first decrypting at the back-end server 120. A non-conventional encryption process that satisfies the above conditions may be referred to as Homomorphic Encryption (HE). The HE process may be designed to allow calculation and processing of data items in encrypted form. The decryption may be performed after the calculation and processing of the encrypted data item, and the decrypted output of the processed encrypted data item may be the same as the output of the same calculation and processing m () performed on the corresponding unencrypted data item. This type of encryption is referred to as homomorphism because data computation and processing (alternatively referred to as data transformation) has the same effect on unencrypted data items and encrypted data items. Fully homomorphic encryption schemes may be complex and may require computational resources that are impractical for the terminal device and may also require computational time that is not suitable for applications requiring real-time responses.

Some encryption algorithms may be partially homomorphic, because some types of data transformations (function m (), above) have the same effect on unencrypted data items and encrypted data items, as opposed to any data processing operation. For example, an encryption algorithm may include multiplying an input number by 10, and a corresponding reverse decryption algorithm may include dividing by 10. For simple data conversion operations such as addition, the encryption is homomorphic. For example, unencrypted digital data "1" and "2" may be encrypted to "10" and "20" (multiplied by 10), respectively, according to the particular encryption algorithm. The simply added data conversion m () generates digital data "3" when used to process unencrypted data and digital data "30" when used to process encrypted data. The encrypted output data "30" is decrypted to "3" (divided by 10), the same as the result of performing the data conversion directly on the unencrypted data.

Thus, in some implementations of the present disclosure, an effective encryption algorithm that is not completely homomorphic may be used in conjunction with data analysis that involves only one set of data processing operations, which is limited to a set of operations that effectively homomorph the encryption algorithm. A data analysis pipeline containing data processing operations that are not within the set of operations may be approximated or modified to include only data processing operations from the set of operations and adaptively trained. Through such modifications to the data analysis pipeline, the encryption algorithm may be homomorphic.

Fig. 4 shows an example of using an efficient encryption algorithm, which is homomorphic for a single data analysis operation or data conversion operation m () involving simple multiplication. The encryption/decryption algorithm shown in fig. 4 may be based on a conventional RSA scheme using a public key for encryption and a private key (secret key) for decryption. In the example of fig. 4, the public key may be (23, 143) and the private key may be (47, 143). The encryption algorithm may include performing a power function using a first component of the public key and then performing a modulation (Mod) operation using a second component of the public key, as shown at 406 and 408. The decryption process may include performing a power function using a first component of the private key and then performing a Mod operation using a second component of the private key, as shown at 418. In the example of fig. 4, the data item to be processed by the local terminal device 120 is the vector f ═ (f ═ f1,f2) (7, 3). The data analysis function m () consists of a simple multiplication of the components of the input data vector, m (f) ═ f1×f2As indicated at 422.

Continuing with fig. 4, local terminal device 120 first uses the public key to pair (f) the data item f ═ f1,f2) Encryption is performed (as shown at 406 and 408) to generate an encrypted vector E (f) (E (f))1),E(f2) (2, 126) and sends the encrypted vector to the backend server 112 deployed in the cloud 110, as shown at 410. The back-end server 112 then performs a data analysis function m () to generate a processed data item m (E (f) ═ E (f) ()1)×E(f2)2 × 126 ═ 252, as shown at 412 and 414. The encrypted output 414 of the data analysis process 412 may then be transmitted by the back-end server 112 to the remote end device 120 and then decrypted by the remote end device 120, as shown at 416 and 418. In this example, the encrypted output 414 after decryption using the private key is "21", as shown at 418, the same as the result obtained by performing the data analysis operation m () directly on the unencrypted input data item f ═ 7, 3, as shown at 422 and 420.

In some other implementations, the data analysis operation may include a combination of multiplication and addition of the input data items. In other words, the data analysis function m () may be a polynomial function of the input data item. An efficient homomorphic encryption/decryption algorithm can be designed for such data analysis functions. In particular, an efficient homomorphic encryption/decryption algorithm can be developed for a low order polynomial data analysis function m (). As shown in the examples below, for m () that is not a low order polynomial function, it may be suitable to approximate a combination of multiple low order polynomial functions. In other words, the data analysis function m () may be polynomial. As a result, an efficient homomorphic encryption/decryption algorithm may be used for such modified or approximated data analysis functions.

In practical applications, the data analysis function m () provided by the back-end server 112 to the terminal device 120 of fig. 1-4 may be constructed to analyze complex input data items. For example, backend server 112 (and/or 130) may provide voice recognition services to terminal device 120. Such speech recognition services may be provided through a speech-to-text Application Program Interface (API). Such an API may be used by the terminal device to request the back-end server to process an encrypted data item (data item f of fig. 2-4) formatted or converted from a digital audio waveform containing speech. The terminal device 120 may also receive the processed data item from the back-end server through the API, decrypt the processed data item, and further post-process the decrypted data item to obtain a text segment corresponding to speech contained in the digital audio waveform. Based on the speech recognition services provided by the backend server 112 using the API, powerful applications can be developed for the terminal device 120, including but not limited to speech-based control, user dialogs using natural speech conversion, speech transcription and dictation.

While the following further example implementations are provided in the context of audio processing and speech recognition services, the underlying principles are applicable to other types of remote data analysis involving different types of data and different types of data processing, including but not limited to data classification (e.g., image classification and text classification), data clustering, object segmentation detection and recognition (e.g., face segmentation and recognition) in digital images, and the like.

Fig. 5 shows an exemplary data processing pipeline 500 for speech recognition. The data processing pipeline 500 includes an audio waveform fetch 502 for generating a digitized audio waveform 503. The digital audio waveform 503 may be divided into frames of a predetermined duration (e.g., 10-100ms) and the audio frames are then processed by the perceptual model 504. The perceptual model 504 may be designed to convert frames of the digital waveform 503 into speech features 506. Perceptual models can be used to model the physiology of the human vocal cords and vocal cavities with a set of speech feature parameters. Thus, the perceptual model is able to transform or encode digital audio frames (audio frames) containing speech into a representation based on a set of speech feature parameters. For example, the perceptual model may be based on algebraic, relaxed, or low-delay code-excited linear prediction (CELP).

Continuing with description of FIG. 5, speech features 506 may then be processed frame-by-frame by acoustic model 508 to generate probability vectors that represent the probability that an audio frame contains each phone in a set of phones in a phone codebook. As will be described in more detail below, the acoustic model 508 may be based on a pre-trained deep learning neural network. The codebook probability vectors 510 for a set of consecutive audio frames may also be processed by a phoneme and language model 512 to generate recognized text 514 by detecting phonemes, words and sentences contained in the set of audio frames.

Depending on the application, the various processing models in FIG. 5 may be implemented and distributed in the local terminal device 120 or the backend server 112 (or 130). As a result, as shown in the exemplary implementations of fig. 6-9, portions of the speech recognition pipeline 500 may be implemented in the local end device 120 and other portions of the pipeline 500 may be implemented in the remote server 112, which may be hosted in a cloud platform. In any of fig. 6 to 9, the data item generated by the local terminal apparatus 120 before the encryption process is denoted by f; the data item generated by the local terminal device 120 after encryption is denoted as e (f); the data items generated by the back-end server 112 after processing e (f) are denoted as m (e (f)); and the data item after decryption by the local terminal device 120 is denoted as D (m (e (f)), as shown in fig. 6-9, consistent with the symbols in fig. 2-4.

In one example implementation shown in fig. 6, the acoustic model 508 may be provided by the back-end server 112 as a service 602, and other models and data processing may be located at the local terminal device 120. In particular, local terminal device 120 may be configured to generate a digital audio waveform 503 through an acquisition process 502 and implement a perceptual model 504 to generate speech features. The voice features may then be encrypted (208), and the encrypted voice features may then be transmitted to the back-end server 112. The back-end server 112 then processes the encrypted speech features using the acoustic model 508 to generate encrypted phone probability vectors, which are then decrypted by the local terminal device 120 (214), which is then post-processed by the phoneme and language model 512 at the local terminal device 120 to generate the recognition text 514.

In another example implementation shown in fig. 7, acoustic models 508 and phoneme and language models 512 may be provided by back-end server 112 as service 702, and other models and data processing may be located at local terminal device 120. In particular, local terminal device 120 may be configured to generate a digital audio waveform 503 through an acquisition process 502 and implement a perceptual model 504 to generate speech features. The voice features may then be encrypted (208), and the encrypted voice features may then be transmitted to the back-end server 112. The back end server 112 then processes the encrypted speech features using the acoustic model 508 to generate encrypted phone probability vectors, then processes the encrypted phone probability vectors using the phoneme and language model 512 to generate encrypted text m (e (f)), which is then decrypted by the local terminal device 120 (214) to generate recognition text 514.

In another example implementation shown in fig. 8, the perceptual model 504 and the acoustic model 508 may be provided by the backend server 112 as a service 802, and other models and data processing may be located at the local terminal device 120. In particular, the local terminal device 120 may be configured to generate a digital audio waveform 503 using the acquisition process 502. Local end device 120 may also encrypt frames of the digital waveform (208) to generate an encrypted digital audio waveform. The encrypted digital audio waveform may then be transmitted to the back end server 112. The back-end server 112 then processes the encrypted digital audio waveform using the perceptual model 504 to generate encrypted speech features, which are then processed by the acoustic model 508 to generate an encrypted phoneme codebook probability vector m (e (f)). The phoneme codebook probability vector may then be decrypted (214) by the local terminal device to generate a decrypted phoneme codebook probability vector. The decrypted phoneme codebook probability vectors may then be post-processed by the phoneme and language model 512 at the local terminal device to generate the recognition text 514.

In yet another example implementation shown in fig. 9, the perceptual model 504, the acoustic model 508, and the phoneme and language model 512 may all be provided by the back-end server 112 as a service 902, and other models and data processing may be located at the local terminal device 120. In particular, local terminal device 120 may be configured to generate a digital audio waveform 503 using acquisition process 502. Local terminal device 120 may also encrypt (208) the frames of the digital waveform to generate an encrypted digital audio waveform e (f). The encrypted digital audio waveform may then be transmitted to the back end server 112. The back-end server 112 then processes the encrypted digital audio waveform using the perceptual model 504 to generate encrypted speech features, which are then processed by the acoustic model 508 to generate encrypted phoneme codebook probability vectors. The phoneme codebook probability vectors may then be processed by the phoneme and language model 512 device to generate encrypted text. The encrypted text may then be transmitted to the local terminal device for decryption (214) to generate unencrypted and recognized text 514.

In the implementations of fig. 6-9, the data processing model included in the back-end servers used to provide services 602, 702, 802, and 902 may be polynomial, to a low degree polynomial operation, such that the encryption and decryption scheme such as that described in fig. 4 becomes homomorphic with respect to services 602, 702, 802, and 902. The following example focuses on the implementation 600 shown in fig. 6, where only the acoustic model 508 is hosted in the back-end server 112 and the remaining data processing required for the speech recognition application is performed at the local terminal device 120. The underlying principles for building and training one or more data analysis models hosted in the backend server 112 may be applied to the other implementations and scenarios of fig. 7-9, thereby distributing data processing models between the backend server 112 and the remote terminal devices 120.

In some implementations, the acoustic model 508 of fig. 6 can be based on a deep learning neural network. For example, the acoustic model may include a deep learning Convolutional Neural Network (CNN) with multiple cascaded convolutional, pooled, modified, and/or fully connected neuron layers with a large number of kernels, weights, biases, and other parameters. These model parameters may be determined by training the CNN using a sufficient set of pre-labeled input data. In an exemplary training process of the CNN model, each of a number of labeled training data sets may be propagated forward through the neuron layer of the CNN network using predetermined connectivity and training parameters to compute end-label loss or errors. Then, when the training parameters are adjusted to reduce the gradient-descent-based marker loss, back-propagation through the interconnected neuron layer may be performed in the opposite direction. The forward and backward propagation training processes for all training data sets iterate until the neural network produces a set of training parameters that provides a minimum total loss of convergence of the labels predicted by the neural network on labels previously associated with the training data sets. The converged model may then include the final set of training parameters and neural connectivity.

The above deep learning CNN may be constructed for use as an acoustic model 508 to process speech features generated by the perceptual model 504 and encrypted by the encryption process 208 of fig. 6. For example, a fully connected layer may include a phone codebook having a predetermined number of components. The phone codebook may contain N components that represent the most common phones that can be found in human speech. For example, the number of telephony components N in the telephony codebook may be 100-10000, or may be any other number. These components may be fully connected to the previous layer of the neural network. Once trained, such a deep-learning neural network may be used to process encrypted speech features of an input audio frame to generate probability vectors having N components, each probability vector representing a probability that the input audio frame contains a corresponding telephony component.

Fig. 10 further illustrates the implementation of fig. 6, wherein the acoustic model 508 is hosted in the back-end server 112 and implemented as a deep-learning neural network as described above. In particular, the frames of the digital audio waveform 503 are processed into speech features 1002 at the local terminal device 120. The speech features 1002 are then encrypted by the encryption process 208 in the local terminal device 120 to generate encrypted speech features 1004, which are then processed in the back-end server 112 using the deep learning neural network 508 to generate encrypted telephony probability vectors, also referred to in fig. 10 as encrypted back-ends (encrypted clients) 1006. The encrypted back-end 1006 is then decrypted at the local end-point device 120 by the decryption process 214 to generate a decrypted telephony probability vector 1008, the decrypted telephony probability vector 1008 corresponding to the original back-end for homomorphic encryption and decryption processing (operating on data included in the deep-learning neural network 508), which may be obtained if the deep-learning neural network 508 processes the original speech features in unencrypted form.

Conventional multi-layer neural networks may include data processing operations that are not low-order polynomial functions. For example, typical neuron activation functions, such as sigmoid functions, are not polynomials. For another example, the max-pooling operation after the convolution operation is also not a polynomial. Thus, a typical multi-layered deep learning neural network can be modified or polynomial into a low degree polynomial to maintain the homomorphism of the encryption/decryption algorithm of fig. 10. This modified deep-learning neural network may be trained and used as the acoustic model 508 for processing the encrypted speech features 1004 in FIG. 10. The modified deep learning neural network may be referred to as a Deep Polynomial Network (DPN) or a deep learning polynomial network.

An exemplary modification of the various layers of a typical deep learning neural network to include only low order polynomial operations is shown in fig. 11 and described in more detail below. For example, a dense layer such as a fully connected layer of a deep learning neural network may include only linear multiplications and additions, and thus contain only low-order polynomial functions. Thus, it can be implemented directly without modification to homomorphic encryption, as shown at 1102 in FIG. 11.

The batch normalization layer of the deep learning neural network also involves only multiplication and addition, e.g.,(where γ, μ, σ, and β are batch normalization parameters) which can be implemented directly for homomorphic encryption without the need for a low-order polynomial approximation, as shown in 1104 of FIG. 11.

Convolution operations in the convolutional layer of deep learning CNN basically involve the dot product of the weight vector (or kernel) and the output of the feed layer. Thus, the convolution operation also involves only multiplication and addition, without the need for additional polynomial approximations, as shown at 1106 of FIG. 11.

However, typical activation functions used in deep learning neural networks are generally not polynomials, but may be approximated by a low-order polynomial function. For example, the ReLU (modified Linear Unit) function z → max (0, z) used in the modification layer of the deep-learning neural network may be represented by a polynomial function of lower degree p (z) ═ z2Approximately as shown at 1108 of fig. 11. In another example, an s-type activation functionCan be represented by a low degree polynomial p (z): 1/2+ z/4-z3The/48 approximation, as shown at 1110 in FIG. 11.

Pooling operations in a pooling layer of a deep-learning neural network (in particular deep-learning CNN) are generally not multipleA polynomial and can be approximated by a polynomial function of lower degree. For example, maximal pooling is not a polynomial, but may be usedTo approximate. For simplicity, the parameter d may be set to "1", so the maximum pooling function above may be approximated by a scaled average pooling as a first order polynomial operation.

It will be appreciated by those of ordinary skill in the art that the above low order polynomial approximations are merely examples and that other polynomial approximations are also contemplated.

The modified deep learning neural network or Deep Polynomial Network (DPN) described above may be trained and tested using a labeled training data set and a labeled test data set. In some implementations, training may be performed using unencrypted voice data. Training using unencrypted data is particularly advantageous in an environment where an acoustic model implemented in a deep polynomial network can be provided as a service to many different clients with different public encryption keys. This is because, if the training process is to be performed on encrypted data, multiple encrypted versions of the training data set may be generated based on the public keys of all potential clients and used to train multiple versions of the DPN individually. By using unencrypted training data, a single version of DPN can be trained and used for all clients.

During the training process, encryption of various training parameters may not be required. Thus, the trained depth polynomial model may include unencrypted model parameters (weights, biases, kernels, etc.) and network connectivity. Also, keeping the model parameters in the unencrypted scheme during training avoids having to prepare a client-specific model. Further, when the trained polynomial network is used to process encrypted speech features from a remote terminal device of a particular client (associated with a particular public key), forward propagation through some or all layers of the polynomial network may be performed by keeping one or more model parameters of those layers unencrypted. For example, the weight parameter W in the dense layer (fully connected output layer) of a deep-learning neural network may be propagated in the forward directionDuring which it is not encrypted for use in data analysis services. Given the encrypted input to the polynomial network, the original approach to forward propagation in the computationally intensive layer is to first encrypt the weight parameters using the client's public key, then in the encryption domain E (W)TE (x) (where e (x) is the output of the previous layer) performs forward propagation through the dense layer so that after decryption by the remote terminal device, W is availableTThe exact value of x. However, this process may be computationally intensive and unnecessary. Conversely, a more efficient operation W may be used in forward propagation through the dense layer without encrypting the weight parametersTE (x). Similarly, for batch normalization, when the trained polynomial network is used to process the input speech features of a particular client, it may not be necessary to encrypt parameters such as the above batch normalization parameters γ, μ, σ, and β. In some implementations, these batch normalization parameters may be merged with the previous dense layer to provide the dense layer with the modified weight parameter WnewBiag (γ/σ) W and modified offset bnew=b+WTβ-WT(μ·γ/σ)。

The encryption may be homomorphic (additive and multiplicative) with respect to the deep polynomial network. However, the speed of encryption and decryption will be significantly affected by the degree of the deep polynomial network. To achieve higher computation speeds, it may be desirable to configure the depth polynomial network to operate at a low bit fixed precision. Otherwise, when the system operates on floating point numbers, homomorphic encryption and decryption will be very slow and unsuitable for real-time applications such as conversational speech recognition. For example, a typical training process for neural networks in GPUs uses 32-bit or higher floating point precision for training parameters, and floating point computations for forward and backward propagation. A deep polynomial network model trained and operated on with 32-bit or higher floating point precision would significantly increase the required encryption and decryption time. On the other hand, applying low-bit fixed-point trained quantization to the high-precision model parameters of a depth polynomial model trained using floating-point calculations, and then using this quantized depth polynomial network model to process the input data items using fixed-point calculations, can result in significant performance degradation because the training process is not suitable for low-precision model parameters and fixed-point calculations.

In some implementations, post-training quantization is not used during training, but rather the training process of the depth polynomial network model may be limited to fixed-point calculations and the model parameters of the depth learning polynomial model may be limited to fixed-point precision. However, such implementations may not account for the non-uniform statistical distribution of values of the model parameters. For example, weight parameter values such as 8-bit fixed-point precision may be concentrated in a small portion of the 8-bit value range of the model parameters, however, the training process based on fixed-point calculations may rely on consistent value resolution over the 8-bit value range of the parameter space. The crowded portion of the 8-bit value range may be set with the same data resolution as the other portions of the 8-bit value range with more sparse parameter occupancy. When training the deep polynomial network at floating point precision (e.g., 32-bit instead of 8-bit precision), this problem may not be a problem because even though the floating point calculation-based training process also uses a consistent data resolution over a range of parameter values, the crowded portion of the range of values may still have sufficient resolution/precision to produce a reasonably accurate model because the total number of bits available in the floating point representation of the model parameters is large.

In some implementations of the present disclosure, non-uniform quantization of the value space of model parameters may be incorporated into the training process such that training may be performed using floating point operations to compute the model parameters and gradients, but the computed model parameters may be dynamically quantized at each layer of the depth polynomial network and used in the next layer during forward and backward propagation of the training process. Additionally, quantizing the calculated floating point model parameters and their gradients to fixed point integer precision at each layer may be performed unevenly over the fixed point value space and based on the statistical distribution of the values of the model parameters.

For example, as shown in logic flow 1200 of fig. 12, the depth polynomial network model may first be trained using floating point precision (without any quantization) to obtain the model parameter set in floating point precision (1202). After this initial training, the deep polynomial network may be retrained using the initially trained floating point model parameters as a starting point, as shown at 1204. The retraining process 1204 is performed using the training data set and further by incorporating non-uniform quantization to convert the initially trained polynomial network with floating point precision into a polynomial network with parameters with lower fixed point precision and suitable for processing the input data items using fixed point calculations (so that the encryption algorithm discussed above is effectively homomorphic).

As illustrated by loop arrow 1205 and block 1206 of fig. 12, the retraining process 1204 may be performed in multiple iterations using the training data set. In the retraining logic flow 1204, steps 1208, 1210, and 1212 are associated with forward propagation in the retraining process, and steps 1214 and 1216 are associated with backward propagation in the retraining process.

In each of iterations 1205 and 1206, statistics for each set of floating point model parameters (weights, biases, etc.) of the depth polynomial network at each layer during forward propagation may be evaluated to determine a distribution of each set of parameters in the floating point value space, as shown at 1208. The model parameters may be grouped by parameter type. Each type of model parameter may have a very different range and distribution of values. Thus, the statistics of each type of model parameter in the floating point value space can be evaluated separately. For example, the model parameters may be grouped into weight parameters, bias parameters, activation parameters, and the like at each network layer.

The floating point value space for each group may then be quantized into Q segments (or quantization levels) based on statistics of the value distributions for each group in the multiple groups of model parameters, as shown at 1210. Q may be determined by fixed point accuracy. For example, if 8-bit fixed-point precision is used, the floating-point value space may be quantized to Q-28256 segments (quantization levels). The quantization levels may not be uniform because portions of the floating point value space that are more crowded than other portions may be assigned more dense quantization levels. Any algorithm may be used to determine the quantization level. For example, the quantization levels for a set of parameters may be based on the Lloyd-max quantization algorithm. In some implementations, a specific quantization limit may be applied. For example, zero may always be maintained as a quantity of QOne of the quantization levels regardless of whether any model parameters fall within the quantization level. This is because for e.g. zero-padding functions, typically in convolutional neural networks, especially deep polynomial networks, zeros may have special meaning and should be specified as quantization levels.

As further shown in step 1212 of the retraining logic flow 1204 of fig. 12, after quantization of the model parameter value space for the sets of model parameters at a particular layer, the forward propagation may proceed to the next layer by performing floating point calculations using the quantized model parameters.

As further illustrated in step 1214 of the retraining logic flow 1204 of fig. 12, during back propagation, the gradients of the model parameters at each network layer may be obtained using floating point calculations. Then, statistics of the distribution of values of the gradient can be performed, followed by non-uniform quantization of the gradient into Q quantization levels determined based on statistical and quantization algorithms, similar to the quantization of the other model parameters described above. The quantization gradient may then be propagated back to the next previous layer, as shown in step 1216.

Thus, the initial training and retraining process for the depth polynomial network includes non-uniform quantization of model parameters and gradients. The quantization level may be determined dynamically and kinematically during the retraining process. During the retraining process, the quantization level for each type of parameter and each network layer may be determined separately based on the value distribution statistics of the dynamically obtained parameters. The resulting depth polynomial model therefore only includes model parameters with fixed-point precision, but is suitable to reasonably refine the network when the fixed-point parameters are used in conjunction with fixed-point forward propagation to process the input data items.

Fig. 13 illustrates an example quantization of a model parameter (or gradient) value space. In FIG. 13, 1302 shows the statistical distribution (histogram) of an exemplary set of weight parameters for a layer of a depth polynomial network in a uniform floating point parameter value space 1304 between a minimum value 1303 and a maximum value 1305. In the example shown at 1302 and 1304, the weight parameter value distributions are, for example, centered in the middle range and sparse at both ends of the value space.

Continuing with fig. 13, 1306 shows quantizing floating point value space 1304 into Q uniform quantization levels represented by the points shown by 1306. Each point corresponding to one of Q fixed-point integer values. Quantizing the weight parameters in the high-resolution floating-point value space 1305 directly into a uniform lower-resolution fixed-point value space 1306 wastes quantization levels at both ends of the parameter value space and produces relatively fewer quantization levels in the regions where the weight parameters are concentrated. FIG. 13 also shows Lloyd-max quantization 1308 that brings the floating point parameter value space to Q non-uniform quantization levels represented by points in 1308, each point corresponding to one of Q fixed-point integer values. As shown at 1308, more dense quantization levels are assigned in a more concentrated region of the floating point value space, based on the statistical distribution of 1302. The above described Lloyd-max quantization scheme is only one of many examples. Other non-uniform quantization schemes may be applied.

FIG. 14 shows a screen shot of retraining of a depth polynomial network using floating point calculations but non-uniformly quantized into a fixed-point value space during forward propagation through multiple network layers. As shown in fig. 14, the floating point calculation outputs of the various network operations (e.g., multiply by weights, activate operations, modify operations, e.g., 1402, 1404, and 1406) are dynamically quantized during retraining according to the non-uniform Lloyd-max quantization levels of fig. 13, as shown by 1408, 1410, and 1412. The quantized output may be used for floating point calculations towards the next layer as shown by arrows 14-1 and 1403.

The implementation above in fig. 12 begins with an initial deep polynomial network that has been modified from a conventional non-polynomial deep learning neural network or a deep learning convolutional neural network. In some other implementations, as shown in table 1, the traditional non-polynomial network model may be initially trained before modifying the corresponding deep polynomial network. After the initial training of the non-polynomial neural network, the corresponding deep-polynomial network model may be initialized with model parameters derived from the model parameters of the initially trained non-polynomial neural network model (step 1 of table 1). Retraining of the deep polynomial network model may then be performed at the level of a small batch of training data sets, as shown in table 1 below (steps 2-4 of table 1).

TABLE 1

In a particular implementation for training a deep polynomial network model and using the trained model for speech recognition, the model may be trained using a Computational Network Toolkit (CNTK). By MicrosoftTMHomomorphic encryption is implemented using SEAL (simple encryption arithmetic library). Switches and our voice-assisted tasks can be used to evaluate the effectiveness of the trained DPN model.

For the switch task, the 309 hour data set and NIST 2000Hub5 are used as the training data set and the test data set, respectively. The speech features used in this exemplary setup include a 40-dimensional LFB with a voicing level CMN. The output of the DPN comprises 9000 bound triphone states. The polynomial approximation is verified on two conventional neural network models. The first model, the deep learning neural network (DNN), includes 6 layers of ReLU networks with batch normalization and 2048 cells on each layer. The second model, the deep-learning Convolutional Neural Network (CNN), comprises a 17-layer neural network comprising 3 convolutional layers with 96 kernels of size 3 x 3, followed by maximum pooling, followed by 4 convolutional layers with 192 kernels of size 3 x 3, followed by maximum pooling, followed by 4 convolutional layers with 384 kernels of size 3 x 3, followed by maximum pooling, followed by two dense layers and one softmax layer, dense layers with 4096 cells. Both example models use [ t-30; t +10] as the input context. The output of the model and its corresponding deep polynomial approximation network are then processed by the language model. The vocabulary size of the language model used is 226 k. The WERs (word error rates) of the first and second models and the corresponding deep polynomial network models are shown in table 2. All models were trained using CE criteria. The deep polynomial network was trained according to the algorithm of table 1.

TABLE 2

For the voice-assisted task, training was performed using 3400 hours of american english data and testing was performed using 6 hours of data (5500 utterances). The speech features of the perceptual model used in this setup include a 87-dim LFB with a voicing level CMN. The neural network used in this setup contains the same structure as the first model above for the switch instance, but with 9404 bound triphone output states. Table 3 summarizes the WER of the task and the average latency per utterance (including encryption, AM scoring, decryption, and decoding).

TABLE 3

Finally, FIG. 15 illustrates an exemplary computer system 1500 for implementing any of the terminal or server computing components described above. Computer system 1500 may include a communication interface 1502, system circuitry 1504, an input/output (I/O) interface 1506, a memory 1509, and display circuitry 1508, the display circuitry 1508 generating a machine interface 1510, either locally or remotely, for display, e.g., in a web browser running on a local or remote machine. Machine interface 1510 and I/O interface 1506 may include GUIs, touch-sensitive displays, voice or facial recognition inputs, buttons, switches, speakers, and other user interface elements. Other examples of I/O interfaces 1506 include microphones, video and still image cameras, headphones and microphone input/output jacks, Universal Serial Bus (USB) connectors, memory card slots, and other types of inputs. I/O interfaces 1506 may also include magnetic or optical media interfaces (e.g., CDROM or DVD drives), serial and parallel bus interfaces, and keyboard and mouse interfaces.

The communication interface 1502 may include a wireless transmitter and receiver ("transceiver") 1512 and any antenna 1514 used by the transmit and receive circuitry of the transceiver 1512. The transceiver 1512 and antenna 1514 may support Wi-Fi network communications, e.g., communications based on any version of IEEE 802.11, e.g., 802.11n or 802.11 ac. Communication interface 1502 may also include a wired transceiver 1516. The wired transceiver 1516 may provide a physical layer interface to any of a wide range of communication protocols, such as any type of ethernet, Data Over Cable Service Interface Specification (DOCSIS), Digital Subscriber Line (DSL), Synchronous Optical Network (SONET), or other protocol.

The memory 1509 may be used to store various initial data, intermediate data, or final data. Memory 1509 may be separate from or integrated with one or more of repositories 114 and 130 of fig. 1. The memory 1509 may be a centralized or distributed memory, and may be local or remote to the computer system 1500. For example, the memory 1509 may be remotely hosted by a cloud computing service provider.

The system circuitry 1504 may include any combination of hardware, software, firmware, or other circuitry. For example, the system circuitry 1504 may be implemented using one or more system on a chip (SoC), Application Specific Integrated Circuit (ASIC), microprocessor, discrete analog and digital circuits, and other circuits. The system circuitry 1504 is part of an implementation of any desired functionality associated with the system 100 of fig. 1. As just one example, the system circuitry 1504 may include one or more instruction processors 1518 and memory 1520. For example, memory 1520 stores control instructions 1526 and an operating system 1524. In one implementation, instruction processor 1518 executes control instructions 1526 and an operating system 1524 to perform any desired functions associated with any of the terminal and server components of fig. 1.

The methods, apparatus, processes, and logic described above may be implemented in many different ways and in many different combinations of hardware and software. For example, all or part of an implementation may be a circuit including an instruction processor, such as a Central Processing Unit (CPU), microcontroller, or microprocessor; an Application Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), or a Field Programmable Gate Array (FPGA); or circuitry comprising discrete logic devices or other circuit components, including analog circuit components, digital circuit components, or both; or any combination thereof. By way of example, the circuit may include discrete interconnected hardware components, and/or may be combined on a single integrated circuit die, distributed over multiple integrated circuit dies, or implemented in a multi-chip module (MCM) of multiple integrated circuit dies in a common package.

The circuitry may further include or access instructions for execution by the circuitry. The instructions may be stored in a tangible storage medium other than a transitory signal, such as a flash memory, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM); or on a magnetic or optical disk, such as a Compact Disc Read Only Memory (CDROM), Hard Disk Drive (HDD), or other magnetic or optical disk; or stored in or on another machine-readable medium. An article of manufacture, such as a computer program product, may comprise a storage medium and instructions stored in or on the medium, and when executed by circuitry in a device, the instructions may cause the device to carry out any of the processes described above or shown in the figures.

Implementations may be distributed as circuitry over multiple system components, such as over multiple processors and memories, optionally including multiple distributed processing systems. Parameters, databases, and other data structures may be separately stored and managed, may be incorporated into a single memory or database, may be logically and physically organized in many different ways, and may be implemented in many different ways, including as data structures such as linked lists, hash tables, arrays, records, objects, or implicit storage mechanisms. A program can be a part of a single program (e.g., a subroutine), a separate program, distributed across multiple memories and processors, or implemented in many different ways, such as in a library, e.g., a shared library (e.g., a Dynamic Link Library (DLL)). For example, a DLL may store instructions that, when executed by circuitry, perform any of the processes described above or shown in the figures.

From the foregoing, it can be seen that the present disclosure provides a system and method for cloud-local federated or collaborative data analysis based on a data processing model that is trained in and hosted in a backend server for processing data items that are pre-processed and encrypted by remote terminal devices. A data analysis model hosted in the backend server generates an encrypted output data item, which is then transmitted to the local terminal device, which requests a data analysis service for decryption and post-processing. The framework functions without providing the backend server with access to the decryption key of the local terminal device and without providing a communication network between the local terminal device and the backend server. Thus, the framework provides data privacy protection. Encryption/decryption and data analysis in the back-end server are configured to efficiently process and transmit data items to provide real-time or near real-time system responses to data analysis requests from remote end devices. The data analysis model hosted in the back-end server, its operation and its training process are adapted and modified so that the data analysis model can process the data items in encrypted form. No decryption key is required in building and training the model. The same data analysis model may be used to provide services to different clients, each with its own decryption key. The framework and data analysis models may be used to provide remote on-demand speech recognition services and other data analysis services.

36页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:量子逻辑门设计与优化

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!